Role of Bioinformatics Analysis and Databases in Determination of Function of Proteins with Unknown Functions (PUF)
do not necessarily reflect the views of UKDiss.com.
Orphan genes are the genes, that doesn’t have any comparable homologues in other lineages. In other words, genes can be called orphan when they are unique to a species (or in some cases unique to a taxon). Proteins from these genes are produced in significant amount and recently found to be important even essential for some species, which made it difficult to overlook their contribution (Li, Foster et al. 2009). However, lack of sequence similarity and unfamiliar origin making the search for their functionality really difficult (Toll-Riera, Bosch et al. 2009). Fortunately, as the next generation of sequencing making its way, we have a huge online resource available publicly that can provide assistance in this pursuit. In this short essay, I will try to use 7 popular papers on one of the popular and well-known plant orphan gene, qua-quine starch, as an example to show how bioinformatics tools and databases may help us investigating their functions.
According to Arendsee et al, qua-quine starch (QQS), is the first plant orphan gene that researchers were able to determine a biochemically characterized function (Arendsee, Li et al. 2014). The research team was trying to expand their perception of Arabidopsis starch metabolic pathway and in the process, knocked out one of the starch synthase gene SS3 (Li, Foster et al. 2009). Nevertheless, even in the absence of a starch synthase gene, to their surprise, the group found more starch content in the knockout mutant. The knocking out not only changed AtSS3 transcript level, it also influenced expression of other transcripts in the mutants. Among the genes with altered transcript levels was At3g30720, later named QQS (qua-quine starch), was noteworthy from its expression and functional point of view.
The observation proliferation of starch content in Atss3 mutant initially convinced the researchers to conclude QQS as a potential component of starch metabolism (Li, Foster et al. 2009). This remarkable discovery generated interest in many researchers in the same field and resulted in many subsequent scientific studies. Some Bioinformatics tools and databases particularly helped them with their search to take the research into the next level.
Here is a short list of those tools and databases:
- BLAST: Basic Local Alignment Search Tool is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences and identify library sequences that resemble the query sequence above a certain threshold. Most popular and comprehensive BLAST library publicly available is from National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/BLAST). Genes are generally classified as being orphans if they lack coding sequence similarity outside their species usually quantified by BLAST(Arendsee, Li et al. 2014). QQS shows no nucleotide or amino acid sequence similarity to any other gene of Arabidopsis, or any genes or sequences from other species in the NCBI database, which helped researchers realize its uniqueness and recognize it as an orphan gene.
In many steps of QQS research, researchers relied on BLAST libraries. Arendsee, Li et al. (2014) used protein BLAST to establish a phylostratigraph of the protein-coding genes of A. thaliana (Arendsee, Li et al. 2014). Protein Blast program, when provided with a protein query, returns with the most similar protein sequences from the protein database that the user specifies. The general approach they used is selection of hierarchical taxonomic groups ascending from the focal species, and for each gene find the oldest taxon in which it has a homolog (Domazet-Lošo, Brajković et al. 2007). By describing the characteristics of increasingly ancient phylostrata, the path from genomic noise to mature protein is exposed (Figure 1). This helped the group understand distribution of genes origin, helped researchers to take one step closer understanding genes development over time.
In 2015, Li et al, used The BLink algorithm identifies 1155 orphan genes in Arabidopsis that included 30 mitochondrial genes (Li and Wurtele 2015). In addition, the group also identified 839 genes using BLink that are unique to A. thaliana and A. lyrata only but not identifiable outside of the Arabidopsis genus; these genes they referred as clade-specific genes or genus-specific genes (Li and Wurtele 2015). (NCBI, http://www.ncbi.nlm.nih.gov/ sutils/blink.cgi?mode=query). Orphan genes length were also predicted by NCBI Blink in the same study (Figure 2) (Li and Wurtele 2015). This gave the research group an opportunity to better understand genes maturation and evolution. The QQS research group, searched for QQS-like protein in patented protein sequence database with partial QQS sequence in order to discover which proteins QQS interacts with. Learning QQS’s partners will be a massive advancement in exploring its molecular basis and activity. To do this, they also filtered the result with some additional conditions like the sequence must belong to plants and should have certain amount of amino acid sequence to make their search more useful (Qi, Zheng et al. 2018).
Researchers not only used the information but also enriched the libraries with their findings to help the future reserchers. For example, The SAQR-KO (saqr) line SALK_052233C RNA-Seq data have been deposited by Jones, Zheng et al. in 2016 in the NCBI Sequence Read Archive (accession number: SRP072428) (Jones, Zheng et al. 2016). QQS-OE, QQS RNAi RNA-Seq data were deposited in the NCBI Sequence Read Archive, (accession number: SRP072425) (Qi, Zheng et al. 2018) (https://www.ncbi. nlm.nih.gov/sra/)
Figure 1: Age stratification of all genes in Arabidopsis thaliana. Each gene is assigned to the oldest clade (or ‘phylostratum’) that contains a homolog, as inferred by a protein-BLAST of each A. thaliana gene against a selected set of genomes with a threshold e value of 10-5 (Arendsee, Li et al. 2014).
Figure 2: Distribution of orphan protein lengths in Arabidopsis. The median length of the predicted protein models from Arabidopsis thaliana orphan genes (blue bar) is smaller (57 amino acid) than the median length of all A. thaliana predicted protein models (orange bar) (349 amino acid) (range of orphan protein models: 16–445 amino acid; range of all protein models: 16–5393 amino acid).Orphan genes length are predicted by NCBI Blink.(Li and Wurtele 2015)
- Protein Data Bank (PDB): Database for protein structures. Contains nearly 7000 protein structures and over 20 million homology models. To make sure QQS protein doesn’t have any similarities with other proteins or parts of proteins QQS was searched against PDB databases. QQS was not found in the structural database, which further confirmed its identity as orphan (Li, Foster et al. 2009). The research group also tried to find evidence of interactions between QQS and NF-YC through published crystal structure data in PDB (Qi, Zheng et al. 2018). Which helped in their investigation to find QQS’s partner. (http://targetdb.pdb. org)
- MetaOmGraph: MetaOmGraph is a tool for plotting and analyzing large sets of data while using as little memory as possible. It is mostly used for visualizing gene expression patterns, finding functional groups of genes and determining which genes have expression patterns most/least correlated to a gene of interest. By using MetaOmGraph, Li et al (2009) determined QQS mRNA accumulations in many plant organs under a wide variety of nutritional, environmental, stress conditions in different developmental stage, organ, cell and tissue type and genetic background (Figure 3) (Li, Foster et al. 2009). To determine molecular basis of QQS’s function and its contribution in different metabolic pathways, it was important determine where it expresses most and in which conditions. Expression and coexpression of QQS and NF-Y genes in WT and mutant lines are not correlated in Arabidopsis. MetaOmGraph software was used for the correlations and visualization and found to be not significantly correlated (Li, Zheng et al. 2015). This piece of information helped the group to realize the relationship of QQS and NF-YC genes in more detail.
MetaOmGraph was also used to analyze the transcriptomic Expression pattern of SAQR using the normalized experimental Data and metadata (metadata includes gene, experiment and Sample annotations) from 71 experiments comprising 956 Affymetrix ATH1 microarrays (dataset“At956-2008”) (Jones, Zheng et al. 2016)
- MOTIF Search: Vital tool for predicting motifs from online resources cis-acting motifs in QQS were evaluated as previously described and using Motif Search (Li, Ilarslan et al. 2007) (http://motif.genome.jp).
- EMBOSS: The European Molecular Biology Open Software Suite is a free Open Source software analysis package specially developed for the needs of the molecular user community that automatically copes with data in a variety of formats and allows transparent retrieval of sequence data from the web as well as extensive libraries. Li et al (2014) prepared an overview of several cross-phylostrata traits in A. thaliana genes and compares them to non-genic ORFs with the help of getorf program from the EMBOSS toolkit (version 6.4.0.0). This helped them observe several-fold increase in protein length from species-specific genes to universally conserved genes has been noted in several metazoans, yeast , and A. thalana and a significant increase in exon size (about twofold) across the first several phylostrata (Arendsee, Li et al. 2014). Furthermore, it also showed Percent GC content also increases gradually across the phylostrata for a number of species including A. thaliana, which is similar to the characteristics of other orphan genes (Arendsee, Li et al. 2014).
Figure 3: Accumulation of QQS RNA. Each point on the x axis represents a publicly available mRNA transcriptomics profiling data sample for a given experimental condition (Li et al., 2007). The y axis represents the normalized expression level for selected genes. The mean transcript accumulation level for each chip is normalized to a value of 100. The data were visualized using MetaOmGraph (a) QQS (At3g30720, red line) has a complex pattern of RNA accumulation across a wide variety of experimental conditions. (b) QQS RNA accumulation is increased during pollen maturation.
- TAIR: The Arabidopsis Information Resource. TAIR maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana. Data available from TAIR includes the complete genome sequence along with gene structure, gene product information, gene expression, DNA and seed stocks, genome maps, genetic and physical markers, publications, and information about the Arabidopsis research community. This data base and tools helped a lot to explore about QQS.
Li et al (2014) predicted protein structural classes for Arabidopsis thaliana proteins with the data retrieved from the TAIR website. Which produce a very informative figure containing important structural features of Arabidopsis thaliana according to phylostratum. This representation provides them with valuable insights with protein structure evolution and functionality. The group used the TAIR10 genome release Helped them in identifying 1155 orphan genes in Arabidopsis 839 genes can be identified using BLink that are unique to only A. thaliana and A. lyrate. (https://www.arabidopsis.org/download/index-auto.jsp?dir=%2 Fdownload_files%2FGenes%2FTAIR10_genome_release)
TAIR have many reliable molecular data sets, like AraCyc. To evaluate the origin of enzymes, we searched the A. thaliana database AraCyc one of the most complete sets of metabolic pathways for any organism. AraCyc includes genes encoding catalytic and non-catalytic subunits of enzymes of both core and specialized metabolism. The youngest gene encoding any catalytic subunit appears in the ancient Embryo- phyta phylostratum, and the vast majority can be traced back to a common origin in cellular organisms. Which proved their theory that catalytic protein such as enzymes takes longer periods for evolution. (release 11.5, https://www.arabidopsis.org/biocyc/),
TAIR also has a seeds and germplasm resource database that provides researchers with diverse seed and other stocks of Arabidopsis thaliana and related species. The facility is known as ABRC (Arabidopsis Biological Resource Center). Li et al, acquired different seeds and germplasm from ABRC for their experimental trials. For example, They used NF-YC4 knockout line of SALK_032163 and Homozygous ABRC QQS knockout line CS907367 to figure out QQS’s interaction with NF-YC4.(Li, Zheng et al. 2015). The QQS team also used Predictions are based on a search against the SCOP (Structural Classification of Proteins) database, which we retrieved from the TAIR website to observe Arabidopsis thaliana secondary structures
- DisProt: Disorder prediction tools provides experimental evidences of disorder manually collected from literature. For better understanding QQS structure, the team used Disprot. Disprot indicated that QQS protein has a disordered N-terminal tail (approximately 20 first residues). The remainder of QQS protein is predicted to contain two α helixes. (http://www.disprot.org/predictors.php)
- Phytozome: The Plant Comparative Genomics portal of the Department of Energy’s Joint Genome Institute, provides users and the broader plant science community a hub for accessing, visualizing and analyzing JGI-sequenced plant genomes, as well as selected genomes and datasets that have been sequenced elsewhere. Li et al. used RNA seq The cleaned reads were aligned to the reference genome of Arabidopsis thaliana(Li, Zheng et al. 2015). The cleaned reads were aligned again for SAQR in the same way in the next study (Jones, Zheng et al. 2016). (Phytozome version 8.0; phytozome.jgi.doe.gov/pz/portal.html)
- HTSeq: HTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays.the main purpose of HTSeq is to allow you to write your own analysis scripts, customized to your needs, there are also a couple of stand-alone scripts for common tasks that can be used without any Python knowledge. TopHat htseq-count was used to count the mapped reads(Li, Zheng et al. 2015). (www.huberembl.de/HTSeq /doc/overview.html)
- ATHENA: The Analysis Tool for Heritable and Environmental Network Associations (ATHENA) is a software package that combines statistical and biological variable selection methods with machine learning modeling techniques to identify complex prediction models that can include non-linear interactions and different types of high-throughput data. Cis-acting motifs present within the SAQR promoter region upstream of the transcription start site were analyzed using Athena (Jones, Zheng et al. 2016) (https://ritchielab.org/athena-downloads)
- PlantCARE: PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences. Cis-acting motifs present within the SAQR promoter region upstream of the transcription start site were analyzed using (Jones, Zheng et al. 2016). (http://bioinformatics.psb.ugent.be/webtools/plantcare/html/)
- Plant promoter db: plant promoter database. Huge database of Genome sequence and annotation for selected plants Cis-acting motifs present within the SAQR promoter region upstream of the transcription start site were analyzed using (Jones, Zheng et al. 2016). (http://ppdb.agr.gifu-u.ac.jp/ppdb/cgi-bin/index.cgi).
- PyMol: PyMOL is computer software, a molecular visualization system created by Warren Lyford DeLano. It is user-sponsored, open-source software, released under the Python License. Structural models of the SAQR protein predicted using I-TASSER. Helices are colored red; sheets, yellow; loops, green. Image made using PyMol (Jones, Zheng et al. 2016) (Figure 4).
- I-TASSER: Iterative Threading Assembly Refinement, is a hierarchical approach to protein structure and function prediction. It first identifies structural templates from the PDB by multiple threading approach, with full-length atomic models constructed by iterative template fragment assembly simulations. Structural models of the SAQR protein predicted using I-TASSER (Jones, Zheng et al. 2016). (https://zhanglab.ccmb.med.umich.edu/I-TASSER/)
Figure 4: Senescence-Associated and QQS-Related (SAQR) gene and predicted protein. Structural models of the SAQR protein predicted using I-TASSER. Helices are colored red; sheets, yellow; loops, green. Image made using PyMol (Jones, Zheng et al. 2016).
- metadisordermd2: is an online tool for prediction of protein disorder. It is a meta method which means that it tries to calculate “consensus” from results returned by other methods. Analysis of the SAQR protein sequence using MetaDisorder MD2 indicated that It has a largely disordered structure within two regions between amino acids 1–29 and 71–85, a somewhat more ordered section within amino acids 43–57 and a global disorder tendency of 0.642 (Jones, Zheng et al. 2016).
- SoyBase Genome Annotation: This tool will return the complete set of SoyBase annotations for either the entire list of the JGI Williams 82 gene calls or for a user-submitted list. This list can be provided either by pasting into the text box or uploaded via a text file. (https://soybase.org/genomeannotation/). This tool was used by the research group to annotate their sequence of interest (Qi, Zheng et al. 2018).
- ArrayExpress: ArrayExpress (maintained by Archive of Functional Genomics Data stores data from high-throughput functional genomics experiments, and provides these data for reuse to the research community. QQS transcript level data in TuMV-infection regions were collected from here (Qi, Zheng et al. 2018). (https://www.ebi.ac.uk/arrayexpress/)
- Quick2D: Quick2D gives you an overview of secondary structure features like alpha-helices, extended beta-sheets, coiled coils, transmembrane helices and disorder regions. QQS and NF-YC interaction. (a) Secondary structure prediction of QQS. The secondary structure of QQS was predicted by Quick2D various prediction methods(Qi, Zheng et al. 2018). (https://toolkit.tuebingen.mpg.de/#/tools/quick2d)
- CLUSTALW: Multiple Sequence Alignment. Alignment of the protein sequences of Arabidopsis NF-YC4, human NF-YC (NP_055038.2) and its N-terminal fragment sequences were aligned with ClustalW and presented using Boxshade (v3.21) software. (Qi, Zheng et al. 2018). (https://www.genome.jp/tools-bin/clustalw)
- BOXSHADE: BOXSHADE is in the public domain and available from Source Forge . This server takes a multiple-alignment file in either GCG’s MSF-format or Clustal ALN-format. Output can be created in the following formats. Alignment of the protein sequences of Arabidopsis NF-YC4, human NF-YC and its N-terminal fragment were aligned with ClustalW and presented using Boxshade (v3.21) software. (Qi, Zheng et al. 2018). (http://sourceforge.net/projects/boxshade/)
- BioGRID: BioGRID is an interaction repository with data compiled through comprehensive curation efforts. All data are freely provided via our search index and available for download in standardized formats. To search the binding region of QQS-41-59, the non-binding C-terminal peptide QQS-50-59 is trimmed off from QQS-41-59 to form the QQS peptide QQS-41-50, which we propose to physically interact with AtNF-YC4 and HsNF-YC. The AtNF-YC4 interactions were curated by BioGRID (Qi, Zheng et al. 2018)(https://thebiogrid.org/)
- PIC: Protein Interactions Calculator (PIC) is a server which recognizes various kinds of interactions; such as disulphide bonds, hydrophobic interactions, ionic interactions, hydrogen bonds, aromatic- aromatic interactions, aromatic-sulphur interactions and cation – π interactions within a protein or between proteins in a complex. It also determines the accessible surface area as well as the distance of a residue from the surface of the protein. The interaction of two NF-YB N-terminal regions NF-YB-51-57 and NF-YB-62-with NF-YC and NF-YA is calculated using Protein Interactions Calculator (http://pic.mbu.iisc.ernet.in)
- MetNet: The MetNet database (MetNetDB) contains information on networks of metabolic and regulatory and interactions in Arabidopsis. This information is based on input from biologists in their area of expertise. Genes with significant changes in the QQS-OE and QQS RNAi mutants. Transcripts that exhibited significant variation in QQS-OE plants compared to the WT control plants, transcripts that exhibited significant variation in QQS RNAi plants compared to the WT control plants and pathways that are over-represented among the transcripts that exhibited significant variation in QQS-OE plants were provided by MetNet Online.
- ImageJ: ImageJ is a Java-based image processing program developed at the National Institutes of Health and the Laboratory for Optical and Computational Instrumentation (LOCI, University of Wisconsin). Each photographed GFP focal area was processed with the ImageJ measure tool and calibrated against the correct scaling (Qi, Zheng et al. 2018). (http://imagej.nih.gov/ij/)
These are the popular bioinformatic tools and databases that used frequently in the above mentioned 7 research papers. Most of these data are already available through online resources made available by previous researchers and most of the software used here are free to use. It is very hard to imagine this much progress in orphan gene research without these tool and data resources.
References:
Arendsee, Z. W., et al. (2014). “Coming of age: orphan genes in plants.” 19(11): 698-708.
Domazet-Lošo, T., et al. (2007). “A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages.” 23(11): 533-539.
Jones, D. C., et al. (2016). “A clade-specific Arabidopsis gene connects primary metabolism and senescence.” 7: 983.
Li, L., et al. (2009). “Identification of the novel protein QQS as a component of the starch metabolic network in Arabidopsis leaves.” 58(3): 485-498.
Li, L., et al. (2007). “Genome wide co-expression among the starch debranching enzyme genes AtISA1, AtISA2, and AtISA3 in Arabidopsis thaliana.” 58(12): 3323-3342.
Li, L. and E. S. J. P. b. j. Wurtele (2015). “The QQS orphan gene of Arabidopsis modulates carbon and nitrogen allocation in soybean.” 13(2): 177-187.
Li, L., et al. (2015). “<em>QQS</em> orphan gene regulates carbon and nitrogen partitioning across species via NF-YC interactions.” 112(47): 14734-14739.
O’Conner, S., et al. (2018). From Arabidopsis to crops: the Arabidopsis QQS orphan gene modulates nitrogen allocation across species. Engineering Nitrogen Utilization in Crop Plants, Springer: 95-117.
Qi, M., et al. (2018). “QQS orphan gene and its interactor NF‐YC 4 reduce susceptibility to pathogens and pests.”
Toll-Riera, M., et al. (2009). “Origin of Primate Orphan Genes: A Comparative Genomics Approach.” Molecular Biology and Evolution 26(3): 603-612.