ASE Using DNA-seq and RNA-seq Data
- ASE using DNA-seq and RNA-seq data
- Introduction
One of the major steps in determining ASE requires identification of heterozygous biallelic variant sites. From a NGS perspective, when reads generated by a sequencer are aligned to a reference genome, ‘variant calling’ is performed to identify the biallelic variant sites and ‘genotype calling’ determines the genotype or if the studied organism is heterozygous at that site or not. As mentioned in the previous chapter, RNA-seq variant calling is an added advantage that can be achieved in addition to expression analysis and identification/quantification of transcript isoforms. Unfortunately, RNA-seq variant calling will have caveats such as alignment errors due to splicing, increased error rate due to reverse transcription, failure to call variants and infer genotypes in lowly expressed genes and incorrectly inferring RNA-editing sites as genomic variation (1).
In the previous chapter, we have aligned RNA-seq reads to a reference genome to identify genomic variants. RNA-seq short reads are derived from mature-RNA (in case of mRNA-seq) which do not have introns on their sequence. A read generated from an mRNA-seq experiment will span two exons, while the reference genome will have an exon, followed by an intron leading to splitting of a single read (popularly known as ‘split-reads’ formed due to a splice junction). This would need an aligner to deal with issues such as mapping ambiguity due to short length of reads and sequencing errors leading to misaligned split reads. For this reason, specific aligners known as ‘splice-aware’ aligners are used and many of such aligners follow different strategies to identify splice junctions and also alternative splicing that enables a gene to code for multiple proteins through inclusion or exclusion of exons (2-4). It was also observed that technical error from reverse transcription and sequencing, and biological error caused from RNA polymerase look undistinguishable, challenging the identification of variants (5). RNA-seq read distribution on a reference genome (when aligned to one) is non-uniform in nature which means that reads will be concentrated on highly expressed genes than genes which have low or no expression. Enough depth is required at a particular locus in order to call genotype and variant confidently and lowly expressed genes or even unexpressed genes defy this (6). On the other hand, reads are distributed uniformly across the whole genome in DNA-seq and is helpful in calling genotypes as well as variants from any locus. RNA-editing refers to a post-transcriptional modification that makes a mature RNA sequence differ from its template DNA sequence by base insertion, deletions or substation leading to RNA mutations and altering gene regulation (7, 8). This means that when only RNA-seq is used for detecting variations after alignment to the reference genome, any change from the genomic DNA will be misinterpreted as a single nucleotide variation or SNV. Having the genomic sequence of the same organism under study will aid in identification and correct interpretation of such events which have been previously done in cow and humans (9-12).
This chapter deals with the determination of ASE by identifying and genotyping variants using DNA-seq data. The variant calling workflow has been ‘adapted from’ the GATK (The Genome Analysis Toolkit) (13) best practices workflow for Germline single nucleotide polymorphisms (SNPs) + Indels (Insertions or deletions) for our ‘non-human’ organism as mentioned in (14).
- Methods
- WGS Raw data description
As seen in table 1, the raw data consisted of Whole Genome sequencing (WGS) data from 81 water buffaloes from 7 breeds. Out of 7, 6 breeds are of Indian origin. Paired-end sequencing of the animals were done in two sequencing centres- Edinburgh Genomics, UK and SciGenom (India) in different depths-30x and 10x, respectively. The number of animals sequenced at 10x and 30x depth is mentioned in table 1. Edinburgh genomics, UK sequenced the animals using Illumina HiSeq X machine (read length: 150) whereas SciGenom used Illumina Illumina HiSeq 2500 machine (read length: 250). There was no particular reason for using different sequencing technologies for different animals, except from what was available during the time when the samples were ready for sequencing as they were sequenced in different times. Due to this reason, batch effect (15) could have hampered downstream analysis and its presence was checked as described further ahead in this chapter. The six Mediterranean breed animals consists of two animals from Lodi (Italy), and four animals from the same farm in Kirkcaldy, town in Fife, on the east coast of Scotland, UK. The Indian breed samples were collected from their respective breeding tracts as mentioned in table 2.
Breed | No. of animals | Sequencing Centre | Depth |
Italian (Mediterranean) | 6 | Edinburgh Genomics | 30x |
Jaffrabadi | 13 | SciGenom (India) and
Edinburgh Genomics |
6 animals-10x
7 animals-30x |
Pandharpuri | 13 | SciGenom (India) and
Edinburgh Genomics |
6 animals-10x
7 animals-30x |
Banni | 12 | SciGenom (India) and
Edinburgh Genomics |
6 animals-10x
6 animals-30x |
Surti | 12 | SciGenom (India) and
Edinburgh Genomics |
6 animals-10x
6 animals-30x |
Murrah | 12 | SciGenom (India) and
Edinburgh Genomics |
6 animals-10x
6 animals-30x |
Bhadawari | 13 | SciGenom (India) and
Edinburgh Genomics |
6 animals-10x
7 animals-30x |
7 BREEDS | 81 ANIMALS | 2 SEQUENCING CENTRES | 2 DEPTHS |
` Table 1 WGS data samples summary
Breed | Breeding Tract in India | Link |
Banni | Banni area of Kutchchh district of Gujarat | http://dairyknowledge.in/article/banni |
Bhadawari | Bhind and Morena districts of Madhya Pradesh and Agra and Etawah districts of Uttar Pradesh | http://dairyknowledge.in/article/bhadawari |
Jaffrabadi | Amreli, Bhavnagar, Jamnagar, Junagadh, Porbandar and Rajkot districts of Gujarat state | http://dairyknowledge.in/article/jaffarabadi |
Murrah | Hisar, Rohtak, Gurgaon and Jind district of Haryana and Delhi | http://dairyknowledge.in/article/murrah |
Pandharpuri | Solapur, Sangli and Kolhapur districts of Maharashtra | http://dairyknowledge.in/article/pandharpuri |
Surti | Vadodara, Bharuch, Kheda and Surat districts of Gujarat | http://dairyknowledge.in/article/surti |
Table 2 Breeding tract information of six Indian water buffalo breeds. Courtesy: Information System on Animal Genetic Resources of India (AGRI-IS) – developed at National Bureau of Animal genetic Resources, Karnal, Haryana, India
- Read alignment to the new water buffalo reference genome and pre-processing before variant calling
Raw paired-end reads from all 81 samples were aligned using BWA-MEM program (v0.1.17) (16) to a newly assembled water buffalo reference genome. In brief, the BWA-MEM algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending those seeds using Smith-Waterman alignment. The newly assembled reference genome was released by NCBI on 14th May 2018. We were fortunate to get access to the genome before it was released officially in NCBI RefSeq (17) FTP website (18). NCBI completed and released its annotation on 25th June 2018 (19). All analysis has been done using the unofficial copy of the currently available reference genome.
Statistic | Old water buffalo genome | New water buffalo genome |
Assembly name | UMD_CASPUR_WB_2.0 | UOA_WB_1 |
Submitter | University of Maryland | University of Adelaide |
Date | 9th Sep 2013 | 14th May 2018 |
Assembly level | Scaffold | Chromosome |
Assembly method | MaSuRCA v. 1.8.3 | Falcon-Unzip v. 1.8.7 |
Genome coverage | 70.0x | 69.0x |
Sequencing technology | Illumina GAIIx; Illumina HiSeq; 454 | PacBio |
Total scaffolds | 366982 | 509 |
Contig-N50 | 21938 | 22441509 |
Total-length | 2836150610 | 2655780776 |
Total-gap-length | 74388041 | 373500 |
Table 3 Difference between old water buffalo draft assembly and improved water buffalo new assembly (Data from NCBI RefSeq database)
The new assembly was assembled using reads generated using PacBio long read sequencing technology that gave it better contiguity than the old draft assembly. It is a chromosome level assembly as compared to the old scaffold level draft assembly. Both the assemblies are from the same animal i.e. a female Mediterranean water buffalo named ‘Olympia’. Table 3 describes the differences between the two assemblies. It can be clearly seen that the contig-N50 is much better in the new assembly. It is also observed that the total-length of the new assembly (2.65 Gb) is less than the old assembly (2.83 Gb) because the old assembly had too many gaps (total-gap-length=74388041) and ‘Total-length’ mentioned in the table is the total sequence length including bases and gaps that explains the larger length of the old assembly.
The new assembly has a total of 509 scaffolds, out of which there are 24 autosomes and 1 sex chromosome (X) and rest 484 are unplaced scaffolds. This is better than cow GCF_002263795.1_ARS-UCD1.2 assembly (2211 scaffolds) and human GCF_000001405.38_GRCh38.p12 assembly (874 scaffolds).
As mentioned before, we used the genome assembly (shared with us by Dr. Lloyd Low from University of Adelaide; personal communication) before it was online in NCBI RefSeq FTP website. The unofficial version of the assembly is extremely similar to the one that is currently present in NCBI RefSeq. Figure 1 depicts this point in a dotplot made by performing a pairwise alignment using minimap2 (20) and then using the ‘minidot’ program from miniasm (21). We observed a straight line throughout with only few differences in the top-right corner. The differences are due to the following reasons:
a) hic5004_arrow (unplaced scaffold) was removed by RefSeq as a contaminant
b) Mitochondrial genome (MT) was added by RefSeq (RefSeq-Accn: NC_006295.1) that belongs to a swamp buffalo from China (https://www.ncbi.nlm.nih.gov/nuccore/52220982)
c) NCBI annotation assigned a unique RefSeq accession id to all the chromosomes. Since the original assembly was used in our analysis (that was submitted to NCBI) before annotation, the chromosomes have been assigned numbers: 1,2,3,4 and so on. An already developed mapping file can be easily used to update the chromosome IDs to RefSeq assigned IDs if needed.
Figure 1 Dotplot between unofficial version and NCBI RefSeq version of water buffalo genome assembly created using minimap2 and miniasm. The straight line signifies that there are no differences in any contig after pairwise alignment between the two assemblies with only slight difference in the top- right corner
Coming back to the alignment process, the reference genome was first indexed using BWA index command (v0.1.17) followed by the actual alignment process using BWA-MEM v0.1.17 using the paired-end reads of each sample. For each sample, the BWA-MEM generated Sequence Alignment Map (SAM) output that was converted to Binary Alignment Map (BAM) output using Samtools view v1.6 with -b parameter. A total of 81 BAM files were generated from 81 samples. On completion of the alignment process, each BAM file was coordinate sorted using Samtools sort v1.6 with default parameters. Duplicates were marked using Picard MarkDuplicates v2.14.0 with parameters ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024 TMP_DIR=<temp_directory> METRICS_FILE=<metrics_filename>. In brief, the tool locates and marks/tags duplicates in a BAM file (marks duplicate reads with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024) which allows downstream GATK tools to exclude duplicates from analysis as they have the probability of being non-independent measurements from the exact same template DNA. Then, each de-duped (or de-duplicated) BAM file was subjected to Picard AddOrReplaceReadGroups v2.14.0 with parameters RGLB=library RGPL=illumina RGPU=barcode RGSM=<sample_name> CREATE_INDEX=true for adding ‘read groups’ which refers to a set of reads that were generated from a single run of a sequencing instrument and is necessary for all downstream GATK tools. As seen from the parameters, the BAM index was generated on the fly during the tool execution. After performing these pre-processing steps, 81 sorted, de-duped and read group added and indexed BAM files were obtained that were used for variant calling further.
- Variant calling
For calling variants, GATK tool called Haplotypecaller v4.0.4.0 (22) was run per sample in GVCF (Genomic VCF) mode with parameter -ERC GVCF. Haplotypecaller utilizes reads from BAM/SAM files and performs variant calling per sample to produce unfiltered genotype likelihoods (22).
Briefly explaining from (22), in the default mode when only single sample is present, haplotypecaller determines ‘active regions’ in the genome using BWA-MEM aligned reads. Active regions are those regions that significantly differ from the reference and have base mismatches/Indels/soft-clipped bases. The reads from the each active region are split into smaller overlapping subsequences (called k-mers) and are reassembled into candidate haplotypes. The program then realigns each haplotype against the reference using the Smith-Waterman algorithm in order to identify potential variant sites. Then, using pairHMM algorithm, the program determines the likelihood for each combination of candidate haplotype and read data which are then used to obtain the likelihoods of alleles per read for each potential variant site. Then, finally, for each potential variant site, per-read haplotype likelihood output from pairHMM algorithm is used to calculate raw genotype likelihoods using Bayes’ rule.
On the other hand, the GVCF mode follows the reference confidence model which gives us the likelihood of all possible genotypes at all sites (including non-variant sites) for each sample. Then, joint genotype calling is performed for all samples in a cohort. Joint calling is always beneficial as compared to single sample calling because through joint calling it is possible to ‘rescue’ genotype calls where one sample has low coverage but other samples within the cohort have a confident variant at that locus.
Haplotypecaller produced 81 GVCFs in total for our cohort. Before joint genotyping, GenomicsDBImport v4.0.4.0 was used to aggregate the GVCF files from all 81 samples. It takes one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. In v4.0.4.0 of GenomicsDBImport, there was a caveat that it could be only run on a single genomic interval (i.e. max one contig) at a time (caveat removed in v4.0.6.0). Hence, it was run for 509 scaffolds separately that created 509 separate GenomicsDB. Then, GenotypeGVCFs v4.0.4.0 with -new-qual parameter was used to read from the created GenomicsDBs directly and output the final multi-sample VCF file per scaffold. GenotypeGVCFs performs joint genotype calling and assigns a quality score (QUAL) to each variant and removes low quality variants (22). Finally, Picard GatherVcfs v2.14.0 was used to concatenate variants called per chromosome/unplaced scaffold to get the final multisample VCF file for all scaffolds together.
Figure 2 Transition vs Transversion (Ti/Tv ratio) is calculated by dividing number of transition SNPs with number of Transversion SNPs. Transition is interchanging of nucleotides having same number of rings (Ex. A↔G or C↔T) and Transversion is interchange between one-ring and two-ring nucleotides (Ex. A↔T or G↔C or A↔C or G↔T). Image courtesy: “Transitions-transversions-v3” By Petulda [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons
- Variant Filtration
Sources of errors in NGS technologies can lead to discovery of false- positives in variant and genotype calling (23). False-positives may arise due to base calling errors or may creep in during read alignment (24). As per the GATK ‘best practices’ workflow for SNPs and Indels (14) , the raw variant callset should pass through, what can be called as ‘soft-filtering’ pipeline, performing variant quality score recalibration (VQSR) (25). Based on a ‘truth-set’ of known and validated good quality, true-positive variants (from sources such as HapMap (26), dbSNP (27), etc.), the pipeline analyses/studies variant annotations (properties or statistics that describe for each variant) of variants in the raw variant callset. Based on the analysis, it develops a machine learning model by learning which annotations describe good variants and which annotations describe the bad ones. The rules are then applied to the whole variant callset and a single score is assigned to each variant which is the log odds of being a true variant versus being false under the model.
Unfortunately, the VQSR pipeline requires the ‘truth-set’ which is absent for our organism. Hence, we opted ‘hard-filtering’ approach and have used it with a pinch of salt. In the hard-filtering approach, the distribution of each variant annotation is checked individually across the variant data and a threshold is chosen. Variants are filtered out above or below this threshold (based on the type of annotation). From this approach, if one of the annotations looks bad, we may filter out a good variant or some bad variants may remain if their annotations match with those of the good ones (28).
Transition vs Transversion (Ti/Tv ratio) can be a quality indicator of variation data produced from NGS experiments (29). As shown in figure 2, transitions are interchange of purines and pyrimidines whereas transversions are change from a purine to a pyrimidine and vice versa. Purines (Adenine or A; Guanine or G) consists of nucleotides with two-ring structure whereas pyrimidines (Cytosine or C; Thymine or T) consists of nucleotides with one-ring structure. Transition mutations are more common than transversion mutations because in the later, interchange between one-ring and two-ring systems require more energy than the former (29). Also, transitions are less likely to produce a difference in the amino acid sequence leading to a non-synonymous change compared to transversions (30). Another reason why transitions occur is because of cytosine methylation, the most common form of modification in eukaryotic DNA (31). When C is adjacent to G in CpG sites (5’—C—phosphate—G—3′), C can easily get methylated to form 5-Methylcytosine which in turn leads to spontaneous deamination turning it into a T (32). The Ti/Tv ratio differs between genomic regions and it can also be used to distinguish between exonic and non-exonic regions. For example, table 4 compiles respective Ti/Tv ratios calculated for five major genomic regional categories in humans as examined in (29). Higher value of Ti/Tv ratio in exons is due to the presence of higher number of methylated cytosine in CpG sites within exons (33). Previous studies have also found that for WGS data in humans, the Ti/Tv ratio should be around approx. 2.0 – 2.1 and a ratio lower than that can indicate poor quality data (34). Ti/Tv ratio for WGS data of Bos Taurus (cow) was also found to be in the same range as those in humans by Birgit et al (35). Although, the Ti/Tv range for WGS data will differ from species to species as found in (36).
Genomic Region | Ti/Tv ratio |
Exon | 3 |
Intron | 2.2 |
Intergenic | 2.06 |
lncRNA | 2.06 |
miRNA | 2.59-2.95 |
Table 4 Ti/Tv ratios showing different values for different genomic regions in humans
In our variant filtration analysis step, we have followed the GATK recommended six annotations that seem to be highly informative and robust (28). They are: QualByDepth (QD), FisherStrand (FS), StrandOddsRatio (SOR), RMSMappingQuality (MQ), MappingQualityRankSumTest (MQRankSum) and ReadPosRankSumTest (ReadPosRankSum). QD is basically the QUAL score (phred-scaled quality score for the assertion made in ALT, the variant confidence score) divided by the allele depth of non-hom-ref samples. This is used instead of the QUAL score because locus is deep coverage will have an artificially inflated QUAL score. FS is the phred scaled probability that a locus has strand bias. In paired-end sequencing, a real variant should be present in both forward and reverse reads. Strand bias is a type of sequencing bias in which one DNA strand is favoured over the other. The higher the FS value, the more likely there is to be bias. MQ is root mean square mapping quality over all the reads at the site. Including standard deviation allows the inclusion of variation in the dataset because reads of different mapping qualities may pile up at a site calling a variant. SOR is another metric to evaluate strand bias in the data but is considered as an updated form of FS which performs better in high coverage situations. ReadPosRankSum is a score from a rank sum test which is calculated by comparing the positions of reference or alternate allele within the reads. If an allele is present only at the end of the read, it may be an error as the sequencer produces erroneous base calls at the end. A negative value indicates the excess presence of alternate allele at the end of the reads than the reference allele where and it is vice versa for a positive value. The final metric is MQRankSum which is a score that is calculated from a rank sum test by comparing the mapping qualities of the reads supporting the reference allele with those supporting the alternate allele. A positive value suggests that the alternate allele reads are greater than reference allele reads whereas a negative value suggests that mapping qualities of reference allele reads are greater than alternate allele reads.
As we are only interested in biallelic sites for ASE studies, the final multisample VCF file was filtered using BCFtools view v1.6 with parameters -v snps -m2 -M2. The filtered biallelic SNPs were then filtered to remove possible false positives.
- Quantifying ASE (Again using MBASED, but now after removing mapping bias)
- Checking for any protein coding variation using SnpEff results in the four animals whose RNA-seq result is also present
- Results
- Alignment statistics
- Variant Calling statistics
- Difference between raw variants called by first and second approach (bcftools mpileup vs GATK)
- Variant annotation statistics (Snpeff results)
- Variant Filtration results
- Macrophage expressed genes ASE results
- Difference between the two ASE results
- If ASE is significantly more in macrophage expressed genes than other genes
- Exploring genes that show ASE, their role in TB and brucellosis, and if they show any imprinting (can see if they have been reported to be imprinted in other ruminants and humans)
- Protein coding variation results
Why ran-seq can’t be used to call genotypes
Why dna-seq is good for calling genotypes
Other sequencing protocols for calling genotypes (methyle-seq)
For detecting imprinting, dna-seq is important
Why adapters should not be removed by trimming before variant calling..
https://gatkforums.broadinstitute.org/gatk/discussion/11667/quality-trimming
During the first phase of alignment, we had access to the draft assembly of the water buffalo genome (GCA_000471725.1_UMD_CASPUR_WB_2.0) whose description has been already provided in the previous chapter. All the samples were aligned to this genome using default parameters of BWA-MEM.
1. Xu C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Computational and Structural Biotechnology Journal. 2018;16:15-24.
2. Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, The RC, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nature Methods. 2013;10:1185.
3. Williams AG, Thomas S, Wyman SK, Holloway AK. RNA-seq Data: Challenges in and Recommendations for Experimental Design and Analysis. Current protocols in human genetics / editorial board, Jonathan L Haines [et al]. 2014;83:11.3.1-.3.20.
4. Mapleson D, Venturini L, Kaithakottil G, Swarbreck D. Efficient and accurate detection of splice junctions from RNAseq with Portcullis. bioRxiv. 2017.
5. Carey LB. RNA polymerase errors cause splicing defects and can be regulated by differential expression of RNA polymerase subunits. eLife. 2015;4:e09945.
6. Muzzey D, Evans EA, Lieber C. Understanding the Basics of NGS: From Mechanism to Variant Calling. Current Genetic Medicine Reports. 2015;3(4):158-65.
7. Brennicke A, Marchfelder A, Binder S. RNA editing. FEMS microbiology reviews. 1999;23(3):297-316.
8. Ouyang Z, Liu F, Zhao C, Ren C, An G, Mei C, et al. Accurate identification of RNA editing sites from primitive sequence with deep neural networks. Scientific Reports. 2018;8(1):6005.
9. Bakhtiarizadeh MR, Salehi A, Rivera RM. Genome-wide identification and analysis of A-to-I RNA editing events in bovine by transcriptome sequencing. PLOS ONE. 2018;13(2):e0193316.
10. Bahn JH, Lee JH, Li G, Greer C, Peng G, Xiao X. Accurate identification of A-to-I RNA editing in human by transcriptome sequencing. Genome research. 2012;22(1):142-50.
11. Eisenberg E, Adamsky K, Cohen L, Amariglio N, Hirshberg A, Rechavi G, et al. Identification of RNA editing sites in the SNP database. Nucleic Acids Research. 2005;33(14):4612-7.
12. Park E, Guo J, Shen S, Demirdjian L, Wu YN, Lin L, et al. Population and allelic variation of A-to-I RNA editing in human transcriptomes. Genome Biology. 2017;18(1):143.
13. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research. 2010;20(9):1297-303.
14. Germline short variant discovery (SNPs + Indels) Best Practices Workflows [25/09/2018]. Available from: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145.
15. Tom JA, Reeder J, Forrest WF, Graham RR, Hunkapiller J, Behrens TW, et al. Identifying and mitigating batch effects in whole genome sequencing data. BMC bioinformatics. 2017;18(1):351.
16. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]2013.
17. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research. 2007;35(Database issue):D61-D5.
18. Water Buffalo NCBI RefSeq assembly [15/09/2018]. Available from: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Bubalus_bubalis/latest_assembly_versions/GCF_003121395.1_UOA_WB_1.
19. NCBI Bubalus bubalis Annotation Release Date [15/09/2018]. Available from: ftp://ftp.ncbi.nih.gov/genomes/Bubalus_bubalis/README_CURRENT_RELEASE.
20. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018:bty191-bty.
21. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32(14):2103-10.
22. Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2018.
23. Ribeiro A, Golicz A, Hackett CA, Milne I, Stephen G, Marshall D, et al. An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome. BMC bioinformatics. 2015;16:382.
24. Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nature reviews Genetics. 2011;12(6):443-51.
25. Variant Quality Score Recalibration (VQSR) [16/09/2018]. Available from: https://software.broadinstitute.org/gatk/documentation/article?id=11084.
26. The International HapMap Project. Nature. 2003;426(6968):789-96.
27. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001;29(1):308-11.
28. Hard-filtering germline short variants [16/09/2018]. Available from: https://software.broadinstitute.org/gatk/documentation/article?id=11069.
29. Wang J, Raskin L, Samuels DC, Shyr Y, Guo Y. Genome measures used for quality control are dependent on gene function and ancestry. Bioinformatics. 2015;31(3):318-23.
30. Guo C, McDowell IC, Nodzenski M, Scholtens DM, Allen AS, Lowe WL, et al. Transversions have larger regulatory effects than transitions. BMC Genomics. 2017;18(1):394.
31. Cooper DN, Gerber-Huber S. DNA methylation and CpG suppression. Cell Differentiation. 1985;17(3):199-205.
32. Coulondre C, Miller JH, Farabaugh PJ, Gilbert W. Molecular basis of base substitution hotspots in Escherichia coli. Nature. 1978;274:775.
33. Hodges E, Smith AD, Kendall J, Xuan Z, Ravi K, Rooks M, et al. High definition profiling of mammalian DNA methylation by array capture and single molecule bisulfite sequencing. Genome research. 2009;19(9):1593-605.
34. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 2011;43:491.
35. Baes CF, Dolezal MA, Koltes JE, Bapst B, Fritz-Waters E, Jansen S, et al. Evaluation of variant identification methods for whole genome sequencing data in dairy cattle. BMC Genomics. 2014;15(1):948.
36. Keller I, Bensasson D, Nichols RA. Transition-Transversion Bias Is Not Universal: A Counter Example from Grasshopper Pseudogenes. PLOS Genetics. 2007;3(2):e22.