Chapter 5: Genome Biology in Tribolium castaneum
/ 5.1 Motivations for research
/ 5.2 Feasibility assay
/ 5.3 DamID: Attempt 1
/ 5.4 DamID: Attempt 2
/ 5.5 DamID: Attempt 3
/ 5.7 Discussion of results
5.1 Motivations for research
Next generation sequencing (NGS) technologies have precipitated a revolution in biological research (Schuster, 2008), enabling the sequencing of entire genomes in a matter of days (Schendure & Ji, 2008; Graveley, 2008). These developments have hastened novel research methods into genomic interactions (Mardis, 2008), bringing about the emergence of entirely new fields of study in genetics (Shendure & Ji, 2008; Mardis, 2008; Morozova & Marra, 2008). One particularly exiting field is the study of genome-wide interactions of transcription factors (TFs), which collectively have exposed a vast complexity in the regulatory network of the genome (Mardis, 2008; Celniker et al., 2009; Conaway, 2012; Dunham et al., 2012). For example, the modENCODE project aims to elucidate the functional and regulatory elements in the model species Drosophila melanogaster and Caenorhabditis elegans, and provide a publicly-accessible and comprehensive encyclopaedia for this data (Celniker et al., 2009). The vast majority of these studies utilise chromatin immunoprecipitation (ChIP) followed by hybridisation to a tiling array (Buck & Lieb, 2004) or high-throughput NGS (Johnson et al., 2007) (ChIP-on-chip, or ChIP-seq, respectively). ChIP involves cross-linking the protein of interest with DNA in vivo, fragmenting chromatin via sonication, and enriching bound fragments using a highly specific antibody, thereby enabling the targeted retrieval of DNA bound by the protein (Aparicio et al., 2005). At the time of writing, modENCODE has 343 entries of ChIP studies mapping chromatin binding sites of various TFs within the Drosophila melanogaster genome, vastly augmenting our understanding of the regulatory networks contained within the genome.
Another independent method of studying TF binding in vivo is dam identification (DamID). DamID aims to achieve a similar result to ChIP, albeit by using a very different method. The enzyme DNA-adenine-methyltransferase (dam) is an adenomethylase that acts only in the context of GATC motifs, and is found in natura in prokaryotes, but not in most eukaryotes (van Steensel & Henikoff, 2000). A fusion between the dam protein and a TF of interest is generated and introduced as a transgene into the target species. Thus, wherever the TF binds to in the genome, the tethered methyltransferase will methylate neighbouring adenine regions, leaving an historical ‘footprint’ of TF activity. Methylated adenine regions are recovered using methylation-sensitive restriction enzymes, and enriched DNA either hybridised to a tiling array or sequenced via an NGS platform. This technique has the advantages of not requiring a specific antibody, and of capturing historical binding events in the genome, instead of the snapshot of DNA binding yielded by ChIP experiments. DamID has been performed successfully in Drosophila melanogaster (van Steensel et al., 2000; Ferrero et al., 2014; Carl & Russell, 2015; Marshall et al., 2016), mammalian cells (Vogel et al., 2007), Caenorhabditis elegans (Schuster et al., 2010), and Arabidopsis thaliana (Germann & Gaudin, 2011). (A more thorough comparison of these two techniques is discussed in Chapter 1.)
These techniques mapping genome-wide binding patterns of TFs help elucidate regulatory regions of the target species’ DNA; however interspecific comparative studies enable researchers to explore evolutionary changes at the genomic level. For example, ChIP investigations in closely related fungal species reveal significant divergence of TF binding within the genomes of 3 species of the Saccharomyces genus (Borneman et al., 2007); this effect is yet more pronounced in more evolutionary distant species of fungi (Tuch et al., 2008). Comparative studies have also been conducted between humans and mice to explore mammalian TF divergence: e.g., despite the highly conserved function of four shared TFs, in vivo mapping of binding activity in hepatocyte cells reveals extensive variation in binding site turnover between humans and mice for each TF (Odom et al., 2007).
Within invertebrates, much work has been carried out in Drosophila species (MacArthur et al., 2009; Bradley et al., 2010; He et al., 2011; Paris et al., 2013; Villar et al., 2014; Ferrero et al., 2015; Carl & Russell, 2015). MacArthur et al. (2009) investigated the genome-wide interactions of 31 TFs involved in early embryonic patterning in Drosophila whilst simultaneously examining chromatin accessibility data. MacArthur et al. argue that chromatin accessibility, as opposed to TF specificity, is chiefly responsible for TF regulatory activity, at least within the Drosophila genome (MacArthur et al., 2009). Interspecific comparative studies within the Drosophila genus have also been conducted, with Bradley et al. (2010) mapping the genome-wide binding sites of 6 TFs: Bicoid, Hunchback, Krüppel, Giant, Knirps, and Caudal, in Drosophila melanogaster and Drosophila yakuba, using ChIP-seq. The researchers find evidence of binding conservation, gain and loss of binding, changes in binding location, and changes in binding intensity, between the two species studied (Figure 5.1.1). Moreover, He et al. (2011) investigated the binding of the TF Twist in 6 Drosophila species, showing that binding conservation recapitulates evolutionary distances, with the most closely related species exhibiting greater conservation of binding intervals than the least closely related species.
The genomic activity of the SoxB proteins in Drosophila has also been extensively studied in previous investigations by former members of this lab group (Aleksic et al., 2013; Ferrero et al, 2014; Carl & Russell 2015). For instance Ferrero et al. (2014) have identified areas of common binding for Dm’Dichaete and Dm’SoxNeuro in Drosophila melanogaster using DamID (Figure 5.1.2), and Carl & Russell (2015) find widespread examples of binding site turnover between 4 Drosophila species for Dichaete, which correlate with phylogenetic distances. The binding motifs identified across the 4 Drosophila species for Dichaete, and 2 Drosophila species for SoxNeuro, are also evolutionarily conserved (Figure 5.1.3). Moreover, sites commonly bound by Dichaete and SoxNeuro exhibit the strongest binding site conservation, implying that despite the redundancy of these genes, selection pressures have maintained the ability of these two proteins to bind at the same loci (Carl & Russell, 2015). The redundancy of these two proteins is demonstrated through an elegant experiment by Ferrero et al. (2014), whereby the genome-wide activities of each protein exhibit evidence of functional compensation, de novo binding, and loss-of-binding events in flies mutant for the orthologous gene. That is, in SoxNeuro mutant flies, Dichaete binding is shown to be stronger, novel, or absent at various loci when compared to wildtype flies. The same is true for SoxNeuro binding in Dichaete mutants (Figure 5.1.4). This extraordinary evidence for compensation, redundancy, and dependency builds upon previous research demonstrating their functional redundancy in embryonic development (Buescher et al., 2002; Overton et al., 2002)
The purpose of this investigation was twofold. First, I wished to establish Tribolium castaneum as a model organism for genomics research. The genome of T. castaneum shares significant homology with D. melanogaster with a similar genome size and many orthologous regions (Richards et al., 2008) (Figure 5.1.5), however at present, there have been no genome-wide TF binding studies conducted in Tribolium embryos, despite proposals for such experiments first appearing 9 years ago (Roth & Hartenstein, 2008).ChIP-seq has been carried out in Tribolium larvae on a trans viral histone, CpBV-H4, which is endogenous to the parasitoid wasp Cotesia plutellae (Hepat et al., 2013). The researchers introduced CpBV-H4 to late-stage larvae of T. castaneum in order to investigate its involvement in epigenetic control of gene expression in eukaryotic organisms, exploring its effect on total transcript content via RNA-seq, and its incorporation sites in insect chromosomes. However, the ChIP-seq assay identified just 16 sites of interaction within the genome of T. castaneum, with no conserved target motif detectable (Hepat et al., 2013). ChIP-on-chip has been performed in Tribolium embryos, however this was only on a specific locus of 240kb, using a custom-made tiling array (Cande et al., 2009).
Because ChIP investigations rely on a robust and highly specific antibody (Orlando, 2000; Buck & Lieb, 2004), these experiments are less suitable to non-model or emerging model organisms where the repertoire of specific polyclonal antibodies is less complete. DamID therefore represents an attractive method to study protein interactions in such genomes as it does not rely on antibodies. However, DamID presents challenges of its own: high expression of the dam protein in eukaryotes is almost invariably toxic, as adenomethylation in eukaryotes tends to be absent, and as such tolerance is poor (van Steensel & Henikoff, 2000; Southall et al., 2013). Consequently, low, ‘leaky’ levels of expression are only tolerated in the genomes of metazoans (Vogel et al., 2007; Southall et al., 2013), and a suitable promoter must be identified that will allow sufficient expression as to methylate the host genome in detectable quantities, whilst simultaneously avoiding toxicity and saturation of methylation (Vogel et al., 2007). Cytosine methylation has been identified in both Tribolium (Feliciello et al., 2013; Song et al., 2017) and Drosophila (see Takayama et al., 2014): in each species, both symmetrical and non-CpG methylation is observable, with the methylome revealing novel and unique methylation patterns in the animal kingdom which function in contrast to the methylomes of vertebrates (Song et al, 2017). However adenine methylation has not been thus far detected in many metazoans (van Steensel & Henikoff, 2000) (there is some evidence of adenomethylation in mouse embryonic stem cells – see Wu et al. (2016)). DamID is a well-established technique in Drosophila, however to the best of my knowledge it has yet to be established in another arthropod species.
The second aim of this investigation was to examine the genome-wide activity of the Tc’Dichaete and Tc’SoxNeuro proteins in Tribolium castaneum. Unlike Drosophila melanogaster, regulatory regions within T. castaneum are not well-represented in published genome annotations. However, given the affinity of Dm’Dichaete and Dm’SoxN with intronic regions in the Drosophila genome (Ferrero et al., 2014; Carl & Russell, 2015), mapping binding sites to the genomic features of Tribolium may prove more straightforward should this affinity be conserved across species.
Within the beetle genome, I hoped to identify whether there was significant overlap observed in the binding data of Tc’Dichaete and Tc’SoxNeuro, as is observed in Drosophila. The overlapping expression patterns of the transcripts of these two genes (discussed in Chapter 4) would suggest that there are areas of common binding between the two proteins; however Tc’Dichaete’s additional expression domains (see Chapter 4.X.X) would lead one to predict unique binding with regions associated with Tribolium segmentation. Evidence implicating functional properties for each protein would be obtained by performing gene ontology enrichment on the genes associated with Tc’Dichaete and Tc’SoxNeuro binding, elucidating whether enriched regions are associated most strongly with CNS development for both proteins as they are in Drosophila. Furthermore, I wished to identify de novo target motifs for both Tc’Dichaete and Tc’SoxNeuro within the beetle genome, and draw comparisons with the conserved motifs discovered in Drosophila species (Figure 5.1.3). The DNA-binding HMG domains of Tc’Dichaete and Tc’SoxNeuro are highly conserved with those of Dm’Dichaete and Dm’SoxNeuro (89.9% and 92.4% sequence identities, respectively), which suggests high conservation in the DNA binding mechanism: one might therefore predict very similar, if not identical, target motifs would be identified. Finally, by taking a selection of the gene loci associated with the most intense binding intervals for each protein, a comparison with the binding intervals observed in Drosophila may be drawn to quantify the level of conservation/divergence across 350 million years of evolution.
Figure 5.1.1. Binding intervals of 6 common transcription factors in two species of Drosophila: D.melanogaster and D. yakuba. (A) Common binding events, (B) Unique binding events, (C) Shift in binding peak, and (D) Change in binding intensity. BCD = Bicoid; HB = Hunchback; KR = Krüppel; GT = Giant; KNI = Knirps; CAD = Caudal. (Reproduced from Bradley et al., 2010.)
Figure 5.1.4. The redundancy of Dm’Dichaete and Dm’SoxNeuro is exhibited in Drosophila melanogaster mutants. Green = Dm’Dichaete binding, light green = Dm’Dichaete binding in SoxNeuro mutants. Blue = Dm’SoxNeuro binding, light blue = Dm’SoxNeuro binding in Dichaete mutants. Instances of (A) Compensation, (B) increased binding, (C) de novo binding, and (D) loss of binding are highlighted in the red boxes. (Figure reproduced from Ferrero et al. 2014.)
5.2 Feasibility assay
I first wished to assess the feasibility of performing DamID in Tribolium castaneum. The tethered dam protein used in DamID will only methylate adenine in the context of GATC, and is thus dependent on the abundance of GATC sites in the target genome (van Steensel & Henikoff, 2000; Vogel et al., 2007). In the genome of D. melanogaster, GATC density provides sufficient resolution to map to nearby genomic features, and GATC sites have been reported to occur on average every ~200-300 base pairs (bp) (van Steensel & Henikoff, 2000). I therefore wished to calculate GATC richness in T. castaneum and the average distances between occurrences. Using an R script, I calculated the mean and median, and minimum and maximum, distances in bp between each GATC site, as well as the total number of occurrences. This R script was applied to the genomes of both D. melanogaster and T. castaneum, as well as 10 other arthropod species, in order to ascertain typical arthropod GATC occurrences (Figure 5.2.1, Table 5.2.1). For all arthropods, the mean occurrence of GATC sites is 480bp, whereas the median is 272bp. For D. melanogaster, the mean occurrence of GATC sites is 355bp, and the median 195bp; for T. castaneum, the mean is 567bp and the median 330bp. This suggests that GATC occurrence in Tribolium is less frequent on average than in Drosophila and most other arthropods. There is also a significant negative correlation (R2 = -0.699; p = 0.115) between average GC content of the genome and the mean distance between GATC sites (Figure 5.2.1B). However, the methylation activity of tethered dam proteins has been shown to act significantly up to ~2.5kb upstream or downstream from the TF binding site (van Steensel & Henikoff, 2000), and 98.3% of GATC sites in the Tribolium genome occur within 2.5kb of one another (99.35% for Drosophila, and 98.61% on average across arthropods – see Figure 5.2.1A). This suggests that GATC occurrence is more than sufficient in the genome of T. castaneum for DamID experiments.
(Out of curiosity, I also used the same R script to investigate the occurrence of the Dm’Dichaete and Dm’SoxN target motifs (ACAATG and ACAAAG, respectively) in the Tribolium genome. I found that the Dm’Dichaete motif occurs 108,280 times, with a mean distance of 1505bp separating each motif, and the Dm’SoxN motif occurs 128,445 times, with a mean distance of 1268bp separating each occurrence.)
Next, I wished to test in vivo the feasibility of using Dm’HSP70, a basal promoter endogenous to D. melanogaster, to allow low level, ‘leaky’ expression in T. castaneum. This promoter has been used successfully in DamID experiments in different drosophilid species representing ~25 million years of divergence (Carl & Russell, 2015); however beetles and flies are separated by ~350 million years. Use of this promoter would have greatly streamlined the cloning required to generate Tc’Dichaete-Dam and Tc’SoxN-Dam constructs. Schinko et al. (2010) demonstrated Dm’HSP70’s ability to drive the GAL4/UAS system (Brand, 1994) in Tribolium, albeit inconsistently. Using piggyBac constructs (Horn & Wimmer, 2000) previously generated for DamID in D. melanogaster by Dr Sarah Carl (Carl & Russell, 2015), I sought to generate transgenic lines with Dm’SoxN-Dam in T. castaneum (and a Dam-only negative control), which were under the control of this promoter and UAS, but not GAL4. This not only served as a useful pilot experiment, but I was also interested if the binding data of Drosophila SoxB was similar to that of Tribolium SoxB in vivo. I sought the assistance of Professor Gregor Bucher’s expertise in T. castaneum, and Dr Julia Ulrich from Professor Bucher’s lab assisted me in the microinjections of T. castaneum embryos, with the above piggyBac constructs. These microinjections proved unsuccessful, however; the results from this pilot study are shown in Table 5.2.2. Zero transgenic lines were obtained, and the survival rate to adulthood was extremely poor (<2% for each construct).
Professor Bucher and Dr Ulrich expressed surprise at the low survival rates (especially as Dr Ulrich had performed ~50% of the injections herself), and suggested that the promoter might be unsuitable for DamID experiments. I was therefore directed towards using another promoter, Tc’HSP68, this time endogenous to T. castaneum. This is a basal promoter that has been shown to efficiently, and consistently, drive GAL4/UAS expression in Tribolium (Schinko et al., 2010). Using the principles of Gibson Assembly (Gibson et al., 2009), I assembled the following constructs with Tribolium Dichaete and SoxN, and the basal Tribolium promoter Tc’HSP68 and UAS: pBac[3xP3-EGFP;SV40;UAS-Tc’Hsp68-Tc’Dichaete-Myc-Dam;SV40], pBac[3xP3-EGFP;SV40;UAS-Tc’Hsp68-Tc’SoxNeuro-Myc-Dam;SV40], and pBac[3xP3-EGFP;SV40;UAS-Tc’Hsp68-Dam-Myc;SV40] (see Figure 5.2.2).
These constructs were then submitted to the Tribolium Genome Editing Service (TriGenES: http://trigenes.com) who performed the embryo microinjections and screenings. The TriGenES service experienced greater success with this new promoter, however survival rates were nonetheless well below those observed with the positive control (see Table 5.2.3). I was fortunate in that transgenic lines were obtained for all three constructs: for the Dichaete-Dam construct, 2 transgenic lines were obtained; for SoxNeuro-Dam, 3 lines, and for the Dam-only negative control, just 1 transgenic line. This was significantly lower than the positive control (just 0.1% of embryos injected with the Dam-only construct produced a transgenic line, in contrast to 2.8% with the positive control), suggesting that adenomethylation is indeed very poorly tolerated in T. castaneum.
Table 5.2.1. GATC occurrence across 12 arthropod species.
|Genome||Genome Size (Mb)||GC Content||Mean (bp)||Median (bp)||Min. (bp)||Max. (bp)||% <500bp||% <2500bp|
Table 5.2.2. Tribolium castaneum microinjection table, using the Drosophila melanogaster pigygyBac constructs with Dm’HSP70.
# Transgenic offspring
% Transgenic of injected
% Transgenic of hatched
% Transgenic of adults
Table 5.2.3. Tribolium castaneum microinjection table, using the Tribolium castaneum pigygyBac constructs with Tc’HSP68.
# Transgenic offspring
% Transgenic of injected
% Transgenic of hatched
% Transgenic of adults
5.3 DamID: Attempt 1
Transgenic populations were given time to grow and expand in optimum growth conditions at 32oC, with fresh media being regularly administered. However, adults from these populations appeared in poor health, and their lifespan was not comparable to wildtype beetles; population growth was consequently substantially slower than would be expected with wildtype populations. Given the absence of balancer chromosomes for T. castaneum, populations had to be continuously monitored for GFP expression to prevent allele loss; eventually, when populations were of a satisfactory size, GFP-expressing adults were positively selected until allele fixation. From these allele-fixed populations, three smaller populations were selected for the purpose of creating distinct replicates, and the parent population kept at 25oC for reserve stock purposes.
A pilot experiment was conducted on each population, taking 100 µl of embryos laid by each parent population, and one wildtype negative control (see Chapter 2.4 for embryo collection methodology), to determine methylation presence/absence. The protocol used is essentially as described in Vogel et al., (2007) with 17 cycles used for amplification (see Chapter 2.4.2-3 for genomic isolation and methylation enrichment methodology), where ‘No DpnI’ and ‘No T4 DNA ligase’ samples were included as double negative controls, in order to establish the presence/absence of genomic adenomethylation (see Vogel et al., 2007). This pilot study yielded mixed results (Figure 5.3.1); the Dichaete-Dam and Dam samples showed evidence of methylation, however the SoxNeuro-Dam sample appeared the same as the wildtype sample and negative controls, especially upon further amplification (Figure 5.3.1B). Fortunately, there were 3 independent transgenic lines generated for SoxNeuro-Dam, and so another population was selected. This population exhibited evidence of methylation in a similar pattern to Dichaete-Dam and Dam, and thus was used for all subsequent experiments. This pilot study also helped optimise the number of amplification cycles required during the PCR step; over-amplification is to be avoided to limit background amplification effects (van Steensel & Henikoff, 2000; Vogel et al., 2007; Marshall et al., 2016).
I also performed a PCR amplification for each of the 3 transgenic lines to determine the presence of each transgene, and amplified product was submitted for Sanger sequencing to establish integrity of the inserts. For the Dichaete-Dam and SoxNeuro-Dam transgenes, synonymous mutations were detected, however these were likely due to population-level variation with the published gene sequences. The Dam-only insert was identical to that expected.
Embryos were collected from the 3 populations for each transgenic line and methylation enrichment performed as described above, with 16 cycles of amplification (results in Figure 5.3.2). Libraries were prepared using the ThruPLEX® DNA-seq Kit, with 10 cycles of PCR amplification. BioAnalyzer analysis on these samples revealed sharp peaks in fragments <250bp (Figure 5.3.2A), indicating that there might be concatemer formation and contamination present. Libraries were pooled into a multiplex, and a size-selection step was successfully carried out using Agencourt AMPure XP beads to remove these smaller fragments (Figure 5.3.3B). Samples were submitted CRUK Cambridge Institute Genomics Core for 50bp single-end-reads on an Illumina HiSeq 4000, with a 50% PhiX control included to smooth the low complexity arising from the DamID adapters present at the start of each sequence.
Sequencing yielded 393,401,639 total sequences. A bowtie (Langmead et al., 2009) index was generated for the 2016 T. castaneum 5.2 genome assembly, and each library had the adapter sequences trimmed using the cutadapt tool (Martin, 2011). The libraries were then aligned to the reference genome using bowtie v0.12.8 (Table 5.3.1). The results of this mapping proved extremely disappointing; fewer than 0.3% of reads from each library mapped to the reference genome. Mapped sam files were converted to bam files, sorted and indexed using samtools (Heng et al., 2009). Reads were converted to bed files and extended using BEDtools (Quinlan & Hall, 2010) according to average fragment length prior to the library preparation stage. Reads were then visualized by converting to wig and then bigwig file formats, and viewed using the Integrated Genome Browser (IGB) (Freese et al., 2016). (Note: this pipeline is modified from Bardet et al. (2011)). The visualization revealed repeated sequences scattered throughout portions of the genome in no discernible pattern (Figure 5.3.4). These data proved unusable, as normally, the binding data from the Dam-only control is ‘subtracted’ from the Dam-fusion data, and the differential binding is recognized as authentic binding of the TF (van Steensel & Henikoff, 2000; Vogel et al., 2007). Since there is virtually no overlap between the Dichaete-Dam fusion and Dam-only control, I was unable to assess this differential binding. Moreover, the reads that are present are merely narrow ‘stacks’; a sequence being mapped repeatedly to the exact same locus.
A sample of 100 unmapped reads each from a replicate of Dichaete-Dam, SoxN-Dam, and Dam was queried using the Basic Local Alignment Search Tool (BLAST), however no single species was significantly represented in the unmapped sequences, and BLAST alignment scores were nonetheless poor for those that were discovered. Finally, a FastQC analysis (Andrews, 2010) was performed for each library, and this revealed significant contamination with overrepresented sequences – likely concatemers – which accounted for at least 22%-41% of each library (Table 5.3.2), thereby explaining the ‘stacked’ mapping observed in Figure 5.3.4.
Figure 5.3.2. Gel electrophoresis image of the DamID products used for library construction. D = Dichaete-Dam, S = SoxNeuro-Dam, B = Dam, WT = wild type; -D = no DpnI negative control; -T = no T4 ligase negative control.
Figure 5.3.3. BioAnalyzer traces of pooled libraries. (A) Initially, libraries showed a sharp peak in fragments below 250bp. (B) A size selection step was performed to remove fragments below 250bp.
Figure 5.3.4. Screenshot of reads mapped to the Tribolum castaneum genome, visualized using the Integrated Genome Browser (IGB; Freese et al., 2016). (A) Linkage Group 5 (LG5) of the T. castaneum genome. (B) Zoomed in screenshot of a locus on LG5 showing the dispersal of mapped reads as narrow stacks of the same sequence.
Table 5.3.1. Bowtie mapping of libraries from DamID Attempt 1 to the Tribolium castaneum genome.
|Reads Processed||Reads with at least 1 reported alignment||Reads that failed to align|
|Dichaete_1||20530628||26292 (0.13%)||20504336 (99.87%)|
|Dichaete_2||22454454||47647 (0.21%)||22406807 (99.79%)|
|Dichaete_3||21505392||59296 (0.28%)||21446096 (99.72%)|
|SoxN_1||22095068||14877 (0.07%)||22080191 (99.93%)|
|SoxN_2||30282886||39494 (0.13%)||30243392 (99.87%)|
|SoxN_3||18487888||12673 (0.07%)||18475215 (99.93%)|
|Dam_1||19892114||16687 (0.08%)||19875427 (99.92%)|
|Dam_2||20669641||17747 (0.09%)||20651894 (99.91%)|
|Dam_3||15719624||10213 (0.06%)||15709411 (99.94%)|
Table 5.3.2. FastQC report of overrepresented sequences (concatemers) present in each library from DamID Attempt 1.
|Reads Processed||Over-represented sequences (#)||Over-represented sequences (%)|
5.4 DamID: Attempt 2
As concatemer contamination was deemed to be the issue chiefly responsible for unmapped reads, a new protocol was sought to assuage this problem. Discussions with Dr Tony Southall revealed that the polymerase I had been using (Advantage cDNA Polymerase, Clontech) had also caused them issues with concatemer formation, and he instead suggested that I use the MyTaq polymerase (Bioline) system which yielded better results in their investigations (Southall, personal communication). Dr Southall further suggested I follow their protocol on targeted DamID (TaDa) for my isolation and amplification steps (published after I had my finished my first experiments (Marshall et al, 2016)). This protocol introduced an additional step where the restriction enzyme AlwI is used to cleave the DamID adapters used for amplification, upstream of library preparation (and sequencing); previously I had removed the adapters in silico. I also elected to use fewer cycles of amplification – 15 during DamID amplification, and 5 during library preparation, to further attempt to mitigate any over-amplification effects – and include a wildtype library as a positive control for bowtie mapping.
The results from this experiment were more illuminating than the last. FastQC analysis on each library proved more promising – concatemer contamination had been notably mitigated, with over-represented sequences accounting for just 1.73% of all reads (Table 5.4.1). However, reads were sill failing to satisfactorily align to the Tribolium genome (Table 5.4.2), with the majority of the Dam libraries exhibiting <1% mapped reads. More concerning was the wildtype control, where only 16.68% had mapped to the Tribolium genome. The Dichaete-Dam samples appeared to exhibit a higher percentage of mapped reads than the other DamID samples, and thus I performed a Student’s T test to determine if the mean percentage of mapped reads for Dichaete-Dam and SoxNeuro-Dam differed significantly; however they do not (p = 0.081343).
Upon examination of the unmapped sequences for the DamID samples, querying them with BLAST obtained similar results to those in the previous attempt – no species was notably more represented than others, and alignment scores were poor. However, querying the unmapped wildtype reads yielded a different story; the vast majority of the reads were reported as belonging to Triticum aestivum, the domesticated wheat species. This implicated an immediate source of contamination: the medium used to rear the beetles is organic flour produced from wheat grain.
Despite attempts to mitigate flour contamination during embryo collections, which involved manually removing flour debris with a paintbrush and rinsing 3x with ddH2O, wheat DNA appeared to be significantly overrepresented in DNA samples. The genome of the Chinese spring wheat variety has just been published earlier this year (see Clavijo et al., 2017). The wheat genome is 13,427,354,022bp in length, which is 81x larger when compared to the 165,944,000bp long genome of T. castaneum. Moreover, wheat is hexaploid, whereas T. castaneum is diploid, so wheat has 3x as many chromosomes as the beetle. This means that for a single beetle cell and a single wheat cell, there is 243x as much DNA present in the wheat cell. This represents a significant challenge for DNA isolation experiments in beetles (particularly in embryos which possess few cells), and, prior to the recent publication of the wheat genome, appears to have remained unnoticed in Tribolium research until now.
A wheat index was generated using bowtie, and reads from each library aligned to it. The results are shown in Table 5.4.3. 77.25% of reads from the wildtype library map to the wheat genome. Together with the beetle alignment, 93.93% of reads map to the beetle or wheat genome, verifying that sterile conditions were achieved throughout the experiments. However, very few reads from the DamID samples aligned to the wheat genome (less than 1% in all cases). This is likely because non-GAmTC DNA is removed during the DamID protocol, and there is no evidence suggesting adenomethylation occurs in wheat in natura.
I therefore hypothesized that if the vast majority of DNA in each sample was in fact wheat DNA, the input of methylated beetle DNA into the DamID protocol was insufficient, and indiscernible from any other background DNA that might be present during PCR amplification. This hypothesis accounts for the fact that the majority of the unmapped reads from Attempts 1 & 2 do not exhibit notable association with any particular organism, and instead are likely low-level background from various DNA fragments contaminating the samples. If true, this low-level background likely had a comparable presence to any methylated beetle DNA, and was therefore amplified in a comparable manner.
Table 5.4.1. FastQC report of overrepresented sequences (concatemers) present in each library from DamID Attempt 2.
|Reads Processed||Over-represented sequences (#)||Over-represented sequences (%)|
Table 5.4.2. Bowtie mapping of libraries from DamID Attempt 2 to the Tribolium castaneum genome.
|Reads Processed||Reads with at least 1 reported alignment||Reads that failed to align|
|Dichaete_1||47243317||922287 (1.95%)||46321030 (98.05%)|
|Dichaete_2||37525180||235483 (0.63%)||37289697 (99.37%)|
|Dichaete_3||32381347||441286 (1.36%)||31940061 (98.64%)|
|SoxN_1||35008284||19416 (0.06%)||34988868 (99.94%)|
|SoxN_2||37666808||22150 (0.06%)||37644658 (99.94%)|
|SoxN_3||33758970||38861 (0.12%)||33720109 (99.88%)|
|Dam_1||24728239||9989 (0.04%)||24718250 (99.96%)|
|Dam_2||33903946||14795 (0.04%)||33889151 (99.96%)|
|Dam_3||30598114||26203 (0.09%)||30571911 (99.91%)|
|Wildtype||36468766||6082372 (16.68%)||30386394 (83.32%)|
Table 5.4.3. Bowtie mapping of libraries from DamID Attempt 2 to the Triticum aestivum genome.
|Wheat||Reads Processed||Reads with at least 1 reported alignment||Reads that failed to align|
|Dichaete_1||47243317||340276 (0.72%)||46903041 (99.28%)|
|Dichaete_2||37525180||293194 (0.78%)||37231986 (99.22%)|
|Dichaete_3||32381347||238358 (0.74%)||32142989 (99.26%)|
|SoxN_1||35008284||120557 (0.34%)||34887727 (99.66%)|
|SoxN_2||37666808||143245 (0.38%)||37523563 (99.62%)|
|SoxN_3||33758970||183749 (0.54%)||33575221 (99.46%)|
|Dam_1||24728239||54791 (0.22%)||24673448 (99.78%)|
|Dam_2||33903946||86580 (0.26%)||33817366 (99.74%)|
|Dam_3||30598114||125985 (0.41%)||30472129 (99.59%)|
|Widltype||36468766||28171313 (77.25%)||8297453 (22.75%)|
5.5 DamID: Attempt 3
I sought to perform one final attempt at DamID in Tribolium castaneum. The egg collections that were performed for the previous 2 attempts took a significant amount of time due to the poor health of the transgenic populations and comparatively low fecundity. As I was nearing the completion of my 3 years of funding at this point, and did not feel I had sufficient time to collect enough biological material for another experiment with embryos, I elected to perform the experiments with adult heads instead. The purpose of this investigation was to illuminate the binding properties of Tc’Dichaete and Tc’SoxN in the central nervous system and developing brain of the beetle, and to establish DamID as a resource for genomic studies in the beetle. Since DamID identifies historical binding events in the genome, genomic DNA from adults should still possess the methylation signatures of these binding events during embryonic development (and subsequent stages of the beetle life cycle). Moreover, beetle heads were chosen as they house the brain, and the digestive tracts of beetle bodies likely contain a substantial amount of wheat flour. Finally, adult beetles possess significantly more cells than embryos, maximizing potential DNA yields.
I therefore dissected the heads of ~200 adults from each transgenic line (and a wildtype population as a control); however there was insufficient material for replicates. To mitigate wheat flour contamination, I removed residual flour with a paintbrush, and thoroughly rinsed the heads 3x in ddH2O, and then added additional washes 3x in 100% ethanol, in an attempt to wash clear any flour adhering to beetle mouthparts and the surface of the exoskeleton. DNA was then extracted from these samples, and processed according to the same DamID protocol used in Section 5.4, which had successfully diminished concatemer contamination effects. 100% of the DNA from each of the DamID samples was used at each step of the DamID protocol in an attempt to maximize DNA input.
Prior to submitting the samples for a final, yet expensive round of sequencing, I sought to test the extent of wheat as a contaminating factor in the DamID investigations by performing a quantitative real-time PCR. I selected the TaRca2-α locus of the wheat genome (Saeed et al., 2016), common to several wheat strains, in order to maximize the likelihood that the (unknown) wheat strain used in my flour medium contained the target amplicon. Primers were designed to generate a 134bp amplicon from the wheat genome, and the Dichaete locus of the T. castaneum genome was selected to target a 139bp amplicon (see primers used in Chapter 2.4.1).
The samples I tested were as follows: 1 replicate for the genomic DNA from embryos of Dichaete-Dam, SoxN-Dam, and Dam-only, and wildtype embryonic gDNA; each of these samples were used in the second experimental attempt described in Section 5.4. I also included the genomic DNA isolated from wildtype adult heads, which were prepared in parallel to the genomic DNA from the adult DamID samples. Serial dilutions of the target amplicons for both beetle and wheat DNA were performed in order to generate a standard curve.
The results of this experiment are shown in Figure 5.5.1. The qPCR revealed that wheat DNA is indeed present in detectable quantities in each sample. The Dichaete-Dam replicate shows approximately 2.5-3.7x as much target DNA present for the beetle amplicon (Figure 5.5.1A). This may be reflected by the fact that the amplicon used is from the Dichaete locus, and within the Dichaete-Dam transgenic line, there would be 2 such loci present in the genome. However, more promising was the fact that the heads from wildtype adults exhibited substantial presence of the beetle amplicon, and very low presence of the wheat amplicon (Figure 5.5.1A-B.
However, as these two amplicons are of comparable sizes (134bp and 139bp); this does not reflect the true quantities of wheat and beetle of DNA present, as the wheat genome is 81x as large as the beetle genome (discussed in Section 5.4). The size of the respective amplicon relative to the size of the entire genome can therefore be used to calculate total DNA presence. Once the total DNA presence is calculated for each species, the relative percentages of wheat and beetle DNA content for each sample can be calculated.
The following equation was used to determine total quantity of genomic DNA (Qt) in each sample, for wheat and beetle DNA respectively:
G = total genome size (bp), and A = the amplicon size (bp). The relative percentages of each can then be calculated by dividing each Qtvalue with the starting concentration of the sample (C) and multiplying by 100:
The results are summarised in Figure 5.5.2. These results demonstrate that, with the exception of the Dichaete-Dam sample, wheat DNA represents >90% of the total DNA for each of the embryo samples examined. This supports the hypothesis proposed at the end of Section 5.4 that wheat was a significant contaminating factor; beetle DNA is highly underrepresented in these samples, and consequently methylated DNA likely even more so.
Moreover, the finding that wheat DNA represents just 7.5% of total DNA in the heads of wildtype adults was very promising, as this DNA was isolated in in parallel with the DamID samples. I therefore decided to generate sequencing libraries for the 4 DNA samples isolated from adult heads, and submit them for one final round of Illumina sequencing.
However, the sequencing results disprove the hypothesis of wheat contamination being the chief confounding variable: once again, very few reads (<1%) mapped to the beetle genome from the DamID samples (Table 5.5.1). The percentage of mapped reads for the wildtype sample (85%) demonstrates that I successfully mitigated wheat as a contaminating factor, however, indicating that the more stringent washing conditions with 100% ethanol and the greater cell numbers present in the heads make a significant difference to DNA yields. With <1% of all reads mapping to the wheat genome (Table 5.5.2), and concatemers representing just 1.26% of the total reads across libraries, the troubleshooting experiments detailed here are shown to have been successful in mitigating contaminating factors.
Figure 5.5.1. Quantitative real-time PCR (qRT-PCR) results for a 139bp beetle amplicon (A,C), and 134bp wheat amplicon (B,D) represented in the genomes of Tribolium castaneum and Triticum aestivum, respectively. The standard curve for these experiments was extremely significant, with R2 values of 0.9954 and 0.9992, respectively, indicating high efficiency in PCR amplification.
|Reads Processed||Reads with at least 1 reported alignment||Reads that failed to align|
|Dichaete_AH||75676064||424505 (0.56%)||75251559 (99.44%)|
|SoxN_AH||77394517||578618 (0.75%)||76815899 (99.25%)|
|Dam_AH||92126641||490723 (0.53%)||91635918 (99.47%)|
|WT_AH||97887504||83074213 (84.87%)||14813291 (15.13%)|
Table 5.5.1. Bowtie mapping of libraries from DamID Attempt 3 to the Tribolium castaneum genome.
Table 5.5.2. Bowtie mapping of libraries from DamID Attempt 3 to the Triticum aestivum genome.
|Wheat||Reads Processed||Reads with at least 1 reported alignment||Reads that failed to align|
|Dichaete_AH||75676064||186366 (0.25%)||75489698 (99.75%)|
|SoxN_AH||77394517||42632 (0.06%)||77351885 (99.94%)|
|Dam_AH||92126641||27453 (0.03%)||92099188 (99.97%)|
|WT_AH||97887504||941281 (0.96%)||96946223 (99.04%)|
Table 5.5.3. FastQC report of overrepresented sequences in the libraries from DamID Attempt 3.
|Reads Processed||Over-represented sequences (#)||Over-represented sequences (%)|
5.7 Discussion of results
In this investigation, I have attempted to establish the first genomic-wide study of TFs in Tribolium castaneum embryos, and attempted the first use of DamID within arthropods beyond drosophilids. However, despite significant troubleshooting and several experimental attempts, I have been unsuccessful in achieving these aims. Nevertheless, DamID is a complex and sensitive technique, and the negative results generated here provide a significant contribution to establishing Tribolium castaneum as a resource for genomic studies.
I have tested the feasibility of DamID in T. castaneum by demonstrating that GATC occurrence is sufficiently abundant for DamID experiments, and I have tested two different core promoters, one endogenous to D. melanogaster and the other to T. castaneum, for their suitability with Sox-dam fusion transgenes. I have found that the Drosophila promoter Dm’HSP70 is unsuitable for DamID experiments, as injected embryos exhibited extremely poor survival rates and zero transgenic individuals. In contrast the Tribolium promoter Tc’HSP68 appeared to be more compatible with the Sox-dam fusions in yielding transgenic lines; however, these were still much more difficult to generate than would normally be expected from non-toxic constructs.
Preliminary observations in these lines appeared to confirm that adenomethylation was indeed present, as the isolation and enrichment of DNA via methylation-sensitive enzymes produced distinct gel distributions when compared with the negative controls and wildtype DNA (Figure 5.3.2). Following digestion with the methylation-sensitive enzyme DpnI, DNA is passed through a column meaning that only the methylated DNA cleaved by DpnI should pass through, whereas uncut genomic DNA is left behind. Double selection of methylated DNA occurs with DpnII, which in turn cleaves only unmethylated DNA, meaning it cannot be enriched by PCR amplification. Over-amplification can lead to background contaminants being amplified, as was observed in the initial pilot study (Figure 5.3.1), however this over-amplification has a distinct pattern on the gel that varies with the negative controls. These results lead me to conclude that some degree of adenomethylation was present in these beetle populations.
Concatemer formation also appeared to pose a significant challenge in Attempt 1 of the experiments, with FastQC reports suggesting that concatemers represented anywhere between 22%-41% of these reads. These challenges were mitigated by the adoption of a different polymerase for enrichment of methylated fragments, and also from the addition of a digestion step with the restriction enzyme AlwI, which recognises and cleaves the adapter sequence used for the purposes of PCR amplification (Marshall et al., 2016). Fewer cycles of amplification also likely helped reduce concatemer formation. Together, these steps reduced concatemer contamination from ~30% of the total reads to just 1.73%.
The most important novel finding, however, comes from the second experimental attempt: wheat is a significant contaminating factor in DNA preparations from T. castaneum. Initially I had elected not to include a wildtype library in Attempt 1 as I wanted to guarantee sufficient read depth across the samples. However, by including the wildtype as a positive control in Attempt 2, I was able to determine that the vast majority of this DNA library was actually wheat DNA (77.25%). The wheat genome is 81x larger than the beetle genome, and is hexaploid as opposed to diploid, meaning that for each wheat cell, there is 243x as much DNA present compared to each beetle cell. Research in embryos is likely to be significantly more challenging as a consequence given their low cell numbers.
These results may explain why the only genomic study to date in T. castaneum has been conducted in beetle larvae, which have many more cells than embryos; however the fact that the researchers identified just 16 binding sites in the genome (Hepat et al., 2013) leads one to speculate whether they too encountered similar problems in obtaining sufficient enrichment of genuine binding events in their investigations. In contrast, many RNA-seq studies have been performed in T. castaneum (e.g. see Preuss et al., 2012; Altincicek et al., 2013; Schmitt-Engel et al., 2015), where contamination from wheat transcripts does not appear to have proven an issue. This is likely to due to the fact that RNA is far less stable than DNA; the preparation of flour from wheat grain involves both rigorous heat and mechanical conditions which may lead to the loss of RNA; and the sterilisation techniques used to prepare the flour as a medium for the beetles (-80oC freezing or 80oC heating) is likely to lead to a further loss of wheat transcripts.
That wheat was a confounding factor in the investigations discussed here is unquestionable; the vast majority of genomic DNA extracted from embryos from each population (~90%) proved to be wheat. The quantity of input DNA to the DamID enrichment protocol is critical given you are positively selecting for methylated sequences only; lower input DNA means that the ratio of methylated DNA to inevitably-occurring background DNA is far lower, and thus they are amplified in similar proportions. Alternatively, when there is an abundance of adapter sequence present relative to genomic DNA, adapters are more likely to ligate to one another and be amplified at the expense of everything else, which explains the severe concatemer presence in each of the DamID samples in Attempt 1.
However, both of these confounding factors – concatemer formation and wheat contamination – were controlled for in the final DamID experiments I performed and yet they still proved unsuccessful. This suggests that there is likely to be a problem with the activity of the Sox-dam fusion itself. The transgenesis itself had been successful, as 100% of individuals exhibited GFP expression and the transgene was PCR amplified and sequenced for each population, showing only that no non-synonymous mutation had occurred and that the transgene was intact. Positional effects of the insertion site may provide an explanation in terms of silencing the transgene (for example if the insertion occurred in regions of open or closed chromatin). However, given the toxicity exhibited by dam in eukaryotes, and the poor survival rate under both the Drosophila and Tribolium promoters, it is more likely that a site of closed chromatin is preferable in order to limit expression. Moreover, the experiments for enrichment of methylated DNA indicate that methylation is indeed present. Determining the extent of this methylation is far more difficult, and it could be that the promoter, while not driving expression so much as to be toxic, may be too tight and only allows truly minute levels of expression, which are insufficient for detection via the protocol used above. In future, determining the expression levels of Sox-Dam fusions via an RT-PCR in both Drosophila and Tribolium, and comparing the two, might yield insights into whether expression is sufficient under the control of the Tribolium promoter.
Other speculative explanations for the lack of detection of methylated regions exist; i.e. the beetle T. castaneum might be biologically incompatible with DamID as a technique. Methylation is very poorly understand in invertebrates, and CpG methylation has only very recently been shown to occur in insects, albeit in a very dissimilar manner to vertebrates (Song et al., 2017). It is plausible for example that Tribolium may have evolved a derived genomic mechanism that mitigates adenomethylation in the genome, perhaps as a defence mechanism. However, this seems unlikely given the similarities of Drosophila and Tribolium in their known methylomes (Song et al., 2017). More likely perhaps is the hypothesis that T. castaneum is simply less tolerant to adenomethylation than D. melanogaster, even at minute levels, and that for the Sox-Dam fusion to sufficiently methylate the genome in an identifiable manner, such levels of methylation are toxic.
An alternative method might be to potentially detect DNA methylation directly. If methylation levels are insufficient to be enriched via traditional techniques, then directly sequencing genomic DNA to detect methylation might prove a more sensitive assay. Bisulfite sequencing is routinely used to detect 5-methylcytosine, however this technique relies on converting all unmethylated cytosine residues to uracil (reviewed by Fraga & Estella, 2002), and thus is inappropriate for detecting N6-methyladenine (Flusberg et al., 2010). An alternative method is to use single-molecule, real-time (SMRT) sequencing, which utilises the incorporation of fluorescent nucleotides. Fluorescent ‘pulses’ are measured at incorporation, which enables direct detection of modified nucleotides, and discrimination between N6-methyladenine, 5-methylcytosine, and 5-hydroxymethylcytosine modifications is possible (Flusberg et al., 2010).
In T. castaneum, cytosine methylation occurs in a mosaic pattern in the genome; hypermethylated regions are interspersed amongst larger unmethylated regions (Song et al., 2017). However, overall cytosine methylation is very low (in contrast to vertebrates). Research by Song et al., (2017) used deep sequencing to detect methylation. However, the authors argue that their data of T. castaneum methylation are likely incomplete, and that it may require sequencing at extreme depths to fully uncover the beetle methylome. This suggests that SMRT sequencing is likely to unsuitable because, despite its single-molecule sensitivity (Flusberg et al., 2010), scaling it up to the extreme depth likely required may prove very challenging.
DamID is a sophisticated and intricate method for detecting historical binding events of TFs in vivo, however this investigation reveals that significant optimisation must be performed for it to be established in a non-model organism. The two promoters tested here may have each been ‘too hot’ and ‘too cold’ respectively, and as such the search should be widened to discover a ‘Goldilocks’ promoter, allowing just the right amount of expression. However, whether such a Goldilocks promoter exists is difficult to determine, as adenomethylation in T. castaneum may not be sufficiently tolerated to allow the broad methylation across the genome necessary for its identification. A more viable option for future genomics research may well be to develop new antibodies, and test the specificity and robustness of existing ones, so that ChIP experiments may be performed. Nonetheless, the findings from this study implicating wheat DNA as a significant contaminant originating from the flour medium will be indispensable in such ChIP studies, and especially so where embryonic development is concerned. Therefore, the contributions from this research will hopefully aid future investigations that seek to establish Tribolium castaneum as a model organism for genomics research.