ABC has been proved to be a powerful framework to compare complex evolutionary scenarios for large datasets Roux et al. Altogether, current available softwares already provide stimulating leads for future developments in phylogeny. Identifying candidate loci for natural selection is a central goal explored by two traditional approaches in adaptation-genomics: top-down GWA and QTL and bottom-up genomic scan approaches.
With the advent of high-throughput sequencing, genomic scans became a popular approach to detect candidate target of selection. Such scans have the merit to identify candidates without the a priori expectation of a candidate gene approach Ellegren, However, they have various limitations with false-positive issues Mallick et al.
Here, we detail how GC-content can lead to important additional bias during genome scans for detecting natural selection. Genome scans of positive selection often rely on methods that look for lineage-specific accelerations in the protein rate of evolution. Because GC alleles are actively selected by the repair systems of meiotic recombination, they are over-represented in the gamete pool and benefit of increased transmission to the next generation in a similar way than beneficial mutations subject to positive selection.
Consequently, many accelerations of the substitution rate attributed to positive selection during genome scans are actually due to gBGC episodes Galtier and Duret, ; Berglund et al. When a mutation toward GC is deleterious, gBGC can counteract positive selection and maintain or fix deleterious alleles. Confusion between positive selection and gBGC could be avoided through two different ways.
The first is by filtering the results of classical tests of positive selection and consider with caution positive selection signatures in GC-rich regions. Several criterions can be used in both cases to differentiate gBGC from positive selection, such as the number of mutations toward GC in the surrounding non-coding regions Galtier and Duret, Popular analytical methods in molecular evolution rely on a strong assumption: synonymous mutations are neutral.
However, natural selection was proposed to be superimposed to these two evolutionary forces at synonymous codons Urrutia, ; Comeron, ; Plotkin et al. Although initially challenged Williamson et al. This association is explained by selection for increased translational efficiency.
Translational efficiency would then be optimized by increasing the usage of the preferred synonymous codons. Such a process can be tested in coding sequences by measuring the effective number of codons ENc in a given gene. ENc takes a value of 61 when all codons of the genetic code minus the three stop codons are used without bias, and decreases to 20 the number of amino-acids for the most biased genes.
In agreement with the hypothesis of selection for translational efficiency, population genetics analyses in Drosophila described signatures of selection on synonymous mutations Akashi, ; Akashi and Schaeffer, A study of codon usage bias in Caenorhabditis elegans, Drosophila melanogaster , and Arabidopsis thaliana has shed light on the over-expression of genes featuring codon preference, with a large predominance of preferred codons ending with G or C Duret and Mouchiroud, By locally increasing GC-content, gBGC mechanically restricts the number of used codons and reduces the measured ENc independently of selection for translational efficiency.
The measured ENc is thus biased by gBGC and must be corrected with local background nucleotide compositions. In addition, variation in GC-content also impacts measures of gene expression.
With the advent of high-throughput sequencing technologies, it is now a standard practice to approximate gene expression levels by counting the number of reads mapping a target in ChIP-seq or RNA-seq analysis. Testing selection for translational efficiency by measuring the correlation between ENc and gene expression levels therefore requires the use of both GC-corrected ENc and GC-corrected expression levels.
The ongoing surge of transcriptomic data will permit measurement of GC-content heterogeneity, preferred codons usage and expression levels across a large number of loci and species. This type of large-scale analysis could open the door to a better understanding of the relationship linking effective population sizes Ne and codon usage.
As theoretically predicted Bulmer, , selection on synonymous codons might be stronger in species with large Ne. While the Ne -hypothesis to explain variation in selection on codon usage remains untested by empirical studies, a descriptive study of the Ne -effect on variation in gBGC will be necessary to avoid entangling the two effects.
Future projects aiming to test these hypotheses are expected to be strongly biased if GC-content biases are naively neglected regarding estimates of gene expression levels or codon usage. GC-content is associated to multiple biases of different nature Figure 1. Whether through technological reasons sequencing technologies biases , biological reasons GC-biased gene conversion or methodological reasons models of sequence evolution limitations , all these biases affect the results of downstream analyses.
With the surge of genomic data from various non-model species, comparative genomics have the opportunity to solve many unresolved questions in evolution. However, one should be aware of the methodological challenges associated to the GC-content heterogeneity inherent to large scale studies, whether it be for a large number of species or loci. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Akashi, H. Genetics , — Google Scholar. Mutation pressure, natural selection, and the evolution of base composition in Drosophila. Genetica 10, 49— Arbeithuber, B. Crossovers are associated with mutation and biased gene conversion at recombination hotspots.
Beaumont, M. Approximate Bayesian computation in population genetics. Benjamini, Y. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. Berglund, J. Hotspots of biased nucleotide substitutions in human genes. PLoS Biol. Bernardi, G. The mosaic genome of warm-blooded vertebrates. Science , — Betancur-R, R. Addressing gene tree discordance and non-stationarity to resolve a multi-locus phylogeny of the flatfishes Teleostei: Pleuronectiformes.
Bierne, N. The coupling hypothesis: why genome scans may fail to map local adaptation genes. Blanquart, S. A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution.
Recombination rate variation modulates gene sequence evolution mainly via GC-biased gene conversion, not Hill—Robertson interference, in an avian system. Boussau, B. Efficient likelihood computations with nonreversible models of evolution. Bulmer, M. The polysome profile was largely unaffected by DDX6 silencing, implying that DDX6 depletion did not grossly disturb global translation Figure 3—figure supplement 1B.
Polysomal RNA isolated from the sucrose gradient fractions Figure 3—figure supplement 1B and total RNA were used to generate libraries using random hexamers to allow for poly A tail-independent amplification. As polysomal accumulation can result from both regulated translation and a change in total RNA without altered translation, we then used the polysomal to total mRNA ratio as a proxy measurement of translation rate. Nevertheless, for few transcripts, polysomal enrichment may reflect an elongation block rather than an increased rate of initiation.
As a result, mRNAs with the most upregulated translation rate were the least stabilized, and conversely Figure 3—figure supplement 2B. Then, as we previously showed that DDX6 can oligomerize along repressed transcripts Ernoult-Lange et al. The analysis was performed as in Figure 1B. The analysis was performed as in A. The read coverage was analyzed in each duplicate experiment and normalized as described in Materials and methods.
The average value in control cells gray lines and after PAT1B silencing peach lines was plotted, with the bars representing the duplicate values. An expanded view of the dashed box is presented on the right panel. The data were analyzed as in C.
Raw GC content and log2 transformed ratio of the other datasets were used for the clustering of both transcripts lines and datasets columns. As for DDX6, we assume that changes in steady-state mRNAs following PAT1B silencing generally reflect their increased stability though, again, we cannot exclude some changes at the transcription level. To gain insight into the mechanism of regulation by PAT1B, we analyzed the read coverage in the PAT1B silencing experiment Figure 4C and found it to be unchanged over the whole transcriptome.
To obtain a global visualization of the results we conducted a clustering analysis of the various datasets Figure 4E. Having shown that GC content is a distinctive feature of DDX6 and XRN1 versus PAT1B targets, we investigated the link between this global sequence determinant and a variety of sequence-specific post-transcriptional regulators for which relevant genome-wide datasets are available Figure 5—figure supplement 1A. Furthermore, they shared common behavior in the various experiments.
This is summarized in Figure 5B in a heatmap representing their median value in each dataset, while Figure 5—figure supplements 1 and 2 provide detailed analysis, as described below.
The targets of the indicated factors were defined using CLIP experiments or motif analysis see Materials and methods. The boxplots represent the distribution of the GC content of their gene. The distribution for all mRNAs is presented for comparison in gray and the red dashed line indicates its median value. B Heatmap representation of the different factors depending on the behavior of their mRNA targets in the different datasets.
The lines were ordered by increasing GC content, and the columns as in Figure 4E. The data are represented as in A. The data were represented as in B , using the same color code. These results were consistent with their high GC content and our global analysis above. It also pointed to a particular role of 4E-T in PB targeting or scaffolding. Our analysis is also informative on the link between DDX6-dependent decay and codon usage. Previous yeast studies have debated whether suboptimal codons could enhance DDX6 recruitment to trigger mRNA decay Radhakrishnan et al.
Thus, in HEK cells, this mechanism seemed to account for a minor part of DDX6-dependent decay, if any, as also found in mouse stem cells Freimer et al.
Overall, they also shared common behavior in the various silencing experiments and PB dataset, with nevertheless some differences. This is summarized in Figure 5D in a heatmap representing their median value in each dataset, while Figure 5—figure supplement 3 provides detailed analysis, as described below.
However, our analysis revealed some differences between miRNAs, particularly in terms of extent of PB storage Figure 5—figure supplement 3H , which appeared associated with distinct GC content: at the two extremes, miRp targets were particularly AU-rich and strongly enriched in PBs, while the targets of miRb-5p, the most GC-rich in these sets, were not.
As the global GC content appeared closely linked to mRNA fate, but also to RBP and miRNA binding, as well as to translation activity, our analyses then aimed at ranking the importance of these various features. We first assessed the respective weight of the GC content and the binding capacity of particular RBPs. To this aim, we binned the whole transcriptome depending on its GC content bin size of transcripts. The median fold-changes of the bins in each RNAseq dataset were calculated and plotted as a function of their median GC content.
Median values were similarly calculated for the various group I and II target lists and overlaid for comparison Figure 5—figure supplement 4. Surprisingly, the fold changes of the targets of particular RBPs generally fell very close to the tendency plot based on GC content only. Similarly, the fate of the miRNA targets was mostly in the range expected from their GC content Figure 5—figure supplement 5.
Nevertheless, some miRNA-specific effects were observed. For instance, the targets of several miRNAs were more stabilized than expected after DDX6 depletion, including miRb-5p, 92a-3p, 16—5 p, 18a-5p, 19a-3p, 19b-3p Figure 5—figure supplement 5A , though this was not observed following XRN1 depletion Figure 5—figure supplement 5B.
Similarly, while the targets of miR—3 p and miR—5 p were both particularly enriched in PBs Figure 5—figure supplement 5E , only miR—3 p targets were particularly dependent on PAT1B for stability Figure 5—figure supplement 5D.
Thus, despite their small size, the miRNA binding sites tend to have a GC content similar to that of their full-length host mRNA, which affects their fate in terms of PB localization and post-transcriptional control.
Thus, while mRNA localization in PBs is highly influenced by their GC content, it may also be outcompeted by retention on membranous organelles and plasma membrane. In a mirror analysis, we analyzed groups of transcripts with similar GC3. A General importance of the CDS. C The transcripts of haplo-insufficiency genes are enriched in PBs. The haplo-insufficiency score is the probability that a gene is haplo-insufficient, as taken from the Huang et al.
The results were similar using Steinberg et al. Error bars, SD. Representative cells are shown in E.
Arrows indicate the PBs enlarged above. The experiment was performed in duplicate exp. The percentage of PBs containing clusters of Rluc transcripts in the four experiments is represented in F. The scheme recapitulates the main steps of the assay. Fluorescence microscopy images show that PBs in cells, PBs after cell lysis, and reconstituted PB-like granules have similar size. We speculate that suboptimal translation of AU-rich CDS makes mRNAs optimal targets for translation regulation, since any control mechanism has to rely on a limiting step.
Conversely, optimally translated transcripts would be better controlled at the level of stability. One prediction is that proteins produced in limiting amounts, such as those encoded by haplo-insufficiency genes, are more likely to be encoded by PB mRNAs. Genome-wide haplo-insufficiency prediction scores have been defined for human genes, using diverse genomic, evolutionary, and functional properties trained on known haplo-insufficient and haplo-sufficient genes Huang et al.
To add experimental support to the importance of GC content for PB assembly, we conducted two assays. First, we analyzed the localization of reporter transcripts that differ only by the GC content of their CDS. After 24 h cells were analyzed for luciferase activity and transcript localization. In agreement with our previous analyses, Rluc protein yield was considerably reduced 4.
After lysis and elimination of preexisting PBs by centrifugation, addition of recombinant DDX6 triggered the formation of new granules on ice, in a dose-dependent manner Figure 6—figure supplement 2C—E. This reconstitution assay was surprisingly efficient, as granule formation required rather low concentrations of both the lysate components about fold lower than in cells, see Materials and methods and recombinant DDX6 0. Next, the cell-free extract was briefly treated with micrococcal nuclease to decrease the amount of cellular RNA, and the assay was repeated with or without addition of an either AU-rich or GC-rich nt-long synthetic RNA Figure 6H , Figure 6—figure supplement 2F.
Low GC content in the CDS likely acts, at least in part, through codon usage and low translation efficiency. Our combined analysis of the transcriptome of purified PBs together with transcriptomes following the silencing of broadly-acting storage and decay factors, including DDX6, XRN1 and PAT1B, provided a general landscape of post-transcriptional regulation in human cells, where mRNA GC content plays a central role.
Moreover, while the analysis was consistent in proliferating cells of various origins, giving rise to a general model, it is possible that changes in cell physiology, for instance at particular developmental stages or during differentiation, rely on a different mechanism.
While the redundancy of the genetic code should enable amino acids to be encoded by synonymous codons of different base composition, the wide GC content variation between PB-enriched and PB-excluded mRNAs has consequences on the amino acid composition of encoded proteins.
Interestingly, we showed that the absolute number of low usage codons per CDS best correlates with low protein yield. Thus, these results provide a molecular mechanism to a previously unexplained feature of PB mRNAs, that is, their particularly low protein yield, which we reported was an intrinsic property of these mRNAs and not simply the result of their sequestration in PBs Hubstenberger et al.
Interestingly, the mRNAs of haplo-insufficiency genes, which by definition are expected to have a limited protein yield, are indeed enriched in PBs Figure 6C. Adjusting the focus on human variation. Trends Genet. Sharp, P. Averof, A. Lloyd, G. Matassi, J. DNA sequence evolution: the sounds of silence. Soriano, P. Meunier-Rotival, G. The distribution of interspersed repeats is nonuniform and conserved in the mouse and human genomes.
Sueoka, N.. Directional mutation pressure and neutral molecular evolution. Svetlova, E. Avril-Fournout, G. Ira, P. Deschavanne, J. DNase-hypersensitive sites in yeast artificial chromosomes containing human DNA. Tenzen, T. Yamagata, T. Fukagawa, K. Sugaya, A. Ando, H. Inoko, T. Gojobori, A. Fujiyama, K. Okumura, T. Precise switching of DNA replication timing in the GC content transition area in the human major histocompatibility complex.
Wahls, W. Meiotic recombination hotspots: shaping the genome and insights into hypervariable minisatellite DNA change. Watanabe, Y. Tenzen, Y. Nagasaka, H. Replication timing of the human X-inactivation center XIC region: correlation with chromosome bands. Wolfe, K. Sharp, W. Mutation rates differ among regions of the mammalian genome.
Wu, T. Factors that affect the location and frequency of meiosis-induced double-strand breaks in Saccharomyces cerevisiae. Zoubak, S. Clay, G. The gene distribution of the human genome. Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide.
Sign In or Create an Account. Sign In. Advanced Search. Search Menu. Article Navigation. Close mobile search navigation Article Navigation. Volume Article Contents literature cited. Our results may represent a solid basis for further investigation on human structural and functional genomics while also providing a framework for other genome comparative analysis.
The genome is the complex of the genetic information of a cell and in eukaryota and thus in humans is stored in the nucleus and mitochondria [ 1 ].
While mitochondrial DNA mtDNA sequence has been known since [ 2 ], the draft sequence of the nuclear human genome was first published in February [ 3 , 4 ]. The fact that very long molecules of human DNA can be contained, following accurate and multiple rounds of folding, within the very limited space of the nucleus, has always attracted attention.
Traditionally, it has actually roughly been estimated over the last decades that the total length of human diploid DNA is around 2 m Table 1 [ 7 , 8 , 9 , 10 , 11 , 12 , 13 ]. The base composition is usually specified quoting the percentage of guanine G and cytosine C of a DNA molecule, or GC content [ 1 ] and was first estimated through the buoyant density centrifugation [ 14 ]. The GC content has been well studied across organisms [ 15 , 16 , 17 , 18 , 19 ], showing its relationships with various genomic characteristics [ 20 , 21 , 22 , 23 , 24 ] and with gene structures such as exons and introns [ 25 , 26 , 27 ], for example showing that G-rich repeats are a consistent feature of human ultra-short introns [ 28 , 29 ].
The availability of a high-quality reference sequence for the human genome currently offers the possibility to provide an accurate evaluation of these parameters. In this work we propose revised estimations for the length, weight and GC content of the reference human genome and of individual chromosomes, including mtDNA, in a standard human diploid cell and in a reference human being.
Moreover, in this paper we discuss the meaning of the obtained results and we formulated a method to calculate the relative GC content in the whole messenger RNA set of sequences and in transcriptomes, comparing different tissues and organisms. Lengths in centimeters cm and weight in picograms pg of all 24 human chromosome and mtDNA sequences were calculated as detailed in Additional file 1 : Additional Methods.
The genomic GC content was calculated among the certain bases for the 24 chromosomes and for mtDNA as detailed in Additional file 1 : Additional Methods. Human quantitative transcriptome maps were previously obtained from publicly available microarray datasets analysed through TRAM Transcriptome Mapper software [ 30 ] as described [ 31 , 32 , 33 ].
Since quantitative gene expression values may anticipate mutational effects that will most likely affect a given human tissue [ 34 ], we compared a pathologic cell type with its normal counterpart and a whole organ with one of its subregions Additional file 1 : Additional Methods. For each analysis, only genes for which an expression value is available in both biological conditions were used. Thus, we performed GC calculations on other representative species genomes: Danio rerio , Caenorhabditis elegans , Saccharomyces cerevisiae and Escherichia coli Additional file 1 : Additional Methods.
Individual chromosome lengths in bp and cm are given in Table 2. Certain base counts and uncertain base composition estimations given in Additional file 2 : Table S1 were used to calculate each chromosome weight, obtaining the results shown in Table 2.
The length and weight sums of the 24 chromosomes 22 autosomes and X and Y chromosomes were used in order to proportionately estimate the length and weight of the unplaced bases, improving whole genome calculation accuracy Table 2. Data for the previous assembly GRCh The chromosomes varying to a greater extent between the two assembly versions are chromosomes 9 and Y GRCh38 has 2. Considering a mean length in a diploid cell of Considering a mean weight in a diploid cell of 6.
Applying all the calculations previously performed for the nuclear genome, the human mtDNA length, weight and GC content were estimated Table 2.
Among the other investigated species, the calculated chromosome numbers, total genome bp lengths and genomic GC contents Table 3 are in accordance with previous reports Additional file 5 : Table S4. This value for whole human hippocampus and whole brain transcriptome maps is of 17, genes. Among the other investigated species, this value is of genes for D. For each biological condition, each mRNA GC absolute count was then multiplied by the corresponding expression value.
The sum of these values related to each transcriptome map gives the transcriptomic GC content Table 3. In this work we have determined, to the best of our knowledge, basic parameters describing the normal human reference genome: the length, expressed in terms of both bp and unit of length cm, m , weight in unit of mass, pg and relative GC content expressed in percentages, for the whole human nuclear genome, for each chromosome and for mtDNA.
We have based our calculations on the GRCh38 assembly, which is longer and more contiguous than previous reference assembly versions and provides a sequence-based representation for genomic features such as centromeres and telomeres for the first time [ 5 ], which, although variable among cell types and ages, would affect our estimates to a small extent.
However, the human genetic diversity ranges from the single-nucleotide variation to large chromosomal events [ 41 , 42 ]. Our results are not far from previous rough estimates Table 1 , however the more accurate determination of the human genome length and weight might offer interesting possibilities. Applying our analysis to other genomes would be useful to update these indexes. Another interesting possibility offered by the knowledge of human nuclear genome length is the derivation of the total human DNA volume, in order to estimate the efficiency of DNA in data storage, resulted to be in the order of 10 4 fold superior in comparison to the most currently advanced hard disks Additional file 7 : Discussion.
The genome weight is a parameter useful for the correlation with the DNA extraction yields through different methods [ 45 ]. Regarding GC content analysis at genomic level, our results are in agreement with a recent study [ 6 ].
Through the implementation of TGCA software we have also determined the GC content at mRNA and transcriptomic levels for the first time, a novel concept we propose here, which is the GC percentage calculated in the mRNA amount actually expressed in a tissue. This has been confirmed also in D. Overall, it seems that the GC composition of highly and poorly expressed genes in specific tissues affects the mRNA GC content to a small extent and a global compensation between them may exist. Recent works conducted on DS subjects showed typical alterations of the metabolome and whole transcriptome [ 46 , 47 ].
Chromosome 21 GC content is one of the closest to the mean genomic GC content, thus the presence of a third copy of chromosome 21 would not cause a great change in GC composition at genomic level. For example, a recent work showed a high expression of high-GC-content mRNAs in psoriasis lesion transcriptome, while resolving lesions had a low expression of these mRNAs [ 49 ]. More in-depth analysis will be needed to validate the use of these indexes as indicators in the comparison of disease versus normal conditions.
Genomic, mRNA and transcriptomic GC content determination can be useful in DNA and RNA sequencing analyses where GC content bias for the Illumina sequencing technology has been documented as likely introduced at the library preparation step, resulting in confounding DNA copy number studies and expression fold-change estimates [ 50 ].
In conclusion, we provide an update on fundamental human genome parameters and a first characterisation of the mRNA and transcriptome GC contents. Our results may represent a solid basis for further investigations on human structural and functional genomics [ 29 , 51 ] while also providing a framework for the comparative analysis of other genomes.
Determination of the length, weight and relative GC content of genome is subjected to the accuracy of the genome assembly and to the variability existing among individuals [ 41 ]. Regarding mtDNA, although its sequence has been exactly determined, the mtDNA molecule copy number per cell is of difficult estimation [ 52 ].
Regarding GC content at mRNA and transcriptomic levels, the analysis is limited to genes for which an expression value together with the corresponding longest mRNA nucleotide sequence is publicly available. Strachan T, Read A.
Human Molecular Genetics. Garland science. Google Scholar. Sequence and organization of the human mitochondrial genome. Initial sequencing and analysis of the human genome. The sequence of the human genome.
0コメント