DNA structural features of eukaryotic TATA‐containing and TATA‐less promoters

Eukaryotic genes can be broadly classified as TATA‐containing and TATA‐less based on the presence of TATA box in their promoters. Experiments on both classes of genes have revealed a disparity in the regulation of gene expression and cellular functions between the two classes. In this study, we report characteristic differences in promoter sequences and associated structural properties of the two categories of genes in six different eukaryotes. We have analyzed three structural features, DNA duplex stability, bendability, and curvature along with the distribution of A‐tracts, G‐quadruplex motifs, and CpG islands. The structural feature analyses reveal that while the two classes of gene promoters are distinctly different from each other, the properties are also distinguishable across the six organisms.

Eukaryotic genes can be broadly classified as TATA-containing and TATAless based on the presence of TATA box in their promoters. Experiments on both classes of genes have revealed a disparity in the regulation of gene expression and cellular functions between the two classes. In this study, we report characteristic differences in promoter sequences and associated structural properties of the two categories of genes in six different eukaryotes. We have analyzed three structural features, DNA duplex stability, bendability, and curvature along with the distribution of A-tracts, G-quadruplex motifs, and CpG islands. The structural feature analyses reveal that while the two classes of gene promoters are distinctly different from each other, the properties are also distinguishable across the six organisms.
Gene expression is the most fundamental biological process, in which the genetic information is used to create a phenotype. The initiation of transcription is the first and most crucial step in the regulation of gene expression. Promoters are the genomic sequences where the transcriptional machinery assembles, and the core promoter activity is conferred by the presence of short sequence motifs at specific positions relative to the transcription start site (TSS). TATA box, Inr (Initiator), BRE (TFIIB recognition element), DPE (downstream promoter element), MTE (motif ten element), TCT (polypyrimidine initiator), and Sp1 (specificity protein 1) are well-characterized sequence motifs reported in several eukaryotes (reviewed in [1][2][3][4][5][6][7][8]). The majority of core promoter motifs serve as binding sites for components of the basal transcription machinery, in particular, TFIID and TFIIB [1]. Furthermore, few noncanonical promoter elements such as 'CpG islands' [9] and 'ATG deserts' [10] have also been well characterized in mammals and are more prominent than the canonical promoter elements [4]. Although a variety of core promoter architectures have been revealed, the precise biochemical mechanisms that govern transcription initiation events from the constituent elements are still being elucidated [4]. TATA box is the best-characterized core promoter element and is considered as being most ancient since it is present in organisms ranging from yeast to plants and metazoans. The TATA box is usually located at the À30 or À31 position relative to the TSS in metazoans [1,6] and at À120 to À40 region relative to TSS [2] or À200 to À50 relative to 'ATG' start codon in yeast [11]. A wide variation is found in the percentage of TATA box-containing promoters reported in several studies, which is due to differences in the definition used for TATA box, the window size considered for extracting TATA-containing promoters, and to a lesser extent the datasets used. Analyses of human promoter sequences report 2.0-2.6% [12], 24% [13], and 27% [14] of sequences as TATA box-containing promoters.
In eukaryotes, genes can be broadly classified as TATA-containing and TATA-less based on the presence or absence of a TATA box in their promoter sequences [11]. They have been studied in depth in yeast, and it is reported that TATA-containing genes are expressed at extremely high or low levels (indicating high plasticity), are stress-induced, and are under evolutionary selective pressure, when compared to TATA-less genes [11]. The two classes of genes also vary in their usage of transcription factors (SAGA vs. TFIID) in yeast [15]. Furthermore, in yeast, TATAcontaining genes have a preference for subtelomeric location in the genome and have more duplicates [11,16]. Promoter sequences of the two classes of genes in yeast and flies have also been shown to differ in their nucleosome occupancy, with TATA-less genes displaying canonical nucleosome occupancy, with nucleosome-free regions in the immediate upstream of TSS (core promoter region), while the TATA-containing promoter regions are occupied by nucleosomes [17][18][19]. In mammals, the two classes adopt different strategies for transcription initiation (focused in TATA-containing and dispersed in TATA-less) [5,20].
Differences in core promoter nucleotide composition and basic gene features such as length of gene, mRNA, and introns in the two classes of genes have been reported earlier [13,21]. However, primary sequence inspection alone provides limited information about the promoter activity. Since, the initiation of transcription involves not only orchestration of different factors but also the DNAprotein recognition, formation of stable complexes and finally the open complex formation, studying DNA structure gives more insights about promoter function. Several studies have shown that promoter regions of both prokaryotic and eukaryotic genomes have distinct structural features compared to their neighboring regions, as well as coding regions [22][23][24][25][26][27][28][29]. Recently it has been reported that the DNA structural features of promoter sequences are linked to gene expression variability in Saccharomyces cerevisiae [30]. Furthermore, the local intrinsic structural features like groove shape, flexibility, and topography are shown to be more informative than the simple nucleotide sequence in understanding the DNA-binding specificities of transcription factors [31,32]. Compared to a simple nucleotide sequence, structural features have more information content, as similar sequences can have similar structural properties in a majority of the cases, as well as very different structures in few cases while divergent sequences sometimes can adopt equivalent local structure [31]. Along with the unique structural features, promoter regions of human have been shown known to possess an unusually high presence of G-quadruplex-promoting sequences [33].
An earlier study has reported differences in DNA bendability of TATA-containing and TATA-less promoters in 11 yeast species based on the translation start site (TLS) data of yeast that was available at that time [34]. Prevalence of various DNA structural features in the two classes of promoters in different domains of life has not been compared till date. The current study aims to analyze three distinct structural properties, DNA duplex stability, bendability, and curvature in TATA-containing and TATA-less promoters in six eukaryotic systems, S. cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, zebrafish, mouse, and human. Along with the structural features, the hexamer composition and occurrence of structural motifs (A-tracts and G-quadruplexes) have been analyzed. The similarities and differences in structural features of the two classes of promoters are discussed in this report.

Extraction of TATA-containing and TATA-less promoters in six eukaryotes
Based on several criteria such as a maximum sequence length (8 bp), minimal consensus sequence, confined upstream location, and conservation across orthologous upstream regions, TATA box was defined as TATA(A/T)A (A/T)(A/G) (TATAWAWR) [11]. TATA-containing promoters in S. cerevisiae have been defined as those which contain the TATA consensus sequence TATAWAWR in the upstream location À200 to À50 relative to the TLS. In the present study, we have considered TATA-containing promoters as those sequences which contain TATA box within À150 to À1 region relative to TSS. Table 1 shows the number of promoter sequences for both classes of genes in S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse, and human.

DNA structural feature calculations
The three DNA structural features, DNA duplex stability, bendability (two models; DNase 1 sensitivity and nucleosome positioning preference), and intrinsic curvature have been chosen, as they are biologically relevant, and the information content of each feature is different. The stability of a double-stranded DNA molecule can be expressed as the sum of free energy or average free energy (AFE) of its constituent base paired dinucleotides. The dinucleotide energy values corresponding to the 16 dinucleotide steps (or 10 unique dinucleotides) are taken from the unified parameters obtained from melting studies on 108 oligonucleotides [38]. The protein-induced bendability of a given sequence has been calculated by using experimentally derived bendability models. Two different trinucleotide models based on DNase 1 sensitivity [39] and nucleosome positioning preference [40] have been used to estimate bendability. Intrinsic static curvature has been computed using in-house software NUCRADGEN [41] using wedge angles derived from gel retardation studies (BMHT parameters) [42]. Structural properties have been calculated using one nucleotide sliding window and converting each promoter sequence into overlapping di/trinucleotide feature values. Window sizes of 15, 30, and 75 have been used for calculating stability, bendability, and curvature, respectively [26,29]. To obtain the structural profiles, all promoter sequences in each class were aligned, relative to their TSSs and then sequence information was converted to numerical values. The numerical values obtained for all sequences were averaged at each nucleotide position, to get the mean structural property for each system.
A-tract, G-tract, G4-motif, and CpG island calculations A-tracts consist of stretches of minimum four consecutive A : T base pairs without a TA dinucleotide step. A-tracts of length more than five can act as antinucleosomal sequences [43]. In this study, A-tracts of length seven (A7 or T7) were searched in the promoter regions. A G-quadruplex is a four-stranded DNA structure with stacked guanine tetrads at its core [44]. G-quadruplex-forming sequences are predicted from primary DNA sequence. Putative G-quadruplex motifs were computed using a simple pattern match G 3À5 N 1À7 G 3À5 N 1À7 G 3À5 N 1À7 G 3À5 or C 3À5 N 1À7 C 3À5 N 1À7 C 3À5 N 1À7 C 3À5 [45], where N indicates the loop regions and can have any nucleotide. CpG islands (CGIs) are short interspersed DNA chunks that deviate significantly from the average genomic pattern by being GCrich, CpG-rich, and predominantly nonmethylated. CpG islands were calculated using Takai and Jones, 2002 algorithm [9], and were defined as regions longer than 500 bp in size, with a GC composition ≥ 55%, and an observed/expected CpG ratio of ≥ 0.65.

Results and Discussion
Six eukaryotic genomes, S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse along with human are considered for this analysis as they are good representative model systems for understanding aspects of TSS information for S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse, and human are obtained from different sources (Datasets and methods). TATA-containing promoters in the six systems are defined as those which contain the consensus motif TATAWAWR in the À150 to À1 region, relative to the TSS. Percent of TATA-containing and TATA-less promoters in the six systems are indicated in parentheses. GC percentages of the À500 to +500 and À500 to À1 regions relative to TSS for two classes of promoters are also given.
eukaryotic transcription at different levels. These six systems differ in their genomic GC content and nucleotide composition, are well studied, and their experimentally validated TSS data have been published. The promoter sequences of six eukaryotes are classified as TATA-containing and TATA-less promoters based on the presence or absence of TATA box in the À150 to À1 promoter region relative to TSS (as described in Datasets and methods). The TATAcontaining core promoters constitute~17% of the total promoters in S. cerevisiae,~9% in C. elegans, 14% in D. melanogaster,~10% in zebrafish, and 3% in human and mouse ( Table 1). The percentage of TATA-containing promoter sequences varies from 46% in S. cerevisiae to~14% in human if the À500 to +500 region spanning TSS is considered [24]. The GC content of the À500 to +500 region in the two systems is found to increase from 38.5% to 53.4% [24]. A comparison of the GC content in the À500 to À1 region in TATA-containing and TATA-less promoters reveals that it is considerably lower in TATA-containing promoters when compared to that in TATA-less promoters, in case of mouse and human, while the difference is much smaller for other systems (Table 1). However, in S. cerevisiae, the GC content of this À500 to À1 upstream region in TATA-less promoters is found to be slightly lower than in TATA-containing promoters, due to a large number of TATA-less promoters containing the TATA box elements outside the core promoter region. The structural and compositional features of TATA-containing and TATA-less promoters of the six eukaryotes, S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse, and human have been analyzed to understand similarities and differences between them.

TATA-containing and TATA-less promoters have distinct structural properties
The average structural properties, DNA duplex stability (or average free energy), bendability (using two models; DNase 1 sensitivity and nucleosome positioning preference) and curvature of TATA-containing and TATA-less promoter sequences of yeast, invertebrate and mammals are computed as reported in earlier studies [26,29].
The average stability profiles of TATA-containing promoters are different from TATA-less promoter regions in all six eukaryotes (Fig. 1). The TATA -containing promoters are less stable compared to TATA-less promoters, but the span of the low stability region varies in each eukaryote. In S. cerevisiae, C. elegans, and zebrafish, the TATA-containing promoters show low stability regions at approximately À150 to À1 region relative to TSS compared to TATA-less promoters. In mouse and human, a significant difference in the two classes of promoters is observed across the whole region. The TATA-containing promoters are less stable in entire À500 to +500 region with two sharp peaks at À30 and À1 region. The stability of DNA is directly dependent on AT/GC content. The lower stability of TATA-containing promoters in mammals is due to their lower GC content as compared to TATA-less promoters (Table 1). Interestingly, the upstream region (À500 to À150) in TATA-containing promoters in S. cerevisiae shows greater stability, and this correlates with its higher GC content in À500 to À1 region when compared to the TATA-less promoters, as discussed above and shown in Table 1. Although there is a variation in the shapes of the AFE profiles, the two classes of promoter region show low stability regions or peaks irrespective of the genomes. In order to read the genetic information in DNA by many processes such as replication, repair, recombination, and transcription, the DNA has to be brought transiently into a singlestranded form. The presence of the low stability in promoter regions in two classes of gene promoters in eukaryotes shows the significance of DNA meltability in genome transcription.
The bendability and curvature profiles also show differences between TATA-containing and TATA-less promoters. The bendability profiles of the TATA-containing and TATA-less promoters have been analyzed using two models, DNase 1 sensitivity and nucleosome positioning preference model as seen in Fig. 1. The DNase 1 sensitivity model shows that the TATA-containing promoters are more flexible (or more bendable) compared to TATA-less promoters in the core promoter regions in S. cerevisiae, C. elegans, D. melanogaster, and zebrafish ( Fig. 1), while Satchwell's nucleosome positioning preference model shows that TATA-containing promoter regions in D. melanogaster, mouse, and human to be more rigid. The rigidity of TATA-less promoters (estimated by DNase 1 sensitivity model) at~100-200 bp upstream of the start codon in 11 yeast species has been reported earlier [34]. Our results with TSS data in S. cerevisiae and other invertebrates are consistent with this earlier result. DNA bendability describes the anisotropic bending of duplex DNA in the presence of various binding factors. Bendability or flexibility can have two completely different roles. More flexibility is important for several DNA-binding proteins such as the TATA-binding protein [46], the catabolite gene activator protein (CAP) [47] and integration host factor (IHF) [48] which play role in transcription and genome organization and the dinucleotide steps TA and CA/TG are frequent in the sequences with increased flexibility [49,50]. Rigid DNA in promoter sequences can play several roles in vivo. Rigid or less bendable regions disfavors formation of nucleoids in prokaryotes and nucleosomes in eukaryotes, making these regions 'nucleosome depleted' and assist in the assembly of the transcriptional machinery. The rigidity of DNA in promoter regions provides greater scope for sliding of DNA-binding proteins along its length [34]. Furthermore, the higher energy cost required for DNA bending may play a role in open complex formation during transcription initiation, by making the DNA resistant to bending and aiding easy escape of the transcription machinery from promoter region [24].
The average intrinsic curvature profiles of TATAcontaining and TATA-less promoter regions of S. cerevisiae, C. elegans, and D. melanogaster show that both classes of promoter sequences are curved in the vicinity of TSS with TATA-containing promoters being slightly more curved (Fig. 1). TATA-containing promoters in D. melanogaster are more curved compared to TATA-less promoters. The TATA-containing and TATA-less promoters in mouse and human have a distinct difference with TATA-containing promoters being more curved throughout the À500 to +500 region. The biological role of intrinsic curvature of DNA was established in the kinetoplast DNA of trypanosomes [51], and they can enhance transcription rate in bacteria [52,53]. The importance of curvature was recognized in promoter regions of pathogenic bacteria and thermo-sensing bacteria [54,55], but it is less apparent in higher eukaryotic promoters.
Of the four structural features studied, stability, and curvature show the most significant differences in the two classes of promoters in mouse and human. In order to get a quantitative estimate of differences in structural properties of TATA-containing and TATAless promoters in core promoter region, À150 to À1 relative to TSS, in all six eukaryotes, cumulative distribution function plots were examined (Fig. S1). These plots, as well as corresponding P-values (shown in  Table 2), suggest that the four structural features, average free energy, bendability, and curvature are significantly different in the core promoter regions of all six systems, except for curvature in S. cerevisiae. The differences in the structural properties of these two classes of promoters may be attributed to differences in their nucleotide composition as well as sequence, and hence these were examined further.

TATA-containing promoters have distinct nucleotide composition compared to TATA-less promoters
The characteristic differences in the structural properties of TATA-containing and TATA-less promoter regions in six eukaryotes can arise due to the differential base composition and prevalence of some selected oligonucleotides. Hexanucleotide composition has been calculated in the À150 to À1 core promoter region as well as À500 to À1 regions for the two classes of promoters. The unique hexamer words in promoter regions of TATA-containing and TATA-less promoter are calculated and compared (Figs 2 and S2). The TATA-containing promoters show biases for TA containing hexamer steps (TATATA, ATATAT, ATATAA, and TATAAA) in the À150 to À1 region ( Fig. 2 as well as proximal regions (Fig. S2) in all six systems. As expected, the À150 to À1 regions show a high prevalence of TATA motif containing hexamers in all systems, while there is no common motif observed in TATA-less promoters. The core promoter regions of TATA-less class in C. elegans show preponderance of TTTTTT, whereas mammals show preference for GGGCGG, CGCCCC, CGGGGC, CCCCGC, GGCGGC, and CGCCGC. Zebrafish core promoters in TATA-less class show a different trend, with a high frequency of hexamers CTCTCT and GAGAGA. The À500 to À1 regions in the TATAcontaining class in mammals have preference for AAAAAA (as well as TTTTTT in human) while GCrich hexamers (SP1 elements) are overrepresented in TATA-less promoters (Fig. S2).
The hexamer distribution in the two classes of promoters is different and quite distinct in the six eukaryotes, even in the À500 to À1 region (Fig. 2). The results suggest that the TATA box-containing promoters differ in the composition of sequence motifs over a broader region that extends beyond the TATA boxcontaining core, especially in mammals. Furthermore, we have studied the frequency of occurrence of different structural motifs A-tracts, G-tracts, and G-quadruplexes along with CpG islands in the two classes of promoter sequences.
TATA-containing and TATA-less promoters have different structural motifs Table 3 shows the distribution of oligo-A-tracts, Gquadruplex-forming sequences, and CpG islands in the two classes of promoter regions of six eukaryotic organisms. The percentages of three promoter regions, spanning from À500 to +500, À150 to À1, and À50 to À1, with at least one occurrence of A-tracts (A7 or T7) and G-quadruplex-favoring sequences (G 3À5 N 1À7 G 3À5 N 1À7 G 3À5 N 1À7 G 3À5 or C 3À5 N 1À7 C 3À5 N1 À7 C 3À5 N 1À7 C 3À5 ) have been calculated. CpG islands have been computed using a 500nt window, so their occurrence is only given for the À500 to +500 region.
tracts are more prevalent in TATA-less promoters. In mouse and human TATA-containing promoters, Atracts are comparatively more frequent than in TATA-less promoters (Table 3). G-quadruplex motifs are rarely present in both TATA-containing and TATA-less promoter regions of S. cerevisiae, C. elegans, D. melanogaster, and zebrafish (Table 3). However, the TATA-less promoters of mouse and human are significantly enriched in G-quadruplexes as compared to TATA-containing promoters ( Table 3). Figure 3 shows the positional distribution of G-quadruplex-favoring sequences for TATA-containing and TATA-less promoters in mouse and human. In mouse and human, approximately~45% and~42% of TATA-less promoters (in À500 to +500 region relative to TSS) are associated with G-quadruplexes while only~27% and~19% TATA-containing promoters contain a G-quadruplex motif. The G-quadruplex density in TATA-less promoters is significantly higher as compared to TATA-containing promoters in both mouse and human, particularly in the À200 toÀ1 region (Fig. 3). The G-quadruplex structures have the potential to influence transcription in both positive (when present on the anti-sense strand) and negative (when present on sense strand) ways [44,56], and they could disfavor the assembly into nucleosomal structures [57]. The TATA-containing and TATA-less promoters also show differences in the distribution of CpG islands which are known to occur frequently in mouse and human. The TATA-less promoters of mouse and human promoter regions (À500 to +500) are significantly enriched with CpG islands with~51% and 42% containing these elements, while only~25% and~19% of TATA-containing promoters contain these elements (Fig. 4). CpG islands and high GC  Percentage of promoter regions with at least one occurrence of the structural motifs such as A-tracts (A7 or T7) and G-quadruplex-favoring sequences (G 3À5 N 1À7 G 3À5 N 1À7 G 3À5 N 1À7 G 3À5 or its complementary sequence) have been shown. Three promoter regions, spanning À500 to +500, À150 to À1, and À50 to À1, with respect to TSS (TSS at '0' position) have been considered. CpG islands are calculated using 'CpG island search' program and 500nt window [9]. TATA-less promoters in mouse and human are enriched with G-quadruplex motifs and CpG islands as compared to TATA-containing promoters.   3. Positional distribution of G-quadruplex motifs in TATA-containing and TATA-less promoter regions of mouse and human. The regular expression G 3À5 N 1À7 G 3À5 N 1À7 G 3À5 N 1À7 G 3À5 or C 3À5 N 1À7 C 3À5 N1 À7 C 3À5 N 1À7 C 3À5 is searched in À500 to +500 region relative to TSS and summed for each 10 nucleotide bin. To compare the two classes of promoters, G-quadruplex motif density has been calculated by dividing the total number of promoters in each bin, which contains a G-quadruplex motif, by the number of promoter sequences in each class. TATA-less promoters of mouse and human are enriched with G-quadruplexes in the vicinity of TSS, as compared to TATA-containing promoters. content in the mammalian promoter region can favor open chromatin conformation and support paused transcription genome-wide [58]. Furthermore, the GCrich sequences can form left-handed Z-DNA at alternating purine-pyrimidine stretches [59].
Overall the composition and structural features of TATA-containing and TATA-less promoter regions are found to be distinctly different even in regions outside the TATA elements. The differences become more prominent going from S. cerevisiae to mammals and may be due to lower mutation rates in TATA-containing promoters. In the case of mammals, it has been reported that TATA-containing promoters tend to evolve more slowly in core promoter as well as upstream regions than the promoters that lack a TATA box [5,60].

Conclusions
The sequence-dependent structural properties of di and tri nucleotides in DNA, lead to variations in the structure at a higher level and play a role in protein binding, DNA melting, nucleosome organization, and gene regulation. The structural features of TATA-containing and TATA-less promoters are distinctly different in lower eukaryotes. The TATA-containing core promoters are less stable, more flexible, and more curved compared to TATA-less promoters in S. cerevisiae, C. elegans, and D. melanogaster. In mouse and human, stability and curvature are distinguishing features of TATA-containing and TATA-less promoters. Significant differences are also observed in the distribution of sequence motifs, such as A-tracts, Gquadruplexes, and CpG islands, in TATA-containing and TATA-less promoters in mouse and human. The TATA-less promoters in mammals are characterized by high prevalence of G-quadruplexes and CpG islands. Overall the work reported in this article gives a broad picture of DNA structural and compositional features of two classes of promoters in different eukaryotes and provides interesting insight into their architecture. polymerase II core promoters: insights from genomewide studies. Nat Rev Genet 8, 424-436.  TATA-less promoters are enriched in CpG islands as compared to TATA-containing promoters in mouse and human. In mouse and human approximately~51% and~42% of TATA-less promoters (in À500 to +500 region relative to TSS) are associated with CpG islands while TATA-containing promoters have only 25% and 19%, respectively.

Supporting information
Additional Supporting Information may be found online in the supporting information tab for this article: Figure S1. Cumulative distribution function of structural features for TATA-containing and TATA-less promoters in the six eukaryotic systems: S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse, and human. Figure S2. Hexanucleotide composition of TATA-containing and TATA-less promoter of different eukaryotic systems.