Journal list menu

Volume 583, Issue 11 p. 1713-1720
Minireview
Free Access

CpG islands – ‘A rough guide’

Robert S. Illingworth

Corresponding Author

Robert S. Illingworth

Wellcome Trust Centre for Cell Biology, Michael Swann Building, University of Edinburgh, Mayfield Road, Edinburgh EH9 3JR, United Kingdom

Corresponding author. Fax: +44 0131 650 5379.Search for more papers by this author
Adrian P. Bird

Adrian P. Bird

Wellcome Trust Centre for Cell Biology, Michael Swann Building, University of Edinburgh, Mayfield Road, Edinburgh EH9 3JR, United Kingdom

Search for more papers by this author
First published: 17 April 2009
Citations: 608

Abstract

Mammalian genomes are punctuated by DNA sequences containing an atypically high frequency of CpG sites termed CpG islands (CGIs). CGIs generally lack DNA methylation and associate with the majority of annotated gene promoters. Many studies, however, have identified examples of CGI methylation in malignant cells, leading to improper gene silencing. CGI methylation also occurs in normal tissues and is known to function in X-inactivation and genomic imprinting. More recently, differential methylation has been shown between tissues, suggesting a potential role in transcriptional regulation during cell specification. Many of these tissue-specific methylated CGIs localise to regions distal to promoters, the regulatory function of which remains to be determined.

1 Introduction

In mammals, the majority of CpG pairs are chemically modified by the covalent attachment of a methyl group to the C5 position of the cytosine ring. This modified residue is distributed throughout the majority of the genome including gene bodies, endogenous repeats and transposable elements and functions to repress transcription [1-3]. Methylcytosine spontaneously deaminates to thymine resulting in the under representation of CpG (21% of that expected in the human genome) [4]. The genome is punctuated however by non-methylated DNA sequences called CpG islands (CGIs) which have an elevated G + C content and little CpG suppression [5-7]. These conspicuous unique sequences are approximately 1 kb in length and overlap the promoter regions of 60–70% of all human genes [4, 6, 8-10].

CGIs have been shown to colocalise with the promoters of all constitutively expressed genes and approximately 40% of those displaying a tissue restricted expression profile [8, 11]. CGI promoters appear to define a class of transcription start site (TSS) which can initiate from multiple positions. The more tissue restricted class of non-CGI promoters is generally associated with a single well defined initiation site (reviewed in [12]). Promoter association accounts for the uneven distribution of CGIs in the genome, showing preferential localisation to gene rich loci [4].

Consistent with promoter association, CGIs are generally characterised by a transcriptionally permissive chromatin state [10, 13, 14]. These findings suggest that CGIs may provide a means to distinguish gene promoter regions from the large proportion of transcriptionally irrelevant intergenic chromatin. Support for this idea was provided by an early study investigating the distribution of transcription factor (TF) binding sites in a small panel of human genes [15]. Whilst binding sites were slightly enriched in promoter proximal sequences, they were also highly abundant throughout the genome (approximately 16 sites per 100 bp). This study concluded that the presence of binding sites alone was insufficient to identify promoters, which supports the idea that CGIs may serve as TF “landing lights” in the darkness of the nucleus [15, 16].

Not all CGIs localise to annotated TSSs (Fig. 1 – example iii), however it is interesting to note that detailed investigation of intragenic CGIs has led to the identification of previously unanticipated promoters [17-19]. This raises the possibility that all CGIs represent sites of transcriptional initiation many of which have yet to be characterised. Indeed it is possible that certain alternative transcriptional start sites are utilised in a highly tissue restricted fashion, and consequently have escaped annotation. Several transcripts initiate from intragenic CGIs and have been shown to be expressed during specific developmental stages [17, 19].

figure image
CpG islands located within a region of human chromosome 19. The upper panel illustrates a 65 kb portion of human chromosome 19 (17195000–17260000) which contains five annotated genes (blue bars) and four CpG islands. The promoters of OCEL1, NR2F6 and ANKLE1 overlap with CGIs (i,iii and iv) and an additional CGI (ii) localises to the third exon of NR2F6. The classical sequence parameters applied to CGI prediction are illustrated (dashed red lines) for CpG (observed/expected; CpG[o/e] = 0.6) and G + C base composition (GC% = 50%). The lower panel represents an enlarged view of four 6 kb regions (i–iv) spanning each CGI and illustrates the distribution of CpG sites (vertical black strokes) relative to the annotated genes.

2 CGI identification

CGIs were first identified by digestion of mouse genomic DNA using the methyl-CpG sensitive restriction enzyme HpaII (CCGG recognition site). A small portion of the genome, composed of very highly fragmented DNA, was found to be derived from sequences containing clusters of non-methylated CpG sites [5, 6, 20]. Quantification of these digestion products, combined with sequence analysis and correction for contaminating DNA indicated that these were derived from approximately 26 300 discrete CGIs [21, 22]. These sequences were characterised as at least 200 bp in length and with a G + C content of 50% and a CpG frequency (observed/expected; [o/e]) of 0.6 (Fig. 1) [7, 8].

The completion of the human genome project in 2001 facilitated in silico CGI prediction [4]. Values for length and base composition similar to those identified by Gardiner-Garden and Frommer are routinely employed by the major genome browsers to annotate CGIs (Table 1 ). Thresholds are somewhat arbitrary however, and the effect of varying these values can profoundly alter prediction accuracy [23-25]. To reduce the extraneous inclusion of non-CGI sequences Takai and Jones investigated the effect of increasing the minimum length, CpG[o/e] and G + C composition to 500 bp, 0.65% and 55%, respectively. This increased stringency reduced the number of identified islands by approximately 90% and largely excluded contaminating Alu elements. This algorithm also reduced the number of gene promoter associated islands, suggesting that bona fide CGIs were also being discarded [24].

Table Table 1. Overview of CpG island prediction algorithms
Database/prediction Length G + C CpG[o/e] RM a Comments Reference
ENSEMBL ⩾400 ⩾50% ⩾0.6 N Stringent length constraint [88]
NCBI relaxed ⩾200 ⩾50% ⩾0.6 N Total CGIs = 307 193
NCBI strict ⩾500 ⩾50% ⩾0.6 N Total CGIs = 24 163
USCS b >200 ⩾50% >0.6 Y Total CGIs = 28 226 [89]
EMBOSS UD c UD UD NA Variable parameters [90]
CpGProD >500 >50% >0.6 Y Total CGIs = 76 793 [23]
CpGcluster NA NA NA N Clustering Total = 197 727 [25]
  • a RM, repeat masked; Y, yes; N, no; NA, non applicable.
  • b Parameters used for CGI identification for the ENCODE project although totals vary due to repeat masking differences between hg17 and hg18 builds [87].
  • c UD, user defined.

Repeat elements such as “young” Alus resemble the base composition of CGIs and significantly contribute to the number of false positives identified [24]. Preliminary computational analysis of the human genome sequence identified 50 267 CGIs, of which only 28 890 were unique [4]. Many of the multi copy sequences could be removed by screening against known classes of repeats identified in the Repbase database [26]. This database is subject to iterative improvements due to updating the repeat repertoire. Reanalysis of the human genome sequence in 2002 resulted in the loss of a further 1890 false positives suggesting a more conservative estimate of 27 000 CGIs [27]. The beneficial consequences of repeat masking can be illustrated by the example of a low copy repetitive element that is related to the adenovirus sequence located on human chromosomes 4 and 19 [28]. This element is identified as a single CGI or a tandem cluster of repeated CGIs by ENSEMBL and NCBI, but is recognized as a repeat and eliminated by the algorithm employed by the USCS browser (Table 1).

The total number of predicted CGIs is highly variable depending on the exact sequence parameters applied. NCBI Mapview maintains two different permutations of these parameters to provide a relaxed and stringent identification of CpG islands (Table 1). “NCBI strict” predicts 24 163 unique CGIs whereas their relaxed criterion identifies more than 307 000. This variability arises due to the following factors: (1) the application of arbitrary thresholds, (2) no account being taken for the heterogeneity of CGIs and (3) the fact that DNA sequence based prediction methods necessarily ignore DNA methylation status.

To overcome these problems we recently developed a novel technique to select CGI sequences based on the empirical criterion of non-methylated CpG clustering [29]. A recombinant CXXC domain from murine MBD1 with specific affinity for non-methylated CpG pairs was used to purify CGIs from total genomic DNA [29-32]. The sequenced library identified in excess of 17 000 CGIs in human blood DNA. Extrapolating the identified CGI sequences to annotated genes suggests that the complete human somatic cell CGI complement is approximately 25 000 [29].

Most computational prediction and sequence selection techniques identify a CGI complement of between 24 000 and 27 000. Despite the apparent concordance between these methods, many identified CGIs are not common between the different sets. This inconsistency may be addressed by the incorporation of multiple layers of information into prediction methods, including DNA methylation status and chromatin modifications. CGIs generally associate with domains of chromatin containing hyperacetylated nucleosomes consistent with a transcriptionally permissive state [10, 14, 33]. In the future, epigenetic information may facilitate detection methods allowing current, somewhat arbitrary, thresholds to be replaced by accurate contextual information [34].

3 The origin of CpG islands

The mechanism by which CGIs remain hypomethylated during the period of global de novo methylation during early development remains unclear [35, 36]. The characteristic clustering of CpG sites is a consequence of immunity against de novo methylation during the earliest stages of mammalian development. A simple suggestion would be that CGIs are intrinsically refractory to de novo methylation by DNA methyltransferases (DNMT) due to their DNA sequence (Fig. 2 A). This seems unlikely however, as CGIs contain a substantially elevated density of CpG sites, the preferred substrate of the DNMT enzymes [37]. Moreover, CGIs located on the female inactive X chromosome and those of certain cultured mammalian cells readily acquire DNA methylation [2, 38].

figure image
Potential mechanisms leading to CGI hypomethylation. (A) CGIs remain hypomethylated via intrinsic sequence properties which exclude the action or association of DNA methyltransferases (DNMT; blue ovals). (B) CGIs acquire DNA methylation normally but are targeted by a demethylating activity (DM). (C) The basal transcriptional machinery (RNApolII and TF) and histone H3 lysine 4 trimethylation (H3K4me3) excludes the DNMTs from sites of transcriptional initiation (dashed green line). (A–C) Methylated and unmethylated CpGs are denoted by filled and open lollipops, respectively.

A second possibility is that CGIs are targeted by a DNA demethylation mechanism, which specifically removes the methyl moiety from the cytosine base (Fig. 2B) [39]. Various protein factors, including CGBP (CpG-binding protein) possess a CXXC domain, which can specifically bind to non-methylated CpG sites [40, 41]. This protein has been shown to associate with the MLL complex, which mediates the formation of transcriptionally permissive chromatin via histone modifying activities [42]. It is possible, that an equivalent recruitment mechanism could target a demethylation activity to CGIs. However, no such demethylase activity has thus far been identified in somatic tissues.

A plausible alternative is that bound transcription factors sterically preclude DNMT association at CGI sequences (Fig. 2C) [43]. Evidence for such a mechanism is supported by mouse transgenic experiments in which ablation of binding sites for the ubiquitous transcription factor Sp1 was shown to facilitate de novo methylation of the APRT promoter CGI [44, 45]. Consistent with this idea, α-globin is transcribed in the embryo and contains a promoter CGI whilst the related, but transcriptionally silent β-globin gene is not embryonically transcribed and has no CGI [46]. Analysis of a panel of genes expressed during mouse embryogenesis found that 93% are associated with a 5′ CGI [47]. These data raise the possibility that CGIs are footprints of the basal transcription machinery localised during embryogenic de novo methylation (Fig. 2C). Furthermore, global run on expression analysis indicated that bidirectional transcription initiation frequently occurs at gene promoters [48]. This observation could account for the relatively large region of steric hindrance that would be required to generate a typical CGI. However, this model does not account for the observation that CGIs are intrinsically sensitive to nuclease digestion and therefore more accessible than the majority of the genome [14, 49]. Moreover, the majority of CGIs remain hypomethylated in terminally differentiated cells irrespective of transcriptional activity [1, 8, 10].

Recent studies investigating the methyltransferase like factor DNMT3L suggests a rather speculative mechanism for the persistence of hypomethylated CGIs. This protein associates with, and facilitates the action of the de novo methyltransferases [50-53]. However, DNMT3L cannot bind to chromatin in which the Histone H3 tails are tri-methylated at the lysine 4 position [54]. Genome wide determination indicated that the majority of protein coding gene promoters are occupied by RNA polymerase II and possess islands of trimethylated H3K4 even in the absence of transcriptional elongation in ES cells [13, 55, 56]. The presence of this active mark at CGI-promoters may be refractive to de novo methylation via repulsion of DNMT3L (Fig. 2C).

It is conceivable that more than one of these models is involved in the establishment of CGIs and the hypomethylation which usually persists during subsequent differentiation. Global analysis of chromatin modifications, transcriptional activity, transcription factor binding and DNA methylation analysis will help to determine the origin of CGIs and the mechanism that maintains them.

4 CpG island methylation

The majority of CGIs are hypomethylated, but a small percentage acquires methylation during normal development. Some of these examples are known to play a key role in X-inactivation and genomic imprinting [57, 58]. Disruption of CGI methylation patterns has also been well-documented as a hallmark of neoplastic cells [59]. Recently, DNA “methylome” characterisation has been the basis for an increasing number of investigations due to significant advances in analytical technologies [1, 2, 10, 29, 60]. A major focus of this work has centered on CGIs as they represent a tractable fraction of the genome with obvious regulatory potential.

Several studies have recently improved our understanding of DNA methylation at CGI-promoters. This is of particular interest as it is known that hypermethylation of CGI promoters result in stable transcriptional repression [3]. Microarrays probed with DNA enriched for methyl-CpGs identified 3–4% of CGI-promoters as hypermethylated in a panel of somatic tissues [2, 10, 61]. Alternatively, promoters with relatively reduced CpG content were frequently found to be more often hypermethylated [10]. This is consistent with the observation that a methylated fraction purified from human whole blood was found to be enriched for DNA sequences with a CpG density intermediate between CGIs and bulk genomic DNA [62].

The above studies focused on gene promoter however CGIs distal to TSSs have also been implicated in transcriptional regulation [58, 63]. Systematic analysis of all predicted CGIs (149) on the q arm of human chromosome 21 determined that 22% were hypermethylated in peripheral blood DNA [64]. An independent investigation characterised methylation at 2524 regions of human chromosomes 6, 20 and 22 across 12 tissues using high resolution bisulfite sequencing [1]. This study identified 9.2% of predicted CGIs as methylated at more than 80% of CpG sites in one or more somatic tissues [1]. More recently, global CGI methylation has been characterised through affinity purification of methylated DNA and microarray screening. MBD affinity purified (MAP) DNA identified 11.6% of islands as hypermethylated in a panel of somatic tissues [29]. In a similar manner, Rauch and coworkers identified approximately 25% of CGIs as being heavily methylated in human B cells [63].

These global studies indicate that sites of CGI methylation frequently localise to genomic regions distal to promoters. Consistent with this observation, bisulfite analysis identified 2.1% of promoter-associated CGIs as hypermethylated (>80% of CpGs) relative to more than 9% of the complete CGI complement [1]. However, despite this observation the exact proportion of hypermethylated CGIs varies widely between these studies (9–25%). The discrepancies between these studies may be attributed to three key experimental factors:
  • (1)

    Variable detection: The relative methylation levels required for detection differs between each of the analytical techniques. For example, bisulfite analysis provides single base pair resolution allowing the determination of intermediate levels of methylation (>20 and <80% meCpG) which is imperceptible by techniques such as Methyl-DNA Immunoprecipitation (MeDIP) [2].

  • (2)

    Inconsistent CGI classification: The number of CGIs identified can vary widely depending on the sequence parameters applied to their identification. This is illustrated for the study by Rauch et al., where the inclusion of CGIs with a relatively low CpG density, are frequently methylated relative to CGIs identified by more stringent criteria [1, 7, 63]. These sequences are arguably not bona fide CGIs.

  • (3)

    Tissue specific CGI methylation: A proportion of all CGIs are differentially methylated between tissues. Studies investigating multiple tissues will consequently identify a greater total number of hypermethylated CGIs.

Two recent studies have combined bisulfite conversion with next generation sequencing technology to characterise DNA methylation at CGIs [60, 65]. This technology, although presently limited to the characterisation of a small fraction of the genome, provides unparalleled resolution and a greater insight into the distribution of DNA methylation in the mammalian genome.

5 Differential CGI methylation

A small but significant proportion of CGIs are differentially methylated between normal tissues and cell types [1, 29, 66-70]. Characterisation of these differences identifies the existence of tissue specific CGI methylation fingerprints which may demarcate cellular functions [69, 71].

5.1 Germ line specific hypomethylation

A number of CGIs have been found to be unmethylated in cells of the germ line, but methylated in all tested somatic cell types. For example, germ line specific genes of the MAGE (melanoma antigen encoding genes) family acquire promoter-CGI methylation during embryogenesis and are silent in all somatic tissues [72]. Promoter demethylation correlates with the ectopic expression of these genes in various cancer cells suggesting that DNA methylation is the primary silencing mechanism [73]. Genome wide characterisation of a synthetic mouse differentiation model identified accumulation of de novo methylation and transcriptional silencing at the promoters of many germ line specific genes [74]. Restriction landmark genome scanning (RLGS) in a panel of mouse tissues identified 5% of CGIs as differentially methylated [75]. Candidate analysis of 15 of these islands confirmed that 14 were specifically hypomethylated in mature sperm and heavily methylated in somatic cells [75]. A similar trend was identified in human tissues, where promoter arrays probed with methylated DNA from brain, testis and monocytes identified testis specific hypomethylation [67]. Furthermore, CGIs shown to be hypermethylated in human blood were also completely devoid of methylation in sperm DNA [61].

Sperm specific hypomethylation suggests that certain germline-specific genes may be irrevocably silenced via CGI methylation in somatic cells. Since mature sperm cells are transcriptionally inactive, the genes which are regulated by CGI methylation must be expressed during sperm maturation. Consistent with this suggestion, somatic acquisition of DNA methylation correlated well with transcriptional activity in human testis (containing primordial germ cells) and gene silencing in somatic cells [76]. Knowledge of specific gene expression profiles in immature germ cells will be required to determine if CGI methylation acts as a primary repressor of germ line specific genes in somatic cells.

5.2 Differential methylation in embryonic stem cells

Embryonic stem cells are pluripotent and might therefore be expected to lack any CGI methylation. Characterisation of DNA methylation patterns in mouse embryonic stem cells (ES cells) has however identified hypermethylation at approximately 3% of CGI-promoters. These included various developmental genes such as Rhox2 and many genes involved in testis and oocyte specific functions [56]. Rhox genes are temporally and spatially regulated during post-implantation development in mice and are expressed specifically in the extraembryonic tissues [77]. The Rhox cluster has been shown to associate with embryo specific hypermethylation and transcriptional analysis in DNMT deficient embryonic cells indicates that this provides the primary silencing mechanism for these genes [56, 77]. Minimal overlap between genes repressed by DNA promoter methylation and those targeted by PcG and Nanog/Oct4 suggests that there are multiple complementary regulatory mechanisms which maintain correct expression during mammalian embryogenesis [56].

5.3 Differential CGI methylation in somatic cells

Recent studies have revealed a significant fraction of CGIs that show tissue specific DNA methylation. It is tempting to hypothesise that this differential methylation serves to regulate gene expression during cellular differentiation. Consistent with this notion, the promoter-CGIs of rSPHK1 and hSLC6A8 have been shown to be specifically methylated in non-expressing tissues [68, 78]. The CpG-rich promoter of the human gene MASPIN was shown to be differentially methylated in a panel of 10 somatic tissues and cell types. Although this promoter sequence represents a weak CGI, DNA methylation levels correlate well with the transcriptional activity of the gene [66].

Analysis of human chromosomes 6, 20 and 22 by bisulfite genomic sequencing identified eleven CGIs which were differentially methylated between 8 somatic tissues [1]. Interestingly, these genes displayed a relatively poor correlation with gene expression levels in these tissues. Global methylation studies also identified a limited concordance between differentially methylated CGIs and gene expression [29, 63]. It is not yet clear whether this represents limited sensitivity of the transcriptional assays or an independent repression mechanism which functions irrespective of methylation status (discussed below).

A clearer understanding of the role of differential CGI methylation may be gained by characterising the function of specific genes in more detail. Strikingly, genes involved in developmental processes are frequently associated with differentially methylated CGIs. PAX6, OSR1 and various members of the Homeobox (HOX) super family have been shown to exhibit cell type-specific DNA methylation at CGIs [29, 63, 69, 79]. HOX genes are highly conserved and function to dictate the positional identities of cells within the embryo, representing key regulators of mammalian development. Similarly, the PAX6 transcription factor is required for neural and ocular development and its expression is temporally and spatially partitioned within the mammalian brain [19]. Further work is needed to test the hypothesis that tissue specific CGI methylation at genes of this kind plays an important role in cell type specification.

6 CGI methylation and transcription

6.1 CGI methylation and transcriptional regulation

There is extensive evidence to support a functional role for promoter-CGI methylation in transcriptional repression (see, for example [10, 72, 80]). DNA methylation of the CpG-rich promoters of MASPIN and GATA2 correlates with tissue specific gene silencing [66, 75]. In light of this evidence, it is tempting to hypothesise that the major function of CGI methylation is to repress transcription. However many genes display a relatively poor correlation between CGI hypermethylation and the transcriptional status of associated genes [29, 63, 81].

There are several potential explanations for this lack of correlation as illustrated in Fig. 3 . In a simple example such as that depicted in Fig. 3A, hypermethylation of the single promoter associated CGI would lead to stable transcriptional silencing. The majority of methylated CGIs are located within intragenic regions where the effect on transcription is less clear [1, 29, 63]. Many genes can generate multiple transcripts by utilising alternative transcription starts sites. Rauch and colleagues identified expression of PARP12 despite hypermethylation of its primary CGI promoter. Rapid amplification of cDNA ends (5′ RACE), however, identified transcription initiation from an intragenic promoter downstream of the methylated CGI [63]. Alternative promoters (e.g. P1–3 in Fig. 3B) could be inactivated by CGI methylation (Fig. 3B – CGIs (i) and (ii)).

figure image
Schematic representation of CGI gene association. (A) A simple mammalian gene with a single promoter associated CGI (high density of vertical black strokes). (B) A more complex gene structure including alternative promoters (P1–3), multiple intragenic CGIs (i–v), a single intergenic CGI and an antisense transcript (dashed red arrow).

Where intragenic islands do not associate with a known TSS, it is possible that their methylation could prevent spurious gene body transcription which could otherwise interfere with the correct expression of the parent gene (Fig 3B – CGI (iii and iv)). As yet there is no evidence for this conjecture.

There is evidence that Intragenic CGIs can localise to sites of antisense non-coding RNA (ncRNA) transcription initiation which negatively regulate the expression of the sense transcript (Fig. 3B – CGI (v)). Both the Air and Tsix ncRNA transcripts originate from CGIs and are involved in the regulation of the sense transcript [82-84]. The HOXD cluster is repressed in trans by the action of HOTAIR, a ncRNA transcribed from the HOXC locus [85]. In each case, CGI methylation results in the derepression of genes silenced by ncRNAs.

Many hypermethylated CGIs are located in intergenic DNA outside coding sequences and therefore have no obvious regulatory role in gene transcription (Fig. 3B – CGI (vi)). In the case of the H19/IGF2 imprinted locus however, parent specific methylation at an intergenic CGI upstream of the H19 ncRNA gene determines the expression of the imprinted locus. CGI methylation prevents the association of the insulator element CTCF and promotes expression of IGF2 from the paternal allele [58]. This illustrates a potential mechanism whereby hypermethylation of intergenic CGIs can illicit a transcriptional effect.

These examples illustrate the complexity in determining the effect of DNA methylation at CGIs. Characterisation of transcription initiation using RNA polymerase chromatin immunoprecipitation and RACE will provide a better understanding of the function of CGI methylation at these sites.

6.2 Initiation or maintenance

Does hypermethylation of TSS associated CGIs act as the initial silencing mechanism or as a secondary event to provide stable, heritable gene repression? Several germ line and embryonic specific genes associate with methylated CGI promoters and can be reactivated by depletion of DNA methylation levels [56, 72, 73, 77]. This observation indicates the former possibility; although it is conceivable that once silenced, the initial repressive event is lost and DNA methylation merely acts as a maintenance device. Several studies have identified differential CGI methylation between somatic tissues associated with constitutively repressed genes. This suggests that methylation is stochastically accumulated in different cell types in the absence of transcription. This fits with the observation that CGI methylation is a relatively late event during X-inactivation following gene repression [86].

Absence of TFs at silenced promoters could facilitate transient de novo methylation. This possibility would align with the notion that methylation may be regarded as the basal state of the genome and is excluded from specific regions by the presence of bound factors. Alternatively DNMT recruitment could be mediated by initial repressive events to target DNA methylation and irrevocably silence transcription of the associated gene. To dissect these possibilities it will be necessary to measure gene transcription levels, chromatin modification, transcription factor binding and DNA methylation during cellular differentiation to determine the order of events leading to transcriptional repression.

7 Concluding remarks

The completion of the human and mouse genome projects has revealed and unexpectedly small number of genes [4, 27]. The mammalian transcriptome, however, is highly complex, with many genes generating multiple, often functionally distinct transcripts [12]. This is the result of many factors, including alternative splicing, differential promoter usage, TF availability, and the expression of regulatory ncRNAs. Several recent studies have identified differential patterns of DNA methylation across the genome. This evidence indicates that CGI methylation may provide an important epigenetic component of mammalian development and cellular differentiation. Interestingly, one recent study identified extensive tissue-specific methylation localised to regions which flank CGIs (<2 kb). Differential DNA methylation of these CGI “shores” correlates well with tissue specific gene expression [69]. These findings illustrate the complex role played by DNA methylation in the regulation of mammalian transcription.

There are many remaining questions. Do all CGIs colocalise to sites of transcriptional initiation? Do tissue-specific methylation patterns have a mechanistic role in “hard wiring” expression patterns in terminally differentiated cells? How prevalent is inter-individual differential CGI methylation? These questions must be addressed before we can begin to understand the role of CGIs in transcriptional regulation and consequently the aberrant events associated with disease.