Journal list menu

Volume 589, Issue 20PartA p. 2966-2974
Review article
Open Access

Contact genomics: scaffolding and phasing (meta)genomes using chromosome 3D physical signatures

Jean-François Flot

Jean-François Flot

Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK

Search for more papers by this author
Hervé Marie-Nelly

Hervé Marie-Nelly

Institut Pasteur, Department of Genomes and Genetics, Groupe Régulation Spatiale des Génomes, 75015 Paris, France

CNRS, UMR 3525, 75015 Paris, France

Search for more papers by this author
Romain Koszul

Romain Koszul

Institut Pasteur, Department of Genomes and Genetics, Groupe Régulation Spatiale des Génomes, 75015 Paris, France

CNRS, UMR 3525, 75015 Paris, France

Search for more papers by this author
First published: 29 April 2015
Citations: 31
Corresponding author at: Institut Pasteur, Department of Genomes and Genetics, Groupe Régulation Spatiale des Génomes, 75015 Paris, France.
Current address: Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720-3370, USA.

Abstract

High-throughput DNA sequencing technologies are fuelling an accelerating trend to assemblede novo or resequence the genomes of numerous species as well as to complete unfinished assemblies. While current DNA sequencing technologies remain limited to reading stretches of a few hundreds or thousands of base pairs, experimental and computational methods are continuously improving with the goal of assembling entire genomes from large numbers of short DNA sequences. However, the algorithms that piece together DNA strands face important limitations due, notably, to the presence of repeated sequences or of multiple haplotypes within one genome, thus leaving many assemblies incomplete. Recently, the realization that the physical contacts experienced by a portion of a DNA molecule could be used as a robust and quantitative assay to determine its genomic position has led to the emerging field of contact genomics, which promises to revolutionize current genome assembly approaches by exploiting the flexible polymer properties of chromosomes. Here we review the current applications of contact genomics to genome scaffolding, haplotyping and metagenomic assembly, then outline the future developments we envision.

1 Introduction

Advances in sequencing technologies have led to a tremendous increase in the catalog of sequenced species[1]. However, although it is now relatively easy and accessible to recover a massive amount of sequences from the genome of a given species, producing a fully assembled genome sequence remains a serious challenge. This is notably because the current DNA sequencing technologies remain limited to reading stretches of only a few hundreds/thousands of base pairs. These short sequences (called reads) have to be pieced together by sophisticated computer programs called assemblers into longer stretches of continuous DNA sequences called contigs[2, 3]. In an ideal world, one would recover after assembly as many contigs as there are chromosomes in the species being sequenced. However, this is hardly ever the case: most of the genome sequences published are called “unfinished” as they are heavily fragmented, whereas the number of truly “finished” genomes remains remarkably low. Only the small, compact genomes of a few so-called “model” organisms have been fully assembled until now, mostly bacteria (such asHaemophilus influenzae andEscherichia coli[4, 5]) and fungi (e.g.,Saccharomyces cerevisiae[6]) along with a single metazoan to date, the nematodeCaenorhabditis elegans[7]. All other sequences consist of “drafts” of varying quality, including the human genome that still contains numerous gaps but is nevertheless the most complete mammalian reference assembly available[8, 9]. Assembly algorithms often lead to fragmented draft assemblies for several reasons, including heterozygosity, the presence of repeated sequences of various sizes/proportions, or strong sequence composition biases. In addition, there is so far no rigorous, quantitative metric to evaluate the quality of an assembly. As a result, genome assembly remains shrouded in magic and it is typical to try a variety of assembly algorithms on a given dataset and look for the “best” solution in a semi-empirical way[10].

Today's assembly algorithms fall roughly into two large categories that employ different paradigms[11, 12]. Assemblers based on the Overlap-Layout-Consensus (OLC) paradigm look for overlaps between reads that allow the gradual construction of extended sequences. These assemblers (such as CELERA[13] and MIRA[14]) were initially developed for the long, high-quality reads produced by Sanger sequencers[15]. They were subsequently adapted to deal with the shorter, more error-prone reads produced by 454 pyrosequencers[16], but the computational cost to apply them to the huge number of tiny reads (typically smaller than 150 bp[17]) produced by the new generations of sequencing machines became unbearable at the turn of the millennium. New assembly approaches were therefore explored, leading to the development of assemblers using De Bruijn Graphs (DBGs) from the work of Nicolaas de Bruijn[18]. First, the reads are split into smaller sequences of k elements (k-mers). The aim of the program is then to find the superstring of nucleotide that best recapitulates the available k-mers. In a DBG this task is achieved by finding an Eulerian path through the graph[18]. VELVET[19], ABySS[20], SOAPdenovo[21], ALLPATHS-LG[22], IDBA-UD[23] and SPAdes[24] are among the most popular algorithms using DBG representation.

Despite their empirical efficiency, these algorithms encounter a number of important limitations that in some cases strongly affect their results. Confronted with perfectly repeated sequences longer than the reads, all current approaches will lead in the best case to contig disruption and in the worst case to misassembled regions. In addition, the level of ploidy of the genome adds several layers of complexity to the problem. Finally, the outputs of existing assemblers are often fragmented and may contain many errors (ranging in size from single nucleotide substitutions to artifactual large-scale rearrangements or copy number variations) but statistical tools to robustly assess the validity of the assemblies are still missing, the development of which represents an active field of research[11, 25-28].

Since no present-day assembler is able to directly produce superstrings that correspond to complete chromosomal sequences of eukaryotes, a second step called “scaffolding” is generally attempted once contigs have been generated. Scaffolding aims at ordering and orienting the contigs as accurately as possible into “supercontigs”, or “scaffolds”, as well as estimating the distances between them in order to generate a more global sequence backbone representative of the genome sequenced. In recent years new techniques have been developed that strongly improved this step: mate-pair sequencing[29], optical mapping[33] and single-molecule, real-time sequencing (SMRT)[30]. Mate-pair sequencing consists in cutting the genome into long DNA fragments (typically 1 to 5 kb, but sometimes up to 10–20 kb) and sequencing their extremities. The resulting pairs of sequences are therefore known to be separated by a genomic distance roughly equal to the size that was selected for, and this information can be used to detect structural variants[29] or to connect contigs over repeated regions (thereby improvingde novo genome assembly; e.g.[31]). Optical mapping is another approach for scaffolding contigs[32-34] that can also be used to validate and/or correct contigs as well as to phase them into haplotypes[35]. In this approach, the DNA molecules of interest are first labeled with fluorescent probes directed toward specific sequences. These molecules are then stretched and elongated using microfluidic devices so that optical imaging of the probes allows determining the relative positions of their target sequences. Optical mapping has been successfully used in several genome sequencing projects, such as the domestic goat[36] and rice[37]. Finally, SMRT sequencing[30] generates long reads (up to 20 kb long) that can be used either to generatede novo assemblies of small genomes[38] and to scaffold contigs and fill up gaps in scaffolds obtained from large genomes[8], thereby efficiently solving many of the problems posed by repeats in assembling genomes.

Although mate-pair sequencing, optical mapping and SMRT sequencing alleviate some of the problems posed by repeats and structural complexity, they are usually unable to solve them all. First, mate pairs can only bridge regions up to a few tens of kilobases long and cannot solve complex structural variations easily; besides, mate-libraries are usually contaminated with erroneous paired-end reads, leading to even more misassemblies. Second, optical mapping requires a complex and costly experimental set-up not readily accessible to many labs involved in genomic projects, and this approach is unable to order and orient small contigs. Last, SMRT sequencing is plagued with a high error rare (about 15%, mostly indels), because of which even 20-kb reads may not map unambiguously to a single genomic location: notably, this approach cannot solve gaps caused by large repeats (>20 kb) of nearly identical sequences[8]. Overall, improving the quality of an assembly remains fastidious, time-consuming, and costly: as a result,de novo draft genomes usually contain numerous errors and gaps, including some that users may not be aware of. Therefore, new methods are actively sought that would allow thede novo assembly of finished genomes, using objective, quantitative and hypotheses-free approaches.

Over the last year, approaches have been proposed that exploit the three-dimensional (3D) physical signature of chromosomes to bring a new level of resolution to scaffolding and haplotyping as well as to metagenomic assembly. As these genomic approaches exploit the quantification of 3D contacts along the chromosomes, we dub this burgeoning field “contact genomics”. These new techniques rely primarily on chromosome conformation capture (3C), which was originally developed to characterize the average 3D organization of chromosomes (see accompanying reviews;[39]). In the present review, we first introduce the supporting theory behind these methods before detailing their practical applications to genome scaffolding. We then present briefly contact genomic approaches to haplotyping and to metagenomic assembly, and conclude by outlining the future developments we envision in this field.

2 The theoretical foundations of contact genomics

As mentioned above, a typical assembly program generates a set of contigs that are subsequently scaffolded in an attempt to approximate the complete sequences of the chromosomes. Unlike mate-pair sequencing, optical mapping and SMRT sequencing, contact frequency data potentially provide a full spectrum of distances ranging from local to chromosome scale. Besides, because of the flexible nature of the chromatin fiber, loci that are in close proximity along its sequence are expected to interact much more than others that are farther apart[40, 41]: hence, quantifying these contacts using genomic derivatives of 3C[42-44] makes it possible to estimate interaction frequencies between all loci within a genome and from there to infer genomic distances. Although many 3C derivatives exist (most notably Hi-C[42], but also 3C-seq[45] and Chicago[46]), they will all be referred to as “3C” in this review unless specified otherwise.

Typically, 3C starts with a crosslinking step that aims at “freezing” the organization of all the cellular components within a population of cells, including the chromosomes[39]. Crosslinked cells are incubated in the presence of a restriction enzyme, and the resulting complexes of proteins and DNA restriction fragments (RFs) are then ligated intramolecularly. The more frequently RFs are trapped together (because of their spatial proximity during crosslinking), the more likely they are to become ligated to one another and generate a molecule that is chimeric with respect to the genome sequence. The quantification of these religation events is currently best achieved through high-throughput paired-end sequencing of the 3C library, allowing the computation of detailed contact matrices (or contact maps) that reflect the contact frequencies between all the RFs in a genome[42, 43].

In all organisms studied, genome-wide contact maps display a strong diagonal signal reflecting the frequent 3D contacts between RFs located near each other along the chromosome(s) (Fig. 1 a). As a result, one may assume a direct relationship between genomic distance and interaction frequency: loci that are in close proximity to each other along the chromosomes interact frequently, yielding a strong 3C signal, and reciprocally, strong 3C signals imply close genomic proximity. This relationship makes it possible to use contact genomics to establish the synteny (i.e., collinearity) of DNA loci across large distances, hence overcoming the current limitations of scaffolding, haplotyping, and even metagenomics analyses. In other words, the synteny information contained in a 3C library is similar to what mate-pair libraries can bring but spans distances that are up to 2 or 3 orders of magnitude larger, therefore potentially connecting loci across the entire length of each chromosome.

figure image
Principle of genome assembly using chromosome contact data. (a) The flexible polymer properties of the DNA molecule explain the strong diagonal observed in the genomewide contact maps of all species studied using genomic 3C derivatives so far, as illustrated here for the genomes ofBacillus subtilis (a bacterium),Naumovozyma castellii (a fungus) andHomo sapiens. (b) 3C contact data mapped on thede novo assembly of chromosomes 4 (red bars) and 15 (blue bars) of the yeastSaccharomyces cerevisiae (paneli). Thede novo assembly comprised errors and was fragmented, but one could easily re-order the fragments to produce an intermediate contact map (panelii) in which the assembly errors were corrected, then from there scaffold the contigs in order to retrieve the correctly assembled genome (paneliii). In this experiment, the diagonal signal was two orders of magnitude stronger than the signal originating from inter-centromeric repeats (white squares), whereas the other extra-diagonal signals initially detected (pink squares) disappeared as the initial assembly was corrected and scaffolded.

3 Application of contact genomics to scaffolding

The most direct application of contact genomic is to scaffold genome data, for example in order to identify large-scale chromosomal structural variations. When paired-end reads from a genomic 3C library are mapped on a reference genome sequence, strong incongruities (i.e., 3D signals outside of the expected diagonal) in the contact map are indicative of structural differences (Fig. 1bi), as was for instance noted in studies of oncogenic cell lines[47]. Like in a jigsaw puzzle, reordering the pieces (Fig. 1bii) and reorienting them (Fig. 1biii) to minimize the amount of incongruities results in a reconstruction of the true genome structure of the isolate that was sequenced. In this application, contact genomics provides strong hints about the connections between the contigs, revealing both their order and their orientation with respect to one another. In simple cases such as the example shown onFig. 1b, the resulting jigsaw puzzle can be easily solved in a visual way thanks to the obvious incongruities in the pattern. However, the complexity of this procedure increases non-linearly with the number of contigs and assembly errors.

A simple, intuitive approach to improve a complex assembly using contact data is a “greedy”, recursive algorithm that finds the best neighbors of a DNA region based on their contact frequencies. This approach represents each contig as an ordered string of oriented RFs, with each fragment having at most two adjacent (left and right) neighbors. The two RFs that interact most frequently with a given fragment are determined then recursively connected to each other until an incompatibility arises. Such a local method discards most of the long-range contact information contained in the 3C data: hence, although greedy approaches may perform well on ideal, simulated data without repeated elements, their performance drops quickly when data get sparser and genomes get more complex[48].

Things become even more complicated if simulations include statistical variations reflecting the fact that 3C is first and foremost a counting procedure. Indeed, 3C experiments quantify a signal that results from two complex, overlapping stochastic processes: first, the multistep experimental protocols used generate biases and artifacts (possibly linked directly to the DNA sequence itself) that must be taken into account when interpreting the result[49-51]. Second, the experiments are performed on dynamic objects: chromosomes are dynamic polymers whose physical properties are likely to vary in time, in space, and even locally over their monomers[39, 43, 52, 53]. Importantly, in the near-perfect situation of a 3C dataset simulated by considering the experiment as the output of a Poisson process, recursive algorithms fail to reconstruct the original contigs[48]. This failure illustrates the fact that even when there are no experimental artifacts and a hypothetical “shortest common supersequence” does exist, the raw contact counts cannot be used directly as a robust indication that two restriction fragments are located on the same chromosome. Therefore, the main challenge when using contact genomics to scaffold genomes appears to distinguish true interactions from background and statistical noise in order to reorder and reorient DNA regions properly. To reach this aim, the algorithms described in rest of this section adopt different strategies to handle contact data and generate outputs; all of them succeed in improving scaffold sizes by several orders of magnitude, leading for instance to scaffolds spanning the entire length of human chromosomes.

3.1 Clustering methods

Clustering solutions offer a quick and practical approach to group DNA contigs or fragments that are likely to be in the vicinity of each other because they are part of the same chromosome. To produce scaffolds, this first step has to be followed by a second one that aims at ordering and, ideally, orienting these DNA segments with respect to each other within a cluster (Fig. 2 a). The programs dnaTri (the name of which stands for “DNA triangulation”;[54]) and Lachesis[55] use this strategy to explore the ability of genomic 3C to scaffold human chromosomes. Notable differences exist between the two approaches. For instance, Lachesis necessitates prior knowledge of the expected number of clusters to proceed, and this approach often clusters small chromosomes together. dnaTri, on the other hand, applies an average-linkage hierarchical clustering algorithm directly to a distance matrix approximated from the contact matrix, without making any prior assumption regarding the expected number of clusters. Despite these differences, these two programs were reported to successfully scaffold both simulated andde novo contigs into full-length chromosomes, paving the way for further development and applications.

figure image
Scaffolding application of contact genomics. (a) Simplified representation of the pipelines used by published algorithms to perform contact genomic scaffolding (dnaTri, Lachesis, GRAAL and HiRISE[46, 54, 55, 57]). Blue, red, and green arrows represent contigs/scaffolds from an assembly presenting discrepancies with the genome of the species or cell line from which 3C data were obtained. (b) Left: contact map of a bacterial chromosome. Right: when plotted against genomic distances, contact frequencies between DNA regions exhibit a power law. This distribution can vary quantitatively depending on the experimental conditions or species, but its overall shape remains highly conserved. In the absence of contigs long enough to compute a distribution over large distances, one can initialize an assembly algorithm using a published distribution and then gradually replace it with one inferred from the actual dataset at hand. (c) Schematic representation of scaffolding using contact frequency distributions. Duplications within genomes (black lines) can be identified based on their contacts with their neighboring regions and then repositioned correctly using adequate algorithms such as GRAAL.

Although clustering approaches are clear improvements over the greedy approach mentioned previously, several limitations remain. Notably, their two-step process potentially results in cumulative errors: unless specific care is taken to tackle such problems, a contig misplaced during the clustering step will not be reassigned to its correct chromosome during the second step. Also, these programs do not account for duplications and do not attempt to correct the assembly errors that may be present in the contigs that are fed to them.

3.2 Probabilistic methods

Alternatively, genome assembly based on contact data can be approached from a probabilistic perspective. Probabilistic approaches using Bayesian inference provide a robust framework to assess the validity of a genome in an objective and quantitative fashion[56]. Such an approach was implemented in the GRAAL program[57], which uses the highly redundant information encapsulated in genomic 3C data together with an analytic model inspired from polymer physics to compute the likelihood of a genomic structure (namely, the probability of observing the contact matrix at hand given the genome structure being evaluated). The program is initialized with a set of DNA sequences, a pool of “bins” from which the program repeatedly draws. For each bin, the program uses 3D contacts to find candidate neighbors among the other bins (Fig. 2b), then determines their most likely relationships within the genome by testing a large number of biologically inspired structural variations (including duplications, inversions, etc.;Fig. 2c). For each structure the program computes a likelihood score, and one of the structures with the highest scores is retained for the next iteration. As a result, upon thousands of iterations the procedure converges toward what is expected to be the most likely structure given the data. Once initialized with a set of contigs and the contact data, the program iterates automatically without further user intervention, with each bin being processed as many times as defined by the user (Fig. 2a). GRAAL was validated on both human and fungal genomes, and on both simulated andde novo datasets[45, 57]. One disadvantage is that, at least in its current implementation, it does not attempt to guess the size of the remaining gaps in an assembly. Another assembler using a probabilistic approach based on likelihood comparisons was recently published (HiRISE;[46]) and appears similar to GRAAL in term of its capabilities (Fig. 2a), but a detailed comparison of the performance of these two programs is still wanting.

4 Application of contact genomics to chromosome-scale haplotyping

Most animal and plant species have diploid or polyploid genomes, the characterization of which poses challenges far beyond those of haploid organisms such as bacteria. Characterizing the genetic variations along all homologous chromosomes/sequences present in a diploid (or polyploid) cell is important not only for biomedical applications such as linkage analyses[58, 59] but also for population genomics and evolutionary studies[60]. However, haplotype reconstruction remains limited by current sequencing technologies, by cost, and, in many instances, by the lack of robust genome scaffolds[61]. Here again, genome 3D physical signatures open new perspectives to phase single-nucleotide polymorphisms (SNPs) and indels among homologous chromosomes as well as to discriminate among different paralogous copies arising from copy-number variations. The underlying principle of these approaches remains the same as for scaffolding: physical linkage, i.e. the fact that two nucleotide variants are carried by the same chromosome and therefore belong to the same haplotype, can be assessed based on the frequency of the contacts between these positions. The basic tenet is that two variants present on the same haplotype (incis positions) are much more likely, up to a certain distance, to be captured together in a 3C experiment than variants present on the two different haplotypes (intrans positions). Instead of ordering restriction fragments as in scaffolding applications, one can therefore use 3D contacts as long-distance anchors to cluster SNPs or indels, thereby unveiling the haplotypes (Fig. 3 a).

figure image
Haplotyping and metagenomic applications of contact genomics. (a) Example of haplotype deconvolution based on 3D contacts. Two genomic variants occurring incis will exhibit more contacts (red arrows) than variant positions located intrans on two different chromosomes (blue arrows). The syntenic variations can then be identified and positioned appropriately. (b) Illustration of the application of contact genomics to metagenomics. A metagenomic 3C (meta3C) experiment performed directly onto a mix of species reveals that 3D contacts are more frequent between DNA regions belonging to the same cellular compartment (red arrows) than between chromosomal sets in different compartments (blue arrows). This discrepancy can be exploited to distinguish the different chromosomal sets in each compartment, thereby separating the genomes of the different species present into the mix. Right panel: 3D reconstruction of the contact matrix recovered from an experiment performed on a controlled mix of species[45]. Each bead represents a ∼30 kb DNA region and is positioned according to its contacts with the other beads. Each cluster of beads that appears on the figure corresponds to one species, illustrating the low amount of noise in these experiments.

Several teams recently investigated the potential of contact genomics for resolving haplotypes. Bing Ren and colleagues performed a Hi-C experiment on a diploid mouse cell line with two homologous chromosomal sets originating from homozygous strains whose genome sequences were already known[62]. As expected, most Hi-C contacts resulted fromcis interactions (as a result of the spatial segregation of chromosomes within nuclei). By adapting the HapCUT program (originally developed to phase haplotypes from shotgun or mate-pair data[63]) to exploit the broader range of long-distance contacts generated by Hi-C libraries, they successfully phasedde novo more than 99% of the known heterozygous sites along each chromosome in their mouse system. When applied to a human cell line carrying about ten times fewer heterozygous sites than their mouse strain, their approach (called HaploSeq) was able to resolve haplotypes with an average accuracy of 98%[62]. Another 3C derivative, targeted-locus amplification (TLA), focuses on specific regions of the genome and allows identification of structural variations, SNPs and other variations affecting the surroundings of these positions of interest[64].

5 Application of contact genomics to metagenomics

From the applications described above, it is easy to see how the principles of contact genomics can be extended to the analysis of genomes of different species cohabiting together, i.e. “metagenomes”. When performing a 3C experiment directly onto a mix of species, one generally observes a very low frequency of intergenomic ligation events, which makes it possible to use 3D contact signatures to distinguish DNA segments of chromosomes belonging to different organisms: instead of scaffolding contigs into chromosomes from a single genome, one should simply consider individual genomes as DNA entities to be characterized based on their 3D contacts within a metagenome (Fig. 3b, left). Three studies have recently provided proofs-of-concept experiments showing that genome-wide 3C of a controlled mixed population can indeed generate sufficient contact information to infer the genomes of the species present in the mix[45, 65, 66] (Fig. 3b, right). However, these published approaches used general contact-genomic algorithms and not algorithms dedicated to metagenomics, leaving ample room for future improvements. Two of these proof-of-concept studies were performed using simulated contigs or prior assemblies of separate shotgun libraries as templates on which to align 3C reads[65, 66], whereas the third article exploited the fact that 3C-seq libraries contain about 80% regular paired-end reads to generate contigs that were subsequently clustered and reassembled using GRAAL (during which some assembly errors present in the initial contigs were corrected)[45]. Importantly, contact genomics also allows detection of extra-genomic elements sharing the same compartment as a given genome, as was shown both for controlled mixes of species[45, 65] and for a complex microbial community isolated from the environment[45]. For instance, a correlation analysis of the 3D contacts originating from a F plasmid (the fertility factor allowing bacterial conjugation[67]) detected in 3C data from a mix of three bacterial species revealed not only that this plasmid belonged toE. coli, but also that it carried a 140-kb copy of a portion of the genome of this bacterium[45]. A similar analysis performed on bacteriophage sequences in the same dataset also revealed which ones among these elements were extra-chromosomal and which ones had become integrated as prophages in the genome (RK, unpublished data). Finally, the genome haplotyping strategies described above may also be applied to phase closely related genomic variants occurring in a metagenome[65].

6 Conclusion

Taking advantage of the spatial signature of chromosomes to improve genomic analysis holds important promises, but these may shift in light of continuous technological developments. For instance, novel sequencing technologies such as nanopore membranes may alleviate the remaining challenges encountered to “fill the gaps” in repeated or otherwise complex regions of genomes[68]. However, we envision that the emerging contact genomics approaches described in this review will remain important for several applications. First, physical contacts make it possible to assess the quality of an assembly using an objective, independent source of information and to correct errors in the assembly[46, 57]. Second, the application of contact genomics to haplotype resolution is likely to develop in the future, not only for single genomes but also for metagenomic analysis and for characterizing the multiple strains within a population of a given species.

Originally, contact genomics analyses were performed using genomic 3C datasets generatedin vivo. Emancipation from the sometimes complex manipulation of living cells by performing 3C directly onto purified DNAin vitro appears a natural extension of this approach, which may follow either one of two possible paths. The first improvement would be to develop chemicals that crosslink the DNA molecule itself. To our knowledge, few chemicals have been synthesized and specifically used to perform interhelical DNA–DNA crosslinking. One such chemical was used in the early 1980s to study the packaging of the lambda phage genome[69, 70]. The synthesis of this product remains fastidious, but it may be possible to develop crosslinking chemicals that are easier to synthesize, such as two intercalation molecules linked together by a long carbonate chain. Another alternative consists in simply reconstituting chromatinin vitro by mixing the molecules of interest with histones, which can be achieved using commercial kits. The latter approach was recently applied to DNA isolated from human and alligator with apparently good results[46] (although it remains difficult to assess its efficiency relative to other published approaches since the available preprint does not include such comparison). One potential advantage of using anin vitro procedure is to remove the 3D signal induced by biologically meaningful contacts (such as the clustering of centromeres in yeast (seeFig. 1b) and gene loops in mammals), as the latter may interfere with the signal originating from the linear structure of chromosomes. However, these contacts do not seem to present a challenge to scaffolding algorithms such as GRAAL or dnaTri[54, 57], given that they have been reported to successfully scaffold hundreds of kilobases of individual chromosomes. Another potentially interesting feature ofin vitro approaches is that they do not require living tissues but can be applied on mere DNA extracts, which is certainly advantageous when performingpost mortem analyses. The advantage ofin vivo experiments, on the other hand, is that beyond scaffolding, haplotyping and metagenomic analysis they also provide insights into the 3D structure of the genomes under scrutiny[45, 57]. This structure, in turn, can eventually reveal the positions of functional elements such as point centromeres[71, 72]. In addition, it is possible that thein vivo packaging of chromatin in the Hi-C and 3C-seq approaches improves the capture and assembly of long, repeated elements by increasing long distance contacts.

Beyond refinements in experimental procedure, we expect contact genomics to benefit greatly from the development of dedicated software. At present, the steps of assembling the reads into contigs, scaffolding them and phasing them are performed by different programs that may not take into account all the information available for each step (e.g., 3D contacts are not taken into account during contig formation) and that may have different input/output formats. Hence, developing a single, user-friendly program able to take as input both regular paired-end and mate-pair reads as well as 3C reads and possibly other type of information (such as PacBio and nanopore reads) then to assemble, scaffold and phase them using an explicitly probabilistic framework such as the one of GRAAL should be a priority direction for future research (Fig. 4 ). From a more technical viewpoint, the exploitation of graphic processors units (GPU[57]), combined with the development of new mathematical treatments of contact matrixes, will likely prove essential to the democratization of these methods by allowing them to run on cheaper computers.

figure image
Integrated contact genomics pipeline, from metagenomic assembly to scaffolding to haplotyping. The schematic representation illustrates how, from a set of DNA sequences recovered from a mixed population, one could theoretically exploit DNA physical contacts to scaffold the chromosomes of the species present in the mixture and phase their haplotypes. The color code illustrates how this process is gradually achieved.

Acknowledgments

The authors thank Ronnie de Jonge and Ken Kraaijeveld for fruitful comments. This research was supported by funding to R.K. from the under the 7th Framework Program (FP7/2007-2013)/ERC grant agreement 260822. J.F.F. is supported by the European Research Council (ERC-2012-AdG 322790).