Trendspotting in the Protein Data Bank

The Protein Data Bank (PDB) was established in 1971 as a repository for the three dimensional structures of biological macromolecules. Since then, more than 85 000 biological macromolecule structures have been determined and made available in the PDB archive. Through analysis of the corpus of data, it is possible to identify trends that can be used to inform us abou the future of structural biology and to plan the best ways to improve the management of the ever‐growing amount of PDB data.


Introduction
The establishment of the Protein Data Bank (PDB) in 1971 [1] was the culmination of several years of community discussion about how best to archive and distribute the results of structure determinations of biological macromolecules. Led at first by Walter Hamilton and then by Tom Koetzle at Brookhaven National Laboratory, the young resource solicited data from the early pioneers in the field and distributed them on magnetic tapes to the scientists who requested them [2]. In 1989, following many years of discussion within the structural biology community, guidelines were established for the timing of data deposition [3]. These guidelines led to the now almost universal journal requirement that data are deposited before a manuscript is accepted and then released upon publication.
In 1998, the management of the PDB was taken over by the Research Collaboratory for Structural Bioinformatics (RCSB) [4]. At about the same time, data centers at the European Bioinformatics Institute in the United Kingdom (now PDBe [5,6]) and Osaka University in Japan (now PDBj [7]) expanded from being distribution sites to also accepting and processing data. The collaboration among the three sites was formalized in 2003 with the formation of the Worldwide PDB (wwPDB) [8,9]. In 2006, BioMagResBank joined the organization [10]. The mission of the wwPDB is to ensure that standards are set and met for data representation and data quality in the archive. To help accomplish this, the wwPDB established Task Forces of experts in X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, 3D electron microscopy (3DEM), and small angle scattering. These Task Forces make recommendations about which data should be collected and how these data should be best validated [11,12].
Data are reviewed across the archive on a regular basis and remediated when appropriate [13,14]. In recent years, atom and residue nomenclatures have been aligned with International Union of Pure and Applied Chemistry (IUPAC) standards. An enriched Chemical Component Dictionary has enhanced the representation of small molecule ligands in the PDB archive. Most recently, the representation of complex peptides has been standardized [15].
The PDB is a well-curated archive that evolves with new developments in structural biology. In this paper, the current contents of the archive are analyzed in order to quantify some of these developments and better understand the trends.

Growth patterns
The holdings in the PDB continue to grow (Fig. 1) [16]. As early as 1978, Dick Dickerson had modeled the growth of crystallographic entries as exponential, n = exp(0.19 y), where n is the number of new structures per year and y is the number of years since 1960. Overall, this model is largely correct [17]. More recently, Cele Abad-Zapatero reanalyzed the growth statistics in more detail and discovered that the overall growth rate has remained surprisingly close to Dickerson's prediction through 2005, with some decrease in the growth rate between 2006 and 2010 [18]. This is consistent with an analysis of PDB depositions that shows a yearly acceleration in data deposition, with the notable exception of 2008. Based upon the rate of increase since the year 2000, our analysis predicts that PDB holdings will increase 1.5-fold between 2012 (current holdings of 85 000) and the end of 2017 (projected holdings of 134 000).
The usage of PDB data is also growing. There were 380 million downloads of data from the wwPDB FTP sites in 2011 as compared to 226 million downloads in 2008. Download statistics for the overall archive and for individual entries are available from the wwPDB website (http://www.wwpdb.org/downloadStats.php).
Data are also accessed from the individual wwPDB member websites. The RCSB PDB website is accessed by about 250 000 unique visitors per month from 140 countries. Around 750 gigabytes of data are transferred each month from the website. The breadth of PDB usage can be seen in the more than 11 000 citations to the original RCSB PDB reference [4] in journal subject areas ranging from medical informatics and surgery to art to physics (wokinfo.com). Fig. 2 shows an overall increase in depositions from each continent, but with a notable dip in 2008. Since then, the number of depositions has resumed growing: North American depositions are continuing to grow steadily; despite a slight decline in 2011, analysis of 2012 statistics indicates that European depositions are growing overall; and in Asia, a slower growth rate of Japanese depositions is compensated for in part by a faster growth rate of Chinese depositions.
The number of structures released without a corresponding publication is growing. Information about publications associated with PDB entries is updated regularly. Ninety-eight percent of the structures released by the PDB in 2001 were published in journals. In 2011, that percentage decreased to 74%. Part of the reason for this drop was the establishment of the Protein Structure Initiative (PSI [19]), which requires data release within one month of structure determination. As a result of this requirement, the percentage of PSI entries with corresponding publications is necessarily much lower than the rest of the PDB archive. Overall, approximately 20% of PSI structures have an associated citation, as compared to almost 80% of all PDB depositions released between 2001 and 2011 [20].

Structure determination methods
Most structures in the PDB -currently 88% of the entire archive -have been determined using X-ray crystallography. There has been steady growth in the number of these depositions (Fig. 1B). Synchrotron radiation is now the predominant source of X-rays used for data collection (Fig. 3A). The use of either Single-wavelength Anomalous Dispersion (SAD) or Multi-wavelength Anomalous Dispersion (MAD) methods for phasing peaked in 2009. Since 1996, when MAD and SAD began to be used, approximately 15% of all X-ray structures deposited through 2011 have been phased using one of these methods (Fig. 3B). Molecular replace-ment or Fourier phasing methods continue to be used for the majority of X-ray structure determinations.
The average resolution of X-ray structures has remained constant at about 2.0 Å. However, with the large volume of data available, there are now substantial numbers of structures determined to very high resolution, including at least one virus structure [21]. At the same time, as more large macromolecular machines are being studied using X-ray methods, there are many examples of very low-resolution structures [22][23][24].
The use of NMR methods for structure determination began in the 1980s (Fig. 1C). After an initial period of growth, the number of structures deposited per year began to decrease in 2008. The average molecular weight for NMR depositions is about 10 000 daltons.
Electron microscopy (3DEM) has been used for structure determination since the 1990s, and the number of map and coordinate depositions is increasing (Fig. 1D) [25]. The rapid growth in 3DEM map depositions points towards future growth in deposition of model coordinates from this method. The most popular 3DEM method is single particle reconstruction (for structures such as viruses), with some representation of helical reconstruction, electron crystallography, and subtomogram averaging methods (Fig. 4).

Overall
More than 90% of the PDB's holdings are proteins. Over the years the average molecular weight of the asymmetric unit for crystal structures has increased from less than 30 000 daltons to over 110 000 daltons (Fig. 5A). The number of biopolymer chains has increased at a somewhat faster rate than the number of entries (Fig. 5B). The number of non-redundant sequence clusters is also growing constantly. Analysis of the top 20 sequence clusters in the PDB shows that the most studied proteins overall are lysozyme, human immunodeficiency virus (HIV) protease, carbonic anhydrase, and trypsin. However, the trends in recent years have changed, with HIV protease, major histocompatibility complex (MHC), carbonic anhydrase, beta secretase, and mitogen-activated protein (MAP) kinase being the most commonly deposited protein structures since 2007 (Table 1). This is most likely because of the important roles these proteins play in biomedical research.
The number of ligands available in the PDB continues to increase; there are now more than 14,000 ligands in the wwPDB Chemical Component Dictionary, including some important drugs (Figs. 5C, 6A and B). Of the 85 000 entries currently available in the PDB, 70% are complexes containing small molecule ligands. Peptide antibiotics and peptide inhibitors compose a special class of ligands, many of which have pharmaceutical value and whose numbers continue to increase (Figs. 5D and 6C,D,F). In addition to the peptide-like antibiotics, there are several examples in the archive of other complex antibiotics such as aminoglycosides (Fig. 6E).

Nucleic acid-containing entries
There are three major classes of nucleic acid-containing entries in the PDB archive: RNA, DNA, and protein-nucleic acid complexes. Nucleic acid crystallography took longer to become established than protein crystallography in large part due to the difficulties of isolating and purifying samples. The first nucleic acid structure to be deposited in the PDB was yeast phe tRNA [26,27] (Fig. 7A). The first DNA structure determined was a short fragment of lefthanded Z-DNA [28]. The first full turn of B-DNA was published in 1981 [29] (Fig. 7C). There was steady growth in the number of Fig. 6. Examples of molecules in the PDB that are or have been used as drugs, shown in ball and stick. For each, the corresponding 3-character code from the Chemical Component Dictionary is listed. Blockbuster drugs shown are (A) atorvastatin bound to HMG-CoA reductase, a key enzyme in the cholesterol biosynthesis pathway (PDB ID 1hwk [49]) and (B) clopidogrel bound to cytochrome P450 2B4, which activates the prodrug (PDB ID 3me6 [50]); peptidomimetic inhibitors shown are (C) remikiren bound to human renin (PDB ID 3d91 [51]) and (D) saquinavir bound to HIV protease (PDB ID 1hxb [52]); (E) aminoglycoside antibiotic shown is neomycin bound to extended duplex RNA (PDB ID 3c7r [53]); (F) peptide-like antibiotic/antitumor agent actinomycin D structure (PDB ID 1a7y [54]).
DNA structures deposited in the PDB until the mid-1990s, when the growth rate plateaued. Around that time, ribozymes were discovered [30] (see example in Fig. 7B), and RNA structure depositions increased and then leveled off. In the 1980s, the first structures of protein-DNA complexes were deposited, followed by the first single-crystal protein-RNA complex structures in the early 1990s (see examples in Figs. 7D and E, respectively).
The growth rate of the deposition of protein-nucleic acid complexes continues to increase (Fig. 8), partly as a consequence of continuing investigations of the structure of ribosomes complexed with drugs.

Carbohydrate-containing entries
Carbohydrates are known to play key roles in energy generation, cell signaling, cellular recognition, and cellular and extracellular matrix formation [31]. While the building blocks, interactions, structures, and organization of proteins and nucleic acids are relatively well understood, carbohydrates have yet to be fully characterized at either the structural or functional level. In addition, carbohydrate polymers, unlike proteins and nucleic acids, do not have a standard backbone structure and are not synthesized based on a genetic code. Carbohydrate polymers in protein glycosylations are subject to the activity of enzymes and to the availability of specific saccharide substrates, leading to considerable variability.
More than 7000 PDB entries contain carbohydrate polymers and/or individual saccharides. They are present as single sugars (monosaccharides) that are either unbound (see example in Fig. 9A) or covalently linked to proteins (as seen in some glycoproteins) and as polymers of various lengths that are either unbound (structural components or substrates of specific enzymes) or covalently linked to proteins (glycoproteins, example in Fig. 9B). While monosaccharides are key components of nucleotides, mono-and polysaccharides also form key components of several antibiotics  such as mithramycin (Fig. 9C, [32]) and other biologically important molecules, such as peptidoglycans (Fig. 9D, [33]), proteoglycans, and glycolipids. Because the PDB was originally designed as an archive for proteins, some important components of macromolecules such as carbohydrates are not well defined, making search and analysis of them difficult. This situation is recognized and is being remedied.

Complex biological assemblies
The PDB contains many examples of multi-subunit biological assemblies (Fig. 10). Analysis shows that fewer structures have an odd number of subunits than have an even number. Some assemblies are particularly overrepresented, such as those with 6, 8, 12, 24, and n Â 60 subunits. One plausible reason for this distribution is that the over-represented values correspond to complexes with Fig. 9. Examples of carbohydrate-containing entries, with the carbohydrates shown in ball and stick. (A) Single unbound monosaccharide, rhamnose, in the structure of rhamnose-binding lectin, a pattern recognition protein with a role in innate immunity (PDB ID 2zx2 [58]); (B) polymeric glycoprotein in glycosylated human lactotransferrin N2 fragment (purple) in complex with legume lectin chains (cyan and red, PDB ID 1lg2 [59]); (C) polysaccharide antitumor drug mithramycin bound to a DNA fragment (PDB ID 1bp8 [32]); (D) mixed polymers: bacterial cell wall muramyl peptide (peptidoglycan) bound to legume isolectin chains (cyan and red, PDB ID 1loc [33]).  regular point symmetries, such as the n Â 60 icosahedral viruses [14]. Further analysis of these assemblies yields some additional interesting observations. Multi-subunit assemblies can be used to facilitate the formation of nanoparticles within their cavities, as with octahedral ferritin (PDB ID 2z6m [34]). In other cases, nanoscale structures have been designed via self-assembly, including a $13 nm octahedral cage (PDB ID 4ddf [35]) and a 16 nm cavity with a tetrahedral arrangement (PDB ID 3vdx [36]). The first atomic structures of viruses were published about 35 years ago [37], and there are now about 400 virus structures in the PDB. The vast majority are icosahedral viruses solved by either X-ray crystallography or cryo-electron microscopy. Because success in this distinct area of structural biology critically depends on expertise in highly specialized methods [38][39][40][41], it is perhaps not surprising that it is practiced by a relatively small and interconnected group of scientists worldwide. Network cluster analysis was used to investigate interconnectedness and growth of this re-search community relative to the first structures determined between 1978 and 1985. The early structures directly nucleated three major author clusters that have each contributed between 30 and 100 icosahedral virus structures to the PDB (Fig. 11: central blue, right purple, and lower red clusters). The community has now evolved into thirteen distinct author clusters; most of these are strongly interconnected by several entries with shared deposition authors.

Looking forward
Structural biology is unique in that the PDB archive provides a quantitative indicator of research productivity. Our analysis of these trends shows that the PDB has had an overall steady growth since its inception in 1971. The slight decline in the number of depositions in 2008 coincides with the discontinuation of a major program in Japan [42] as well as a decline in the use of NMR for  [47]), southern bean mosaic virus (PDB ID 4sbv [63]), satellite tobacco necrosis virus (PDB ID 2buk [37]), rhinovirus (PDB ID 4rhv [64]), and poliovirus (PDB ID 2plv [65]). Gephi [66] was used for cluster analysis of 375 icosahedral virus PDB entries connected by 364 deposition authors. structure determination. However, other factors such as global economic developments and changes in science funding may also be involved.
Analyses of these trends may help inform development of many aspects of the archive such as the data dictionaries, annotation practices, software development, and remediation efforts. For example, the current development of a Common Tool for Deposition and Annotation will allow the wwPDB to manage an increased data load without an increase in resources [43]. This tool will provide for distribution of the data load worldwide and incorporates the best practices for annotation developed by the wwPDB.
As another example, the increased complexity and size of the entries being deposited has led to the adoption of the PDBx format, which has far fewer restrictions than the legacy PDB format [44,45]. Current work with structure determination software developers to incorporate PDBx ensures that data will be input and exported from the PDB without loss of information. In addition, efforts to review and remediate special categories of entries such as those containing complex peptides or carbohydrates will improve the usability of the PDB by other scientists. Similarly, the diversity of methods used for structure determination had led to the creation of Task Forces that are making recommendations for data collection and validation.
These trends also inform the development of external resources. The decline in the percentage of publications with accompanying depositions strongly suggests the need to consider data as a type of publication. This is, in fact, being done by the Web of Knowledge's Data Citation Index (wokinfo.com).
Continued surveillance and analysis of the PDB holdings can provide new directions and opportunities for structural biology and will also allow the archive to evolve along with the science it represents.