Entropic criteria for protein folding derived from recurrences: Six residues patch as the basic protein word
Abstract
Some research has suggested that patches of six constitute an important amino acid window length in proteins for conveying information. We present database evidence that supports this conjecture, as well as additional recurrence-based data that characterization and quantification of these words affect the folding/aggregation features of proteins. Other indirect evidence is presented and discussed.
1 Introduction
The subject of protein folding continues to be a topic of major interest eliciting multiple approaches to its understanding [1]. Given its importance for biological systems functioning, it is assumed that the principles governing folding exhibit regularities. Of particular interest is the question of what determines the folding core and resultant collapse of globular proteins in aqueous solutions. That there may be latent regularities in the process is suggested by research which has demonstrated that the fractional size of the hydrophobic core remains relatively constant relative to protein length [2, 3].
In previous work we presented evidence to suggest that six residue patches were important “words” for understanding protein dynamics [4]. In the present Letter, we extend the previous findings with results from a different database that indicate that these observations relate to the fact that nucleation sites are constrained by patches of approximately six. That these “words” are important for folding dynamics is supported by additional research into nucleation energy, unstructured protein regions, allergenicity recognition and an ansatz developed by our group which is correlated with protein solubility in a number of different situations.
2 Materials and methods
A set of 1977 single chain PDB files solved by X-ray diffraction from CATH v2.6.0 (April 2005) in Cath List Format (CLF) 1.0 were obtained from http://cathwww.biochem.ucl.ac.uk/latest/lists/index.html. Each protein sequence (amino acid single letter code) was transformed into a numerical profile by means of the hydrophobicity scale of Miyazawa and Jernigan, and subjected to recurrence quantification analysis (RQA) (Table 1. Supplementary Material).
RQA has been extensively described in the literature [5-7], but briefly, the method is based upon detecting recurrences of similar segments along a given series, as computed by the Euclidian distance below a given radius. The Euclidian distances between all the possible patches of consecutive residues of predefined length are reported on a square, symmetric matrix whose rows and columns correspond to the ordered number of residues along the chain. The graphical representation of such a matrix is called a recurrence plot (RP), where a dot is placed if the Euclidian distance between two segments of the chosen physico-chemical property fall below a pre-defined radius. The fraction of recurrences (dots) in the RP is called REC, and the percentage of consecutive dots forming lines of predefined length and parallel to the main diagonal is called determinism (DET). This last index was recently discussed by Marwan et al. [8] in the realm of dynamical systems theory and found to be directly linked to the presence of “rule obeying” patterns in time series. Additional variables included in the calculation are MAXL (length of the longest diagonal line); LAM (percentage of recurrence points which form vertical lines); ENT (Shannon entropy of histogram of line lengths); TREND (paling of the RP towards its edges); and TT, trapping time (average length of vertical lines).
We set the length of segments along protein sequences to 4, and the radius to 20% of the mean Euclidian distance between segments. The minimum number of consecutive recurrences to be considered as deterministic was set to 2.
3 Results and discussion
Of immediate interest was the finding that the average MAXL value was 6.18 ± 1.71(SD), (Fig. 1 ) which approximates Strait and Dewey's finding [9] that the maximum entropy value of amino acid patches in proteins is achieved with six residue “words”. This is not to say that different word lengths cannot occur; rather, that this length conveys the most information per patch length, thus representing the minimal patch length encapsulating the maximal “information complexity”. It is worth noting that this peculiar character of six residue words was recognized by a completely independent analysis we performed on a different protein data set with different RQA measurement parameters [4] thereby ruling out a database selection bias. Of additional interest was the finding that a maximum for RQA ENT also obtained a maximum at a MAXL of six. Since this ENT is based on line length distribution (as opposed to the amino acid distribution of Strait and Dewey), there is the suggestion that entire protein chains are maximizing their information content with six residue windows (Fig. 2 ).


The plot of MAXL vs. protein sequence length (Fig. 3 ) shows that there is no dependence of the length of deterministic (i.e. repeated in different portions of the molecule) words with the length of the protein sequence. Rather MAXL, and in particular six residue MAXL, spreads across the length distribution. This is exactly what we expect with real words whose length is independent of the length of the text they are embedded into.

The relevance of a characteristic length of six residues is supported by other avenues of research. Schwartz et al. [10] demonstrated a strong bias against long blocks of hydrophobic strings deviating from expected frequencies at about six residues of block length. In another study evaluating the potential for protein “knotting,” Lua and Grossberg [11] point out that knots are relatively rare, and that chains beyond six residues quickly increase their chances of interpenetration, thus promoting aggregation. Finally, an analysis of the nucleation cores found by Compiani et al. [12] on the basis of a structural entropy criterion shows an average length of 6.12 [4, 12]. This is a crucial point: Compiani et al. looked for sequence elements that maintained their local folding irrespective of the sequence they are embedded into. The characteristic length of 6.12 of these elements points to the fact that six residue patches maintain their “individual features” thus possibly giving rise to a mutual recognition both inside the same protein (folding cores), as well as between different proteins (aggregation cores).
In their most recent paper, Schwartz and King [13] have pointed out several factors limiting long patches. The most obvious reason is the promotion of aggregation, but also suggested was their position in the string. What was not clearly identified is the fact that net charge may also be a limiting factor as Chiti et al. have pointed out [14].


4 Conclusion
Based on these observations, we are led to conclude that patches of approximately six contain the maximum information for hydrophobicity nucleation of globular proteins. Although longer “words” occur, this increases the probability of interpenetration and aggregation. The one additional variable to consider is net charge. And net charge typically occurs when no deterministic patches are available, thus also promoting aggregation. Aggregated proteins would have died out simply because they were useless.
Although “information” entropy is commonly used as a measure of complexity, it should be recalled that at least according to one prominent physicist, the entire universe is composed of the flow of information bits [20]. There is no reason to suggest that the biomolecular scale does not also derive its function in a similar way. Proteins interact with themselves as well as with other proteins because they provide “signals;” i.e., information triggering characteristic motions. Units of biological signal transduction vary (e.g., voltage and pressure), and in the case of proteins, six amino acid words may be the characteristic unit.
This concept of “meaning” may also be related to the elicitation of an immune response: if a peptide elicits an immune response when injected into an host, this means it is retaining a sort of “signature” or “trade mark” of the biological systems it comes from, and is recognized as “non-self” by the host. There is some evidence that the minimum length necessary for a peptide to elicit an allergenic response and molecular mimicry (a patch of a protein eliciting an immune response equivalent to the entire protein) is around six [21]. Moreover, the FAO requirement for food safety relative to an allergenic response begins at a minimum six residues homology with a known allergen [22]. This implies the existence of a fragment of biologically meaningful information located at more or less six residues length, and opens the way to extremely interesting speculations on the origin of the protein world from the recognition and subsequent aggregation of ancestral hexapeptides.
Acknowledgements
This work was supported by a joint DMS/DGMS initiative to support mathematical biology, from the National Science Foundation and National Institutes of Health (NSF DMS #0240230), to J.P.Z.
Appendix A A
Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.febslet.2006.07.076.




