Journal list menu
A genomics method to identify pathogenicity-related proteins. Application to aminoacyl-tRNA synthetase-like proteins
Abstract
During their extended evolution genes coding for aminoacyl-tRNA synthetases (ARS) have experienced numerous instances of duplication, insertion and deletion of domains. The ARS-related proteins that have resulted from these genetic events are generally known as aminoacyl-tRNA synthetase-like proteins (ARS-like). This heterogeneous group of polypeptides carries out an equally varied number of functions that need not be related to gene translation. Several of these proteins remain uncharacterized. At least 16 different ARS-like proteins have been identified to date, but their functions remain incompletely understood. Here we review the individual phylogenetic distribution of these proteins in bacteria, and apply a new genomics method to determine their potential implication in pathogenicity.
1 Introduction
Aminoacyl-tRNA synthetases represent an extraordinary example of functional and structural conservation [1]. Across all living species most of these enzymes display an almost identical structure, providing one of the few cases where phylogenetic and structural analyses can be expected to yield information about the first evolutionary steps of cellular life on earth [2-4]. As would be expected from a large group of enzymes, with complicated modular structures and extremely long evolutionary lives, a large group of related proteins has formed as a result of total or partial duplications of ARS genes [5, 6]. In addition, some ARS-like proteins may exist that are coded by ancestral genes that were lately fused to a pre-existing ARS. Differentiating between these two possibilities can be difficult.
Functionally speaking ARS-like proteins are not a homogeneous class. However, a global analysis of their distribution is interesting because it provides information on the evolutionary history of ARS, and it might help to identify tendencies in the functional roles that ARS-related domains adopt when they diverge from their ancestral enzymes. Moreover, the species distribution of each ARS-like protein is likely to provide information on its biological role. More specifically, the search for correlations between gene distribution and complex biological phenotypes can be a powerful tool for the identification of biological function.
Here we combine the analysis of the phylogenetic distribution of bacterial ARS-like proteins with a simple and rapid algorithm for the identification of proteins that are over-represented in human pathogenic organisms. First, we have applied our method to re-examine the different ARS-like proteins found in bacteria, clustering them according to a sequence-similarity profile. Secondly, we have analyzed whether each of the 11 bacterial ARS-like proteins that we obtain is functionally linked to bacterial virulence (Fig. 1 ). Our method positively identifies AsnA as over-represented in pathogenic species. AsnA has already been described as important in bacterial pathogens of plants and animals [7, 8]. We suggest that its importance in infection may be extended to human microbial infections.
2 Methods
2.1 Protein profile generation and determination of phylogenetic distributions
We selected 16 well-documented ARS-like proteins for our study (Table 1 ). For each of them, a multiple alignment was built with ClustalW [29] using the Gonnet protein matrix, followed by a Hidden Markov profile building using the HMMER package [30]. Each protein profile was used as query to find all existing homologues in the Uniprot database (www.uniprot.org). In order to apply a consistent criterion to the determination of each protein's distribution we applied a cutoff value to the search for homologues (per-sequence E-value cutoff of 10.0). This procedure identified clusters of proteins that were considered as evolutionarily related and treated as a single family. Those families present in bacteria were selected for further analysis. The distribution found for each bacterial ARS-like family was graphically displayed through the quantification of all its homologous sequences in the main bacterial phyla and the representation of these frequencies on a model phylogenetic tree of bacteria [31] (Fig. 2 ).
Synthetase-like | aaRS paralog | Reference |
---|---|---|
Ybak | ProRS | [9, 10] |
HisZ | HisRS | [11, 12] |
AlaX | AlaRS | [5, 13] |
PrdX (ProX) | ProRS | [13, 14] |
GluX (YadB) | GluRS | [15] |
CTP | Class I ARS | [16] |
ATPS | Class I ARS | [16] |
EMAP-II | MetRS, TyrRS | [17-19] |
Arc1p | MetRS, TyrRS | [20] |
Trbp111 | MetRS, TyrRS | [21] |
BirA | SerRS | [22, 23] |
AsnA | AspRS, AsnRS | [24, 25] |
ThrRS-ed | ThrRS | [26] |
Gcn2 | HisRS | [11] |
Pol gamma B | GlyRS | [27] |
PoxA/GenX | LysRS | [28] |
Since a protein of a given species may be represented more than once in the Uniprot database – e.g. same protein from different strains –, only semi-quantitative values can be obtained from this analysis. Nevertheless, the calculation is accurate enough to provide an estimation of the distribution of each ARS-like protein among bacterial phyla.
2.2 Correlation analysis of protein distributions and pathogenicity
2.2.1 Database preparation and construction of the set of human pathogens
In order to identify proteins over-represented in pathogenic species the curated set of complete proteomes from the Integr8 database (2069 complete proteomes) was used (www.ebi.ac.uk/integr8). This collection was further modified to obtain our final proteome dataset (viral proteomes were removed and only one proteome per species was used) of 910 complete proteomes.
In over-representation studies a carefully curated dataset is essential to avoid artificial over-representation of data (e.g. fragments of the proteins, point mutations, more that one strain per species) that leads to non-reliable values of enrichment. From the final dataset of 910 complete proteomes, 168 were identified as belonging to human pathogens. This was done with the help of different curated databases: HAMAP database (http://www.expasy.ch/sprot/hamap/), pathogenic bacteria database (bac.hs.med.kyoto-u.ac.jp), national microbiology pathogen data resource (www.nmpdr.org), pathogenic fungi database (www.pfdb.net), eukaryotic pathogens database (eupathdb.org), and pathogen portal (www.pathogenportal.org). The final list of human pathogens includes 146 bacteria, 11 fungi and 12 protozoa.
2.2.2 Construction of a control dataset
Both positive and negative controls were included in the study for external validation of the method. Negative controls used are proteins not expected to be over-represented in human pathogens (tubulin, enolase, alanyl-tRNA synthetase, lactate dehydrogenase, and pyruvate dehydrogenase). Positive controls were built with proteins known to be linked to pathogenicity – e.g. virulence factors – (haemolysin, gamma-glutamyl transpeptidase, CapC, fim2 fimbrial subunit precursor, lipopolysaccharide transferase, sycE secretion chaperone, heme exporter protein CcmC, long polar fimbrial chaperone, adhesin, cholera enterotoxin, streptococcal exotoxin I and lipoteichoic acid synthase).
2.2.3 Calculation of over-representation indices
Although both ratios quantify the over-representation of a given protein among pathogen species they may produce different enrichment ratios because a species can have one or more homologues of the queried protein. Thus, enrichment must be quantified both in terms of number of proteins and number of species.
3 Results
3.1 Distribution of bacterial ARS-like proteins
Analysis of the phylogenetic distributions among the different bacterial phyla was performed for the complete set of ARS-like proteins (Fig. 2). From the 16 ARS-like proteins initially analyzed (Table 1) Arc1p, Gcn2, ThrRS-ed, Polγβ and AlaX2 were excluded because their distribution was found to be limited to eukarya (Gcn2 and Arc1p), archaea (ThrRS-ed), or eukarya and archaea (AlaX2, Polγβ). Emap-II and Trpb111 sequences were merged into one unique class because 90% of the sequences identified as Trbp111 are also present in the Emap-II profile. The distributions of the resulting 11 ARS-like proteins present in bacterial phyla are shown in Fig. 2. Minority phyla have not been represented in order to simplify the presentation of the results.
3.2 Identification of pathogenicity-related ARS-like proteins
We have constructed a simple and fast algorithm to determine whether a given protein is significantly over-represented in pathogenic organisms, and we have applied the method to bacterial ARS-like proteins. We consider a protein as pathogenicity-related if it is over-represented in a set of proteomes from human pathogens compared to what it should be expected by chance.
We computed the enrichment values (ER-proteins and ER-species, see Section 2), both for the set of controls and for the ARS-like proteins (Table 2 ). By plotting the enrichment rates (Fig. 3 ), we can clearly distinguish two differently distributed populations, corresponding to the negative and positive controls. The negative control distribution is centered around ER-proteins = 1 and ER-species = 1, whereas the positive control distribution (pathogenicity-related) has a higher variance and goes from non-enrichment values to high enrichment values. ARS-like proteins are mainly distributed among the negative control distribution, with the exception of AsnA, which clusters with pathogenicity-related proteins.
ER-proteins | ER-species | |
---|---|---|
Negative controls | ||
Tubulin | 1.1 | 1.55 |
Enolase | 1.01 | 0.96 |
Alanyl-tRNA synthetase | 1.14 | 1.02 |
Lactate deshydrogenase | 0.89 | 0.86 |
Pyruvate deshydrogenase | 0.93 | 0.92 |
Positive controls | ||
Lipoteichoic acid synthase | 1.24 | 1.17 |
Adhesin yadA | 6.17 | 5.42 |
Haemolysin | 2.06 | 1.81 |
Glutamyl transpeptidase | 0.95 | 0.91 |
CapC | 1.68 | 1.55 |
Fimbrial subunit precursor | 2.26 | 2.71 |
LPS transferase | 2.92 | 2.56 |
SycE secretion chaperone | 3.08 | 2.71 |
Heme exporter protein | 1.32 | 1.17 |
Fimbrial chaperone | 2.46 | 1.87 |
Cholera enterotoxin | 6.17 | 5.42 |
Streptococcal exotoxin | 6.17 | 5.42 |
Coagulase | 5.11 | 4.42 |
HifA – pilin | 3.08 | 2.71 |
ARS-like proteins | ||
AlaX | 0.5 | 0.51 |
ThrX | 1.11 | 1 |
AsnA | 2.53 | 1.79 |
ATPS | 0.76 | 0.69 |
BirA | 0.97 | 0.87 |
CTP | 1 | 0.91 |
GluX | 1.21 | 1.43 |
HisZ | 0.44 | 0.4 |
PoxA | 1.29 | 0.97 |
PrdX (ProX) | 0.86 | 0.77 |
Ybak | 1.2 | 1.06 |
EMAP-II | 1.13 | 0.98 |
Significance testing on the distribution results for AsnA was performed using a one-tailed test as described above. Since the ER-proteins mean for the negative controls is 1.014 ± 0.107, the thresholds corresponding for 5% FP, 1% FP and 0.1% FP are 1.19, 1.26 and 1.34, respectively. Taking this into account, AsnA is not a member of the negative control distribution with a P-value that approaches zero even at 0.1% FP. Thus, our results suggest that AsnA might be correlated with pathogenicity. GluX slightly deviates from the negative control set, however ER-proteins and -species values for GluX are below its respective cutoffs for a 1% false positive rate. Thus we can conclude that this deviation is not statistically significant and that GluX is not over-represented in human pathogens.
4 Discussion
The evolutionary relationships between ARS and ARS-like proteins have been analyzed previously through the use of phylogenetic methods [3, 34, 35]. This approach represents the best available strategy for the identification of cladistic relationships, but it is easily confounded by the extremely long evolutionary times experienced by aminoacyl-tRNA synthetases and their related proteins. Irrespectively of clade relationships, the species distribution of genes represents important information that can be linked to function and, indirectly, to evolutionary origin. Here we have analyzed the distribution of an ARS-like proteins families in bacteria and built a simple algorithm to analyze correlations between the distribution of a given protein and the pathogenicity of the species where it is present. The 11 ARS-like protein families that we have analyzed display very different distribution patterns among bacterial phyla. A grosso modo, we can distinguish between proteins that are universally or almost universally present, those that are present in the majority of phyla, and those that are present only in a minority of the main bacterial groups.
A wide distribution of a protein possibly reflects an ancient origin of the gene but lateral gene transfer, which is particularly widespread among bacteria, should always be considered an alternative explanation. This is the case for the proteins CTP, EMAP II, YadB, HisZ, and PoxA. Among this group are enzymes whose function is completely unrelated to gene translation (CTP, HisZ, and PoxA) and others that remain linked to tRNA biology (EMAP II and YadB). Interestingly, PoxA is a well-known pathogenicity factor in Salmonella [28]. However, its wide distribution suggests that its biological function is not exclusively linked to the establishment of infection, and the protein does not appear to be over-represented in pathogenic species (2, 3). Obviously negative values for enrichment in pathogens do not eliminate the possibility that a protein is a virulence factor. However, significant positive enrichment rates should be indicative of proteins whose function is pathogenicity-related.
Abundant but not universally distributed bacterial ARS-like families represent an important fraction of the set analyzed here (AlaX, ATPS, BirA, YbaK). Interestingly, two trans editing domains are present in this group, indicating that the need for misacylation correction may not be universal among bacteria. The scattered distribution of these enzymes may suggest that lateral gene transfer occurred among those species where the fidelity of the genetic code is particularly compromised and benefits from the function of in-trans editing domains [26].
It should be stressed that this situation needs not to be related to the specific kinetic behavior of the concerned ARS but can be caused by environmental conditions that, for instance, change the relative availability of similar amino acids. This situation would clearly favor the lateral transfer of these genes among species under similar environmental stresses.
Finally a small set of proteins (AsnA and PrdX) present a very limited distribution among bacteria. PrdX was originally described as the trans-editing enzyme ProX from Clostridium sticklandii, and shown to specifically deacylate alanyl-tRNAPro [13, 36]. PrdX and YbaK are two different trans-editing enzymes that hydrolyze different forms of mischarged tRNAPro [13]. Consistent with previous reports, YbaK and PrdX groups do not overlap in our analysis. However, they do display overlapping distributions at the phylum level, as would be expected from two independent editing domains that recognize different substrates. Despite its more limited distribution PrdX is not over-represented in pathogenic bacteria (Fig. 3).
Asparagine synthetase (AsnA) is a paralog of asparagine- and aspartyl-tRNA synthetases that displays a limited distribution among bacterial phyla. AsnA is unique among the ARS-like proteins analyzed here because it is significantly over-represented in human pathogenic bacteria. AsnA has been shown to act as a virulence factor in fish and plant pathogens, although the molecular bases for this role in virulence remain unknown [7, 8]. From our data it is reasonable to predict that AsnA may also be a virulence factor among human pathogens that, as such, deserves further analysis and consideration as a potential therapeutic target.
Acknowledgements
This work has been supported by Grants BIO2006-01551 from the , and HEALTH-F3-2009-223024 (MEPHITIS) from the European Union.