Journal list menu

Volume 584, Issue 2 p. 460-466
Short communication
Free Access

A genomics method to identify pathogenicity-related proteins. Application to aminoacyl-tRNA synthetase-like proteins

Eva Maria Novoa

Eva Maria Novoa

Institute for Research in Biomedicine (IRB), c/ Baldiri Reixac 15-21, 08028 Barcelona, Spain

Search for more papers by this author
Manuel Castro de Moura

Manuel Castro de Moura

Institute for Research in Biomedicine (IRB), c/ Baldiri Reixac 15-21, 08028 Barcelona, Spain

Search for more papers by this author
Modesto Orozco

Modesto Orozco

Institute for Research in Biomedicine (IRB), c/ Baldiri Reixac 15-21, 08028 Barcelona, Spain

Search for more papers by this author
Lluís Ribas de Pouplana

Corresponding Author

Lluís Ribas de Pouplana

Institute for Research in Biomedicine (IRB), c/ Baldiri Reixac 15-21, 08028 Barcelona, Spain

Catalan Institution for Research and Advanced Studies (ICREA), Passeig Lluís Companys 23, 08010 Barcelona, Spain

Corresponding author. Address: Institute for Research in Biomedicine (IRB), c/ Baldiri Reixac 15-21, 08028 Barcelona, Spain.Search for more papers by this author
First published: 12 November 2009
Citations: 6

Abstract

During their extended evolution genes coding for aminoacyl-tRNA synthetases (ARS) have experienced numerous instances of duplication, insertion and deletion of domains. The ARS-related proteins that have resulted from these genetic events are generally known as aminoacyl-tRNA synthetase-like proteins (ARS-like). This heterogeneous group of polypeptides carries out an equally varied number of functions that need not be related to gene translation. Several of these proteins remain uncharacterized. At least 16 different ARS-like proteins have been identified to date, but their functions remain incompletely understood. Here we review the individual phylogenetic distribution of these proteins in bacteria, and apply a new genomics method to determine their potential implication in pathogenicity.

1 Introduction

Aminoacyl-tRNA synthetases represent an extraordinary example of functional and structural conservation [1]. Across all living species most of these enzymes display an almost identical structure, providing one of the few cases where phylogenetic and structural analyses can be expected to yield information about the first evolutionary steps of cellular life on earth [2-4]. As would be expected from a large group of enzymes, with complicated modular structures and extremely long evolutionary lives, a large group of related proteins has formed as a result of total or partial duplications of ARS genes [5, 6]. In addition, some ARS-like proteins may exist that are coded by ancestral genes that were lately fused to a pre-existing ARS. Differentiating between these two possibilities can be difficult.

Functionally speaking ARS-like proteins are not a homogeneous class. However, a global analysis of their distribution is interesting because it provides information on the evolutionary history of ARS, and it might help to identify tendencies in the functional roles that ARS-related domains adopt when they diverge from their ancestral enzymes. Moreover, the species distribution of each ARS-like protein is likely to provide information on its biological role. More specifically, the search for correlations between gene distribution and complex biological phenotypes can be a powerful tool for the identification of biological function.

Here we combine the analysis of the phylogenetic distribution of bacterial ARS-like proteins with a simple and rapid algorithm for the identification of proteins that are over-represented in human pathogenic organisms. First, we have applied our method to re-examine the different ARS-like proteins found in bacteria, clustering them according to a sequence-similarity profile. Secondly, we have analyzed whether each of the 11 bacterial ARS-like proteins that we obtain is functionally linked to bacterial virulence (Fig. 1 ). Our method positively identifies AsnA as over-represented in pathogenic species. AsnA has already been described as important in bacterial pathogens of plants and animals [7, 8]. We suggest that its importance in infection may be extended to human microbial infections.

figure image
Schematic representation of the over-representation analysis performed in this work.

2 Methods

2.1 Protein profile generation and determination of phylogenetic distributions

We selected 16 well-documented ARS-like proteins for our study (Table 1 ). For each of them, a multiple alignment was built with ClustalW [29] using the Gonnet protein matrix, followed by a Hidden Markov profile building using the HMMER package [30]. Each protein profile was used as query to find all existing homologues in the Uniprot database (www.uniprot.org). In order to apply a consistent criterion to the determination of each protein's distribution we applied a cutoff value to the search for homologues (per-sequence E-value cutoff of 10.0). This procedure identified clusters of proteins that were considered as evolutionarily related and treated as a single family. Those families present in bacteria were selected for further analysis. The distribution found for each bacterial ARS-like family was graphically displayed through the quantification of all its homologous sequences in the main bacterial phyla and the representation of these frequencies on a model phylogenetic tree of bacteria [31] (Fig. 2 ).

figure image
Phylogenetic distribution and relative abundance of the 11 bacterial ARS-like proteins considered in this work. Each tree is labeled according to the protein whose distribution is being analyzed. The tree labeled UNIPROT shows the number of proteins per phylum that are included in the database used. The relative abundance (r.a.) of each protein in each phylum is represented by a colored circle at the end of the phylum's branch: blue (r.a. ⩾ 25), salmon (r.a. ⩾ 10) and yellow (0 > r.a. > 10). Only bacterial relative abundances are shown.
Table Table 1. List of the 16 ARS-like proteins considered in this study
Synthetase-like aaRS paralog Reference
Ybak ProRS [9, 10]
HisZ HisRS [11, 12]
AlaX AlaRS [5, 13]
PrdX (ProX) ProRS [13, 14]
GluX (YadB) GluRS [15]
CTP Class I ARS [16]
ATPS Class I ARS [16]
EMAP-II MetRS, TyrRS [17-19]
Arc1p MetRS, TyrRS [20]
Trbp111 MetRS, TyrRS [21]
BirA SerRS [22, 23]
AsnA AspRS, AsnRS [24, 25]
ThrRS-ed ThrRS [26]
Gcn2 HisRS [11]
Pol gamma B GlyRS [27]
PoxA/GenX LysRS [28]
To correct for the fact that not all bacterial phyla are equally represented among the Uniprot database, a standardization of the values of the ARS-like proteins was done in order to obtain final values comparable among the different bacterial phyla. The relative abundance of each protein in a phylum was computed by dividing the number of protein hits found in that phylum by the total number of proteins found for the phylum in the Uniprot database:
urn:x-wiley:00145793:media:feb2s0014579309009090:feb2s0014579309009090-math-0001((1))

Since a protein of a given species may be represented more than once in the Uniprot database – e.g. same protein from different strains –, only semi-quantitative values can be obtained from this analysis. Nevertheless, the calculation is accurate enough to provide an estimation of the distribution of each ARS-like protein among bacterial phyla.

2.2 Correlation analysis of protein distributions and pathogenicity

2.2.1 Database preparation and construction of the set of human pathogens

In order to identify proteins over-represented in pathogenic species the curated set of complete proteomes from the Integr8 database (2069 complete proteomes) was used (www.ebi.ac.uk/integr8). This collection was further modified to obtain our final proteome dataset (viral proteomes were removed and only one proteome per species was used) of 910 complete proteomes.

In over-representation studies a carefully curated dataset is essential to avoid artificial over-representation of data (e.g. fragments of the proteins, point mutations, more that one strain per species) that leads to non-reliable values of enrichment. From the final dataset of 910 complete proteomes, 168 were identified as belonging to human pathogens. This was done with the help of different curated databases: HAMAP database (http://www.expasy.ch/sprot/hamap/), pathogenic bacteria database (bac.hs.med.kyoto-u.ac.jp), national microbiology pathogen data resource (www.nmpdr.org), pathogenic fungi database (www.pfdb.net), eukaryotic pathogens database (eupathdb.org), and pathogen portal (www.pathogenportal.org). The final list of human pathogens includes 146 bacteria, 11 fungi and 12 protozoa.

2.2.2 Construction of a control dataset

Both positive and negative controls were included in the study for external validation of the method. Negative controls used are proteins not expected to be over-represented in human pathogens (tubulin, enolase, alanyl-tRNA synthetase, lactate dehydrogenase, and pyruvate dehydrogenase). Positive controls were built with proteins known to be linked to pathogenicity – e.g. virulence factors – (haemolysin, gamma-glutamyl transpeptidase, CapC, fim2 fimbrial subunit precursor, lipopolysaccharide transferase, sycE secretion chaperone, heme exporter protein CcmC, long polar fimbrial chaperone, adhesin, cholera enterotoxin, streptococcal exotoxin I and lipoteichoic acid synthase).

2.2.3 Calculation of over-representation indices

Protein profiles were built for both controls and test cases (ARS-like proteins) following the same procedure explained above. Each protein profile was compared to our curated set of Integr8 proteomes, to obtain the complete list of homologues for each of the proteins among the 910 proteomes. We considered a protein as pathogenicity-related if it was found over-represented in the set of human pathogens compared to what is expected by chance. Over-representation was measured using two different indices: enrichment rate of the number of proteins (ER-proteins) and enrichment rate of the number of species (ER-species), which are computed as follows:
urn:x-wiley:00145793:media:feb2s0014579309009090:feb2s0014579309009090-math-0002((2))
urn:x-wiley:00145793:media:feb2s0014579309009090:feb2s0014579309009090-math-0003((3))
where “X” is the queried protein of interest.

Although both ratios quantify the over-representation of a given protein among pathogen species they may produce different enrichment ratios because a species can have one or more homologues of the queried protein. Thus, enrichment must be quantified both in terms of number of proteins and number of species.

Significance testing on protein distribution results was performed using a one-tailed test, and threshold values were computed both for 1% and 0.1% false positive rates (FP) [32, 33]. In one-tailed tests, we can compute the threshold or cutoff value depending on the false positive rates (FP) that we accept:
urn:x-wiley:00145793:media:feb2s0014579309009090:feb2s0014579309009090-math-0004((4))
urn:x-wiley:00145793:media:feb2s0014579309009090:feb2s0014579309009090-math-0005((5))
urn:x-wiley:00145793:media:feb2s0014579309009090:feb2s0014579309009090-math-0006((6))
where urn:x-wiley:00145793:media:feb2s0014579309009090:feb2s0014579309009090-math-0007 is the population mean and urn:x-wiley:00145793:media:feb2s0014579309009090:feb2s0014579309009090-math-0008 is the estimator of the standard deviation of the population. By plotting ER-species as a function of ER-proteins, control proteins that are not linked to pathogenicity should be clustered around the (1, 1) coordinates. A protein that is not over-represented is expected to fall into the normal distribution of the negative controls, with cutoff values that depend on the rate of false positives that we accept.

3 Results

3.1 Distribution of bacterial ARS-like proteins

Analysis of the phylogenetic distributions among the different bacterial phyla was performed for the complete set of ARS-like proteins (Fig. 2). From the 16 ARS-like proteins initially analyzed (Table 1) Arc1p, Gcn2, ThrRS-ed, Polγβ and AlaX2 were excluded because their distribution was found to be limited to eukarya (Gcn2 and Arc1p), archaea (ThrRS-ed), or eukarya and archaea (AlaX2, Polγβ). Emap-II and Trpb111 sequences were merged into one unique class because 90% of the sequences identified as Trbp111 are also present in the Emap-II profile. The distributions of the resulting 11 ARS-like proteins present in bacterial phyla are shown in Fig. 2. Minority phyla have not been represented in order to simplify the presentation of the results.

3.2 Identification of pathogenicity-related ARS-like proteins

We have constructed a simple and fast algorithm to determine whether a given protein is significantly over-represented in pathogenic organisms, and we have applied the method to bacterial ARS-like proteins. We consider a protein as pathogenicity-related if it is over-represented in a set of proteomes from human pathogens compared to what it should be expected by chance.

We computed the enrichment values (ER-proteins and ER-species, see Section 2), both for the set of controls and for the ARS-like proteins (Table 2 ). By plotting the enrichment rates (Fig. 3 ), we can clearly distinguish two differently distributed populations, corresponding to the negative and positive controls. The negative control distribution is centered around ER-proteins = 1 and ER-species = 1, whereas the positive control distribution (pathogenicity-related) has a higher variance and goes from non-enrichment values to high enrichment values. ARS-like proteins are mainly distributed among the negative control distribution, with the exception of AsnA, which clusters with pathogenicity-related proteins.

figure image
Distribution of over-representation values for all ARS-like proteins (yellow boxes), and positive or negative controls for pathogenicity (red squares and blue diamonds, respectively). The position of AsnA is marked by an arrow and labeled accordingly.
Table Table 2. Overrepresentation values for the different ARS-like proteins, including the negative and positive controls used in this study
ER-proteins ER-species
Negative controls
Tubulin 1.1 1.55
Enolase 1.01 0.96
Alanyl-tRNA synthetase 1.14 1.02
Lactate deshydrogenase 0.89 0.86
Pyruvate deshydrogenase 0.93 0.92
Positive controls
Lipoteichoic acid synthase 1.24 1.17
Adhesin yadA 6.17 5.42
Haemolysin 2.06 1.81
Glutamyl transpeptidase 0.95 0.91
CapC 1.68 1.55
Fimbrial subunit precursor 2.26 2.71
LPS transferase 2.92 2.56
SycE secretion chaperone 3.08 2.71
Heme exporter protein 1.32 1.17
Fimbrial chaperone 2.46 1.87
Cholera enterotoxin 6.17 5.42
Streptococcal exotoxin 6.17 5.42
Coagulase 5.11 4.42
HifA – pilin 3.08 2.71
ARS-like proteins
AlaX 0.5 0.51
ThrX 1.11 1
AsnA 2.53 1.79
ATPS 0.76 0.69
BirA 0.97 0.87
CTP 1 0.91
GluX 1.21 1.43
HisZ 0.44 0.4
PoxA 1.29 0.97
PrdX (ProX) 0.86 0.77
Ybak 1.2 1.06
EMAP-II 1.13 0.98

Significance testing on the distribution results for AsnA was performed using a one-tailed test as described above. Since the ER-proteins mean for the negative controls is 1.014 ± 0.107, the thresholds corresponding for 5% FP, 1% FP and 0.1% FP are 1.19, 1.26 and 1.34, respectively. Taking this into account, AsnA is not a member of the negative control distribution with a P-value that approaches zero even at 0.1% FP. Thus, our results suggest that AsnA might be correlated with pathogenicity. GluX slightly deviates from the negative control set, however ER-proteins and -species values for GluX are below its respective cutoffs for a 1% false positive rate. Thus we can conclude that this deviation is not statistically significant and that GluX is not over-represented in human pathogens.

4 Discussion

The evolutionary relationships between ARS and ARS-like proteins have been analyzed previously through the use of phylogenetic methods [3, 34, 35]. This approach represents the best available strategy for the identification of cladistic relationships, but it is easily confounded by the extremely long evolutionary times experienced by aminoacyl-tRNA synthetases and their related proteins. Irrespectively of clade relationships, the species distribution of genes represents important information that can be linked to function and, indirectly, to evolutionary origin. Here we have analyzed the distribution of an ARS-like proteins families in bacteria and built a simple algorithm to analyze correlations between the distribution of a given protein and the pathogenicity of the species where it is present. The 11 ARS-like protein families that we have analyzed display very different distribution patterns among bacterial phyla. A grosso modo, we can distinguish between proteins that are universally or almost universally present, those that are present in the majority of phyla, and those that are present only in a minority of the main bacterial groups.

A wide distribution of a protein possibly reflects an ancient origin of the gene but lateral gene transfer, which is particularly widespread among bacteria, should always be considered an alternative explanation. This is the case for the proteins CTP, EMAP II, YadB, HisZ, and PoxA. Among this group are enzymes whose function is completely unrelated to gene translation (CTP, HisZ, and PoxA) and others that remain linked to tRNA biology (EMAP II and YadB). Interestingly, PoxA is a well-known pathogenicity factor in Salmonella [28]. However, its wide distribution suggests that its biological function is not exclusively linked to the establishment of infection, and the protein does not appear to be over-represented in pathogenic species (2, 3). Obviously negative values for enrichment in pathogens do not eliminate the possibility that a protein is a virulence factor. However, significant positive enrichment rates should be indicative of proteins whose function is pathogenicity-related.

Abundant but not universally distributed bacterial ARS-like families represent an important fraction of the set analyzed here (AlaX, ATPS, BirA, YbaK). Interestingly, two trans editing domains are present in this group, indicating that the need for misacylation correction may not be universal among bacteria. The scattered distribution of these enzymes may suggest that lateral gene transfer occurred among those species where the fidelity of the genetic code is particularly compromised and benefits from the function of in-trans editing domains [26].

It should be stressed that this situation needs not to be related to the specific kinetic behavior of the concerned ARS but can be caused by environmental conditions that, for instance, change the relative availability of similar amino acids. This situation would clearly favor the lateral transfer of these genes among species under similar environmental stresses.

Finally a small set of proteins (AsnA and PrdX) present a very limited distribution among bacteria. PrdX was originally described as the trans-editing enzyme ProX from Clostridium sticklandii, and shown to specifically deacylate alanyl-tRNAPro [13, 36]. PrdX and YbaK are two different trans-editing enzymes that hydrolyze different forms of mischarged tRNAPro [13]. Consistent with previous reports, YbaK and PrdX groups do not overlap in our analysis. However, they do display overlapping distributions at the phylum level, as would be expected from two independent editing domains that recognize different substrates. Despite its more limited distribution PrdX is not over-represented in pathogenic bacteria (Fig. 3).

Asparagine synthetase (AsnA) is a paralog of asparagine- and aspartyl-tRNA synthetases that displays a limited distribution among bacterial phyla. AsnA is unique among the ARS-like proteins analyzed here because it is significantly over-represented in human pathogenic bacteria. AsnA has been shown to act as a virulence factor in fish and plant pathogens, although the molecular bases for this role in virulence remain unknown [7, 8]. From our data it is reasonable to predict that AsnA may also be a virulence factor among human pathogens that, as such, deserves further analysis and consideration as a potential therapeutic target.

Acknowledgements

This work has been supported by Grants BIO2006-01551 from the , and HEALTH-F3-2009-223024 (MEPHITIS) from the European Union.