Improving the accuracy of recombinant protein production through integration of bioinformatics, statistical and mass spectrometry methodologies
Abstract
Heterologous protein production is a key technology for biotechnological, health sciences and many other research fields. Various approaches have been developed for its optimization, but the research emphasis has been on optimization of protein yield rather than protein quality. In this study, we have established a workflow for synthetic gene optimization for heterologous protein expression that combines bioinformatics, laboratory experiments, mass spectrometry and statistical analysis. Two gene primary structure analysis platforms, Anaconda and EuGene, and multivariate optimization methods were employed to re‐design the Plasmodium falciparum lysyl‐tRNA synthetase gene for optimal expression in Escherichia coli. Synthetic genes were expressed from common vectors, and amino acid mis‐incorporations in the expressed proteins were detected and quantified using mass spectrometry. The association between the identified amino acid mis‐incorporations and 23 gene variables was then analysed. The synthetic genes yielded significantly higher levels of protein relative to the wild‐type gene, but 71 amino acid mis‐incorporation sites were observed along the whole protein and across the synthetic genes that were statistically associated with specific codons and protein secondary structures. The optimization method that led to production of the most accurate protein was based on a multivariate approach that combined variables that are known to influence mRNA translation.
Abbreviations
-
- CAI
-
- codon adaptation index
-
- CC
-
- codon context
-
- CPB
-
- codon pair bias
-
- HARM
-
- codon harmonization
-
- LysRS
-
- lysyl‐tRNA synthetase
-
- MFE
-
- minimum free energy
-
- MS
-
- mass spectrometry
-
- RNAf
-
- RNA folding energy
-
- RSCU
-
- relative synonymous codon usage
Introduction
Heterologous protein production is an important tool for biophysical studies, immunological applications, and for the screening and profiling of candidate drugs [1, 2]. As gene sequences co‐evolve with host cells, their expression often occurs under sub‐optimal conditions when expressed in heterologous hosts, due to unbalanced tRNA populations and ribosome stalling [3, 4]. Another recurrent problem in high‐level expression of heterologous genes is increased translational error due to tRNA mis‐reading, which affects protein folding and disrupts protein function [5-8].
Various strategies have been employed to improve the quality of recombinant proteins, using various in silico algorithms for gene optimization and in vitro gene synthesis that take advantage of the degeneracy of the genetic code to maintain the primary structure of the recombinant proteins [9]. Codon optimization methods may be used to reduce differences in codon usage and G+C content between recombinant genes and expression hosts [9-12]. Other methods are based on the codon adaptation index (CAI), codon context, codon harmonization or mixed approaches [10]. As the CAI is a measure of codon usage in highly expressed genes, optimizing a gene to improve its CAI will predictably improve its level of expression in the surrogate host [13]. However, a gene with a maximum CAI value (CAI = 1) contains only the most frequent codon of each amino acid codon family, dramatically reducing the codon diversity of the gene. Codon context (CC) refers to non‐random utilization of adjacent codon pairs [14-16], and relates to tRNA/tRNA steric interaction in the ribosomes [17, 18] that influence the translation elongation rate and accuracy in Escherichia coli [16, 19, 20]. Therefore, if a gene primary structure is re‐designed according to the host cell CC rules, its expression is predictably more efficient. Gene optimization using codon harmonization rules is based on the concept that, during protein synthesis, functionally relevant rare codons (low‐usage codons, decoded by low‐abundance tRNAs) slow down the translation rate, allowing for correct protein folding [11]. Using this approach, genes have been re‐designed to maintain these rare codons and thus the translation dynamics of the wild‐type gene in the expressing host [11].
Another important variable that affects the efficiency of gene expression is the ‘ramp’ effect, which is due to the increased occurrence of functionally relevant rare codons immediately after the initiator codon (~ 30–50 codons after the AUG). These rare codons appear more frequently in highly expressed genes, and are thought to slow down translation to a degree that allows for more efficient ribosome binding to the mRNA, avoiding ribosomal bottlenecks [21]. The ramp effect also appears to allow chaperone recruitment at the end of the ribosome exit tunnel, which is crucial for facilitating correct protein folding [21]. Another important aspect of gene optimization is the availability of tRNAs to decode the most abundant codons, which is largely determined by the tRNA gene copy number [22]. The decoding efficiency of tRNAs also depends on species‐specific base modifications, in particular modifications of base 34 of the anticodon, which modulate codon pairing and improve translation efficiency [23]. Finally, the formation of stable mRNA secondary structures, a consequence of base pairing in the mRNA strand, is particularly important during translation initiation [24-26]. The formation of stable structures is related to the minimum free energy (MFE) in the sense that the lower the MFE, the more stable the secondary structure is. Thus, increasing the MFE of nucleotide sequences enhances gene expression [27]. The ramp effect and mRNA secondary structure may also act synergistically, as the ramp appears to reduce the formation of stable structures at the mRNA level, hence improving the yield of recombinant protein [21].
Most of the above‐mentioned gene primary structure features have been investigated extensively in E. coli. For example, Raghavan et al. [28] demonstrated that highly expressed heterologous genes affect bacterial fitness, and revealed a positive association between increasing G+C content and growth rate, mostly in genes expressed at higher levels in E. coli. However, there was no apparent association between G+C content and protein yield. In a similar study [26], using a library of 154 GFP genes synthesized with random codon usage, fluorescence levels varied 250‐fold across the library, but were not correlated with the CAI, the frequency of optimal codons, or the number of rare codons (including pairs of consecutive rare codons). The most important variable was the RNA folding energy (RNAf) of the first third of the mRNA sequence. This result was later confirmed in a broad survey of 340 genomes, including bacteria, fungi, plants, insects and fishes; in birds and mammals, it was confirmed only in the most G+C‐rich genes [29]. The relevance of the mRNA secondary structure for protein synthesis has also been investigated in ‘long‐term evolutionary experiments’ in E. coli, in which synonymous mutations that change mRNA secondary structure were found to affect fitness [30]. In a further study, Welch et al. [31] synthesized and expressed two sets of 40 genes, and discovered that synonymous changes in codon usage altered protein yield more than 40‐fold, but this alteration was not correlated with the CAI or the RNAf close to the ribosomal binding site. Instead, the differences in gene expression were highly correlated with the codons used. Interestingly, those codons were not the most used codons in E. coli, but those that correspond to tRNAs that are most efficiently charged during amino acid starvation [31]. In other words, high‐level recombinant gene expression may deplete intracellular amino acids and charged tRNA pools with a direct effect on the translation rate. This may explain why gene optimization experiments performed in different laboratories have produced contradictory results.
A fundamental issue in heterologous protein production is translation accuracy and protein quality. Various approaches have been employed to quantify amino acid mis‐incorporations [32], namely determination of the isoelectric point of proteins [33, 34] and use of reporter systems that produce a signal when a specific amino acid is mis‐translated [35]. However, measuring translation error rates is a difficult task because most mis‐incorporations occur at low level and become diluted in heterologous mixtures of polypeptides [3]. Use of tandem mass spectrometry (MS/MS), which potentially allows detection of low‐frequency mis‐incorporations in polypeptides against a background of wild‐type molecules, combined with data analysis and databases containing all possible single amino acid substitutions, may finally overcome this hurdle [3, 36-40]. Moreover, the use of MS/MS analysis on purified recombinant proteins with known DNA sequence enables detection of the type of amino acid mis‐incorporated, and also the specific position where they occur along the protein. In the present study, we used MS/MS to determine the performance of various in silico gene optimization methods with respect to the accuracy and expression level of recombinant genes in E. coli. More specifically, we investigated differences in translational accuracy between the wild‐type gene (not optimized) and synonymous genes optimized using various methods. To do so, we re‐engineered the lysyl‐tRNA synthetase (LysRS) gene of Plasmodium falciparum for heterologous expression in E. coli. The resulting recombinant proteins were purified and analysed using matrix‐assisted laser desorption/ionization (MALDI) to identify and quantify the amino acid mis‐incorporations.
Results
Gene optimization methods
To test the yield and fidelity of heterologous protein production in E. coli, the P. falciparum LysRS gene was optimized using two gene primary structure analysis platforms, Anaconda 2.0 [16, 19, 41] and EuGene [12] (Fig. 1). Six synonymous genes were generated using various optimization criteria (Fig. 1 and Table 1): (a) the CAI gene, created by maximization of the CAI, (b) the CC and DIG2 genes, created by maximization of the codon context, [16] plus a variant of the CC gene (DIG2) that harbours the same 6xHis tag used for the CAI and CAI/CC genes, (c) the CAI/CC gene, created using a mixed approach combining maximization of CAI and CC; (d) the HARM genes, created by optimization through codon usage harmonization, and (e) the NF2B genes, created using a mixed approach that combines various gene features and optimization strategies, such as codon context, codon usage harmonization, removal of repeated nucleotides, and removal of codons that are read by unmodified tRNA [27]. Further details are provided in Experimental procedures.
| Gene name | Optimization method | Tag | Nc | GC% | RSCU | CPB | ENC | CAI | RNAf | MFE |
|---|---|---|---|---|---|---|---|---|---|---|
| WT | None | His + FLAG | 584 | 29.91 | 0.972 | −0.018 | 37.929 | 0.5534 | −367.93 | −331.4 |
| CAI | CAI | His + FLAG | 584 | 46.75 | 1.424 | 0.02 | 20.224 | 0.8344 | −453.16 | −422 |
| CAI/CC | CAI/CC | His + FLAG | 584 | 46.23 | 1.375 | 0.088 | 24.303 | 0.8008 | −451.59 | −422.2 |
| CONT | CC | His + FLAG | 584 | 43.78 | 1.126 | 0.198 | 50.634 | 0.6514 | −553.63 | −523.3 |
| DIG2 | CC | His + FLAG | 584 | 43.78 | 1.126 | 0.198 | 50.634 | 0.6519 | −548.69 | −518.2 |
| HARM | H | His + FLAG | 584 | 34.53 | 0.968 | −0.088 | 27.817 | 0.5353 | −363.54 | −331.3 |
| NF2B | CC, H, RRnuc and UtRNA | His | 584 | 42.47 | 1.127 | 0.229 | 51.568 | 0.6378 | −487.2 | −455.3 |
- Optimization methods comprise the codon adaptation index (CAI), codon context (CC), harmonization (H), removal of repeated nucleotides (RRnuc), codon usage (CU) and removal of codons read by unmodified tRNA in E. coli (UtRNA). ‘Tag’ indicates the tags linked to the recombinant proteins; ‘Nc’ indicates the number of codons; G+C% indicates the G+C content as a percentage; RSCU, represents the mean relative synonymous codon usage; CPB indicates the codon pair bias; ENC indicates the effective number of codons; RNAf indicates the RNA fold energy (kcal·moL−1); MFE indicates the minimum free energy (kcal·moL−1). Gene names are defined as follows: WT, wild‐type; CAI, optimized gene based on the codon adaptation index; CAI/CC, optimized gene based on the CAI and codon context; CC, optimized gene based on codon context; HARM, optimized gene based on harmonization of codon usage; DIG2, hybrid gene between the CAI/CC and CC genes; NF2B, gene optimized based on harmonization of codon usage, removal of repeats and removal of codons decoded by unmodified tRNAs in E. coli.

Gene characterization
The recombinant genes derived from the LysRS are characterized in Table 1. Six of the seven genes, including the LysRS wild‐type (WT) gene, had FLAG and 6xHis tags. The FLAG tag was removed from the NF2B gene to avoid excessive manipulation of the synthetic gene (Table 1). The total length of PfLysRS is 584 amino acids, and the WT gene has 29.91% G+C, which is rather different from the genome G+C composition of E. coli (50.8%). All optimized genes have a G+C content closer to that of E. coli, except the HARM gene, which has an intermediate value of G+C (34.5%) (Table 1). Similarly, the mean relative synonymous codon usage (RSCU) is lower in the WT and HARM genes than in the other optimized genes. The overall evaluation of the codon context of the recombinant genes (estimated as codon pair bias, CPB [16] in table 1) was positive for most of the genes, compared to the WT gene, except for the HARM gene, which also originated a negative CPB value. The effective number of codons (ENC) for the recombinant genes follows a trend that is opposite to the values obtained for the CAI, i.e. the CAI and CAI/CC genes have the lowest ENC and the highest values of CAI, followed by the WT and HARM genes. The highest ENC was observed for the CC, DIG2 and NF2B genes. These values reflect how optimization by increasing CAI reduces the overall codon variability. These optimization approaches did not improve the RNAf or the MFE relative to the WT, except for the HARM gene, suggesting that the harmonization algorithm is specific in the way it influences gene expression.
Growth rate and protein yield
Growth curves were determined for the seven recombinant LysRS genes using three clones in triplicate; the WT PfLysRS gene was used as control. The growth rates before induction of gene expression were similar in all cases (Fig. 2A); however, differences in the growth rate were observed after inducing gene expression. In particular, strains transformed with the CAI/CC, DIG2, NF2B or CAI genes showed significantly higher growth rates than strains expressing WT and HARM genes (Fig. 2A). The CC mRNA, which has the same gene sequence as DIG2, except for the tag, produced a stable secondary structure (according to the RNAfold website, http://www.tbi.univie.ac.at/RNA/RNAfold.html), strongly reducing the growth rate and the protein yield. Because of this, the strain expressing the CC gene was not considered for further analysis, and only the DIG2 gene was studied further. To evaluate whether the growth rate of E. coli was correlated with the G+C content, mean RSCU, CAI, CPB, ENC, RNAf and MFE, a series of linear regressions were performed. The only significant associations found were between growth rate and the G+C content (R2 = 0.737, P = 0.018; Fig. 2B), the presence of adenosine (A) at the third codon position (R2 = 0.66, P = 0.049) and the presence of uracil (U) at the third codon position (R2 = 0.67, P = 0.045). Thus, higher protein yields were obtained for genes with higher G+C content and lower AU content at the third codon position.

The WT LysRS gene yielded a lower amount of recombinant protein relative to the other constructs, as shown by both SDS/PAGE (Fig. 3A) and western blot analysis (Fig. 3B). The CAI/CC and CAI genes produced the highest level of recombinant protein, followed by the HARM, DIG2 and NF2B genes (Fig. 3C). No significant association was observed between protein yield and the above‐mentioned variables, except for G+C content at the first codon position (R2 = 0.72, P = 0.019). These results support the hypothesis that there is a tight link between the G+C content of heterologous genes, the growth rate of the recombinant strains and protein yield.

Identification of amino acid mis‐incorporations by MS
To clarify whether gene sequence optimization interfered with mRNA translation accuracy and protein quality, we analysed the recombinant proteins by mass spectrometry (Fig. 4). The amino acid mis‐incorporations found in the recombinant proteins are shown in Table 2, and validation of spectra is shown in Fig. S1. We identified a total of 71 amino acid mis‐incorporation positions over the full length of the PfLysRS gene. Interestingly, 17 of these mis‐incorporation positions were located in α‐helix regions, 13 in β‐sheet regions and the other 41 in random coil regions of PfLysRS (Table 2), suggesting that protein structure may influence decoding errors. However, the influence of secondary structure on the amino acid mis‐incorporation requires further clarification (see below).

Some mis‐incorporations occurred in all genes (common mis‐incorporations), others occurred in most genes, and a smaller number were detected only once across the recombinant proteins (singletons) (Table 2). The number of amino acids that were replaced at least once was 16; four of these were replaced by one amino acid only, while the remaining 12 were replaced by more than one amino acid (Table 3). Among the common mis‐incorporations, seven occurred at high frequency across all recombinant proteins (Table 2), at the following structural positions along the protein: 5 (α‐helix), 18 (β‐sheet), 124 (random coil), 246 (random coil), 314 (β‐sheet), 476 (random coil) and 498 (α‐helix).
| Number | Mis‐incorporation | Mismatch | AA modification | Editing | Occurrence | Mean (%) |
|---|---|---|---|---|---|---|
| 1 | Ala→Asn | Carbamyl | 3 | 0.221 | ||
| 2 | Ala→Ser | G1/A | Hydroxylat | Class II | 1 | 0.439 |
| 3 | Asn→Asp | A1/C | Deamidation | 31 | 0.193 | |
| 4 | Asn→Leu | – | 1 | 0.013 | ||
| 5 | Asn→Pro | – | 2 | 0.030 | ||
| 6 | Asp→His | G1/G | – | 1 | 0.017 | |
| 7 | Asp→Leu | – | 3 | 0.029 | ||
| 8 | Asp→Trp | – | 1 | 0.204 | ||
| 9 | Gln→Arg | A2/G | – | 1 | 0.055 | |
| 10 | Gln→Lys | C1/U | – | 11 | 0.140 | |
| 11 | Gly→Gln | Propionamide | 9 | 0.066 | ||
| 12 | Gly→Lys | C1/A | Deoxyhypusine | 1 | 0.016 | |
| 13 | Gly→Thr | caused by ethanol | 7 | 0.094 | ||
| 14 | Gly→Trp | – | 1 | 0.045 | ||
| 15 | His→Met | – | 1 | 0.028 | ||
| 16 | Ile→Arg | U2/C | – | 11 | 0.078 | |
| 17 | Ile→Asn | U2/U | – | 1 | 0.007 | |
| 18 | Ile→Glu | – | 8 | 0.024 | ||
| 19 | Leu→Arg | U2/C | – | 6 | 0.080 | |
| 20 | Leu→Asn | – | 1 | 0.128 | ||
| 21 | Leu→Glu | – | 17 | 0.075 | ||
| 22 | Lys→Gln | A1/G | – | 18 | 0.161 | |
| 23 | Lys→Glu | A1/C | – | 1 | 0.238 | |
| 24 | Lys→Gly | – | 1 | 0.194 | ||
| 25 | Met→Asp | – | 1 | 0.076 | ||
| 26 | Met→Phe | – | 3 | 0.166 | ||
| 27 | Phe→Ile | U1/U | – | 2 | 0.013 | |
| 28 | Pro→Arg | C2/A | – | 1 | 0.153 | |
| 29 | Ser→Ala | U1/C | – | 4 | 0.162 | |
| 30 | Ser→Asp | Formyl | 32 | 0.046 | ||
| 31 | Ser→Cys | A1/U or C2/C | Thiocarboxy | 8 | 0.164 | |
| 32 | Ser→Gln | – | 4 | 0.114 | ||
| 33 | Ser→Met | S‐ethyl | 4 | 0.061 | ||
| 34 | Thr→Glu | Formyl | 13 | 0.045 | ||
| 35 | Thr→Phe | – | 1 | 0.022 | ||
| 36 | Tyr→His | U1/G | – | 1 | 0.012 | |
| 37 | Val→Asn | – | 1 | 0.012 | ||
| 38 | Val→Asp | – | 2 | 0.115 | ||
| 39 | Val→Pro | Didehydro | 5 | 0.055 |
- ‘Mis‐incorporation’ indicates amino acid substitutions relative to the WT sequence. ‘Mismatch’ indicates base pair mismatch between mRNA codon and tRNA anticodon; the numbers represent the mismatched nucleotide position in the mRNA codon. ‘AA modification’ indicates possible amino acid modifications with identical molecular mass to the amino acid substitutions. ‘Editing’ indicates editing activity of class I and class II aaRS. ‘Occurrence’ indicates the number of times an amino acid mis‐incorporation was observed across all recombinant proteins. ‘Mean (%)’ indicates the relative mean level of amino acid mis‐incorporation across all recombinant proteins.
As it is technically difficult to distinguish between some amino acid substitutions and amino acid chemical modifications using MS, we cross‐checked which of the above amino acid mis‐incorporations had molecular mass differences identical to known amino acid modifications, producing a second dataset (Tables 2 and 3). This reduced the total number of amino acid mis‐incorporation positions to 52 (Table 2, grey columns), with 13 located in α‐helices, 10 in β‐sheets and 39 in random coils. For simplicity, we refer to this subset of amino acid substitutions as the second dataset.
Quantification of amino acid mis‐incorporations
The relative level of amino acid mis‐incorporation was quantified by integrating the peak areas of the validated MS spectra (Tables 2 and 3). The mis‐incorporations ranged from 0.007–0.9%, with a mean of 0.1% across the identified sites. Moreover, the mean level of mis‐incorporations was higher at random coils than at α‐helices and β‐sheets, with values of 0.135%, 0.083% and 0.077%, respectively. Similar results were obtained for the second dataset, for which the level of amino acid mis‐incorporation was ~ 0.139% at random coils, 0.081% at α‐helices and 0.075% at β‐sheets. On average, the genes with the highest level of amino acid mis‐incorporations were the DIG2 and WT genes, while the lowest value was recorded for the CAI gene (Table 4). After removing putative amino acid modifications (second dataset), the genes with the highest level of mis‐incorporations were the HARM and WT genes, while the NF2B and CAI genes had the lowest level (Table 4).
| Gene | Overall | No AA modification |
|---|---|---|
| WT | 0.942 | 0.508 |
| CAI | 0.391 | 0.231 |
| CAI/CC | 0.500 | 0.317 |
| CC | 0.543 | 0.369 |
| DIG2 | 1.003 | 0.260 |
| HARM | 0.674 | 0.481 |
| NF2B | 0.660 | 0.207 |
- Gene names are defined in the footnote to Table 1. ‘Overall’ data represent dataset 1, i.e. the mean level of error (%) for the full set of data. ‘No AA modification’ values represent dataset 2, i.e. the mean level of error after removing the mis‐incorporations that have identical molecular mass to amino acid modifications (%). The results for dataset 2 show that the NF2B gene has a lower level of error than the other genes, and that the WT gene has the highest level of error.
To understand the mechanism of amino acid mis‐incorporations, we analysed the interactions between each codon base of the engineered mRNA and each anticodon base of the tRNAs responsible for the mis‐incorporations (Fig. 5). Of the 39 types of amino acid mis‐incorporations observed (Table 3), 15 were associated with a single nucleotide mismatch between anticodon/codon pairings (amino acid mis‐incorporations underlined in Tables 2 and 3). We also analysed the mis‐incorporations that involved substitutions between amino acids with similar chemical properties, but a correlation between these variables was not observed (data not shown). The remaining mis‐incorporations involved chemically unrelated amino acids, with three codon/anticodon mismatches. The second dataset, which excluded possible amino acid modifications, produced similar results, with 12 mis‐incorporations resulting from single nucleotide mismatches and the other 16 originating from more than one mismatch (Table 3). The genes that had the highest number of mismatches at the first codon position were the HARM and WT genes, those with the highest number of mismatches at the second codon position were the WT, CC and HARM genes, and those with the highest number of mismatches at the third codon position were the WT, CC and HARM genes (Table 5). The second dataset produced slightly different results: those with the highest number of mismatches at the first codon position were the WT and CAI genes, those with the highest number of mismatches at the second codon position were the WT and CAI/CC genes, and those with the highest number of mismatches at the third codon position were the CC and NF2B genes (Table S1). These data highlight the importance of manually assessing the amino acid mis‐incorporations to ensure that putative amino acid modifications are excluded from the datasets and do not introduce biases in the analysis. Interestingly, mismatches more often occurred at the first codon position than at the second and third positions in both datasets (Table 5 and Table S1). Mis‐incorporations were also related to the presence at the wobble position of inosine (I), which is able to pair with A, U or C, particularly in the WT gene, followed by the CAI and HARM genes, in both datasets (Table 5 and Table S1). Mis‐incorporations involving G–U pairing between the codon and the mis‐reading tRNA anticodon were also more frequent at the third codon position, followed by the first and second codon positions, again in both datasets (Table 5 and Table S1).
| Gene | Codon position | Total mismatch | Inosine | GU |
|---|---|---|---|---|
| WT | 1 | 31 | * | 6 |
| 2 | 22 | * | 2 | |
| 3 | 12 | 2 | 6 | |
| CAI | 1 | 27 | * | 3 |
| 2 | 11 | * | 3 | |
| 3 | 5 | 3 | 9 | |
| CAI/CC | 1 | 19 | * | 3 |
| 2 | 21 | * | 1 | |
| 3 | 6 | 0 | 12 | |
| CC | 1 | 20 | * | 5 |
| 2 | 22 | * | 3 | |
| 3 | 14 | 0 | 9 | |
| DIG2 | 1 | 19 | * | 2 |
| 2 | 15 | * | 8 | |
| 3 | 8 | 0 | 10 | |
| HARM | 1 | 30 | * | 2 |
| 2 | 21 | * | 3 | |
| 3 | 14 | 1 | 3 | |
| NF2B | 1 | 17 | * | 2 |
| 2 | 11 | * | 1 | |
| 3 | 8 | 0 | 2 | |
| Total | 1 | 163 | * | 23 |
| 2 | 123 | * | 21 | |
| 3 | 67 | 6 | 51 |
- Gene names are defined in the footnote to Table 1. ‘Codon position’ indicates each of the three possible anticodon positions. ‘Total mismatch’ indicates the total number of mismatched nucleotides at each position, considering the codon that was present in each mis‐incorporation detected and the tRNA that was more likely to have originated that mis‐incorporation. ‘Inosine’ and ‘GU’ refer to non‐canonical codon–anticodon interactions. Asterisks indicate positions that were not considered.

Statistical analysis of the amino acid mis‐incorporation data
In an attempt to differentiate the optimization methods with respect to error frequency, we compared all pairs of optimization methods to search for errors that occurred in the same position (Fig. 6A, top right). This showed that, while each method produced a relatively small number of errors, there was a significant correlation with respect to the occurrence of errors in the same positions for the different optimization methods (Cramér coefficient < 0.40; Fig. 6A, bottom left) [42]. Moreover, analysis of the empirical distribution of errors, i.e. comparison between the error frequencies observed for each position and the cumulative distribution of errors corresponding to each of those frequencies (Fig. 6B), showed that all optimization methods produced the same type of result, i.e. most of the errors appear at low frequency and are scattered at many different positions rather than accumulating with high frequency at a small number of positions. Therefore, in view of the homogeneity observed among optimization methods, we decided to group all methods together for the subsequent analyses. Also, as the parallel analysis of the second dataset comprising non‐ambiguous amino acid mis‐incorporations produced similar results to the whole dataset, and in order to increase the number of cases for statistical purposes, we performed subsequent analyses on all mis‐incorporations identified by the MS/MS method (71).

We evaluated the association between gene characteristics and error occurrence using several statistical methodologies, including logistic regression and the Mann–Whitney U test (for quantitative variables). None of these methodologies identified significant association effects (data not shown), either because most of the variables showed strong multicollinearity or because the overall association effect was rather weak. Finally, in an attempt to rationalize the mis‐translations identified, we performed a χ2 test between error occurrence and 23 gene variables (Table 6). Nine of these variables were quantitative and produced no significant results (data not shown). The results for the 14 categorical variables are presented in Table 7, with five of them being significantly associated with error occurrence (P value of ~ 0) and with a Cramér coefficient ranging from 0.137 to 0.271. These variables were: protein secondary structure prediction (for the WT protein and the mutated protein with mis‐incorporations), V15 and V16; anticodon, V18; amino acid in the wild‐type protein, V19; codon, V17; amino acid in the protein with mis‐incorporations, V22. Among these five categorical variables, the type of amino acid involved in mis‐incorporations showed the strongest effect according to the Cramér coefficient (Table 7).
| Variable | Variable description | Type |
|---|---|---|
| V1 | 5′ context bias (adjusted residue for the whole genome [16]) | Quantitative |
| V2 | 3′ context bias (adjusted residue for the whole genome [16]) | Quantitative |
| V3 | Codon usage bias (RSCU calculated for the whole genome) | Quantitative |
| V4 | Number of times that the codon appears in the whole genome | Quantitative |
| V5 | Codon frequency (V4/total number of codons in the genome) | Quantitative |
| V6 | Rarity of the codon (yes if V5 < 0.005; no if otherwise) | Categorical |
| V7 | Contribution of that codon to calculating CAI (geometrical mean of RSCUs calculated for a group of highly expressed genes); the value given is the RSCU of that codon in highly expressed genes | Quantitative |
| V8 | tRNA gene copy number for the tRNAs that decode that codon | Categorical |
| V9 | Codon–anticodon interaction (C, canonical; GU or UG, wobble; IX or XI, based on inosine) | Categorical |
| V10 | Conversion of V9 into a numerical value, depending on the predicted efficiency of the interaction | Categorical |
| V11 | G and C content for the codon | Categorical |
| V12 | G and C content for the third codon position | Categorical |
| V13 | mRNA secondary structure evaluation (possible pairing with other bases; or free base) | Quantitative |
| V14 | Number of points from V13 (3, easiest reading by the ribosome; 1, higher pairing probability) | Categorical |
| V15 | Protein secondary structure prediction (for the WT protein) | Categorical |
| V16 | Protein secondary structure prediction (for the mutated protein, i.e. with mis‐incorporations) | Categorical |
| V17 | Codon | Categorical |
| V18 | Anticodon | Categorical |
| V19 | Amino acid (WT protein) | Categorical |
| V20 | Number of times that amino acid appears in the whole proteome | Quantitative |
| V21 | Amino acid frequency (V20/total of amino acids of the proteome) | Quantitative |
| V22 | Amino acid (with mis‐incorporations) | Categorical |
| V23 | Error frequency (among replicates) | Quantitative |
| y | x | P value (χ2 test) | Cramér coefficient |
|---|---|---|---|
| V24* | V12 | 0.807 | 0.0038 |
| V24* | V9 | 0.352 | 0.0155 |
| V24* | V14 | 0.536 | 0.0231 |
| V24* | V10 | 0.519 | 0.0236 |
| V24* | V6 | 0.178 | 0.0243 |
| V24* | V16 | 0.252 | 0.026 |
| V24* | V11 | 0.110 | 0.0385 |
| V24* | V15 | 0.006 | 0.0503 |
| V24* | V8 | 0.099 | 0.0512 |
| V24* | V15_16 | 0.000 | 0.1379 |
| V24* | V18 | 0.000 | 0.1648 |
| V24* | V19 | 0.000 | 0.2002 |
| V24* | V17 | 0.000 | 0.2202 |
| V24* | V22 | 0.000 | 0.271 |
- P value (χ2 test) and Cramér coefficients for evaluation of the association between error occurrence (y = V24) and the 14 categorical variables (x, as explained in Table 5 ). Categorical variables significantly associated with error occurrence are highlighted in bold (Cramér coefficient > 0.1).
The five categorical variables that showed a significant association with the amino acid mis‐incorporations (Table 7) were further explored to highlight biased distributions that may explain the previous result. To do this, each variable was analysed by comparing the actual error distribution with the one that was expected by chance (Fig. 7). To do this, each variable was analysed by comparing the actual error distribution with the one that was expected by chance. For example, for variable V17 (Fig. 7A), which corresponds to the codon at each position, we quantified the number of times that there was a mis‐incorporation involving any codon. We also calculated the expected number of mis‐incorporations for each codon based on the frequency with which each codon appeared in the gene being studied. The values shown in Fig. 7 correspond to the difference between observed and expected values, i.e. we show only those errors that occurred above or below the expected level. This analysis showed that the mis‐incorporations tended to appear preferentially at codons starting with A or U, and were lower in codons starting with C or G (Fig. 7A, V17). Additionally, they were also associated with anticodons ending in U or A rather than with anticodons ending with G or C (Fig. 7B, V18), i.e. with serine, glutamine, lysine, asparagine and isoleucine rather than with glutamate, phenylalanine, arginine or proline (Fig. 7E, V19), and with β‐sheets and random coils rather than with α‐helices (Fig. 7C, V15). However, the errors introduced mainly aspartate and glutamine (Fig. 7F, V22,) reinforcing the random coil content of the protein (Fig. 7D, V16). Removing those mis‐incorporations that may have resulted from chemical amino acid modifications instead of amino acid substitutions gave very similar results (Fig. S2). The main difference was that the errors appeared to be associated with glutamine, serine, lysine and isoleucine only (Fig. S2E), introducing mostly glutamine and arginine (Fig. S2F).

Discussion
Many gene features are important for accurate synthesis of recombinant proteins. For example, codon usage biases are very important for gene expression yield [28], while codon context at the 5′ and 3′ ends of genes is essential for the elongation phase and overall accuracy of translation [16, 20]. The codon adaptation index (CAI) is frequently used to optimize gene expression, as it selects optimal codons for translation. Similarly, codon usage of the host genome is important for both protein yield and synthesis accuracy. Moreover, the efficiency and accuracy of codon translation may also be affected by the tRNA gene copy number, which is indicative of the tRNA concentration within the cell and hence tRNA availability during protein synthesis [22]. Other variables are related to mRNA secondary structure, which may reduce the gene expression rate especially if such structures are present at the beginning of the mRNA [26]. Variables that primarily relate to the accuracy of protein synthesis include codon–anticodon pairing interactions between nucleotides (i.e. the presence of the canonical interaction, G/U pairing or inosine at the wobble position) [43, 44].
The goal of the present study was to test the effects of various gene optimization methods on the accuracy of expression of recombinant genes in E. coli, and to determine which of the above mentioned variables are associated with mis‐incorporations during protein synthesis. Expression of most of our optimized genes increased the growth rate of the host relative to expression of the wild‐type gene, confirming the positive effect of heterologous gene optimization on host fitness (Fig. 2A,B). Moreover, we observed a positive association between growth rate and the G+C content, especially at the third codon position of the optimized genes. These results are in agreement with those reported by Raghavan et al. [28], and suggest that optimization of the G+C content, in particular at the third codon position, is important for synthetic gene design. This is reinforced by the observation that, in bacterial genomes, the G+C content of synonymous sites of coding regions is higher than that of non‐coding regions [45-48]. However, Raghavan et al. [28] did not find any significant association between the G+C content or the CAI of re‐designed GFP genes and protein yield. In our case, the G+C content at the first codon position was positively correlated with the protein yield. Interestingly, the global G+C content (34.53%) of the HARM gene was lower than that of the other optimized genes and closer to that of the WT gene (29.90%). However, at the first codon position, the HARM gene had a G+C content (50.43%) that was similar to the global G+C content of E. coli, and the protein yield was similar to that of the CAI and CAI/CC genes. Expression of the CAI/CC and CAI genes resulted in the highest protein yield (Fig. 3), supporting the idea that optimization based on CAI enhances the translation elongation rate, but with a possible increase in error rate. Such an increase is indeed suggested by the western blot results for CAI and CAI/CC genes (Fig. 3B), which showed various bands of smaller molecular mass than the expected 69 kDa band for full‐length LysRS. These bands most likely correspond to truncated proteins produced by frame‐shifting or ribosome drop‐off [49]. For the WT, CC and NF2B genes, such low molecular mass bands were faint and appeared less frequently. For the WT and CC genes, this was probably due to the very low protein yield, but in the case of the NF2B gene, it is likely that the optimization, which removed repeated nucleotides, reduced the level of frame‐shifting and thus the probability of synthesis of truncated proteins.
Only a few studies have investigated amino acid mis‐incorporations in proteins using various methods, and quantification of these mis‐incorporations has not been performed [3, 37-43, 49-57]. We observed 71 mis‐incorporation sites over the entire length of the PfLysRS protein (584 amino acids), with a mean level of amino acid mis‐incorporation of 0.1% (Tables 2 and 3). Almost half of the identified mis‐incorporations were shared by all gene versions, independently of their codon or codon context biases, and the other half was mostly present in one version of the PfLysRS gene only. As mis‐incorporations that appeared in all optimized genes cannot be related to the mRNA primary structure, they are probably related to the structure of the growing polypeptide. A higher number of mis‐incorporations occurred at random coils (41), while fewer were observed in α‐helices (17) and β‐sheets (13) (Fig. 7C,D and Table 2), suggesting that certain sites along the mRNA, corresponding to specific protein structured domains, are more prone to amino acid mis‐incorporations than others. Moreover, random coil had the highest level of mis‐incorporations both before and after removing sites affected by putative amino modifications. Thus, it is possible that the higher number of mis‐incorporations found in random coils is associated with the presence of rare or less accurate codons. However, the higher number of mis‐incorporations observed in random coils may also be due to protein quality control mechanisms in E. coli. Mis‐incorporations occurring in well‐defined secondary structures, i.e. α‐helices and β‐sheets, may have a higher disruptive effect on protein structure, activating protein quality control processes that destroy the mis‐translated protein, while mis‐incorporations in disordered structures, i.e. random coils, may have a weaker effect on protein stability and degradation, resulting in their more frequent detection.
It has been reported that different protein secondary structures are associated with different codon usage patterns, with α‐helices being mostly encoded by fast codons, while β‐sheets, loops and disordered structures show enrichment in rare and slow codons [58]. Different amino acids appear to have a different propensity to adopt certain protein structures [59]. Among the various mis‐incorporations observed, only a few corresponded to changes between amino acids with similar secondary structure propensity, and this may imply that the absence of a selective force to maintain protein secondary structure may indeed allow random mis‐incorporations. Thus, information about the protein secondary structure should also be taken into account when optimizing genes from different genomes for heterologous protein expression.
Many amino acid mis‐incorporations probably produced protein molecular mass alterations that are identical to those associated with known amino acid modifications. We manually curated our mass spectra and removed the false positives that we were able to identify. For example, at position 124, the MS workflow identified a mis‐incorporation of Asp instead of Asn that may also be explained by deamination of the Asn residue during protein digestion [3, 43]. Similarly, six of the most frequent amino acid mis‐incorporations that were found across our recombinant proteins may be explained by amino acid modifications. A possible explanation for such events is that these residues are present in exposed sites of the recombinant protein, making them more prone to chemical modification.
To understand the mechanism of the amino acid mis‐incorporations, we determined the most probable codon–anticodon pairs that may cause the mis‐incorporations, and found that the majority of mismatches probably occurred at the first codon position, followed by the second and third positions (Fig. 5 and Table 5). Among the mis‐incorporations observed, ~ 16 of 39 involved amino acid replacements associated with single nucleotide mismatches (Tables 2 and 3). We further identified mis‐incorporations caused by G/U mismatches at the wobble position as well as at the first and second codon positions [43]. The G/U mismatch was higher at the third codon position, especially in the CAI, CAI/CC, CC and DIG2 genes. Mis‐incorporations due to the presence of inosine at the wobble position were also investigated, as I34 permits base pairing with A, U and C at the third codon position in E. coli [44]. Such mis‐incorporations due to the presence of I34 occurred mainly in the WT, CAI and HARM genes. The remaining amino acid mis‐incorporations involved mismatches at all three codon–anticodon positions, indicating that the mis‐incorporations are probably related to tRNA mis‐charging [43].
Finally, we investigated the relationship between mis‐incorporations and 23 variables associated with the protein synthesis machinery. We have observed that the highest frequency of errors was associated with codons starting with A or U or anticodons ending with A or U (Fig. 7A,B and Fig. S2A,B). This may be due to the higher relative stability of the G–C pairing due to the presence of three rather than two hydrogen bonds. The amino acids decoded by these codons are serine, glutamine, lysine and isoleucine, with errors mostly introducing glutamine and arginine (Fig. 7E,F). All the mis‐incorporated amino acids are decoded by highly frequent codons in E. coli, but these optimal codons may be over‐used under over‐expression conditions, creating tRNA–codon imbalances that decrease coding accuracy. Interestingly, the mis‐incorporations observed at codons ending with A or U (Fig. 7A and Fig. S2A) were positively associated with lower growth rate of the respective host cells (Fig. 2B), suggesting that the presence of an A or U at the third codon position affects both gene expression and protein fidelity during heterologous protein production in E. coli.
In conclusion, we wished to establish a workflow for synthetic gene optimization for heterologous protein expression, using bionformatics, laboratory experiments, mass spectrometry and statistical analysis. Overall, our optimized genes resulted in a higher expression level and lower error rates than the WT gene. Importantly, we were able to identify an optimization method that produced better results than the others. Use of the NF2B gene resulted in the lowest number of mis‐incorporations and the lowest level of error after removing putative amino acid modifications (Tables 2 and 4), supporting the idea that this optimization approach improved translational accuracy.
Experimental procedures
Gene optimization methods
The P. falciparum LysRS gene was optimized for expression in E. coli (Table 1) using Anaconda 2.0 [16, 19, 41] and EuGene [12]. Three of a total of six synonymous genes were optimized using Anaconda with the following criteria: (a) maximization of the codon adaptation index [19, 13] (CAI gene), (b) maximization of codon context‐adjusted residues [16, 41] (CC gene), and (c) a simultaneous maximization of CAI and CC (CAI/CC gene). A fourth variant, named DIG2, which has the same coding sequence as the CC gene, was constructed by replacing the original CC‐optimized His tag (mainly encoded by the CAT codon) with a CAC‐encoded His tag, as in the CAI and CAI/CC genes. This modification was performed because the original His tag suppressed protein expression almost entirely. The other two genes were optimized using EuGene on the basis of codon usage harmonization [11] (HARM gene) and using a mixed approach (NF2B gene), comprising codon context maximization, codon usage harmonization, removal of repeated nucleotides [8], and removal of codons that are read by unmodified tRNA and are associated with genes expressed at a low level in E. coli [23] (Fig. 1, panel 1). Gene sequences are available in Appendix S1.
Gene synthesis, plasmid construction, strains and growth
All genes, including the His and FLAG tag sequences, were synthesized de novo by a commercial service provider (GeneArt/Life Technologies, Thermo Fisher Scientific Inc, Waltham, MA USA). The delivery plasmids, containing the gene constructs, and the over‐expression vector pET19b (Novagen, Madison, WI, USA) were double‐digested using NcoI and XhoI for 2 h at 37 °C. The restriction enzymes were heat‐inactivated, the samples were treated with shrimp alkaline phosphatase, and both the target genes and vectors were purified using PCR clean‐up or gel purification kits (Qiagen, Duesseldorf, Germany). Ligations of target genes to the vector were performed overnight at 16 °C using T4 ligase (New England Biolabs, Ipswich, MA USA), followed by heat inactivation of the enzyme at 65 °C for 10 min. The constructed plasmid was a slight variation of the pET19b vector because the original His tag and the enterokinase site were replaced by the de novo synthesized shorter His tag (6xHis) and an additional FLAG tag in six of the seven genes (see Table 1).
For the protein over‐expression assays, the E. coli BL21 (DE3) strain was used [genotype: F−ompT, hsdSB(rB‐, mB‐), gal, dcm, λDE3 (lacI, lacUV5‐T7 gene 1, ind1, sam7, nin5)] because it is suitable for high‐level protein expression using T7 promoter‐driven vectors such as the pET vectors. Overnight pre‐cultures were inoculated (initial attenuance of 0.02 at 600 nm) into either 10 or 200 mL of LB medium containing 75 μg·mL−1 ampicillin, and grown at 37 °C with stirring at 180 rpm to determine growth rates and to produce protein for MS analysis, respectively. After 3 h, when the various strains had reached an attenuance of ~ 0.5, 1 mm isopropyl‐β‐d‐thiogalactopyranoside was added, and the temperature was reduced to 25 °C for the next 5 h (Fig. 1, panel 2). The attenuance was measured every hour after the pre‐inoculum addition until 5 h after the isopropyl‐β‐d‐thiogalactopyranoside induction using a microplate reader (IMARK, Bio‐Rad, Hercules, CA USA).
Protein extraction
For total protein extraction, pelleted cells (centrifuged at 5000 g for 5 min) were thawed on ice and resuspended in 1 : 10 ratio (w/v) of lysis buffer (50 mm NaH2PO4, 300 mm NaCl, 5 mm imidazole and 6 m urea), sonicated four times for 30 s, and then centrifuged again for 30 min at 16 100 g at 4 °C. The supernatant was filtered through a 45 μm filter and stored at −20 °C. Total protein concentration was determined using a Nanodrop (NanoDrop products, Wilmington, DE, USA) spectrophotometer (Fig. 1, panel 3).
Protein purification
The His‐tagged proteins were purified by immobilized metal affinity chromatography. The chelating ligand was charged with nickel, resulting in high selectivity for the target proteins. Proteins were purified by gravity flow using Poly‐Prep chromatography columns (Bio‐Rad). Briefly, Profinity IMAC resin (Bio‐Rad) was added to the samples and incubated overnight with agitation. The following day, the flow‐through was first collected into Falcon tubes (Orange Scientific, Braine‐l'Alleud, Belgium), and the resin with the sample was transferred into columns. The Falcon tubes were washed in wash/binding buffer, at 4 °C, until all the resin was in the column under denaturing conditions, according to the manufacturer's instructions (Profinity IMAC resins, Instruction Manual, Bio‐Rad). The samples were eluted using five volumes each of three wash buffers containing increasing concentrations of imidazole (5, 20 and 50 mm), and samples were finally washed with elution buffer (100 mm imidazole). All fractions were analysed using SDS/PAGE and then concentrated using 3K centrifugal filter units (Merck Millipore, Darmstadt, Germany).
SDS/PAGE and western blotting
Total protein fractions were separated by 12% SDS/PAGE, and blotted onto nitrocellulose membranes using the Trans‐Blot® Turbo™ transfer system (Bio‐Rad). The seven constructs were detected using a mouse antibody against FLAG tag (F1804‐1MG, Sigma‐Aldrich, St Louis, MO, USA) and an antibody against His tag (18184, Abcam, Cambridge, UK) at 1 : 5000 dilution. Bound antibody was visualized by incubating membranes with an IRDye680‐labelled goat secondary antibody against mouse (Li‐Cor Biosciences, Lincoln, NE, USA) at 1 : 10 000 dilution. Detection was performed using an Odyssey infrared imaging system (Li‐Cor Biosciences).
Protein and post‐translational modifications identification by nano‐HPLC‐MALDI‐TOF/TOF
After SDS/PAGE, the protein bands corresponding to those identified by western blot analysis were selected for analysis by MS. Briefly, after band excision, the gel bands were de‐stained by washing with 25 mm ammonium bicarbonate/50% acetonitrile solution (twice), and dried under vacuum using a SpeedVac® (Savant, Thermo Scientific, Shah Alam, Malaysia) before trypsin addition. The dried gel pieces were further rehydrated using 25 μL of 10 μg·mL−1 trypsin solution in 50 mm ammonium bicarbonate, and incubated overnight at 37 °C. The tryptic peptides generated were then extracted from the gel by incubation with 10% formic acid/50% acetonitrile (repeated three times). The collected supernatant was then dried in a vacuum concentrator, and resuspended in 10 μL of solvent A (5% acetonitrile/0.1% trifluoroacetic acid). Tryptic peptides were separated by nano‐HPLC using an Ultimate 3000 (Dionex, Amsterdam, the Netherlands) and a Pepmap100 C18 column (3 μm particle size, 0.75 μm internal diameter, 15 cm in length). Peptide separation was accomplished using a linear gradient of 5–50% solvent B (10 : 90 : 0.045 v/v/v water/acetonitrile/trifluoroacetic acid) for 30 min, 50–70% B for 10 min and 70‐5% A for 5 min, with a flow rate of 0.3 μL·min−1 . The eluted peptides were applied directly to a MALDI plate in 7 s fractions using a Probot automatic fraction collector (Dionex).
Mass spectra were obtained on a MALDI‐TOF/TOF mass spectrometer (4800 Proteomics Analyzer, Applied Biosystems, Foster City, CA, USA) in positive ion reflector mode in the mass range 700–4500 Da with 800 laser shots. A data‐dependent acquisition method was created to select the 16 most intense peaks in each sample spot for subsequent tandem mass spectrometry (MS/MS), excluding those derived from the matrix, those due to trypsin autolysis and acrylamide peaks. The Glu‐1‐fibrinopeptide B (Glu‐fib) peptide mass standard peak (m/z 1570.68) was used for internal calibration of the mass spectra.
MS/MS data were searched against an internal database that included contaminant proteins, E. coli and recombinant Fasta protein sequence (68 759 entries), which uses the internal mascot software (version 2.1.0.4, Matrix Science Ltd., London, UK) for protein/peptide identification based on MS/MS data. Mis‐translations were confirmed in the Mascot Modfile (Table S2). The database search was performed utilizing MS/MS data and multiple combinations of up to nine modifications; the ‘variable modifications’ option was preferred. Search parameters included the ‘trypsin’ option for enzyme selection, and allowed mass tolerances of 40 ppm for parent ions and 0.3 Da for fragment ions. Identifications were considered positive for individual ion scores above 40 and a default P values < 0.05. For every engineered gene, between four and six protein samples were analysed. The recombinant proteins were initially identified by searching against an internal database using Mascot; searches were performed for all possible mis‐incorporations using the MS/MS data. The total number of possible amino acid mis‐incorporations was 384, i.e. six mis‐incorporations per run, each sample being run 64 times. In addition, as some mis‐incorporations and certain modifications have identical molecular masses (e.g. amino acid deaminations and carbamylations), we analysed the samples again to search for such events (Table S2). All the resulting analyses were merged in a single file, and the mis‐incorporations with higher scores were manually validated using the GPS Explorer software (Applied Biosystems, Life Technologies, Thermo Fisher Scientific Inc, Waltham, MA USA) in order to remove false positives, as shown in Fig. 4. Matched sequences were validated if all major peaks in the MS/MS spectrum were explained by the candidate sequence and the spectrum contained peaks from a, b and/or y fragmentation ion series to confirm the peptide mis‐incorporation. Validated mis‐incorporations were quantified using peak explorer software™ (version 1.0, ABSciex, Framingham, MA, USA). For each individual MALDI run, a heatmap and peak list were generated. The calculated integrated area for each peak was exported to microsoft excel (Redmond, WA, USA), and the relative abundance (%) was calculated as follows: peak integrated area/sum of all peak integrated areas × 100.
Gene characterization and statistical analysis
G+C content, mean RSCU, CAI value, rare codons content, CPB and ENC values [60] for each gene were estimated using Anaconda 2.0 [16, 19, 41] and EuGene [12]. The RNAf, MFE and secondary structures were estimated using the RNAfold web server (http://www.tbi.univie.ac.at/RNA/RNAfold.html). Codon context and codon usage analysis were performed using Anaconda, uploading single genes as genomes and with the RSCU and tRNA gene copy number of E. coli strain BL2 1(DE3). The protein secondary structure of wild‐type LysRS and the same protein with mis‐incorporations were estimated using EuGene [12, 61].
Analysis of variance (ANOVA) and regression for the growth rate, protein yield and different gene variables were performed using the spss statistics 20 package (IBM, Armonk, NY, USA).
Based on our experience and previous published papers, we identified a set of 23 variables with relevance for translation fidelity (Fig. 1, panel 5, and Table 6). These variables took into account various aspects of the protein synthesis machinery, such as codon usage and context, the number and frequency of codons, the contribution of each codon to the CAI, the tRNA gene copy number, codon–anticodon interactions, mRNA and protein secondary structures, codon, anticodon and amino acid names, amino acid frequencies in the whole proteome, and error frequencies among replicates. Table 6 provides a detailed description of the variables. We used χ2 tests and Cramér coefficients to select the significant variables according to the strength of association between each variable and error occurrence. χ2 tests, calculation of Cramér coefficients, logistic regression and the Mann–Whitney U test (for quantitative variables) were performed using SPSS, and the remaining exploratory analysis was performed using R (2.15.1 package, http://www.R-project.org).
Acknowledgements
The RNomics Laboratory is supported by the Portuguese Foundation for Science and Technology through projects PTDC/BIA‐GEN/110383/2009 and FCT‐ANR/IMI‐ANR/0041/2012.
Author contributions
GM, JLO and MAS planned the experiments; JLO and PG developed the software; LR and JF performed the experiments; RV performed mass spectrometry analysis and validations; RV and LR analysed the mass spectrometry data; GM and LR analysed data; LRP provided materials and helped with writing the paper; LR, GM and MAS wrote the paper.
References
Citing Literature
Number of times cited according to CrossRef: 6
- Dorota Stadnik, Anna Bierczyńska-Krzysik, Joanna Zielińska, Jarosław Antosik, Piotr Borowicz, Elżbieta Bednarek, Wojciech Bocian, Jerzy Sitkowski, Lech Kozerski, Identification of Lysine Misincorporation at Asparagine Position in Recombinant Insulin Analogs Produced in E. coli, Pharmaceutical Research, 10.1007/s11095-019-2601-z, 36, 6, (2019).
- H. Edward Wong, Chung-Jr Huang, Zhongqi Zhang, Amino acid misincorporation in recombinant proteins, Biotechnology Advances, 10.1016/j.biotechadv.2017.10.006, (2017).
- Juan C. Villada, Otávio José Bernardes Brustolini, Wendel Batista da Silveira, Integrated analysis of individual codon contribution to protein biosynthesis reveals a new approach to improving the basis of rational gene design, DNA Research, 10.1093/dnares/dsx014, 24, 4, (419-434), (2017).
- Sarah J. Routledge, Lina Mikaliunaite, Anjana Patel, Michelle Clare, Stephanie P. Cartwright, Zharain Bawa, Martin D.B. Wilks, Floren Low, David Hardy, Alice J. Rothnie, Roslyn M. Bill, The synthesis of recombinant membrane proteins in yeast for structural studies, Methods, 10.1016/j.ymeth.2015.09.027, 95, (26-37), (2016).
- Corrigendum, The FEBS Journal, 10.1111/febs.13619, 283, 2, (395-395), (2015).
- Roslyn M Bill, Tobias von der Haar, Hijacked then lost in translation: the plight of the recombinant host cell in membrane protein structural biology projects, Current Opinion in Structural Biology, 10.1016/j.sbi.2015.04.003, 32, (147-155), (2015).




