Challenges in the annotation of pseudoenzymes in databases: the UniProtKB approach

The universal protein knowledgebase (UniProtKB) collects and centralises functional information on proteins across a wide range of species. In addition to the functional information added to all protein entries, for enzymes, which represent 20–40% of most proteomes, UniProtKB provides additional information about Enzyme Commission classification, catalytic activity, cofactors, enzyme regulation, kinetics and pathways, all based on critical assessment of published experimental data. Computer‐based analysis and structural data are used to enrich the annotation of the sequence through the identification of active sites and binding sites. While the annotation of enzymes is well‐defined, the curation of pseudoenzymes in UniProtKB has highlighted some challenges: how to identify them, how to assess their lack of catalytic activity, how to annotate their lack of catalytic activity in a consistent way and how much can be inferred and propagated from experimental data obtained from other species. Through various examples, we illustrate some of these issues and discuss some of the changes we propose to enhance the annotation and discovery of pseudoenzymes. Ultimately, improving the curation of pseudoenzymes will provide the scientific community with a comprehensive resource for pseudoenzymes, which in turn will lead to a better understanding of the evolution of these molecules, the aetiology of related diseases and the development of drugs.


Introduction
During enzyme evolution, gene duplication and the accumulation of mutations affecting residues involved in catalysis have given rise to a group of enzyme-related proteins that have lost their capacity to catalyse biochemical reactions [1,2] (Thornton et al., this issue). Despite the loss of their original catalytic function, these proteins, known as pseudoenzymes, are remarkably well-conserved. They are found in almost all enzyme families, where they represent between 10% and 15% of the members, and are distributed across the whole tree of life. Recent years have witnessed a surge in pseudoenzyme research uncovering their biological roles, particularly those belonging to the most abundant enzyme groups, namely kinases [3][4][5], phosphatases [4] and proteases [6,7]. These studies have revealed that, despite the lack of enzymatic activity, these proteins have evolved essential catalytic-independent functions, explaining why there has been a selective pressure to retain them. These roles, which are described in more detail in Ref. [8,9], include: (a) allosteric activation of an active enzyme, for example, myotubularin-related pseudophosphatase MTMR9 [universal protein knowledgebase (UniProtKB) Q96QG7] binds to MTMR6 and increases MTMR6 lipid phosphatase activity [10]; (b) control of the localisation and/or assembly of macromolecular complexes, for example, pseudophosphatase STYX (serine/threonine/tyrosine-interacting protein, UniProtKB Q8WUJ0) anchors the mitogen-activated protein kinases MAPK1 and MAPK3 in the nucleus [11]; (c) assemblage of signalling cascades, for example, kinase suppressor of Ras 1 (KSR1; UniProtKB Q8IVT5) recruits various components of the MAPK/Erk signalling cascade [12]; and (d) competition for substrate binding or complex assembly, for example, Caenorhabditis elegans pseudophosphatase egg-4 (UniProtKB O01767) sequesters and inhibits phosphorylated kinase mbk-2 [13].
It has become apparent from these studies that some pseudoenzymes are also linked to diseases [4,14]. A well-characterised case is Charcot-Marie-Tooth disease, a neurodegenerative disorder caused by mutations affecting pseudophosphatases SBF2/MTMR13 (Uni-ProtKB Q86WG5) and SBF1/MTMR5 (UniProtKB O95248) [15,16]. In part due to their capacity to regulate enzymes, pseudoenzymes have also attracted interest as potential targets for therapeutic treatments [14].
The growing interest in pseudoenzymes led to two successful international meetings in 2016 and 2018 where various topics were discussed, including how bioinformatics tools could advance pseudoenzyme study. Among these tools, protein databases play an instrumental role by providing repositories for proteinrelated data where functional information and protein sequences are brought together. For example, the Protein Kinase Ontology resource [17] has established a list of all known and predicted pseudokinases across all kingdoms of life [18]. Similarly, the peptidase database MEROPS includes pseudoproteases where they are defined as nonpeptidase homologues [19]. While these resources provide invaluable data, they focus only on one specific enzyme family.
The UniProt Knowledgebase (UniProtKB) provides the scientific community with free access to more than 150 million protein sequences (release 2019_05) annotated with high-quality functional information [20]. Reviewed entries (also known as UniProtKB/Swiss-Prot entries) have been enriched with information extracted from peer-reviewed literature by expert curators. Unreviewed entries (also known as UniProtKB/ TrEMBL entries) have functional information added automatically by transferring annotation from wellstudied, closely related orthologs.
UniProtKB records are regularly assessed and revised to integrate new advances in the protein biology field. This ensures that we provide users with accurate and up-to-date information. The recent advances made in the pseudoenzyme field prompted us to revisit those records in UniProtKB describing pseudoenzymes and update their content.
In this study, we present an outline of the process and the challenges faced in reviewing pseudoenzymes including how they are identified, how the information related to their loss of activity is captured and presented in a concise manner and finally, how we improve their discoverability. The ongoing improvements to pseudoenzyme annotation will provide the scientific community with a valuable resource to facilitate pseudoenzyme biology and the study of pseudoenzyme and enzyme evolution.

Identification of pseudoenzymes
Demonstrating unequivocally that an enzyme is catalytically inactive is notoriously challenging. In Uni-ProtKB, curators use three main types of evidence: (a) evidence based on sequence analysis and/or structural data; (b) evidence based on experimental assays; and (c) evidence based on sequence similarity or orthology. Each of these evidence types provides some information but also has its own caveats and may also conflict with other evidence types. All these evidences are combined and carefully assessed before a decision is made regarding the activity of the protein.

Sequence analysis-based evidence
Computer-based protein sequence analysis is commonly used to predict the lack of catalytic activity. The recent curation of the C. elegans kinome [21] and phosphatome (R. Zaru et al., manuscript in preparation) showed that, among the reviewed members that have been functionally characterised, > 95% of the proteins identified as inactive are classified as pseudokinases or pseudophosphatases based on sequence analysis evidence only (Fig. 1A,B). Usually, this method is based on the absence of essential residues that have been shown experimentally to be critical for the enzymatic reaction. For example, the myotubularin-related phosphatase family contains five members in C. elegans. They are involved in the dephosphorylation of the D3 position of phosphatidylinositol 3-phosphate and phosphatidylinositol 3,5-bisphosphate [22]. The reaction mechanism involves a highly conserved C-X 5 -R (C:Cysteine, X: any amino acid, R:arginine) motif containing the essential cysteine and arginine residues which stabilise the substrate by forming a thiol-phosphate intermediate [23]. An alignment of their sequences shows that two out its five members lack the essential cysteine residue in the phosphatase domain and thus are predicted to be inactive (Fig. 1C).
While this method is valuable in predicting the lack of catalytic activity, it has some limitations and can sometimes be misleading. Firstly, sequence analysis software relies on a good understanding of residues and/or motifs implicated in the reaction. Thus, for enzymes for which the residues involved in catalysis have not yet been identified, the capacity of sequence analysis methods to predict the lack of catalytic activity will be low.
Secondly, the catalytic mechanism may have evolved to result in the use of alternative residues. One of the best characterised examples of this is the serine/threonine-protein kinase WNK1 (protein kinase with no lysine 1; UniProtKB Q9JIH7). Based on sequence analysis, WNK1 is predicted to be inactive as it lacks the catalytic lysine in the kinase subdomain II that is crucial for binding to ATP. However, in kinase assays, WNK1 was proven to be catalytically active due to an alternative lysine at position 233 in the kinase subdomain I becoming involved in ATP binding [24].
Thirdly, sequence analysis may mistakenly predict the lack of catalytic activity or predict the wrong enzymatic activity. This has been elegantly shown for two bacterial enzymes SelO [25] and SidJ [26]. Based on sequence analysis, these two proteins contain a domain that resembles the protein kinase domain. Only by combining experimental data with 3D structure analysis was their actual catalytic activity determined; SelO turns out to use ATP to AMPylate proteins and SidJ acts as a protein polyglutamase. These last examples illustrate the importance of combining sequence information with experimental data to assess the enzymatic activity of a protein.

Experimental evidence
The most convincing method for confirming the predicted lack of enzymatic activity is to test the protein in a biological assay, usually comparing the predicted pseudoenzyme with a closely related active enzyme. Often, site-directed mutagenesis of the missing catalytic site is used to restore activity. For example, mutating the glycine residue at position 120 in the C-X 5 -R motif of human pseudophosphatase STYX to the catalytic cysteine restores its phosphatase activity [11]. This method, although not without its own caveats, confirms that the lack of catalytic activity predicted by sequence analysis is due the replacement of the catalytic site residue. While having convincing experimental confirmation of the lack of catalytic activity is highly desirable, some caution is nonetheless required for the interpretation of the results as the results of enzymatic assay can be misleading. The lack of detectable activity can be the result of inappropriate experimental conditions. pH and temperature can affect the activity of an enzyme as demonstrated by lysosomal proteases which require an acidic environment for their activity, whereas thermophilic DNA polymerases are active only at high temperature. Often, the physiological substrate is unknown or despite being closely related, two enzymes can have very different targets. For example, Nat8f2 (N- member 2; UniProtKB Q8CHQ9) is predicted to be an acetyltransferase but, so far, no histone acetyltransferase activity has been detected, although histone proteins are well-characterised substrates of other Camello family members [27]. Enzymes are rarely constitutively active and often require either post-translational modifications such as phosphorylation and/or binding to other protein partner(s) or small molecules. The possible contamination of the assay with an active enzyme can also result in the wrong attribution of activity, a common problem when the source of the pseudoenzyme is obtained via immunoprecipitation. This is an important issue to consider as pseudoenzymes often associate with and regulate the activity of their active counterparts. For example, the pseudophosphatase SBF2/MTMR13 binds to MTMR2 to promote MTMR2 phosphatidylinositol phosphatase activity [28]. Sometimes, the activity detected is very low. In this circumstance, the pseudoenzyme designation is made on a case-by-case basis, considering all the available evidence. For example, in the kinase domain of KSR2 (UniProtKB Q6VAB6), the lysine residue in the VAIK motif is replaced by an arginine, suggesting that the protein is inactive but low protein kinase activity has been detected in vitro [29]. The interaction with BRAF is proposed to induce a conformation change that increases the low intrinsic kinase activity. In this specific case, KSR2 has been recorded as active in UniProtKB with a comment added to explain that KSR2 kinase activity is currently unsure.

Orthology-based evidence
UniProt makes use of orthology to allow the propagation of functional information between similar proteins in different species and to provide consistent information across orthologs. To identify putative orthologs, curators combine results from reciprocal Blast searches with data from other resources including scientific literature, sequence analysis tools, phylogenetic and comparative genomics databases, and other specialised databases such as species-specific collections.
In some cases, orthology and sequence analysis prediction give rise to apparent contradictory results. This is particularly true when, for example, orthologs use alternative residues instead of the canonical catalytic sites [30]. The proteolytic activity of serine proteases is based on a Ser/His/Asp triad where the serine residue acts as a nucleophile ( Fig. 2A). Among the 32 mammalian PRSS50/TSP50 (testis-specific protease-like protein 50) protein entries in UniProtKB, 25% have a threonine instead of a serine residue, suggesting that they may be devoid of proteolytic activity (Fig. 2B). However, it has been shown that the threonine can replace the serine residue in the reaction mechanism [31]. This case illustrates the ability of reaction mechanisms to evolve by exploiting closely related residue substitutions and the importance of experimental evidence to support sequence analysis. Evolution can result in residue changes that lead to the loss of catalytic activity, a situation that becomes apparent when comparing distant homologues. For example, C. elegans ddr-1 (UniProtKB Q18163) and ddr-2 (UniProtKB Q95ZV7) are predicted homologues of human DDR1 (discoidin domain receptor 1; Uni-ProtKB Q08345) and DDR2 (UniProtKB Q16832). In human DDR1 and DDR2 and C. elegans ddr-2, the catalytic site is conserved, whereas in C. elegans ddr-1, the aspartic acid residue has been replaced by a histidine suggesting that ddr-1 is inactive. These two examples illustrate the importance of combining various evidences when deciding if a protein has catalytic activity or not.

Specific annotation for pseudoenzymes
Once a protein sequence has been identified as a potential pseudoenzyme and all the available evidence has been assessed, the next step in the curation process is to translate this information into meaningful annotation. This annotation also needs to reflect the type of evidence used and enable pseudoenzyme discovery using the UniProtKB search engine. UniProtKB provides a wealth of protein-related information including function, subcellular location, expression and interacting partners, as well as key residues within the protein sequence such as those which are post-translationally modified. This concise summary uses a combination of controlled vocabularies and free text which facilitates the retrieval and discoverability of proteins matching specific criteria.
For active enzymes, which represent 45% of the reviewed entries, we add enzyme-specific information including the catalytic activity, the regulation mechanism, whether a cofactor is required and the positions of active site(s), cofactor and substrate binding sites (Fig. 3A). While the annotation of enzymes is wellestablished [21], the current annotation of pseudoenzymes needed to be revised to integrate new advances in the field. The revision process involved addressing various challenges to make sure that the new annotation workflow was appropriate. For this task, we considered two perspectives: the user point of view and the curator point of view. To provide the best information to our users, the challenges were: (a) where and how to display the information about the lack of catalytic activity; (b) how to provide the user with an evidence-supported reason such as lack of sites important for catalysis or cofactor binding; in other words, how to convey that despite sharing a similar catalytic domain with active enzymes, the domain of the pseudoenzyme is not functional; (c) how to ensure consistency in the annotation of pseudoenzymes; (d) how to highlight the fact that, despite their lack of catalytic activity, they share sequence homology with their active counterparts; and (e) how to ensure that the annotation is sufficiently unique to facilitate their discoverability? Curators needed to understand: (a) which criteria to use to identify bona fide pseudoenzymes; (b) how to evaluate the evidence available; (c) how to deal with conflicting results; and (d) how to efficiently apply the revised annotation to 'existing' reviewed pseudoenzyme entries. Ultimately, the re-evaluation of the existing pseudoenzyme annotation resulted in various improvements which are described in more detail below and highlighted in Fig. 3B.

Protein name
A protein name is often what provides researchers with a first hint about a protein function. Usually, when authors name a protein, they devise a meaningful name that offers a first indication of the protein function. During the curation of a UniProtKB entry, an official recommended protein name based on the name(s) provided by the literature and/or nomenclature committees is added. When the protein is known by more than one name, these names are included as synonyms. By providing users with a comprehensive list of protein names, the mining of the scientific literature is thus facilitated.
Whilst the names given to active enzymes often reflect their catalytic activity, naming their inactive counterparts has proven to be more challenging and various approaches have been used. Some names reflect the noncatalytic function of the pseudoenzymes (e.g. PPAF2 name is phenoloxidase-activating factor 2, UniProtKB Q9GRW0), while others use names that highlight their lack of enzymatic activity by including words such as 'inactive', '-like' or 'homologue' (e.g. DPP10 name is inactive dipeptidyl peptidase 10, Uni-ProtKB Q8N608). To standardise pseudoenzyme names and avoid ambiguities that 'like' or 'homologue' could cause, curators follow the International Protein Nomenclature Guidelines (https://www.uni prot.org/docs/International_Protein_Nomenclature_ Guidelines.pdf) and now include the word 'inactive' followed by the missing enzymatic activity in the official name or in a synonym (Fig. 3B).

Caution
The basis for the lack of catalytic activity is reported in the 'Function' section in a caution comment highlighted in yellow in the entry view on the UniProt website (Fig. 3B). The comment describes the nature of the conserved active site residues which are changed, any experimental evidence of inactivity, if available, and conflicting results. Importantly, the evidence used to infer this information is provided (Fig. 4 and below).

Sequence features
UniProtKB indicates important residues and regions within the protein sequence such as catalytic sites, functional domains and post-translational modifications obtained from computer-based sequence analysis in combination with experimental evidence. For both enzymes and pseudoenzymes, the position of the catalytic domain is usually provided based on sequence analysis tools such as InterPro (www.ebi.ac.uk/inter pro/). For active enzymes, the position of the active site(s) is annotated, while, for pseudoenzymes, they are omitted even when the residue is conserved as illustrated by C. elegans pseudokinase kin-32 (UniProtKB Q95YD4), which has been experimentally proven to be inactive [32]. However, when residues involved in cofactor binding or substrate binding are conserved, these are indicated, especially when they are supported by experimental evidence such as a 3D structure, as they can be important to stabilise the structure or to enable to protein to perform its noncatalytic function. For example, for pseudokinases, ATP binding to the inactive kinase domain is essential in maintaining their correct folding or in promoting their binding to other proteins [33]. Similarly, the annotation of substratebinding residues is important as one of the functions of pseudoenzymes is to sequester substrates as illustrated by C. elegans pseudophosphatase egg-4 mentioned previously.

Protein family
Although pseudoenzymes lack catalytic activity, they retain sequence similarities with active enzymes of the same protein family. For example, both active MTMR6 (UniProtKB Q9Y217) and inactive MTMR9 (UniProtKB Q96QG7) belong to the 'protein-tyrosine phosphatase family, nonreceptor class myotubularin subfamily'. To enable the identification of proteins with similar sequences, UniProtKB provides this information in the 'Sequence similarities' subsection of the 'Family and domains' section ( Fig. 3A,B). Proteins are assigned to families using a range of sources including protein family databases, sequence analysis tools, scientific literature and sequence similarity search tools.

Inactive isoforms
In some rare cases, alternative RNA splicing during expression of enzyme-coding genes can result in the production of inactive isoforms. For example, HDAC9 (histone deacetylase 9, UniProtKB Q9UKV0) produces 11 isoforms. Isoform 1 displays histone deacetylase activity, whereas isoform 3 is inactive due to the loss of the domain containing the catalytic site residue [34]. The lack of enzymatic activity is indicated in a note attached to the isoform sequence.

Protein with catalytic and noncatalytic domains
Interestingly, some proteins that contain multiple catalytic domains have one domain that is inactive. These domains appear to have conformational roles either in stabilising the protein or by providing a mechanism to regulate the activity of the other domains. Such proteins are found in many enzyme families. For example, the receptor guanylate cyclases, members of both the guanylate cyclase and protein kinase families, contain one active guanylate cyclase domain and one inactive kinase domain [35]. These entries are annotated as active enzymes; however, curators also report the lack of catalytic activity of one of the catalytic domains in the caution comment together with an alternative name describing the lost enzymatic function (Fig. 5).

Conflicting results
To provide our users with an accurate identification of pseudoenzymes, curators ensure that the annotation reflects as much as possible the evidence available. This is particularly crucial when the various pieces of evidence described previously appear to contradict each other. The most common case is when a protein is predicted to be inactive based on sequence analysis but shows activity when tested experimentally, or vice versa. For example, C. elegans kinase drl-1 (Uni-ProtKB Q86ME2) is predicted to be inactive as the catalytic site is not conserved. However, in an in vitro assay, some kinase activity has been detected [36]. After carefully assessing the evidence, drl-1 was annotated as inactive with a caution comment highlighting the discrepancy: 'Although the residues involved in the catalytic activity are absent, suggesting that the kinase is inactive, some kinase activity has been detected'. Similarly, the mannosidase activity of EDEM1 (ER degradation-enhancing alpha-mannosidase-like protein 1, UniProtKB Q92611) and EDEM2 (UniProtKB Q9BV94), which belong to the glycosyl hydrolase 47 family, is controversial [37]. In this case, they have been annotated as inactive, while mentioning that some mannosidase activity has been detected, until further evidence becomes available.

GO annotation
As part of the manual curation process, UniProtKB entries are enriched with Gene Ontology (GO) terms which describe gene products in terms of their associated biological processes, molecular functions and cellular components in a species-independent manner [38,39]. UniProtKB curators assign GO terms to all reviewed entries based on experimental data from the curated literature. The 'molecular function' ontology contains GO terms for most of the Enzyme Commission (EC) numbers. There is no GO term as such to describe the lack of catalytic activity. Instead, the NOT qualifier is used in combination with the GO term corresponding to the expected specific enzymatic activity. For example, the GO annotation for inactive MTMR5 is NOT + GO term phosphatase activity (GO:0016791). Ideally, the annotation is supported by experimental manual evidence, but for most pseudoenzymes, an evidence code based on sequence analysis only (inferred from key residues) is used.

Data evidence
As demonstrated above, the evidence source is crucial to assess the strength of the information used to support the lack of catalytic activity. For each piece of information that we annotate, UniProtKB provides a direct link to its original source so that users can easily identify its origin and evaluate it. UniProtKB makes use of a subset of evidence codes from the Evidence and Conclusion Ontology (ECO) to indicate data origin [40]. These ECO codes are shown directly in the text version of the entries, while on the UniProtKB website, they are transformed into user-friendly, easyto-understand labels (Fig. 4) [21]. For instance, for information inferred from experimental data, we provide a link to the original paper. For information which has been transferred from a related experimentally characterised protein, the accession number of the characterised protein is indicated, providing a link to the entry with experimental evidence. Similarly, information based on computer-based sequence is indicated as such. An analysis of the serine endopeptidase (S1 protease) family showed that, out of the 874 reviewed UniProt entries, 74 are annotated as inactive (Fig. 2C,D). Strikingly, only one of them has experimental evidence for the lack of catalytic activity, whereas for the active S1 proteases, more than 30% have experimental evidence to support their catalytic activity. This is in agreement with what we found for the C. elegans pseudokinome and pseudophosphatome described earlier where the predominant evidence for the loss of enzymatic activity comes from sequence analysis prediction.

Prediction and automatic annotation
The advances in sequencing techniques in the last decades have led to an explosion in the number of sequenced genomes. In 2018 alone, 29 316 new proteomes were imported into UniProt and the flow of new sequenced genomes is not slowing down. In the first half of 2019, 28 180 proteomes have already been integrated. These newly imported sequences are presented as unreviewed entries (> 150 Mio entries in 2019_05 release). Due to the constraints with regard to the time required for manual curation and the lack of available experimental characterisation, most of them will remain unreviewed. Yet, UniProt does provide functional information for these entries using rulebased systems to automatically annotate and classify them. Together with predictions from a suite of sequence analysis methods, they enrich the records with information describing protein names, function, catalytic activity, pathway and family memberships, and subcellular location, along with sequence-specific information. These rules are kept up-to-date and all predictions are refreshed with each UniProtKB release to ensure the latest state-of-knowledge is applied. The Unified Rule system, or UniRule, contains rules designed and tested by curators using experimental data from manually reviewed entries [20]. These rules use the presence of specific protein signatures together with taxonomy to predict the biochemical features and biological role of a protein (Fig. 6).
Out of the 7222 UniRules implemented in the Uni-Prot automatic annotation pipeline, 2580 rules (36%; release 2019_05) are specific for annotating enzymes. These rules provide annotation for the name, EC number, catalytic activity, active sites, cofactor, enzyme-related keywords and GO terms for more than 20 million unreviewed entries (13% of the total) covering the four superkingdoms (bacteria, eukaryotes, viruses and archaea). While enzyme and pseudoenzyme identification is linked, there is no rule yet for the annotation of pseudoenzymes as such. At present, if an entry does not meet the criteria for an active enzyme, that is, the presence of critical residues such as active site(s) in a specific family, no annotation is made, and the following caution comment is usually added: 'Lacks conserved residue(s) required for the propagation of feature annotation'. Could rules be designed to automatically annotate pseudoenzymes? Or could the existing enzyme prediction rules be improved by including additional and/or more stringent criteria? Answering these questions is not an easy task as there are many challenges affecting the design of these rules that need to be considered, including (a) reliable criteria for prediction; (b) well-characterised templates; and (c) conservation of these criteria across species. For enzymes where the reaction mechanism is well-known, such as protein kinases, it could be possible to update the existing rules adding new conditions which would enable the identification and labelling of potential pseudokinases. While addressing the challenges described above is still an ongoing project, in the end, these rules will provide an invaluable tool to expand the prediction and identification of potential pseudoenzymes in Uni-ProtKB.

Searching for pseudoenzymes in UniProt
One important goal behind the revision of the pseudoenzyme annotation was to improve their discoverability. UniProtKB can be queried using the search box on the top of the website page either by typing terms directly into the box or by using the advanced search options. The advanced search allows our users to restrict search terms to specific fields in a UniProtKB entry and, if required, to combine multiple fields using Boolean logic. For example, using the search term 'inactive' in the 'Protein name' field allows the retrieval of pseudoenzymes. This search retrieved 455 reviewed entries (release 2019_05). Although this number is far from reflecting the total number of existing inactive enzymes, the analysis of these members offers a preliminary insight in terms of what type of information can be retrieved. As shown by previous studies, the majority of pseudoenzymes identified so far are from eukaryotic species, but a substantial number are also found in bacteria and viruses. They belong to over 100 enzyme families confirming that inactive members are present in almost all families. Twenty-one of them have an inactive domain combined with an active domain. So far, 269 have a caution comment providing an explanation for the lack of catalytic activity.

Discussion
In the era of high-throughput experiments, databases play an instrumental role in the analysis of large datasets. They not only provide a tool for identification but are also often the initial source of functional information. In the protein field, UniProt is a unique resource that currently gives access to more than 150 million sequences belonging to over 800 000 species combined with functional information based on expert curation and automatic predicted annotation.
To ensure that users are provided with the latest knowledge, the annotation is constantly revised and updated. This is made possible by keeping up-to-date with the latest advances in specific protein fields through the literature, conferences, workshops and, most importantly, through discussion with scientific experts. Curators play an active role in community workshops and are involved in activities such as writing nomenclature guidelines or classification systems which can then be adopted by the UniProt database [41,42]. One such collaboration has been particularly fruitful, leading to improvements in the curation of pseudoenzymes, their description and in enhancing their discoverability [41]. Researchers with specialist knowledge are actively encouraged to contribute to the manual curation process by highlighting key publications and critical information which should be included in specific entries. To this end, we provide mechanisms by which users can feedback on Uni-ProtKB entries, for example, enabling researchers to submit additional bibliography to UniProt entries, with ORCIDs used to both validate and credit contributions (http://insideuniprot.blogspot.com/2019/07/) and by providing direct feedback links from every protein record. To expand the information contained in a UniProt entry, we also integrate data from other specialised databases, including several enzyme resources. Thus, in each UniProt entry, users can find direct links to relevant external resources -UniProt release 2019_07 provides cross-references to 170 specialised external resourceswhich they can use to find further information on their protein of interest.
The criteria used to identify pseudoenzymes are intricately linked to how catalytic activity is assessed in their active counterparts. The capacity to predict accurately that a protein is devoid of catalytic activity correlates with how well the reaction mechanism, in terms of the residues involved, is known in the active members of the related family. Among the various methods used to identify pseudoenzymes, the most commonly used are, by far, based on sequence analysis prediction (up to 95%). This highlights a need to improve and extend our understanding of the molecular mechanism of active enzymes, and manually curated repositories such the Mechanism and Catalytic Site Atlas (M-CSA) reaction database are instrumental in this [43]. Comparison of structural data between enzyme and pseudoenzyme has been instrumental in understanding the evolution of the catalytic domain, the reaction mechanism, in particular, which are the key residues and/or motifs and how pseudoenzymes achieved their catalytic-independent functions. Importantly, the increasing number of experimentally solved 3D structures (> 150 000 in the protein structure database PDB) together with structural protein domain evolution databases such as CATH/Gene3D [44,45] and the advances in 3D prediction model software will facilitate their study. A better understanding of the enzymatic reaction mechanism at the molecular level will contribute to the development of accurate prediction tools or rules to identify and automatically annotate putative pseudoenzymes.
While this review focuses mainly on how pseudoenzymes are identified and how the lack of catalytic activity is reported in UniProt, we obviously also annotate their catalytic-independent roles which are described in the 'Function' section of an entry. The effort invested in reporting a 'non'-function may appear trivial. However, the molecular reasons behind the loss of catalytic activity often provide crucial clues to understand the actual functions of a pseudoenzyme.
UniProt provides researchers with a unique resource for the study of pseudoenzymes, providing a snapshot of the magnitude of the biological processes they are involved in and helping to explain why their catalytic domain is no longer functional. Importantly, these data will lead to a better understanding of the evolution of pseudoenzymes and their active counterparts and the aetiology of related diseases. It will also support the ongoing quest to target pseudoenzymes for therapeutic treatments and offer some insight into the expanding field of enzyme engineering.