The modular structure of α/β‐hydrolases

The α/β‐hydrolase fold family is highly diverse in sequence, structure and biochemical function. To investigate the sequence–structure–function relationships, the Lipase Engineering Database (https://led.biocatnet.de) was updated. Overall, 280 638 protein sequences and 1557 protein structures were analysed. All α/β‐hydrolases consist of the catalytically active core domain, but they might also contain additional structural modules, resulting in 12 different architectures: core domain only, additional lids at three different positions, three different caps, additional N‐ or C‐terminal domains and combinations of N‐ and C‐terminal domains with caps and lids respectively. In addition, the α/β‐hydrolases were distinguished by their oxyanion hole signature (GX‐, GGGX‐ and Y‐types). The N‐terminal domains show two different folds, the Rossmann fold or the β‐propeller fold. The C‐terminal domains show a β‐sandwich fold. The N‐terminal β‐propeller domain and the C‐terminal β‐sandwich domain are structurally similar to carbohydrate‐binding proteins such as lectins. The classification was applied to the newly discovered polyethylene terephthalate (PET)‐degrading PETases and MHETases, which are core domain α/β‐hydrolases of the GX‐ and the GGGX‐type respectively. To investigate evolutionary relationships, sequence networks were analysed. The degree distribution followed a power law with a scaling exponent γ = 1.4, indicating a highly inhomogeneous network which consists of a few hubs and a large number of less connected sequences. The hub sequences have many functional neighbours and therefore are expected to be robust toward possible deleterious effects of mutations. The cluster size distribution followed a power law with an extrapolated scaling exponent τ = 2.6, which strongly supports the connectedness of the sequence space of α/β‐hydrolases.

Introduction a/b-hydrolases represent a rapidly growing enzyme family with a common fold and a similar active site. The a/b-hydrolase fold consists of a central b-sheet packed between two layers of a-helices [1]. The catalytic triad consists of a nucleophile (serine, aspartate or cysteine), a histidine, a catalytic acid (aspartate or glutamate), and two or three amino acids which form the oxyanion hole. In addition to this conserved, catalytically active core domain, most a/b-hydrolases contain further modules: a lid, a cap, an N-terminal domain, or a C-terminal domain. Despite the high structural similarity of the core domain and the catalytic machinery, a/b-hydrolases show a high sequence diversity and include enzymes with a broad variety of catalytic activities such as acetylcholinesterases, acyltransferases, amidases, dehalogenases, dienelactone hydrolases, epoxide hydrolases (EH), esterases, hydroxynitrile lyases, lipases, peroxidases, proteases and thioesterases [2][3][4][5]. It would have been expected that enzymes with different catalytic activities are separated in sequence space and form separate subfamilies. However, the catalytic activities are overlapping rather than clearly separated in sequence space. Many a/bhydrolases have multiple catalytic activities, and single amino acid positions discriminate between activities such as lipase and amidase [5]. The exchange of single amino acids switched a lipase into an EH [6] or a catalyst for Michael additions [7], a hydrolase into an acyltransferase [8], an esterase into an amidase [9,10], an esterase into an EH [11] and an esterase into a hydroxynitrile lyase [12].
Lipases and other a/b-hydrolases are widely used in biocatalysis because of their high catalytic activity, their regio-and stereoselectivity, their stability in nonpolar media and their ability to catalyse alcoholysis, esterification and transesterification reactions [13,14]. Metagenomic screenings have discovered an ever increasing stream of new sequences [15,16]. Therefore, there is a growing need for the classification of newly discovered sequences into protein families and for the prediction of biochemical and biophysical properties from sequence information [17]. Protein family databases on a/b-hydrolases are versatile tools to systematically compare protein sequences, to annotate functionally relevant amino acids and to assign new sequences to already known subfamilies. Protein family databases such as the Lipase Engineering Database (LED) [18], the database of the a/b-hydrolase fold superfamily ESTHER [19] and the a/b-Hydrolase Fold 3DM Database ABHDB [20] provide a large collection of sequences and structures of a/b-hydrolases, group them into protein families, perform multiple sequence alignments, and derive sequence profiles.
Although sequence alignment allows for a systematic analysis of close homologues, it misses structural and functional relationships between remote homologues. Therefore, a classification based on structural and functional properties is needed. The systematic comparison of the amino acids forming the oxyanion hole allowed to assign a/b-hydrolases to two classes, the GX-and GGGX-types [18]. In GX-types, the first part of the oxyanion hole is formed by the backbone N-H of an amino acid (X) in a GX motif, whereas in GGGX-types it is formed by the backbone N-H of the third glycine in a GGGX-motif. Later, a third type of oxyanion hole (Y-type) was discovered [21]. In Y-type hydrolases, the oxyanion hole is formed by the sidechain of a bulky amino acid, mainly tyrosine or aspartate. The classification by GX-and GGGX-types has high predictive value: in contrast to GX-type, most of the GGGX-types are active toward tertiary alcohols [22]. A systematic comparison of the structures of EHs revealed the modular structure of this subfamily of a/ b-hydrolases with a high similarity within the core domains and within the caps [23]. Cytosolic and microsomal EHs mainly differ by the length of a single loop inside the cap and by their N-terminal domains. Because of their modular structure, the global sequence similarity is misleading, resulting in an overestimation of diversity. Using this classification, the structures of core domains and caps of all EHs and the cap loops of most of the EHs could be reliably modelled [23]. Thus, modularity is key to sequence classification, structure modelling and prediction of function. This analysis was extended to the families of haloalkane dehalogenases and prolyl iminopeptidases, which are homologous to EH despite their different catalytic activities [24]. In the meantime, additional domains were identified in other hydrolases, and their functional role was discussed, such as the N-terminal b-propeller domain in acyl aminoacyl peptidases [25], the role of subdomains in the catalytic behaviour of lipases and acyltransferases [26], the stabilizing N-terminal domain in the murine liver EH (MLEH) [27] or the importance of the C-terminal domain in colipase binding of pancreatic lipases [28].
In this paper, we analysed the architectures of all known a/b-hydrolase structures. The modular structure of a/b-hydrolases can be described as a combination of three core domains (GX, GGGX and Y-types) with caps, N-terminal or C-terminal domains. In addition, a mobile element (the lid) is located at five different positions in lipases. For the analysis, the LED was updated and now contains 280 638 sequences and 1557 structures (https://led.biocatnet.de).

Update of the LED
The LED (https://led.biocatnet.de) was updated starting from the previous version (release 3.0, December 2009) which contained 24 783 sequences from 18 585 proteins and 1117 structures. All structures from this release together with newly identified structures of MHETases [29] were clustered. The resulting centroids were then used as seed sequences (query sequences) for a BLAST search in the NCBI non-redundant protein database and in the Protein Data Bank (PDB), which led to the identification of more than 450 000 putative a/b-hydrolase sequences. One homologous family was created for every centroid sequence, and sequence homologues were added. The centroids of each homologous family and the unclassified structures were analysed visually, revealing a similar core structure which contains the active site, the oxyanion hole residues, and 10 additional modules that can be attached to the core domain: a lid between b-strands b +1 and b +2 , b À1 and b 0 , b À4 and b À3 , b +3 and b +4 , or between the N terminus and b -3 , a single cap, a double cap, an N-terminal cap, two N-terminal domains or a C-terminal domain. These additional modules determine the architecture of the a/b-hydrolase and interestingly some of these elements, such as the lid between b +3 and b +4 and the lid between the N terminus and b À3 , only occur in combination with a C-terminal domain (Fig. 1). One superfamily was created for each architecture. The newly classified sequences were clustered with a similarity threshold of 60%, and new homologous families were created and assigned to their respective superfamilies. Superfamilies were named according to the architecture, whereas the homologous families were named according to their centroid sequence, and groups were formed by the oxyanion hole types (GX, GGGX and Y). Similarly, the superfamilies were assigned to five groups (core, lid, cap, one additional domain or two additional domains). The update resulted in a new version of the LED, which comprises 280 638 individual sequence entries assigned to 198 844 protein entries by a threshold of  98% sequence identity and contains 1557 protein structures in 2772 homologous families and 12 superfamilies. The largest superfamily in the current version of the LED is the N-terminal domain superfamily (GX-and Y-types), containing 25% of the a/b-hydrolases, followed by the single cap superfamily (GX-, GGGX-and Y-types, 23%) the N-terminal cap family (GX-and GGGX-types, 21%), and the core domain family (GX-and GGGX-types, 13%). The smallest protein in the LED with known structure is a lipase of Bacillus subtilis (LED sequence ID 60, PDB-ID: 2QXT and 2QXU) with a length of 179 amino acids and a molecular weight of 19 kDa, the largest protein a dipeptidyl peptidase 8 (LED sequence ID 456274, PDB-ID: 6EOP) with a sequence length of 900 amino acids and a molecular weight of 104 kDa.
Of each homologous family, one protein with available structure was visually analysed, and the catalytic triad, the oxyanion hole residues and structural modules such as lids, single caps, double caps, N-terminal caps, N-terminal domains and C-terminal domains were annotated. The annotation information was then transferred from the centroids to the other members of a homologous family based on a multiple sequence alignment. In total, 162 737 sequences were annotated (58% of all sequences of the LED). Based on all annotation entries in the LED, the lengths of the additional modules and domains were analysed. Lids are the shortest modules with a length of 20 AE 7 amino acids (mean AE standard deviation), N-terminal caps and the N-terminal part of the double caps are similar in size (40 AE 8 and 53 AE 14 amino acids respectively), single caps are larger (75 AE 18 amino acids), and the additional C-and N-terminal domains are largest and vary considerably in size (367 AE 129 amino acids).

Cluster size distribution
Pairs of homologous sequences were derived for 280 638 sequence entries from the updated LED (release 4.0). The sequences were assigned to different clusters (communities), by setting a threshold of global sequence identity to 60%, 70%, 80% or 90%. The number of clusters N(s) with a cluster size s decreased with increasing s following a power-law distribution N (s)~s Às (Fig. 2). Logarithmic histograms were formed for s ≥ 2 and s ≤ 10, s ≥ 11 and s ≤ 100, . . ., s ≥ 1001 and s ≤ 10 000 sequences [30]. s h was determined from the slope of the histograms as 0.8, 1.0, 1.2 and 1.4 at a threshold of global sequence identity of 60%, 70%, 80% and 90%, respectively (Table 1), and the underlying s was modelled as described previously [30]. For comparison, s was also determined by linear regression of the actual distribution for s ≤ 100 (Fig. S1), and both methods resulted in the same values of s, ranging from 1.7 to 2.3 between 60% and 90% global sequence identity. s was then extrapolated to 100% global sequence identity, yielding s 100 = 2.6 ( Fig. S2), which is identical to the value derived previously for a/b-hydrolases [30].

Degree distribution
As further property of the network topology, the number of neighbouring sequences (degrees), was analysed. Neighbouring sequences were defined by a threshold of global sequence identity of 95%. The number N(n) of sequences having n neighbours followed a power law distribution N(n)~n Àc . Although many sequences had few neighbours, a small number of sequences had a high degree and were located in a hub region. Linear regression was performed for n ≤ 80 resulting in a scaling exponent c = 1.4 ( Fig. 3), which is similar to the values of c between 1.1 and 1.3 determined previously for five different protein families [31]. All sequences in the hub region, (n ≥ 300) belong to a single homologous family, the protease 2 homologous family number 1011, which are Y-type a/b-hydrolases from the N-terminal domain superfamily number 8 (Table S1).

Modular structure of a/b-hydrolases
One representative a/b-hydrolase structure for each architecture and oxyanion hole type was visually analysed. The structures of centroids were selected if they were fully resolved and had a resolution of at least 3 A. Otherwise, a structure with high similarity to the centroid was selected. All a/b-hydrolases have a similar core structure, the a/b-hydrolase fold, which contains the catalytic triad and the oxyanion hole residues (Fig. 1). The b-strands of the central b-sheet were numbered starting from b 0 , the b-strand preceding the nucleophilic elbow, where the catalytic serine is located. All b-strands in the direction of the N terminus were assigned negative numbers (. . ., b À3 , b À2 , b À1 ), whereas the b-strands in the direction of the C terminus were assigned positive numbers (b +1 , b +2 , b +3 ,. . .) (Figs S3-S9). As an additional module, a/b-hydrolases might contain a mobile lid consisting of one or two a-helices. The lid can only be unambiguously assigned if open and closed conformations of the same or two homologous proteins have been crystallized. Therefore, in the absence of both conformations, a/b-hydrolases with a lid might be misclassified as core domain a/b-hydrolases. The lid is located at different positions within the central b-sheet: between b-strands b +1 and b +2 , b À1 and b 0 , b À4 and b À3, b +3 and b +4 , or between the N terminus of the protein and b À3 . In contrast to the mobile lid, a cap is immobile and consists of more than two ahelices which are stacked on top of the core structure. Three cap arrangements were found: a single cap located between b-strands b +1 and b +2 , a double cap where an additional N-terminal cap is stacked on top of the single cap, and an architecture with an N-terminal cap only. In addition, a/b-hydrolases might contain an N-terminal domain with a b-propeller fold or a Rossmann fold, or a C-terminal domain with a b-sandwich fold. The N-terminal Rossmann fold domain and C-terminal domains can be combined with a cap or a lid, respectively. Thus, in total 12 different architectures were identified (Fig. 1). In addition, a/b-hydrolases can be distinguished by their oxyanion hole ( Table 2). Ten of the 12 architectures contain GX-types, but only four architectures contain GGGX-types (core, core with lid  between b À3 and b À4 , core with a single cap or with an N-terminal cap) and three architectures contain Ytypes (core with a single cap, core with an N-terminal b-propeller domain or core with a C-terminal b-sandwich domain).

Core domain a/b-hydrolases
The core domain of a/b-hydrolases is composed of up to 13 b-strands forming a central parallel b-sheet. The b-strands alternate with varying numbers of a-helices, and the b-sheet is packed between two layers of a-helices. The central b-strand b 0 is connected to the subsequent a-helix by a sharp turn, the nucleophilic elbow, which harbours the catalytic serine and one of the oxyanion hole residues (Fig. S3). a/b-Hydrolases consisting only of the core domain were found in the GXtypes (29 712 sequences, representative structure: lipase A from B. subtilis, BSLA) and in the GGGX-types (5376 sequences, representative structure: carboxylesterase from Lactobacillus plantarum, LPC). As of now, no Y-type a/b-hydrolases were found that consist exclusively of the core domain ( Table 2). The active site of core domain a/b-hydrolases is completely exposed to the solvent (Fig. S10A). In contrast to GX-types, additional a-helices restrict the substrate access to a specific direction in most GGGXtypes, thus contributing to substrate specificity (Figs S11A and S12A), because GGGX-types are larger (200-500 amino acids, 8-13 b-strands) than GX-types (180-300 amino acids, 5-8 b-strands) and thus have more helices close to the substrate access tunnel that can shield the active site (Table S3).
Lids as an opening and closing mechanism for the active site  Table 2). GGGX-type a/b-hydrolases can have a lid between b À4 and b À3 (9079 sequences, representative structure: lipase from Candida rugosa in open and closed conformation, Table 2. Basic architectures of a/b-hydrolases. Representative structures were chosen for the different oxyanion types and architectures. Abbreviations of protein names can be found in Table S2. In addition to the architecture, the position of the additional module in the central b-sheet is indicated. The letter in brackets following the protein name indicates the amino acid of the active site nucleophile. - CRL open and CRL closed ) ( Fig. S4C and Table 2). No Ytype a/b-hydrolases with a lid has been found, yet. Regardless of where the lid emerges, its position in respect to the active site is almost identical. In its closed conformation, it covers the active site from the N-terminal side of the protein and moves towards the N terminus of the protein upon opening (Figs 1B-D, S10 and S11B-D). When the lid is in the closed conformation, the active site residues are not visible and the contact area of the lid surrounds the active site completely. This suggests that the lid covers the active site in the closed conformation, making the active site inaccessible to solvent or substrate molecules. Upon opening of the lid, the contact area of the lid moves away from the active site, thus making the active site fully accessible to the solvent and thereby allowing substrate molecules to bind to the active site (Figs 4 and 5).

Caps covering the active site
Caps are immobile elements, which cover the active site either partly or completely. a/b-Hydrolases can have single caps, N-terminal caps or double caps added to the core domain ( Fig. 1E-G). Single caps consist of three or more a-helices on top of the proteins (Figs S5A and S11E). N-terminal caps consist of two or more long a-helices formed by the N terminus folding back over the active site (Figs S5B and S11F).
In a/b-hydrolases with a double cap, two caps are stacked on top of the active site. The lower cap is similar to the single cap, the upper cap to the N-terminal cap (Figs S5C and S11G). a/b-Hydrolases with a single cap were found as GXtypes (61 596 sequences, representative structure: haloalkane dehalogenase from Xanthobacter autotrophicus, XAHD), GGGX-types (2581 sequences, representative structure: HsaD from Mycobacterium tuberculosis, MTH), and Y-types (22 sequences, lipase A from Candida antarctica, CALA) ( Table 2), and is the only architecture that can be found for all oxyanion hole types. The position of the single cap is conserved and is always located between strand b +1 and b +2 (Fig. S5A). The active site of Y-type a/b-hydrolases with a single cap is completely covered by the cap, thereby restricting access to the active site. In GX-types, the active site is mostly covered, however, the catalytic triad histidine is partially exposed to the solvent. In GGGX-types, the entrance to the active site from the top is completely covered by the cap, but there is a narrow tunnel between the cap and the core domain which leads to the active site (Figs 6A and S10E).
In N-terminal caps, the helices restrict the access of solvent or substrate molecules to the active site (Figs 6B, S10F and S122B). a/b-Hydrolases with an N-terminal cap could be found as GX-types (585 sequences, representative structure: cutinase from Trichoderma reesei, TRC) and as GGGX-types (57 900 sequences, representative structure: esterase from Pyrobaculum calidifontis, PCE), but not as Y-types. In GX-types, the N-terminal cap blocks the active site completely, suggesting that the substrate access to the active site is restricted (Fig. S13A). In contrast, in GGGX-types, the cap does not block the active site completely, resulting in a narrow tunnel in the side of the protein leading towards the back of the central helix of the active site, which might allow substrate access (Fig. 6B).
Until now, double caps have only been found in GX-type a/b-hydrolases (14 874 sequences, representative structure: EH from Streptomyces carzinostaticus, SCEH) ( Table 2). The combination of two caps covers the active site completely without leaving a tunnel between the caps (Figs S10G, S12C and S13B).
a/b-hydrolase fusion proteins with additional N-or Cterminal domain The modules described so far are relatively small and interfere with the entrance to the active site. Another group of a/b-hydrolases displays additional domains that are attached to their N-or C terminus (Fig. 1H, I). a/b-Hydrolases with an additional N-terminal bpropeller domain were found for GX-types (15 436 sequences, representative structure: acylaminoacyl peptidase from Aeropyrum pernix K1, APAP) and Y-types (53 669 sequences, representative structure: human dipeptidyl peptidase IV [DPPIV]) ( Table 2). The bpropeller domain consists of seven to eight blades, each blade is formed by a four-stranded b-sheet. (Figs 7A and S6). The b-propeller domain is stacked on top of the core domain, and the blades are arranged to form a channel through the middle of the propeller towards the active site (Fig. S10I). In Ytypes, the propeller domain does not block the active site completely, suggesting there might be an additional substrate access between the propeller and the core domains (Figs 7A and S13C).
a/b-Hydrolases with an additional C-terminal b-sandwich domain were found in the Y-types only (13 624 sequences, representative structure: cocaine esterase from Rhodococcus sp. MB1, RCE) and not in GX-types or GGGX-types (Figs 7B, S7 and Table 2). In the CATH database, the C-terminal b-sandwich domain is annotated as galactose-binding domain-like, CATH superfamily 2.60.120.260 [32]. The C-terminal b-sandwich domain does not interfere with the active site and is located at the side of the protein. Therefore, a/b-hydrolases with a C-terminal b-sandwich domain have a freely accessible active site (Figs 7B, S10H and S13D).
Proteins with two additional modules a/b-Hydrolases with more than one additional module are rare and only found for GX-types (Fig. 1J,L): a/bhydrolases with a C-terminal domain and a lid  Although the opening/closing transition and the size of the lids are similar, the location of the lid differs. In HPL/HPLRP1, it is located between strands b +3 and b +4 , whereas in PSML it is located between the N terminus and b À3 (Fig. S8A,B). Interestingly, although the lids emerge from different locations within the central b-sheet, they are situated in a similar position covering the active site from the C terminus of the protein, rather than from the N terminus as observed for other a/b-hydrolases (Fig. 1). The N-terminal domain of MLEH consists of a Rossmann fold formed by alternating a-helices and bstrands, annotated by CATH classification as HAD superfamily/HAD-like domain, or CATH Superfamily 3.40.50.1000 [32]. Other than the N-terminal b-propeller domains and similar to the C-terminal b-sandwich domains, this N-terminal Rossmann fold domain is attached to the side of the protein and does not interfere with substrate access to the active site (Figs 8B and S12D). The cap of MLEH is comparable to other single caps and is located between b +1 and b +2 (Figs 8B and S9). Although it covers the active site, there seems to be a tunnel underneath the cap, which allows access of substrates to the active site (Figs 8B and S10L).

Similarity of the N-and C-terminal domains to other proteins
In order to analyse whether the N-and C-terminal domains in superfamilies 8-12 are similar to domains in other proteins, one representative protein from each of these superfamilies and each oxyanion hole type was selected: from the N-terminal domain superfamily number 8, the GX-type A. pernix acylaminoacyl peptidase (APAP) and the Y-type human DPPIV, from the C-terminal domain superfamily number 9, the Y-type Rhodococcus sp. cocaine esterase (RCE), from the lid (b +3 and b +4 ) and C-terminal domain superfamily number 10, the GX-type HPL, from the lid (N-terminal) and C-terminal domain superfamily number 11, the GX-type Pseudomonas sp. MIS38 lipase (PSML), and from the cap and N-terminal domain superfamily number 12, the GX-type human epoxide hydrolase (HEP) ( Table S4).
The N-terminal b-propeller domains of DPPIV and APAP are structurally similar, as well as the b-sandwich domains of RCE and HPL. All a/b-hydrolase core domains are similar to each other. In addition, they are similar to the additional N-terminal Rossmann fold domain of HEP (Fig. S14).
In a second step, the PDB was searched for domains with structural similarity to the N-or C-terminal domains (https://doi.org/10.18419/darus-458).
The search with the b-propeller domains of DPPIV and APAP resulted in 103 and 173 structurally similar proteins, respectively, which consist of a single propeller, two fused propellers, or fusion proteins of a bpropeller and other domains. They cover a broad range of biological functions, such as transcription and translation initiation factors, DNA damage-binding proteins, elongation complexes, export factors, splicing factors, ribosomal proteins, proteins involved in cell cycle, apoptosis, and intracellular transport, and also b-propeller lectins, although the carbohydrate-binding site does not seem to be conserved. The search with the Rossmann fold domain of HEP resulted in 154 hits of structurally similar proteins. Because of its similarity to the a/b-hydrolase core domain, 54% of all hits belonged to a/b-hydrolases, mostly single cap a/b-hydrolases from LED superfamily 5. All other hits were mainly proteins consisting of a single Rossmann fold domain, two fused Rossmann fold domains or fusion proteins of Rossmann fold with other domains. They cover a broad range of enzymatic functions, such as phosphatases, haloacid dehalogenases, phosphoglucomutases, reductases and glycosyltransferases, but also regulators of transcription or replication.
The search with the b-sandwich domains of RCE, PSML and HPL resulted in 41, 35 and 10 proteins, respectively, which consist of a single b-sandwich or are fusion proteins of b-sandwich domains with mainly ahelical domains. Proteins with structural similarity to HPL cover a broad range of annotations such as lipoxygenases, glucanases, xylanases, the GTPase Rab6 and carbohydrate-binding modules. Proteins with structural similarity to RCE are involved in DNA repair, flagellar biosynthesis or receptor binding, but also lectins and enzymes involved in carbohydrate degradation, such as xylanases, glycoside hydrolases, mannanases and endoglucanases. Although the structural similarity suggests a carbohydrate-binding function, the carbohydrate-binding site of lectins is not present in the b-sandwich domains of RCE and HPL. Proteins with structural similarity to PSML include hemagglutinin, adhesins, toxin A and antifreeze proteins.

Naming conventions
Despite their high variability in sequence and structure, a/b-hydrolases can be assigned to a small number of classes. Two criteria were used: the sequence motif of the oxyanion hole and the presence of lids, caps and N-or C-terminal domains. However, there is no consistent naming of these structural modules in literature, especially for the definition of lids and caps. The definition of a 'cap' as a large immobile module covering the active site is widely accepted [33][34][35][36]. However, in some publications it is referred to as 'lid' [2,37,38]. Similarly, small mobile elements covering the active site are mostly referred to as 'lid' [35,[39][40][41][42][43], but they might also be called 'flap' [44][45][46][47]. To improve communication about a/b-hydrolases, we suggest the following naming conventions: A 'lid' is a small mobile structure comprised of one or two a-helices, which can undergo a conformational transition, thereby opening and closing the entrance to the active site. Lids are located between b-strands b À3 and b À4 , b 0 and b À1 , b +1 and b +2 , b +3 and b +4 or between the N-terminus and b À3 .
A 'cap' is an immobile module and consists of three or more a-helices, which cover the active site. Caps are located between strands b +1 and b +2 . In proteins with double caps, a second cap consisting of two or three a-helices is located at the N terminus. This second cap is similar to N-terminal caps. These are formed by two or more a-helices emerging from the N terminus of the protein, which fold over the active site and thereby cover it.
N-terminal domains have either a b-propeller fold or a Rossmann fold. The C-terminal domains have a b-sandwich fold. These additional domains can be located on top of the core domain covering the active site, on its side or separated by a long loop.
Although caps and the N-terminal or C-terminal domains can be easily identified in protein structures, the unambiguous identification of one or two a-helices as lid requires the presence of an open and a closed conformation, such as in C. rugosa lipase (PDB-ID: 1CRL and 1TRH), Rhizomucor miehei triacylglyceride lipase (PDB-ID: 3TGL and 4TGL), or Pseudomonas sp. MIS38 lipase (PDB-ID: 2Z8X and 2ZVD).
The suggested classification by oxyanion hole motif and by the presence of lids, caps and N-or C-terminal domains is based on the analysis of protein structures. However, a few putative a/b-hydrolases with known structure could not be assigned because of deviations of their oxyanion hole sequence from the conserved motifs, which did not allow us to assign them to GX-, GGGX-or Y-types: The N-acyl homoserine lactone degrading enzyme (PDB-ID: 5EGN) and a hydrolase from Pseudomonas aeruginosa PA01 (PDB-ID: 3OM8) had a cap, but their oxyanion hole motifs were PF and SI respectively. Similarly, the putative dienelactone hydrolase from Klebsiella pneumoniae (PDB-ID: 3F67) and an uncharacterized protein from Escherichia coli (PDB-ID: 4ZV9) are core domain a/bhydrolases, however, their oxyanion hole motif was EX.
In addition, the database update resulted in % 170 000 discarded sequences which could not be assigned to one of the superfamilies because of their low sequence similarity to proteins with known structure, which made it impossible to decide about the presence of modules.

Substrate access
The knowledge of the structural elements of a protein allows a prediction about the access of substrate to the active site. For a/b-hydrolases that consist only of the core domain, the active site is fully exposed to the solvent. This was already shown for the lipase from B. subtilis, the cutinase from Fusarium solani or the carboxylesterase from Pseudomonas fluorescens [48][49][50]. In a/b-hydrolases with a lid, the substrate enters between core and lid. In the closed conformation, the active site is covered by the lid, but becomes accessible to the substrate upon a conformational transition of the lid to an open conformation, as shown for the lipases from C. rugosa, P. aeruginosa or Rhizopus niveus [51][52][53]. In contrast, the cap covers the active site. In order to access to the active site, a substrate molecule has to pass through a tunnel located between the core and the cap, such as in the HEP, the prolyl aminopeptidase from Serratia marcescens, or the C-C hydrolase from M. tuberculosis [54][55][56], or through a tunnel in the cap, such as in the fluoroacetate dehalogenase from Rhodopseudomonas palustris or the hydroxynitrile lyase from Hevea brasiliensis [57,58]. Access to the tunnel is controlled by gatekeepers such as the sidechain of Leu262 in the haloalkane dehalogenase from X. autotrophicus [59]. Engineering of the tunnel can result in changes of substrate specificity [60]. Interestingly, in GGGX-type proteins with a single cap or an N-terminal cap, the active site is partially solvent accessible, which, however, might be a crystallization artefact. For Y-type a/b-hydrolases with an N-terminal b-propeller domain, substrate access is under discussion. For dipeptidyl peptidases, the substrate was shown to enter the active site through a side opening, along the interface between N-terminal and core domain [61,62]. The function of the tunnel in these proteins is not clear, but was suggested to aid the release of product from the active site [63,64]. For prolyl oligopeptidases, which are also Y-type a/b-hydrolases with an N-terminal b-propeller domain, a crystal structure with a tilted propeller domain was identified, resulting in a large opening between the core domain and the propeller domain, similar to dipeptidyl peptidases [37]. It has therefore been discussed whether substrate molecules enter prolyl oligopeptidases through the tunnel of the propeller domain [65,66] or through the entry site that opens upon conformational change of the proteins [37,67,68].

Sequence space
Despite the large number of 280 638 sequence entries in the updated LED, we still know only a tiny fraction of the whole extant sequence space of a/b-hydrolases [69]. The limited coverage of sequence space and the even lower coverage of structure space might explain why GX-types were found in combination with most modules and multiple lid positions, but only few combinations and lid positions were found for GGGX-and Ytypes. However, the analysis of the sequence network confirmed previous results on the properties of protein sequence space. The scale-free distribution of the number of neighbours of a protein sequence demonstrates the existence of a few hubs and a large number of loosely connected sequences, as found previously for five different protein families. The scaling exponent of the a/b-hydrolase sequence network (c = 1.4) is similar to other protein families which had scaling exponents between 1.1 and 1.3 [31]. The members of the homologous family 1011 (annotated as 'protease 2' or 'oligopeptidase B') had the highest number of neighbours (Table S1) and thus formed the largest hub region of the a/b-hydrolase network. Since hub sequences have many functional neighbours, they have proven to be highly evolvable with respect to robustness towards mutations [70]. Because mutations might readily induce new functions, sequences with a large number of neighbours are promising starting points in directed evolution experiments.
A second property of the a/b-hydrolase network, the scale-free cluster size distribution, is also similar to other protein families [30]. The extrapolated scaling exponent (s 100 = 2.6) was in the range between 2.3 and 3.3 which was previously determined for six protein families [30] and indicates percolation, that is, connectedness, of protein sequence space, despite the fact that extant sequence space covers only a tiny fraction of theoretic sequence space, yet. The connectedness of protein sequence space implies that various evolutionary pathways between two homologous sequences are possible, favouring substrate ambiguity or promiscuity [71][72][73].

Structure space
Although global sequence similarity is a widely used measure of the relationship between proteins, it is misleading. In addition to exchanges, insertions and deletions of single amino acids, there is a second evolutionary mechanism: the recombination of structural modules or domains, which has been observed in many protein families, such as thiamine diphosphatedependent enzymes [74] or glycoside hydrolases and carbohydrate-binding modules [75]. Although the identification of structural and functional domains and modules in a protein structure might be challenging [76], many proteins which are distant considering their global sequence and structure share highly similar fragments, which resulted in the view that protein structure space is continuous rather than discrete [77].
The sequence similarity between a/b-hydrolases is generally low and their similarity relies on their structure [3]. The similarity of domains can be hidden by the global similarity, especially if different modules are recombined and exist in different orders [74]. Although the global structural similarity is quite low or undetectable between some a/b-hydrolases with additional C-and N-terminal domains, the structural similarity of the core domains to each other is remarkable. This is different for the additional modules and most of them are structurally quite diverse. However, there are exceptions, such as the b-propeller domain of Y-type and GX-type a/b-hydrolases. These are structurally very similar and so are the core domains, pointing to coevolution. Interestingly, also the core domain and the additional C-terminal b-sandwich domain of HPL and RCE show significant structural similarity. Interestingly, RCE is a Y-type with an additional C-terminal b-sandwich domain, whereas HPL belongs to the GXtypes and has an additional C-terminal b-sandwich domain and a lid attached. This suggests that existing modules can be exchanged or newly added to proteins. In fact, the addition or deletion of N-and C-terminal domains to proteins was described as one of the most frequent domain rearrangements observed [78]. The substitution of domains inside a protein was, however, shown to be rare [79], which raises the questions, whether lids and caps can also be exchanged between proteins. Interestingly, the fold of the additional Nand C-terminal domains also occurs in several other proteins with very different functions. Although the sequence similarity between these modules is very low, they share structural similarity, showing how evolution reuses suitable folds for completely new functions.
Continuous protein space is supported by another very interesting finding in the LED that proteins could clearly be assigned to one of the homologous families although they are enzymatically inactive and display no hydrolase activity. One such an example is the Gibberellin receptor gibberellin insensitive dwarf1 (GID1) (PDB-ID: 2ZSI and 3ED1). This protein can clearly be assigned to GGGX-types with an N-terminal cap, but they miss the catalytic histidine. This is interesting, because the histidine is the most conserved residue of the catalytic triad of a/b-hydrolases [3]. Indeed, it was shown that although GID1 is similar to hormone-sensitive lipases, it does not display any hydrolytic activity [80][81][82]. Another interesting example are neuroligins, for example, the extracellular domain of neuroligin 2A from mouse (PDB-ID: 3BL8) or neuroligin-1 from rat (PDB-ID: 3BIW). Both can clearly be assigned to the GGGX-type a/b-hydrolases with an additional lid, however, they are missing the catalytic serine and instead show a GXGXG motif, which leads to an enzymatically inactive protein [83,84]. This demonstrates how the structure of a/b-hydrolases can be exploited and used for other functions.
A few years ago, plastic-degrading enzymes were discovered as an interesting possibility for the bioconversion of polyethylene terephthalate (PET) [85]. These so called PETases are able to convert PET to mono-(2-hydroxyethyl) terephthalate (MHET), which is then hydrolysed by MHETases to terephthalate and ethylene glycol [29]. PETases and MHETases are members of the a/b-hydrolase family [29,85] and can be found in the LED. PETases are GX-types, whereas MHETases are GGGX-types, but both are core domain a/b-hydrolases. MHETases were suggested to have a lid that confers substrate specificity [29], however, the domain specified by Palm and colleagues only shields, but does not cover the active site. Furthermore, it consists of 8 a-helices and 242 amino acids and thus is too big to be classified as a lid according to the naming convention suggested in this paper. In the LED, PETases and MHETases are found in superfamily number 1. There are 1099 PETases and homologues thereof combined in homologous family number 49 and 13 MHETases and homologues thereof in homologous family number 2923. Therefore, the LED can aid in identifying possible PETases and MHETases. It further allows to find related proteins that could be used in protein engineering to develop highly active plastic-degrading enzymes.

Setup of the Lipase Engineering Database (LED)
The sequences of 1117 a/b-hydrolases with structure information in the previous version of the LED (release 3.0, December 2009) were clustered using USEARCH (version 11.0.667) with an identity threshold of 90%, resulting in a list of representative sequences (named centroids) [86]. The structures of the centroids were checked visually for their architecture using the PyMOL Molecular Graphics System (version 1.8.0.5, Schr€ odinger; LLC, New York, NY, USA). Structures lacking the a/b-hydrolase fold were discarded, and the remaining centroid sequences were reclustered by USEARCH with an identity threshold of 60% to reduce the number of queries for later BLAST searches. In addition, the sequences of three recently identified structures of MHETases, which also belong to the family of a/b-hydrolases, were added to the centroids [29]. Each centroid served as a seed sequence of a homologous family and was assigned to a superfamily [87], which was named by its respective architecture. By using the centroid sequences as query, homologous proteins were searched in the NCBI non-redundant protein database and the PDB by BLAST searches with an E-value cut-off of 10 À10 [88,89], resulting in more than 450 000 putative a/b-hydrolase sequences. A threshold of 98% global sequence identity was used to assign individual sequences to proteins. Sequences were sorted into homologous families at a global sequence similarity threshold of 60%. For each of the homologous families, a profile hidden Markov model (HMM) was created using HMMER 3.1b2 (http://hmmer.org/). Iterative HMM searches with these profiles in the remaining unassigned sequences were performed with decreasing E-values from 1.5•10 À5 to 5•10 -10 , and the hits were assigned to the respective superfamily after a visual inspection of their architecture. The unassigned sequences were excluded from further analyses. The newly assigned sequences were clustered with an identity threshold of 60% using USEARCH. For each cluster containing more than nine sequences, a new homologous family was created in the respective superfamily. All sequences in a superfamily, which were not assigned to a homologous family, were summarized in the 'singleton family'.

Visual analysis of a/b-hydrolase structures
The PyMOL Molecular Graphics System, (version 1.8.0.5, Schr€ odinger; LLC) was used for the visual analysis of a/bhydrolases and the generation of figures. Contact areas of the modules with the core domains were calculated using a PyMOL script by Martin Christen, in which the contact radius was increased to 6 A (Martin Christen, 2013, Contact Surface, v.3.0, https://pymolwiki.org/index.php/Contac t_Surface). Explosion graphs were created by disassembling the structures and pulling the modules (lids, caps, N-or Cterminal domains) away the from the core protein.

Analysis of protein sequence networks: clusters and degrees
Pairwise global sequence alignments for the 280 638 sequence entries from the updated LED were used to derive edge weights of pairwise sequence identity for protein sequence networks. Instead of aligning all sequence pairs, the calc_distmx command from USEARCH (version 11.0.667) was used to heuristically determine pairs of homologues, thereby reducing computational effort [86]. Subsequently, the sequence pairs were aligned using the Needleman-Wunsch algorithm implemented in the EMBOSS software suite (version 6.6.0, EMBL-EBI, Hinxton, UK) with gap opening and gap extension penalties of 10 and 0.5 respectively [90,91]. Protein sequence networks were constructed as undirected graphs with edge weights of global sequence identity. Thresholds of global sequence identity were applied to form clusters (communities) of homologous sequences. The number of nodes N(s) of a cluster size s was fitted by a power law N(s)~s Às , and the Fisher exponent s was determined from the slope of the distribution in a log-log plot [30]. Logarithmic histograms were formed for s ≥ 2 and s ≤ 10, s ≥ 11 and s ≤ 100, . . ., s ≥ 1001 and s ≤ 10 000 sequences. The slopes of these histograms, s h , were determined at varying thresholds of global sequence identity. The slopes of the actual distribution, s, were derived by fits of s h against a power-law model distribution as described previously [30]. The number of neighbouring nodes n at a given threshold is called the degree of a node. The number of nodes N(n) having a degree of n was fitted by a power law N(n)~n Àc , and the scaling exponent c was derived from a log-log plot [31]. Distributions of cluster sizes and degrees were analysed by linear fitting via the fitlm function from the Statistics and Machine Learning Toolbox (version 11.5) in MATLAB (version R2019a The MathWorks, Natick, MA, USA).

Annotation of protein sequences
If available, a representative protein with structure information was selected from each homologous family, and the catalytic nucleophile (serine, aspartate or cysteine), histidine and the catalytic acid (aspartate or glutamate) were annotated, as well as the residues forming the oxyanion hole, the lid, single cap, secondary cap (N-terminal part of double caps), N-terminal cap, N-terminal domain and C-terminal domain. For each homologous family, a multiple sequence alignment was generated using Clustal Omega (version 1.2.1) [92]. The annotations were then transferred to the respective positions in the aligned sequences. Thus, the sequences of all homologous families that contain at least one protein structure were annotated in the updated release of the LED (release 4.0).

Structural similarities of the N-and C-terminal domains
To analyse the structural similarities of the N-and C-terminal domains, representative protein structures of each oxyanion hole type from the proteins of superfamilies 8-12 were selected. The structures of the N-and C-terminal domains and the core domains were extracted and saved in separate.pdb files, which were then used to compare the domains by the 'all against all' structure comparison tool on the Dali server [93]. In a second step, the PDB search tool on the Dali server was used to compare the N-and Cterminal domains against all structures in PDB25, a subset resulting from clustering the whole PDB database with an identity threshold of 25%. The Dali server has different measures to determine similarity between proteins. The Z-Score is a measure for structural similarity, where a Z-score > 2 implies significant structural similarities and thus a similar fold [94]. Besides the Z-Score, the output also contains information about the rmsd and the sequence identity.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article. Table S1. Exemplary hub sequences with their annotated source organisms and descriptions. Table S2. List of the proteins names and abbreviations of the representative structures. Table S3. Sequence length of the representative protein structures retrieved from the PDB. Table S4. Representative structures used for the structural comparison of additional modules to other proteins using the Dali web server. Fig S1. Cluster distributions N(s) with linear regressions for cluster sizes s ≤ 100 at thresholds of global sequence identity of 60% (A), 70% (B), 80% (C) and 90% (D). Fig. S2. Fitted exponents s (dots), derived from fits of the slopes of the histograms, s h , against the model distribution from , determined for different thresholds of global sequence identity. Fig. S3. Topology diagram of the core domain of a/bhydrolases. Fig. S4. Topology diagram of a/b-hydrolases with an additional lid (green). Fig. S5. Topology diagram of a/b-hydrolases with an additional single cap (red), N-terminal cap (pale red) or double cap. Fig. S6. Topology diagram of a/b-hydrolases with an additional N-terminal propeller domain (blue). Fig. S7. Topology diagram of a/b-hydrolases with an additional C-terminal b-domain (blue). Fig. S8. Topology diagram of a/b-hydrolases with two additional domains: an additional lid (green) and Cterminal b-sandwich domain (blue).