Recognition of the ligand‐type specificity of classical and non‐classical MHC I proteins

Functional characterization of proteins belonging to the MHC I superfamily involves knowing their cognate ligands, which can be peptides, lipids or none. However, the experimental identification of these ligands is not an easy task and generally requires some a priori knowledge of their chemical nature (ligand‐type specificity). Here, we trained k‐nearest neighbor and support vector machine classifiers that predict the ligand‐type specificity MHC I proteins with great accuracy. Moreover, we applied these classifiers to human and mouse MHC I proteins of uncharacterized ligands, obtaining some results that can be instrumental to unravel the function of these proteins.


Introduction
The major histocompatibility complex class I (MHC I) protein superfamily encompasses a large number of glycoproteins including classical MHC I molecules and non-classical and MHC I-like molecules [1]. Because of their crucial role in graft rejection and antigen presentation, classical MHC I molecules (hereafter MHC Ia) were the first to be discovered and studied. In humans, classical MHC molecules are for historical reasons, known as human leukocyte antigens (HLAs).
MHC Ia molecules are cell surface expressed glycoproteins consisting of an a chain (encoded inside the MHC gene region) paired with b2-microglobulin (b2m), and their function is to present peptides for immune recognition by CD8 T cells [2]. MHC I-bound peptides are nested in the a1a2 domain of the MHC I molecule (MHC I a1a2 domain). The structural presence of this a1a2 domain is the signature that relates all the members of the MHC I superfamily.
Non-classical and MHC I-like molecules (hereafter MHC Ib molecules) were discovered later and differ in many aspects from MHC Ia molecules. First of all, MHC Ib molecules do not conform a single protein family but comprise several highly divergent protein families that are structurally related to MHC Ia molecules. MHC Ib molecules can be encoded inside or outside the MHC loci, display a wide range of functions and, in contrast to MHC Ia molecules, are either non-polymorphic or exhibit little polymorphism [3,4]. Moreover, while MHC Ia molecules can only bind peptides, there are known examples of MHC Ib molecules that bind peptides (e.g., HLA-E and HLA-G, and their mouse functional counterparts Qa1 (H2-T23) and Qa2 (H2-Q9), respectively), others that bind lipids (e.g., CD1 antigens; ZAG, zinc-binding alpha-2-glycoprotein 1; EPCR, Endothelial Protein C Receptor) and some that do not have any ligand; they have an empty groove (e.g., MICA&B, mouse TL antigens, ULBP1,2,&3, HFE and FcRn) [4].
Functional characterization of MHC I proteins can be challenging and generally requires knowing their cognate ligands. However, the identification of ligands of MHC I proteins, if any, is a difficult task, which generally requires having some a priori knowledge of the chemical nature of the ligand (ligand-type specificity) to guide the experimental efforts. Therefore, in this study, we built machine learning-based classifiers to predict the ligand-type specificity of these proteins. Subsequently, we applied such models to a number of human and mouse MHC Ib molecules of uncharacterized ligands obtaining some interesting results, such as the binding of lipids of the H2-M1 family and the lack of ligand of MR1, that could be instrumental to unravel the function of these proteins.

MHC I dataset
We used a dataset (MHCI 556 dataset) consisting of 556 nonoverlapping and unique sequences of MHC I proteins of known ligand-type specificity of which 355 bind peptides (P), 84 bind lipids (L) and 117 do not bind anything (N). Sequences in the MHCI 556 dataset were collected and processed as previously described [5].
The sequences only comprise the a1a2 domain and all range between 170 and 189 amino acids and were subjected to a sequence-similarity reduction schema, setting the maximum sequence similarity allowed between two sequences to an e-score of 925 using a BLOSUM62 matrix [5]. Sequence identity within the P, L and N groups is 63.69 ± 17.62, 41.8 ± 15.37 and 36.37 ± 21.07, respectively.

Building and evaluation of machine learning-based models
We used the Waikato Environment for Knowledge Analysis (WEKA) [6] to built and evaluate machine learning (ML)-based classification models. As input for features for training, we used the amino acid composition of the sequences (attributes) and their known ligand-type specificity (P: Peptides, L: Lipids and N: No ligand) (classification instances). WEKA provides a large collection of ML algorithms for classification, and in this study, we selected k-Nearest Neighbor algorithm (kNN) and support vector machines (SVMs) [7,8]. Briefly, kNN classifies objects based on the majority class of their k nearest neighbors in the training sets. The vicinity between objects is computed as a Euclidean distance. SVMs were first introduced to classify linear data and are based on decision planes that define decision boundaries. A decision plane is one that separates between a set of objects having different class memberships. For non-linear data, SVMs first use a function (kernel) to map the input data onto a higher m-dimensional space, where a linear model based on decision planes can then achieve an optimal separation of the data. Here we used a Gaussian Radial Basis Function (RBF-kernel) and a Polynomial function (P-kernel), as kernels for SVMs.
Model refinement was achieved in 10-fold cross-validation experiments varying the relevant parameters of the ML algorithms. In a 10-fold cross-validation, the data are randomly partitioned into 10 sets, and each set is tested using classifiers trained on the sum of the remaining sets. Thus, kNN were refined with regard to k, the number of neighbours, while SVMs were refined with regard to C, the complexity parameter -allows one to trade off training error versus model complexityin combination with c for the RBF-kernel (defines de width of the Gaussian function) and E for the P-kernel (exponent of the polynomial function).
As measures of performance to evaluate the models, we used sensitivity (SE), specificity (SP) and accuracy (ACC) in percentages, which for a 3-class classification can be computed using Eqs. (1)-(3), respectively, P, L, N, are the total number of instances (in our case, MHC I proteins) belonging to the corresponding class and P i , L i , N i with i = (P, L, N), represents predicted instances and their class. For example, P P numbers P instances predicted as P, while P N and P L , number P instances classified as N and L, respectively. Note that SE and SP are computed for each of the tree classes (P, L, N) while ACC corresponds to the percentage of properly predicted instances. Because we run each 10-fold cross-validation 10 times, for each model we obtained 100 different estimates of the noted parameters of performance, computing the mean and standard deviations. To compare the performance of the models, we carried our paired ttest in WEKA over the ACC [6].

Other procedures
We used MULTIPROT [9] for superimposing protein 3D-structures of MHC I proteins and STACCATO [10] for deriving a multiple sequence alignment (MSA) from the superimposed structures. MHC Ia and Ib proteins were aligned with TCOFFEE [11], using a seed structural alignment. We obtained dendrograms from the relevant MSAs using the Neighbor-joining method [12], and we addressed their reliability using bootstrapping [13] with 1000 replications. We also used bootstrapping with 1000 replications to evaluate the performance of ML-based models on holdout protein sequences with an increasing number of random mutations at random variable sites (PERL script to mutate sequences will be provided upon request). BLAST searches were executed with default settings against a BLAST-formatted database derived from the MHCI 556 dataset in FASTA format in which each distinct MHC I sequence had a header with a label indicating the nature of its ligand (P: peptides, L: lipids and N: No ligand).

MHC I sequence similarity space
In order to investigate whether one could define the ligand-type specificity of MHC I proteins (P, L, N) by sequence similarity approaches, we superimposed the 3D-structures of the a1a2 domain of representative MHC I proteins (Fig. 1A), obtained a structuredbased alignment (Fig. 1B) and subsequently derived a sequence similarity-based dendrogram (Fig. 1C). We found that the selected MHC I proteins fail to cluster in just three distinguishable groups matching their ligand-type specificity (Fig. 1C). Thus, while some MHC I proteins group according to their ligand-type specificity (e.g., the lipid-binding CD1 antigens and EPCRs, and the peptidebinding classical and non-classical MHC I proteins), other molecules like ZAG, that bind lipids, and TLA, HFE, and FcRN, that have no ligand, appear unrelated to the molecules of the relevant groups. Also, TLA, which does not bind any ligand, is much closer to the group of peptide-binding MHC I proteins than to those that do not bind anything. Likewise, using TreeDet [14], a popular and robust program to explore sequence space, we were unable to identify key signature residues allowing the distinction of MHC I proteins according to the nature or their ligand. These results are likely due to the fact that the division between MHC I proteins by ligand-type specificity is functionally relevant but it is not phylogenetic. For example, TLA and MICA, both incapable of binding any ligand, are, beyond having the MHC I fold, completely unrelated and shared sequence identity of just 24.06%. The sequence identity between all the proteins considered in the structure-based alignment is shown in Supplementary file 1.
Since the ligand-type specificity of MHC I proteins could not be properly distinguished in the MHC I sequence similarity space, in this work, we approached the task of predicting the type of ligand of MHC I proteins as a classification problem for machine learning (ML).

Evaluation of ML-based classifiers in cross-validation
We built our ML-based models by training several ML algorithms on the amino acid composition of 556 non-overlapping and unique MHC I protein sequences (a1a2 domain) of know ligand-type specificity (MHCI 556 dataset): 355 bind peptides (P), 84 bind lipids (L) and 117 do not bind anything (N) ( Table 1). Specifically, we trained k-Nearest Neighbor algorithm (kNN) and support vector machines (SVMs) with polynomial (SVM-Pk) and RBF-kernels (SVM-RBFk). We selected these algorithms because of their reliability, simplicity and speed [7]. We used 10-fold cross-validations to built, evaluate and optimize the models (Fig. 2).
The best performance was obtained using SVM-RBFk, which reached an ACC of 100.0 ± 0.0%; not a single MHC I protein was misclassified (Fig. 2B). Next to these results were those obtained using kNN (ACC = 99.93 ± 0.35%, SE = 99.98 ± 0.03%, SP = 99.82 ± 1.16%), which misclassified one protein (Fig. 2B). The performance achieved using SVM-Pk (ACC = 99.46 ± 0.87, SE = 99.96 ± 0.05, SP = 99.51 ± 1.82) was slightly lower than that obtained with kNN; three proteins were misclassified (Fig. 2B). It could be well argued that these results are conditioned by the fact that the MHCI 556 dataset is imbalanced; the peptide-binding group is much larger than the other two. However, all MHC I proteins were properly classified regardless of their class (Fig. 2B). Moreover, we virtually obtained the same results ( Supplementary Fig. S1) using a dataset (MHCI 334 dataset) encompassing only 133 MHC I peptide-binding proteins (P). The MHCI 334 dataset is described in Supplementary Table S1.
In sum, classifying the highly divergent MHC I protein families into the three defined groups was a surprisingly simple task for ML, which on the one hand highlights the quality of the assembled dataset and on the other suggests that the ligand-type specificity of MHC I proteins is readily imprinted in their amino acid composition. In comparison, accurate classification of biological sequences using ML often requires many more input features such as using dipeptide composition [15] However, it is true that those studies involved binary classifications of a group of related proteins (e.g., Histones) from the remaining universe of proteins (e.g., non-Histones). Instead, here we performed a multiclass classification.

Evaluation of ML-based classifiers on holdout MHC I proteins
We carried out several validations on distinct independent datasets consisting on entire groups of MHC I proteins (holdout sequences) that were drawn from the MHCI 556 dataset prior to model building. Our goal was to explore whether ML-based classifiers built and optimized in cross-validation can predict the ligand-type specificity of MHCI proteins differing entirely from those used for model building as well as to identify MHC I proteins that are critical to guarantee the robustness of the method. The results of these analyses are depicted on Table 2 and summarized next.   In each distinct training dataset, the ML-based classifiers reached in cross-validation an extraordinary accuracy (ACC P 99.3%) that mirrored the results obtained on the full MHCI 556 dataset (ACC SVM-RBFk > ACC kNN > ACC SVM-Pk). In many occasions, ML-based classifiers were able to predict the right ligand-type specificity of the proteins that were removed prior to model building (Table 2). Moreover, we noted that classifiers that performed best in cross-validation classified the holdout proteins with fewer errors (see results involving HFE, ULBP, and FcRn holdout tests in Table 2). However, we also found MHC I proteins, such as CD1 antigens ZAG and TLA proteins (Table 2), that could not be classified appropriately, indicating that it is critical to include them in the training datasets. ML-based models were also unable to predict peptide binding for all MHC I proteins form fish (Table 2). Although there is not enough experimental evidence, it is reasonable, yet arguable, to think that all these fish MHC I should bind peptides; they have been classified as classical MHC I molecules in specialized databases (see http://www.ebi.ac.uk/ipd/mhc/fish/index. html).
We also investigated the potential tolerance of ML-models to mutations in the testing data. Specifically, we evaluated the ability of the three SVM-RBFk models trained on datasets lacking HLA I, EPCR and MICA&B proteins to predict the correct class of the relevant proteins modified with an increasing number of random mutations at random variable sites. The results indicate that at least these models are quite tolerant to mutations in the tested proteins (decreasing the percentage of properly predicted instances to 80% required more than 30 mutations) (see Supplementary Fig. S2).

Predicted ligand-type specificity of MHC Ib molecules of uncharacterized ligands
There are a number of mouse and human MHC Ib molecules whose ligands, if any, have not been characterized yet. In humans, the MHC Ib molecules of unverified ligands are HLA-F and MR1, whereas in mouse are MR1 and a number molecules encoded by the H2-T, H2-Q and H2-M loci. Here, we predicted the ligand-type specificity of these proteins (a1a2 domain) using the ML-based classifiers trained on the MHCI 556 dataset and compared the results with those resulting from a BLAST search using the corresponding a1a2 domain sequences against the MHCI 556 dataset (ligand-type specificity was assigned to that of the closest hit). To verify the generalization power of our classifiers, we also predicted the ligand-type specificity of two MHC I molecules from chicken (BF2⁄2101 and YF1⁄7.1) as well as UL18, H2-T9, H2-T10 and H2-T22. All these proteins, despite their known binding ability, were not used for model building (they were not included in the MHCI 556 dataset). UL18 is a viral MHC I-like molecule from human cytomegalovirus that is known to bind peptides [16], whereas H2-T9, H2-T10 and H2-T22 are closely related murine MHC Ib molecules, which are incapable of binding any ligand because of having a truncated a1a2 domain [17,18]. As for the chicken MHC I proteins, BF2⁄2101 binds peptides [19] while YF1⁄7.1 appears to bind some unknown non-classical ligand (perhaps a lipid) [20]. A phylogenetic tree depicting the analyzed proteins is shown Fig. 3. The predictions are summarized in Table 3 and we will next highlight some of the results.
UL18 and both chicken MHC I proteins were predicted to bind peptides using BLAST and the ML-based classifiers (Table 3). However, only ML-based classifiers predicted the lack of ligand of H2-T9, H2-T10 and H2-T22 (Table 3). In fact, BLAST predicted peptide-binding for all the MHC Ib proteins that were tested (Table 3). These results suggest that the ML approach is more suitable to predict the ligand-type specificity of MHC I proteins than the BLAST approach. Nevertheless, the two approaches are not necessarily exclusive but complementary.
ML-based classifiers revealed MHC Ib proteins with ligand-type specificities that could not have been anticipated using sequence similarity analyses. Thus, ML-based classifiers predicted that the mouse H2-M1 and H2-M10 proteins (Fig. 3), which are expressed in vomeronasal sensory neurons (VNS) [21], bind lipids and have no ligand, respectively ( Table 3). The lack of ligand of H2-M10 proteins is consistent with the available crystal structure of H2-M10.5   a Sequence length shorter than training sequences. b Genbank accession numbers. c Gene annotated in NCBI as MR1 major histocompatibility complex, class I-related, most likely by mistake. [22]. Functional V2R pheromone receptors express concomitantly with H2-M1 proteins in VNS [21], and we speculate that likely these receptors recognize lipidic pheromones presented by H2-M1 proteins. ML-based classifiers also predicted that MR1 does not bind any ligand. MR1 (MHCI-related, class I, molecule) is a highly conserved MHC Ib molecule, which has been shown to restrict a sub-population of ab T cells named MAIT (Mucosal-associated invariant T cells) [23]. The expression of MR1 appears to be TAP and proteasome independent but yet some authors have found evidence supporting antigen-presentation function for MR1, possible of peptides [24][25][26]. Moreover, a group has indicated that MR1 presents a-mannosyl glycolipids to invariant V a19-J a33 MAIT cells [27], although others have failed to confirm these findings [28]. While it is possible that MR1 binds molecules that are different from peptides or lipids, our results indicate that MR1 function and restriction of MAIT cells might be independent of antigen presentation.
Because it is not clear whether all of the classical MHC I proteins from fish included in the MHCI 556 dataset (Table 1) do really bind peptides, we repeated all these predictions using ML-based models trained without the fish proteins (MHCI 500 dataset). The results were largely the same, as shown in Supplementary Table S2.

Conclusions and limitations
Currently, there is a plethora of methods to predict peptide binding to specific MHC Ia molecules [29], but surprisingly, until now no method was available to predict whether any uncharacterized member of the MHC I protein family can bind any ligand at all, and if so, the nature of such ligand (peptides or lipids). This information is key to lead the experiments that allow the identification of the relevant ligands. Upon an original multi-class classification approach, we developed here ML-based classifiers that achieve such task with great accuracy and generalization power. Prediction of the ligand-type specificity of MHC I proteins using our classifiers is available for free public use at http://imed.med.ucm.es/MHCLIG/. It is important to stress that the classifiers developed here can only predict the three known ligand-type specificities of MHC I proteins (P, L and N). If there would be MHC I proteins with other, yet to be characterized, type of ligands (e.g., sugars, nucleotides, etc.) the breath of our predictions will be clearly limited.