Benchmarking HLA genotyping and clarifying HLA impact on survival in tumor immunotherapy

Human leukocyte antigen (HLA) genotyping gains intensive attention due to its critical role in cancer immunotherapy. It is still a challenging issue to generate reliable HLA genotyping results through in silico tools. In addition, the survival impact of HLA alleles in tumor prognosis and immunotherapy remains controversial. In this study, the benchmarking of HLA genotyping on TCGA is performed and a ‘Gun‐Bullet’ model which helps to clarify the survival impact of HLA allele is presented. The performance of HLA class I genotyping is generally better than class II. POLYSOLVER, OptiType, and xHLA perform generally better at HLA class I calling with an accuracy of 0.954, 0.949, and 0.937, respectively. HLA‐HD obtained the highest accuracy of 0.904 on HLA class II alleles calling. Each HLA genotyping tool displayed specific error patterns. The ensemble HLA calling from the top‐3 tools is superior to any individual one. HLA alleles show distinct survival impact among cancers. Cytolytic activity (CYT) was proposed as the underlying mechanism to interpret the survival impact of HLA alleles in the ‘Gun‐Bullet’ model for fighting cancer. A strong HLA allele plus a high tumor mutation burden (TMB) could stimulate intensive immune CYT, leading to extended survival. We established an up to now most reliable TCGA HLA benchmark dataset, composing of concordance alleles generated from eight prevalently used HLA genotyping tools. Our findings indicate that reliable HLA genotyping should be performed based on concordance alleles integrating multiple tools and incorporating TMB background with HLA genotype, which helps to improve the survival prediction compared to HLA genotyping alone.

Human leukocyte antigen (HLA) genotyping gains intensive attention due to its critical role in cancer immunotherapy. It is still a challenging issue to generate reliable HLA genotyping results through in silico tools. In addition, the survival impact of HLA alleles in tumor prognosis and immunotherapy remains controversial. In this study, the benchmarking of HLA genotyping on TCGA is performed and a 'Gun-Bullet' model which helps to clarify the survival impact of HLA allele is presented. The performance of HLA class I genotyping is generally better than class II. POLY-SOLVER, OptiType, and xHLA perform generally better at HLA class I calling with an accuracy of 0.954, 0.949, and 0.937, respectively. HLA-HD obtained the highest accuracy of 0.904 on HLA class II alleles calling. Each HLA genotyping tool displayed specific error patterns. The ensemble HLA calling from the top-3 tools is superior to any individual one. HLA alleles show distinct survival impact among cancers. Cytolytic activity (CYT) was proposed as the underlying mechanism to interpret the survival impact of HLA alleles in the 'Gun-Bullet' model for fighting cancer. A strong HLA allele plus a high tumor mutation burden (TMB) could stimulate intensive immune CYT, leading to extended survival. We established an up to now most reliable TCGA HLA benchmark dataset, composing of concordance alleles generated from eight prevalently used HLA genotyping tools. Our findings indicate that reliable HLA genotyping should be performed based on concordance alleles integrating multiple tools and incorporating TMB background with HLA genotype, which helps to improve the survival prediction compared to HLA genotyping alone.
reported in latest IMGT/HLA database [1] (Release 3.42.0). Typically, HLA system contains two copies of HLA-A, B, C, DRB1, DQB1, DPB1, DQA1, etc. Such combination of different HLA alleles makes the dramatically distinct immune response between individuals, even under the same internal/external stimulation. Therefore, accurate HLA genotyping and stringent matching of patient and donor are extremely important in organ transplantation. Any slight mismatches may lead to a serious graft-versus-host rejection. Currently, capillary electrophoresis-based dye-terminator Sanger sequencing is served as the gold-standard approach for high resolution HLA genotyping in clinical laboratories. Since two copies of alleles are sequenced together in one reaction, it is difficult to determine whether the detected SNPs come from the same chromosome (cis) or from opposite chromosome (trans). Several rounds of additional sequencing are needed to identify cis/trans polymorphism. In addition, the increasing quantity of new HLA alleles would introduce more laborious efforts on primer designing and sequencing. Due to such limitations, the concordance rate of Sanger sequencing-based HLA genotyping is about 84% among different laboratories [2]. Obviously, the traditional labor-intensive and accuracy-low methods are far behind clinical requirement.
The development of next-generation sequencing (NGS) technology has changed the landscape of HLA genotyping. Since the DNA fragments are independently amplified and sequenced, it could dramatically reduce the ambiguities of cis/trans polymorphism occurs in Sanger sequencing. Generally, three typical processes are implemented in NGS-based HLA genotyping: (a) reads alignment toward known HLA reference sequences, such as IMGT/HLA sequence database; (b) candidate HLA alleles identification through assembling the aligned reads into contigs; and (c) HLA allele pair inference by scoring the coverage balance of any two identified candidate alleles. Up to now, dozens of bioinformatics tools have been developed for NGS-based HLA genotyping. All the tools were smartly designed to optimize the accuracy and speed on the mentioned three processes. For example, POLYSOLVER [3] adopted a two-step Bayesian classification model to infer HLA allele pair, with careful consideration of sequence quality, insert sizes, and ethnicity-dependent allele prior probability. OptiType [4] constructed a binary hit matrix indicating the alleles that a specific reads sequence could be aligned with minimum mismatches. Integer linear programming is subsequently used to maximum the number of explainable reads by selecting one or two alleles. A recently developed tool called Kourami [5] introduced a graphguided assembly strategy to infer the most likely HLA allele pair. Since the gene level partial order graphs were prebuilt from known HLA sequences and remodified by reads alignment, it confers the ability of discovering novel alleles that do not appear in the database. Most of these HLA genotyping tools displayed the best performance in their releasing demo dataset, an unbiased and comprehensive benchmarking with large scale curated benchmark datasets on these tools, however, is still lacking in this field.
In addition, due to the essential role in the molecular mechanism of immune recognition, HLA gains a lot of scientific interests on immune oncology in recent years. Rachel [6,7] found that HLA allele provide a tumor evolutionary pressure through restricting the oncogenic mutational landscape in TCGA cohort. The frequent driver mutations are universally poorly presented by HLA, since the tumor clones carrying the strongly presented mutations had already been killed by immune system. HLA allele-specific loss of heterozygous (LOH) was verified as an important immune evasive mechanism of nonsmall-cell lung cancer (NSCLC), which could impair the ability of recognizing tumor antigens in immune system [8]. Furthermore, HLA may serve as an independent biomarker for immune checkpoint blocker (ICB) therapy. Positive correlation between diversity of HLA class I alleles and clinical benefit was found by Chowell [9] in ICB-treated melanoma cohort. Since a broader antigen profile would be presented in heterozygous HLA allele than homozygous, the likelihood of benefit would be subsequently increased. What's more, the presence of HLA class I supertype B44 was correlated with a longer overall survival (OS). However, it remains controversial on the prognostic effect of HLA under ICB therapy. In a recently released study, Negrao et al. [10] found HLA supertype is not correlated with OS in ICB-treated advanced NSCLC cohort collected in MD Anderson Cancer Center. Therefore, it is worthy to investigate why such conflict occurs in these studies.
Cytolytic activity (CYT) is the ultimate effect of HLA-antigen stimulated immune response, which initially quantified by Rooney [11] as the geometric mean expression of GZMA and PRF1. Theoretically, neoantigen is likely to drive CYT. It has been partially verified through the positive correlation between neoantigen load and CYT across multiple tumor types in TCGA. Moreover, a high CYT is associated with a modest but significant pan-cancer survival benefit. Therefore, it is worthy to investigate that whether CYT could bridge the HLA-antigen initialized immune response and the OS.
Taking together, an objective benchmark of available HLA genotyping tools as well as a systematic investigation of the survival impact of HLA genotype based on the well curated TCGA data and clinical cohorts treated with ICBs are needed. We investigated these two problems in one study since the accurate HLA calling and the following analysis of their survival impact are fundamental for immunotherapy. In this study, we performed a comprehensive HLA benchmark in TCGA cohorts, composing of concordance results generated from eight frequently used or latest developed tools, including POLYSOLVER, OptiType, xHLA, HLA-HD, hla-genotyper, SOAP-HLA, HLA-VBSeq, and Kourami. In addition, the influence of HLA allele on immune CYT and survival is further investigated. Moreover, a 'Gun-Bullet' model was proposed to interpret the survival impact of HLA alleles through CYT. Specifically, the relationship of HLA and tumor mutation burden (TMB) was similar to 'Gun' and 'Bullet'. The integrative effect of HLA allele and TMB was likely to induce CYT and subsequently influence survival. We set out to address several following questions, which is closely related to the clarifying of the controversial issue of survival impact of HLA allele in immunotherapy, including: (a) systematically investigation and comparison of the performance of individual HLA genotyping tools and the ensemble one; (b) what is the frequency error pattern in each HLA genotyping tool? and (c) what is the underlying explanation to address the controversial issue on the prognostic effect of HLA allele in survival and tumor immunotherapy?

Study design
As shown in Fig. 1, the benchmark dataset was composed of the most concordance HLA alleles generated by eight prevalently used HLA genotyping tools based on peripheral blood mononuclear cells (PBMCs) WES data in TCGA. In addition, HLA allele-specific LOH were called from tumor WES data in TCGA and the benchmark dataset. Then, the HLA genotyping performance of each tools was evaluated, including recall, accuracy, and error patterns. What's more, the survival impact of HLA alleles was also investigated on TCGA cohorts. Specifically, univariate and multivariate CoxPH regression were used to detect the HLA alleles that significantly influencing the OS of patient. Since the survival benefit of CYT had been proved in TCGA pan-cancer cohorts in Rooney's study [11], we further investigated the survival impact of CYT on independent cohorts under ICB treatment. Finally, a 'Gun-Bullet' model was proposed to interpret the survival impact of HLA based on CYT. The study methodologies conformed to the standards set by the Declaration of Helsinki. Detailed clinical information regarding the cohorts collected in the study can be found at Table S1.

TCGA benchmark dataset of HLA class I and class II
As no golden standard dataset is available for HLA genotyping, it is critical to create an HLA benchmark dataset representing the benchmark in silico. Herein, we take advantages of the basic idea of ensemble learning by equally treat each HLA genotyping tool as an expert. Then, each copy of HLA gene including A, B, C, DRB1, DPB1, DQA1, and DQB1 was, respectively, voted by these 'experts' with an ensemble way. The most concordance allele would obtain the highest votes, and they are selected as the 'ground truth' in the benchmark analysis. In order to avoid the ambiguous calling, only the most concordance allele with at least two votes would be curated into the benchmark dataset. Then, each sample was assigned a group of HLA genes that composed of the most concordance alleles with highest votes. Twelve HLA supertypes were also conferred to TCGA samples, strictly following the HLA class I allele classification approach mentioned in Chowell's study [9] and Sidney' study [16]. It should be noted that the benchmark dataset derived from such highest concordance HLA alleles are not exactly equal to the ground truth; however, the ambiguous calling alleles against such benchmark dataset would nevertheless indicate the risk of genotyping the specific allele in that tool. The HLA alleles with high error rate genotyped by single tool would be taken with caution.

HLA-LOH status assessment
The HLA allele-specific LOH calling was implemented by LOHHLA [8] (https://bitbucket.org/mcgranaha nlab/lohhla/src/master/). Since only allele sequence at genome level is supported by LOHHLA program, the consensus 4-digit protein level HLA allele generated above was transformed to higher resolution allele with longest genome sequence. For example, A01:09:01:02 with 3359bp in genome length was used for LOH assessment on 4-digit allele A01:09. Allele-specific LOH event was detected, according to the allelic coverage imbalance status (P value ≤ 0.05).

Evaluation metrics for HLA genotyping benchmark
HLA class I and class II genotyping performance were separately evaluated, since not all tools could detect two HLA classes. Recall and accuracy defined by Thorne's study [2] were used for performance evaluation. These two metrics were calculated as the following formula: Recall ¼ number of right alleles number of right alleles þ number of wrong allele (1) Taking HLA class I alleles as an example, two copy of HLA-A, B, and C alleles predicted by eight tools were compared with the corresponding benchmark dataset for each patient, respectively. Then, the predicted alleles could be classified into three types. Right allele indicates the allele is covered by the benchmark Accuracy ¼ number of right alleles number of right alleles þ number of wrong alleles þ number of uncalled alleles (2) dataset, while the wrong allele means the allele is absent in the benchmark dataset. The alleles that failed for genotyping by the tool were defined as uncalled alleles.

HLA genotyping error pattern in each tool
In order to investigate the error calling bias in each tool, general error rate and specific error rate were calculated through formula (3) and (4)

CYT calculation
Releasing perforin and granzymes into target cell is the major molecular mechanism underlying the destroy ability of immune effector cells [17,18]. Since perforin and granzymes were exclusively secreted by immune effector cytotoxic T lymphocytes and natural killer (NK) cells, the expression of GZMA and PRF1 gene detected in bulk tumor tissue could be quantified as an indicator of the intratumoral immune CYT [19,20]. In this study, we retrieved the mRNA expression GZMA and PRF1 among TCGA tumor samples from UCSC Xena website (https://xenabrowser.net/datapages/). Then, the CYT of each sample was defined as the geometric mean of GZMA and PRF1 (as expressed in TPM, 0.01 offset), following Rooney's study [11].
Regarding to each specific cancer condition, the patients were equally separated into two groups (high CYT and low CYT) by the median CYT value under the specific cancer condition.

Clinical cohort treated with immune checkpoint blockers
Analyses were conducted on five cohorts (including three melanoma cohorts and two NSCLC cohorts) treated with ICBs. The Van Allen cohort [23] consisted of 41 melanoma patients treated with ipilimumab (anti-CTLA4 therapy) (Table S1A). The Tuba cohort [24] consisted of 73 melanoma patients treated with Nivolumab/Pembrolizumab (anti-PD-1 monotherapy) or combined anti-PD-1 and anti-CTLA-4 immunotherapy. Patient outcomes were classified as responding to therapy (CR or PR, n = 39) or not responding to therapy (SD or PD, n = 34) (Table S1B). Both RNA-seq and WES data were available in Van Allen's melanoma cohort [23]. The processing results (including HLA, TMB, CYT) of the two melanoma cohorts were retrieved from our previous studies [25][26][27]. Chowell's melanoma cohort [9] contains 164 patients under anti-CTLA4 treatment. The HLA and TMB information were collected from the supplementary of the original paper. (Table S1C). Jae-Won Cho's NSCLC cohort [28] contained five patients with durable clinical benefit (DCB) and 11 patients in no durable benefit (NDB) group under anti-PD1 therapy. Jeong Yeon Kim's NSCLC cohort [29] consisted of eight patients with DCB and 19 patients with NDB under anti-PD1 treatment. The gene expression profiling results were generated through STAR [30] alignment and subsequent expression quantification in RSEM [31]. (Table S1D, E). OS data utilized in the survival analyses and TMB information were also retrieved from original study. The HLA alleles and LOH of HLA were generated as described in Materials and methods.

Statistical analysis
Survival analysis was implemented by python package lifelines (https://github.com/CamDavidsonPilon/lifeline s). In detail, hazard ratio was calculated by Cox proportional hazards regression model with the function CoxPHFitter. Subsequently, R package forestplot was utilized for results visualization. Kaplan-Meier survival plot was generated by function KaplanMeierFitter in python package lifelines.

The performance of HLA class I genotyping is generally better than class II
Since no golden standard HLA genotyping results are available in TCGA, we try to computationally curate a comprehensive benchmark data taken as the ground truth with ensemble strategy. In detail, the 4-digit HLA class I and II benchmark dataset were generated by the concordance analysis on HLA genotyping results of eight tools. Such HLA benchmark dataset contains 10 479 samples with HLA class I results and 10 440 samples with class II results, respectively. Each allele in the benchmark dataset derives from the most concordance alleles that successfully reported by the eight tools. The population frequency of the top 20 HLA class I alleles, class II alleles, and 12 supertypes in TCGA cohort was displayed in Fig. 2A (see the full population frequency list in Table S2). The population frequency across the alleles is quite imbalance. A02:01 is the dominant HLA class I allele in TCGA cohort with the population frequency at 41.99% (allelic frequency is 8.03%), which was detected in 4400 individual patients. Six thousand and forty-nine patients carrying DPB1:04:01 that occupied 57.94% in TCGA cohort (the corresponding allelic frequency in TCGA is 9.00%). B07, A03, A02, B44, and A01 are top five HLA supertypes with population frequency more than 40%.
We compared the prediction results of each tool against the benchmark dataset. The recall and accuracy of eight tools on HLA class I genotyping were displayed in Fig. 2B. The best performance of HLA class I genotyping was found in POLYSOLVER. Both recall and accuracy are detected above 0.954. 3/10 479 samples were failed for HLA class I genotyping on POLYSOL-VER. However, OpiType, xHLA, HLA-HD, and SOAP-HLA were successful to generate HLA class I results on all the samples. Specially, OpiType displayed a very good performance with accuracy at 0.949, which is slightly close to POLYSOLVER.
Since the top two HLA class I predictors (POLYSOL-VER and OpiType) do not support HLA class II genotyping, the performance of the other six tools was evaluated. Specifically, HLA-HD, SOAP-HLA, and HLA-VBSeq are able to call four major class II genes, including DRB1, DPB1, DQA1, and DQB1. DRB1, DQB1, and DPB1 were supported by xHLA. Kourami, hla-genotyper, and HLA-VBSeq were designed for calling DRB1, DQA1, and DQB1. As shown in Fig. 2C, HLA-HD obtained the highest accuracy at 0.904 on calling DRB1, DPB1, DQA1, and DQB1. The highest recall was achieved by xHLA at 0.948. However, the accuracy of xHLA was dramatically dropped to 0.708, due to its disability on DQA1 calling. If it is specifically focused on the three genes (DRB1, DQB1, and DPB1) supported by xHLA, the corresponding accuracy could increase to 0.944, which indicates that xHLA shows a high performance on calling the supported class II genes.

Frequently miscalled HLA class I alleles in the tested eight tools
In order to investigate the error bias in each tool, we calculated the general error rate of HLA alleles through comparing the HLA calling results with the benchmark dataset. Thirty-nine frequently miscalled class I alleles with high general error rates were picked out according to two criterions: (a). the allelic frequency in benchmark dataset ≥ 200 and (b) the general error rate is no less than 0.05 in at least one tool. As shown in Fig. 3A and

Frequently miscalled HLA class II alleles in the tested six tools
Using the same processing approach mentioned above, 57 HLA class II alleles with high general error rates were displayed in Fig. 3B and

The preference of miscalled HLA allele pairs
With the aim of gaining insights on the ambiguous calling HLA pairs in each tool, we further investigated the error components of top two miscalled alleles in each tool. As shown in Fig. 3C

Integration of HLA calling from the top-3 tools is superior to individual one
Generally, none of the tested tools achieved the high accuracy that claimed in their respective papers. Probably such situation is caused by the relatively small benchmark data size. For example, only three trio samples were used in HLA-VBSeq performance assessment [14]. Kourami was tested on 12 trios samples [15]. However, the large TCGA HLA benchmark dataset covering~10 479 samples curated in our study would ensure a robust and unbiased performance assessment. POLYSOLVER, OpiType, and xHLA ranked as top three HLA class I genotyping tools with the high accuracy ranging from 0.937 to 0.954. However, HLA-HD, SOAP-HLA, and xHLA obtain the relatively high performance (accuracy 0.708-0.904) on HLA class II alleles DRB1, DPB1, DQA1, and DQB1 calling. In addition, distinct error patterns were found among the HLA genotyping tools. POLYSOLVER tends to make mistakes on A25:01 calling, but such error is unlikely to happen on OpiType. Therefore, it is unwise to make decision on HLA genotyping based on individual tool. Such distinct error pattern may suggest that the incorrectly called alleles would be efficiently avoided through concordance analysis from multiple HLA genotyping tools. To this end, we try to create an ensemble result from the top three tools following the approach that quite similar to benchmark dataset curation. If the allele is reported by no less than two tools, it will be considered as the true allele. Otherwise, if no overlapping allele is found in the three tools, the allele reported by the tool with a highest accuracy will be assigned as the true allele. Regarding to HLA class I, the combining A, B, and C allele results were generated from POLYSOLVER, OpiType, and xHLA. HLA-HD, SOAP-HLA, and xHLA were utilized for creating ensemble results on DRB1, DQB1, DPB1, and DQA1. As shown in Fig. 4A,B, the accuracy of ensemble results on HLA class I genes (A, B, C) and HLA class II genes (DRB1, DQB1, DPB1 and DQA1) would reach 0.981 and 0.907, respectively. Taking together, the ensemble results of the top 3 tools are superior to individual ones. It is highly recommended to generate reliable HLA results based on the concordance alleles integrated from multiple tools.

The impact of HLA on survival
The univariate CoxPH regression model was firstly utilized to analyze the impact of specific HLA supertypes on OS in TCGA cohort. In this study, HLA supertypes are defined as the groups of HLA alleles sharing specific amino residues at anchor positions of the B and F pockets in the peptide-binding region. The HLA alleles within same supertypes were expected to have similar presentation ability. In detail, regarding to each specific HLA supertype, two subgroups could be defined based on allele present or absent on the patients. Subsequently, risk comparison was implemented among the two subgroups. As shown in Fig. 5 A, eight HLA supertypes were observed with a significant impact on survival (Wald test P < 0.05, see full list in Table S6). Each record with hazard ratio (HR) < 1 means the specific HLA supertype could serve as a beneficial factor under that condition. However, records with HR > 1 probably indicate that such specific HLA supertype increase the death risk on the cancer condition. It seems that the impact of HLA on OS may be disease-specific. The same HLA supertype may generate distinct influence among cancer conditions. For example, different survival impact of B44 was found in early-stage skin cutaneous melanoma (SKCM), advanced stage rectum adenocarcinoma (READ), advanced stage ovarian serous cystadenocarcinoma (OV), and early-stage breast invasive carcinoma (BRCA). Specifically, better survival was observed in the B44-present rather than B44-absent on the previous two cancer conditions (early-stage SKCM: HR = 0.66, Wald test P = 0.039; advanced stage READ: HR = 0.35, Wald test P = 0.049). On the contrary, B44-present patients showed worse survival in advanced stage OV and early-stage BRCA (advanced stage OV: HR = 1.32, Wald test P = 0.03; early-stage BRCA: HR = 1.66, Wald test P = 0.03).
Since many cofounding factors may interact with HLA allele on OS, a more precise multivariate CoxPH model analysis was further performed on the HLA supertype and tumors as presented in Fig. 5A. The age at initial diagnosis of tumor may played a dominant harmful role in advanced stage OV and early-stage BRCA (advanced stage OV: HR = 1.59, Wald test P = 0.001; early-stage BRCA: HR = 3.50, Wald test P = 7.18E-06), comparing with HLA supertype B44 (advanced stage OV: HR = 1.26, Wald test P = 0.112; early-stage BRCA: HR = 1.55, Wald test P = 0.104), while the beneficial effect of B44 on survival was still observed in early-stage SKCM and advanced stage READ. Specially, no significant survival impact of TMB or age was detected in advanced stage READ (Table 1). It is interesting that the beneficial effect of B44 on SKCM survival is exactly consistent with B44 on ICBs treated melanoma survival [9]. Since ICBs is not the dominant treatment in TCGA SKCM cohort, it seems that the beneficial effect of B44 on the SKCM survival is independent of ICBs treatment. In addition, B62 was reported to show negative correlation with survival of ICBs treated melanoma in Chowell's study [9]. Such harmful effect of B62 on survival was also observed in TCGA advanced liver hepatocellular carcinoma (LIHC) cohort.
The antigen presentation role of HLA might influence survival through increasing immune response. Based on such assumption, we further investigate the corresponding CYT among the HLA supertypes-present/absent cohorts in different cancer conditions as displayed in Fig. 5A. The average CYT of B44-present patients in early-stage SKCM, advanced stage READ were slightly higher than that of B44-absent patients (Fig. 5B, Table S7). Specifically, such difference trend reached P value at 0.057 in early-stage SCKM, which is very close to the statistic confidence cutoff. However, no difference of mean CYT was detected among the patients in early-stage BRCA and advanced stage OV regardless of B44 status. In addition, since the age was detected as a dominant risk factor in the multivariate CoxPH regression, such observed harmful impact of B44 on the two cancer conditions was probably not caused by immune response. It should be noted that nearly 88.22% OV patients in TCGA received carboplatin treatment. Since chemotherapy is most likely to weaken the immune system, such survival risk observed in TCGA OV cohorts is probably not caused by HLA alleles. We further checked the survival impact of CYT on the four cancer conditions in TCGA cohorts (see full list in Fig. S1). High-CYT patients displayed extended survival in early-stage SKCM with HR   (13) 40 (5) 57 (8) 69 (10) 109 (42) 188 (119) 117 (61) 335 (43) 20 (15) 33 (18) 58 (11) 26 (15) Abs.Pts(death) 43 (19) 34 (12) 127 (39) 204 (55) 124 (66) 212 (127) 116 (47) 425 (33) 152 (79) 57 (29) 331 (30) 85 (   According to the supertypes status, either 'present' or 'absent' were assigned to each patient. Only the supertypes with significant influence on OS were displayed. (B) Immune cytolytic activity analysis on HLA supertypes under specific cancer conditions mentioned in Fig. 4A. The patients with specific HLA supertype were marked in blue. While, orange bar represents the supertype-absent patients. Violin plot and one-tailed t test were generated by python package seaborn and statsmodels, respectively. "*" indicate significant with t test P value ≤ 0.05. Since HLA allele-specific LOH event also indicates a type of allele-absent conditions, it could precisely describe the allele status in the patient. We further evaluated the survival impact between the B44-present patients without any B44-LOH events and patients with either B44-absent or B44-LOH using both univariate and multivariate CoxPH analyses, the HR value was achieved at 0.64 (Wald test P = 0.025) and 0.63 (Wald test P = 0.025) in early-stage SKCM, respectively (Fig. S2A, Table S8). In addition, the mean CYT of B44-present group is significantly greater than B44-LOH/B44-absent group (Fold change = 1.83, P = 0.033) (Fig. S2B, Table S9). It indicates that CYT may partially clarify the survival impact of HLA alleles.

The 'Gun-Bullet' model for interpreting the impact of HLA on survival
In canonical immune process, the immune signaling is stimulated by the recognition of TMB-derived antigen and HLA allele [32,33]. TMB and HLA allele were both essential contributing for immune signaling that finally lead to CYT of effector T cells. It is quite controversial on the impact of the HLA in survival, since opposite conclusions were generated in two previous studies. In Chowell's study [9], HLA supertype B44 was associated with extended survival in patients with melanoma receiving ICBs from two independent cohorts. However, no HLA supertypes were observed with significant impact on survival, regarding to an ICB-treated NSCLC cohort in M. D. Anderson Cancer Center [10]. How can we interpret such diseasespecific impact of HLA supertype on survival? The distinct mutational signatures of amino acids across cancer types may provide clues on such issue [34]. In the molecular level, HLA-B44 share a preference for glutamic acid (E) at anchor position 2, which is exactly match the most enriched amino acid mutations (G>E) in melanoma. It may indicate that HLA-B44 prefers to presenting such 'E' enriched mutant peptides in melanoma. The subsequently stimulated immune signaling is the ultimate effector that response for the survival extension of patients receiving ICBs. CYT, deriving from the expression of GZMA and PRF1 that specifically released by effector T cells, could well mimic the intratumoral cytolytic T-cell activity in microenvironment. According to Rooney's study [11], high CYT was proofed as a beneficial factor for modest but significant pan-cancer survival in previous TCGA pan-cancer study. Moreover, in the two independent ICB-treated melanoma cohorts [35,23], the significant extended OS were observed in the high-CYT group (median OS in Van Allen's cohort 27 vs 7 m, P = 0.0011; median OS in Tuba's cohort 732 vs 506 d, P = 0.0006, see details in Fig. S3). Furthermore, we also found that significant higher CYT was detected in DCB group than NDB group from two independent NSCLC cohorts under anti-PD1 therapy. Especially, in Jae-Won Cho's cohort [28], the median OS in high-CYT group is significant longer than low-CYT group (median OS: 14.1 vs 3.1 m, P = 0.0048) (Fig. S4). Therefore, CYT may provide us novel insights on the survival impact of HLA alleles.
Herein, we proposed a 'Gun-Bullet' model for interpreting the impact of HLA on survival. As shown in Fig. 6A, the impact of HLA and TMB on CYT is quite similar to the relationship between gun and bullets on force. A powerful gun plus high quantity bullets could maximize the force. Similarly, a strong HLA allele plus a high TMB could stimulate intensive immune response (CYT), leading to longer survival. 'Stronger' HLA alleles are more able to present antigens (TMB) and subsequently induce higher CYT. As shown in Fig. 6B, regarding to a typical 'strong' HLA supertype B44 in early-stage SKCM, the maximum distinct CYT and OS were found between B44-present and high-TMB group and B44-absent & Low-TMB group, while no difference on both CYT and OS was observed in B44-present and low-TMB group and B44-absent and high-TMB group. Since much less antigens are available in low-TMB group, it is unexpected to achieve high CYT in such B44-present patients. On the contrary, the 'weak' allele A02-present and low-TMB patients displayed the minimum CYT and short OS. No obvious difference on CYT and OS was found between A02-present and high-TMB group and A02-absent and low-TMB group (Fig. 6C). Although much more antigens were available for A02presenting in high-TMB group, no dramatically increase of CYT could occur due to the weak antigen presentation ability of A02. That means TMB is a critical factor for the immune response that regulated by HLA allele. The patients with a 'strong' HLA allele as well as a high TMB were likely to induce a strong CYT and a better survival. However, since insufficient antigens could be generated for HLA presentation in cancers with low-TMB background, HLA may be not Typical "Weak" allele: A02 "Strong" The "Gun-Bullet" model for fighting cancer qualified as an independent prognostic biomarker for survival. At least, the signal of HLA impact on survival is much weaker in cancers with a low-TMB background than that of a high-TMB background. Melanoma had always been reported as a typical high-TMB cancer with median TMB larger than 14 in several large pan-cancer cohorts, including Foundation cohort [36], TCGA cohort [37], MSK-IMPACT cohort [38], and Yarchoan's cohort [39]. NSCLC displayed much lower TMB in the studies mentioned above. This may partially explain the debates of HLA as a prognostic biomarker occurred in Chowell's [9] melanoma cohort and Negrao's [10] NSCLC cohort.
The 'Gun-Bullet' model can be used to clarify the survival impact of HLA in ICB-treated cohort as well. As we mentioned above, a higher CYT generally indicates a longer survival in two melanoma cohorts [35,23] and two NSCLC cohorts [28,29] under ICB treatment. In addition, B44 and TMB were detected with positive correlation with survival in Chowell's cohort [9]. Furthermore, the effect of B44 on extended survival was greater when TMB was also considered. (P = 0.00003, median OS: 23.7 vs 8.2 m) (Fig. S5).
Since the ensemble effect of B44 and TMB on survival was greater than any individual one, it may indicate that complementary roles of B44 and TMB on survival.
However, B44-present patients did not display any survival advantage, comparing with the B44-absent patients in Van Allen's cohort [23] (Fig. S6A). The same situation was also detected for TMB as an individual prognostic biomarker (Fig. S6B), although the trend of clinical benefit was observed in survival curve between B44-present and high-TMB group and B44absent and low-TMB group (Fig. S6C). However, it did not achieve statistical significance. Regarding to the 'Gun-Bullet' model, such situation may be caused by the similar CYT among these comparison conditions. As expected, no significant higher CYT was found in B44-present group (Fig. S6D), high-TMB group (Fig. S6E), or B44-present and high-TMB group (Fig. S6F). Since CYT may serve as an underlying mechanism in 'Gun-Bullet' model for interpreting the survival impact of HLA and TMB, such phenomena are explainable. No survival advantage could be observed among the patients with statistically equivalent CYT, regardless of the TMB or B44 status.

Discussion
In this study, we have evaluated eight frequently used and publicly available NGS-based HLA genotyping tools on TCGA WES data. Since no golden standard HLA dataset is available, constructing a HLA benchmark dataset is the critical point ahead of performance evaluation. With the assumption that the 'True' HLA alleles are likely to be frequently reported by different HLA genotyping tools, a TCGA HLA benchmark dataset covering seven major HLA genes: HLA-A, B, C, DRB1, DPB1, DQA1, and DQB1, was generated based on the concordance allele processing. Actually, we tried to implement a comprehensive performance evaluation on WES-based HLA genotyping tools as more as possible. More than 11 tools were included in the original study design, and some of them had been discarded in pilot test due to different issues. For example, HLAscan [40] is not supported for NGS data with reads length < 76 bp. That means more than 41% TCGA WES data (reads length < 76 bp) were unable to generate the HLA genotyping results. However, the source code of PHLAT [41] is not convenient to access. HLA-PRG-LA [42] utilizes the precise graph alignment during HLA genotyping which requires extremely high computational resource. In our pilot test, it costs nearly 4 h to process a WES data (see the running time in Table S11). That makes it unsuitable Fig. 6. The "Gun-Bullet" model for illustrating the survival impact of HLA. (A) The brief chart illustrates how TMB and HLA exert the impact of survival through CYT. The impact of HLA alleles on OS may partially due to CYT. A strong HLA plus a high TMB could stimulate intensive immune response (CYT), leading to a better survival. Stronger HLA alleles are more able to present antigens (TMB) and subsequently induce higher CYT. The relationship between TMB and HLA allele is quite similar to the bullet and gun. A powerful gun plus high quantity bullets could maximize the force. The clipart in the figure were downloaded from http://clipart-library.com with the licence of copyright free for non-commercial use. (B) The typical 'strong' HLA B44 in early stage SKCM survival. Kaplan-Meier survival analysis and corresponding CYT comparison were implemented for the following three cohort pairs: B44-present vs B44-absent; B44-present&HighTMB vs B44-absent&LowTMB; B44-present&LowTMB vs B44-absent&HighTMB. (C) The typical 'weak' HLA A02 in early stage SKCM survival. Survival analysis and corresponding CYT comparison were implemented for the following three cohort pairs: A02-present vs A02-absent; A02-present&HighTMB vs A02-absent&LowTMB; A02-present&LowTMB vs A02-absent&HighTMB. Survival curves and violin plot were generated by python package lifelines and seaborn, respectively. One-tailed t test was implemented by R. High and Low TMB groups were determined by whether the TMB values higher or lower than the median TMB of the specific cancer condition. For example, the sample size of patients in early stage SKCM TCGA cohort is 237. The median TMB in early stage SKCM is 9.521. (see the full median TMB information in Table S10).
for performance assessment on the whole TCGA dataset. Since such limitations exist on tools selection, the tested eight tools may not represent the highest accuracy achieved by existing NGS-based HLA genotyping tools. However, they can be treated as the robust and efficient methods to generate the HLA genotyping results.
Distinct error patterns were detected in each HLA genotyping tools. Such error may not derive from the allele inference models utilized in the tools. The outdated HLA allele database used in the initiate reads alignment would be one of the issues. For example, SOAP-HLA was failed on calling DRB1:08, DRB1:11 alleles at the 4-digit resolution, since the background allele database in SOAP-HLA was updated 5 years ago. Although the background allele database is important for accurate genotyping, the documents on updating the database are always lacking in most of the tools. It would be a typical issue in current NGSbased HLA genotyping tools.
CYT was proposed as the underlying mechanism to interpret the survival impact of HLA alleles in the 'Gun-Bullet' model for fighting cancer. Since CD8+ T cell and NK cell were preferred immune cells for killing cancer in the microenvironment, it is important to investigate whether CYT detected in bulk tumor tissue could reflect the cell fractions of CD8+ T cell and NK cell in TILs. Referring to the deconvolution-based approach in CIBERSORT [43] with absolute mode, we retrieved the cell fraction of 22 immune cell types of each TCGA sample (downloaded at http://timer.c istrome.org/). As shown in Fig. S7, CYT was highly correlated with the cell fractions of CD8+ T cell and activated NK cell evaluated in the corresponding bulk tumor tissues (Pearson correlation coefficient = 0.73, P < 1E-20), while extremely low correlation was detected between the cell fraction of resting NK cell and CYT (Pearson correlation coefficient = 0.07). It indicates that CYT detected in the bulk tumor tissue could probably severed as an indicator of CD8+ T cell and activated NK cell in TILs. It is biologically reasonable to use CYT to interpret the survival impact of HLA alleles.
The clinical benefit effect of CYT was verified in 4 independent NSCLC and melanoma cohorts under ICB treatment as well as TCGA pan-cancer cohorts. In addition, the relationship among B44, TMB, and CYT on survival has been proved in TCGA earlystage SKCM cohort. It still lacks of a positive case from ICB-treated cohorts. In Allen's [23] melanoma cohort under anti-CTLA4 therapy, no significant higher CYT was detected in B44-present group, high-TMB group, and B44-present and High-TMB group. Therefore, no significant extended survival could be observed in any of these groups. It could be treated as a negative case that explained by 'Gun-Bullet' model. In Chowell's cohort [9], the extended survival was greater in B44-present and high-TMB group than each isolated situation. However, the CYT information is not available in such cohort. More ICB-treated cohorts with both RNA-seq (for CYT calculation) and WES (for HLA genotyping and TMB) data were needed to verified the 'Gun-Bullet' model.

Conclusions
In summary, we established an up to now most reliable TCGA HLA benchmark dataset, composing of concordance alleles generated from eight prevalently used HLA genotyping tools. Each HLA genotyping tool displayed specific error pattern, and it is better to generate reliable results with multiple tools. Regarding to the survival analysis, HLA alleles show distinct survival impact among cancers. CYT, as the readout of immune signaling, could partially bridge the HLA allele and survival. According to the 'Gun-Bullet' model, a strong HLA allele plus a high TMB could stimulate intensive immune CYT, leading to an extended survival. Although TMB is a critical factor in the stimulating immune response, HLA allele could amplify or attenuate the effect of TMB. In most of cancers with the low-TMB baseline, HLA allele is unlikely to serve as a prognostic biomarker on survival, since minimum antigens were provided for HLA presenting. Similar to the relationship between gun and bullet, the integrative effect of TMB and HLA allele was likely to stimulate immune response and influence survival. It indicated that incorporating TMB with HLA genotype helps to improve the survival prediction compared to HLA genotyping alone.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article. Fig. S1. The survival impact of CYT in TCGA cancer conditions. Fig. S2. The survival impact of HLA alleles (with consideration of LOH status) in TCGA cancer conditions. Fig. S3. The survival impact of CYT in two ICB-treated melanoma cohorts. Fig. S4. The survival impact of CYT in two ICB-treated NSCLC cohorts. Fig. S5. The survival curve in Chowell's ICB-treated melanoma cohort. Fig. S6. The survival plot and corresponding violin plot on CYT in Van Allen's melanoma cohort. Fig. S7. The correlation between CYT and cell fraction evaluated by CIBERSORT in absolute model. Table S1. The clinical information of cohorts under ICB therapy. Table S2. The statistics of HLA alleles in TCGA benchmark dataset. Table S3. Error rate of HLA Class I/II alleles. Table S4. The HLA class I error calling information in the eight tools. Table S5. The HLA class II error calling information in the six tools. Table S6. The HR in all of the TCGA cancer conditions generated by univariate CoxPH regression. Table S7. CYT of the HLA supertypes-present/absent cohorts in different cancer conditions. Table S8. The multivariable CoxPH analysis of HLA alleles and tumors in Fig. S2A. Table S9. CYT of the HLA supertypes-LOH/NotLOH cohorts in different cancer conditions. Table S10. The median TMB and CYT in each cancer condition. Table S11. The running time of tested HLA genotyping tools.