Potential prognostic impact of EBV RNA‐seq reads in gastric cancer: a reanalysis of The Cancer Genome Atlas cohort

Epstein–Barr virus (EBV)‐associated gastric cancer (EBVaGC), whose prognosis remains controversial, is diagnosed by in situ hybridization of EBV‐derived EBER1/2 small RNAs. In The Cancer Genome Atlas (TCGA) Stomach Adenocarcinoma (STAD) project, the EBV molecular subtype was determined through a combination of multiple next‐generation sequencing methods, but not by the gold standard in situ hybridization method. This leaves unanswered questions regarding the discordance of EBV positivity detected by different approaches and the threshold of sequencing reads. Therefore, we reanalyzed the TCGA‐STAD RNA sequencing (RNA‐seq) dataset including 375 tumor and 32 normal samples, using our analysis pipeline. We defined a reliable threshold for EBV‐derived next‐generation sequencing reads by mapping them to the EBV genome with three different random arbitrary alignments. We analyzed the prognostic impact of EBV status on the histopathological subtypes of gastric cancer. EBV‐positive cases identified by reanalysis comprised nearly half of the cases (49.6%) independent from infiltrating lymphocyte signatures, and showed significantly longer overall survival for adenocarcinomas of the ‘not‐otherwise‐specified’ type [P = 0.016 (log‐rank test); hazard ratios (HR): 0.476; 95% CI: 0.260–0.870, P = 0.016 (Cox univariate analysis)], but shorter overall survival for the tubular adenocarcinoma type [P = 0.005 (log‐rank test); HR: 3.329; 95% CI: 1.406–7.885, P = 0.006 (Cox univariate analysis)]. These results demonstrate that the EBV positivity rates were higher when determined by RNA‐seq than when determined by EBER1/2 in situ hybridization. The RNA‐seq‐based EBV positivity demonstrated distinct results for gastric cancer prognosis depending on the histopathological subtype, suggesting its potential to be used in clinical prognoses.

Epstein-Barr virus (EBV), formally designated as human herpesvirus 4, whose genome length is 172 kbp, infects~90% of Homo sapiens worldwide and establishes a lifelong persistent infection, typically with no observable symptoms in immunocompetent hosts [1][2][3][4]. Memory B cells are usually the host cell species responsible for establishing latent infections [4]. Interestingly, EBV can be detected in lymphoma cells that originate from B cells, and in the rare type of lymphomas originating from T cells or NK cells, nasopharyngeal cancer, and gastric cancer (GC) [3,4], suggesting that EBV is capable of infecting normal and/or malignant cells, including B, T, NK, and epithelial cells. Numerous studies on EBV have shown carcinogenic activities in EBV genome-encoded products [4,5]. Hypermethylation of tumor suppressor genes has been observed in EBV-associated cancers [4,[6][7][8], suggesting that aberrant methylation might be a critical mechanism of EBV-related tumorigenesis.
Gastric cancer is a heterogenous disease, with subtypes, such as diffuse or intestinal, generally being based on histopathological criteria [9]. EBV is detected in tumor cells in 2-20% of GC cases [5,7], and most cases of classical EBV-associated (EBVa) GC exhibit diffuse histology accompanied by lymphocyte accumulation, which is known as lymphoepithelioma-like carcinoma [10,11]. However, EBV-associated GC (EBVaGC) includes other histological subtypes [12], indicating its intrinsic heterogeneity.
Classical EBVaGC has traditionally been diagnosed via in situ hybridization, which detects the presence of EBV genome-derived EBV-encoded RNA (EBER)1/2 small RNAs in GC cells [13,14]. Based on molecular evidence, as outlined by The Cancer Genome Atlas (TCGA) Stomach Adenocarcinoma (STAD) study [15], four molecular subtypes of GC have been proposed, including tumors positive for EBV, microsatellite-unstable tumors, genomically stable tumors, and tumors with chromosomal instability. The EBV-positive subtype in the TCGA-STAD study was determined by pairwise comparisons of EBV read counts obtained using four sequencing platforms [whole genome, exome, RNA sequencing (RNA-seq), and micro-RNA-seq] but not by EBER in situ hybridization [15,16]. Despite the discordant results for EBV positivity among these sequencing methods, positivity was defined by dichotomous values from pairwise plots between two sequencing methods without careful consideration of the small number of sequencing reads [15]. Although the concordance between EBV positivity obtained by the TCGA methods and in situ hybridization has been validated [16], it remains unclear whether the small amounts of EBV sequencing reads are sufficient for determining prognosis. Additionally, in some reanalysis studies of the TCGA RNA-seq data, the cutoff levels of next-generation sequencing (NGS) reads for EBV positivity were determined by arbitrary criteria that excluded small amounts of EBV reads [17][18][19].
Several studies have shown that the overall survival (OS) for EBVaGC is longer than in other types of GC that are not associated with EBV [12,20,21]. However, it has shown that the prognosis of EBVaGC is identical [22][23][24][25], or shorter [26,27]. The first report based on the TCGA-STAD dataset showed that GC prognosis did not differ between the EBV molecular subtype and the other three subtypes [15]. However, when a gene expression data-based subtype prediction model was used to define the EBV molecular subtype, this subtype was associated with a better prognosis in two large independent cohorts [28]. The underlying cause of these discrepancies has yet to be determined.
Recently, Liu et al. [29] reported a high-quality validated survival outcome based on the entire TCGA dataset: The median follow-up time of STAD was calculated as 14.0 months. Like other TCGA datasets, the OS data from the STAD study were used for survival analysis. Furthermore, 'primary diagnosis' data corresponding to histopathological information in the clinical TCGA datasets were harmonized with the terms in the WHO International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3; GDC data release 13.0 on September 27, 2018), which allowed us to perform analysis based on current histopathological classifications, rather than on the Lauren scheme. We thus reanalyzed the TCGA-STAD RNA-seq data using our own analysis pipeline and successfully obtained wide-ranging EBV-derived NGS reads. The reliability of a small number of these reads was examined with the randomness properties of the mapping tool. Considering the heterogeneity of EBVaGC, we performed survival analysis on the EBV-positive and EBV-negative groups based on the high-quality survival outcome [29] and the updated histopathological diagnosis data.

Ethics statement
This study was approved by the Ethics Review Board of Tokyo Metropolitan Institute of Medical Science (IRB#18-18) and the Tokyo Metropolitan Cancer and Infectious Diseases Center Komagome Hospital (IRB#1563). Approval for access to anonymized RNA-seq BAM files was obtained from dbGaP [30].

TCGA-STAD data and molecular subtype
Sequence data in BAM file format (407 files) and accompanying clinical data were downloaded from the GDC data portal (https://portal.gdc.cancer.gov). We then linked the obtained files using patient identifiers. Among these files, 375 files were derived from tumor samples and the rest from normal samples. BAM files were converted into FASTQ with bam2fastq (https://github.com/jts/bam2fastq). The data used were from cancer specimens and noncancer control specimens.

EBV detection
We used hg38, constructed by the Genome Reference Consortium, as a human genome reference sequence. We utilized virus detection methods for cancer RNA-seq data as described elsewhere [17,18,31,32]. Briefly, reads that were not mapped to the human genome reference sequence were excised from the BAM files and mapped to virus references. Raw sequencing reads were mapped to reference human genome hg38 with STAR v.2.4.2a [33], and unmapped reads from the first step were again mapped to hg38 with BOWTIE2 v.2.2.8 [34] using the '--very-sensitive-local' option to perform high sensitivity mapping. After trimming low-quality bases and adaptor sequences with CUTADAPT v.1.8.1 [35], unmapped reads from the previous step were mapped to the virus reference FASTA files listed in Table S2 using BOWTIE2 with the '--very-sensitive-local' option. SAMTOOLS v.1.8 [36] was used to convert the SAM file to BAM and sort the resulting BAM file.

Verification of EBV reads
In order to verify the small number of detected EBV reads obtained from the analysis pipeline in this study, we implemented our pipeline three times for all samples using the '-non-deterministic' option of BOWTIE2, which yielded different results due to the arbitrary choice of alignment (http:// bowtie-bio.sourceforge.net/bowtie2/manual.shtml).

Visualization of EBV reads
Mapped EBV reads were visualized using the Integrative Genomics Viewer [37] with the EBV reference genome.

Evaluation of lymphocyte infiltration into tumors
We obtained TCGA-Pan-Cancer Clinical Data, including extended prognostic data from the supplementary information of a previous report [29] and TCGA immunogenomics data reported in the supplemental table in Thorsson et al. [38], which includes intratumoral infiltrating lymphocyte data and intratumoral immune cell scores estimated using CIBERSORT [39].

Statistical analysis
Statistical analyses were performed with R (The R Foundation for Statistical Computing, Vienna, Austria). In multigroup comparisons, we used the Kruskal-Wallis test with a post hoc Dunn's test to estimate the difference between two groups. To test for correlation between EBV read and intratumoral immune cell scores, we used the Spearman correlation coefficients. Survival curves with mortality hazard ratios (HRs) were generated using the Kaplan-Meier method and the univariate Cox proportional hazard model unless otherwise mentioned, applying inverse probability weighting (IPW) in order to adjust for potential imbalances due to confounding factors [40,41]. Propensity scores for calculating IPW were obtained from logistic regression using a set of covariates deemed likely to have affected the outcome, including age, gender, race, tissue or organ of origin, and tumor stage. We acknowledge that there may have been other differences that the TCGA-STAD project did not measure. The log-rank test was used to estimate statistical significance in survival analyses.

Reanalysis of TCGA-STAD RNA-seq data and threshold of EBV reads
Clinicopathological characteristics of patients in the TCGA-STAD dataset are shown in Table S1. We first mapped TCGA-STAD RNA-seq reads to the human genome and obtained nonhuman reads; the analysis pipeline is shown in Fig. 1. Unmapped reads were mapped to the EBV reference genomes listed in Table S2. Figure 2A shows the number of EBV-positive cases both in the TCGA-STAD study and in our analysis. Our analysis pipeline detected EBV in 49.6% cases, in contrast to the TCGA-STAD frequency of 9% ( Fig. 2A). Cumulative transcriptional profiling of the EBV genome revealed EBV reads that were in accordance with a previous report (Fig. S1) [15]. To confirm the reliability of the small number of EBV reads in this study, we focused on the mapping process of our analysis pipeline by using BOWTIE2, which employs a random process (BOWTIE2 manual, URL: bowtie-bio.sourceforge.net/bowtie2/manual. shtml) that yields different results that are filtered out for each calculation. To exclude ambiguous mapping results, we carried out three rounds of mapping with BOWTIE2 for all samples; Fig. 2B shows the proportion of results in which EBV was absent. The proportion of cases with negative results was the highest (35%) when the maximum EBV read count was 1 (Fig. 2B). reads was 2 or more, the proportion of negative results was 16% and 5%, respectively. Figure 2C shows corrected EBV read counts with respect to the total read counts in each case (particles per billion reads, ppb). The maximum corrected EBV level of one-read cases was equivalent to approximately half that of the two-read cases and was lower than that of all three-read cases. We thus provisionally adopted the maximum level in one-read cases as the threshold (7.1 ppb) (Fig. 2C), which resulted in the identification of 186 EBV-positive cases and 189 EBV-negative cases (Fig. 2D). Furthermore, we analyzed the normal sample BAM files in TCGA-STAD using the same analysis pipeline (Fig. 2E). The obtained EBV reads from the normal samples were comparable to those from the EBV-negative tumor samples. The number of EBV reads from EBV-positive tumor samples was significantly higher than that of the normal samples and EBV-negative tumor samples (P < 0.001). The lymphocyte infiltration score of EBV-positive cases in the TCGA-STAD report was significantly higher than that of negative cases. The lymphocyte infiltration scores in the present study, excluding positive cases from the TCGA-STAD report and > 200 ppb cases in this study, did not differ from those of the negative cases (Fig. 2F). Available intratumoral immune cell scores showed that there was no correlation between EBV reads and the number of intratumoral B-cell indicators [38] (Fig. 2G,H). We further analyzed the correlation between EBV reads and CD4 + T-cell (Fig. S2A), CD8 + T-cell (Fig. S2B), and natural killer (NK) cell ( Fig. S2C) scores, but no correlation was found.

Survival analysis in reclassified groups according to histopathological type
We performed a Kaplan-Meier survival analysis of whole cases. Covariates were adjusted by IPW. There was no difference in adjusted OS between the EBVpositive and EBV-negative groups in this study [P = 0.956 (log-rank test); HR: 1.008, 95% confidence interval (CI) 0.723-1.406, P = 0.961 (Cox univariate analysis); Fig. S3A]. We confirmed that there were no differences between the molecular EBV subtype and others, as previously reported [15] (Fig. S3B). Since EBVaGC is associated with diffuse-type GC [42][43][44], we examined whether EBV positivity in each histopathological type affects survival. According to the criterion of whether a luminal structure (i.e., intestinal) was involved, we provisionally reclassified the ICD-O-3-harmonized classification into diffuse, intestinal, mixed, and not-otherwise-specified (NOS) types (Table S3). This reclassification is used hereinafter. The average number of EBV reads in EBVpositive cases did not differ among the reclassified pathological groups (Fig. 3A). Lymphocyte infiltration scores [38] were higher in the diffuse type of EBV-positive GC than in the intestinal type, irrespective of age (Fig. S4). Statistically significant differences in lymphocyte infiltration scores were observed between EBVpositive and EBV-negative cases only in the intestinal type ( Fig. 3B univariate analysis); Fig. 3G] but a favorable prognostic effect in the NOS type [P = 0.027 (log-rank test); HR 0.520; 5% CI: 0.2907-0.9291, P = 0.027 (Cox univariate analysis); Fig. 3H]. The statistical power in the mixed type was insufficient due to the small sample size (Fig. 3I). The unadjusted Kaplan-Meier curves were comparable to the IPW-adjusted results described above (Fig. S5). To validate the prognostic effects of a small number of EBV reads, we carried out a Kaplan-Meier survival analysis for groups classified by the number of EBV reads: those with a small number (7.1-100 ppb) and a large number (more than 100 ppb) of reads. Unlike in Fig. 3G,H, there was no statistically significant difference between the two groups, likely due to the reduction in sample size (Fig. 3J,K). In multivariate Cox survival analysis, the EBV-positive group with a small number of reads (7.1-100 ppb) showed favorable prognosis in NOStype GC, as shown in Fig. 3H (HR: 0.528, 95% CI: 0.283-0.983, P = 0.044) ( Table 1).  shows distinct proportions of G1-GX in the provisional histopathological categories with their original ICD-O-3-harmonized histopathological types, suggesting a difference between the histopathological classification and tumor grade. The Kaplan-Meier curves in Fig. 3F-I were further subdivided by G1/G2 (well/moderately differentiated) and G3 (poorly differentiated). Statistically significant differences in prognosis (shorter OS in intestinal, longer OS in NOS) were observed only in G1/G2 (P = 0.008, 0.015, respectively) but not in G3 (P = 0.441, 0.275, respectively; Fig. 4). The unadjusted Kaplan-Meier curves (Fig. S7) were comparable to the IPW-adjusted results (Fig. 4).
We continued to investigate the correlation between EBV reads and tumor grade. There was no statistically significant difference between G1/G2 and G3 based on the number of EBV reads (Fig. S8A). The lymphocyte infiltration score also demonstrated no significant difference between EBV-negative and EBV-positive samples in G1/G2 and G3 (Fig. S8B,C). The Kaplan-Meier survival analysis demonstrated no difference in OS between the EBV-positive and EBV-negative groups in this study both in G1/G2 and in G3, regardless of IPW adjustment (Fig. S8D-I).

Survival analysis in original ICD-O-3-harmonized histopathological types with TCGA-provided tumor grade
We next analyzed the prognosis of the original ICD-O-3-harmonized histopathological types that correspond to intestinal and NOS types. Of the assigned ICD-O-3-harmonized histopathological types in each case, we excluded 'Adenocarcinoma with mixed type' and 'Papillary adenocarcinoma NOS' because of their extremely small sample sizes (n = 1 and 5, respectively). Survival analysis was performed for the   remaining cases after adjusting for explanatory variables, although some types had low statistical power as a result of their very small sample size (mucinous adenocarcinoma, n = 19; signet ring cell carcinoma, n = 12). For 'tubular adenocarcinoma' (n = 65), which corresponds to intestinal-type GC, EBV-positive cases demonstrated significantly shorter OS [P = 0.005 (log-rank test); HR 3.329; 95% CI: 1.406-7.885, P = 0.006 (Cox univariate analysis); Fig. 5B], but this was not the case for the 'adenocarcinoma intestinal' type (n = 71; Fig. 5A). For 'Adenocarcinoma NOS' (n = 120), EBV-positive cases showed significantly longer OS [P = 0.016 (log-rank test); HR: 0.476; 95% CI: 0.260-0.870, P = 0.016 (Cox univariate analysis); Fig. 5C]. As seen in the analysis of provisional histopathological categories (Fig. 4), these statistically significant differences in prognoses were observed only in G1/G2 (P = 0.004, 0.008, respectively) but not in G3 (P = 0.148, 0.275, respectively; Fig. 5D-I). The unadjusted Kaplan-Meier curves were comparable to the IPW-adjusted results (Fig. S9A-I). For the 'carcinoma diffuse' type (n = 59), which was placed in the diffuse classification, we observed no difference between EBV-positive and EBV-negative groups [P = 0.376 (log-rank test); HR: 0.416, 95% CI: 0.156-1.109, P = 0.079 (Cox univariate analysis); Fig. S10A]. In the 'signet ring cell carcinoma' and the 'mucinous adenocarcinoma' types, we were unable to obtain statistically meaningful results due to the small sample size and because patients were censored early on (Fig. S10B,C). The unadjusted Kaplan-Meier curves were once again comparable to the IPW-adjusted results (Fig. S10A-C).

Discussion
This is the first report to examine the reliability of EBV reads in very small quantities and the prognostic impact of RNA-seq-determined EBV positivity. Our reanalysis of the TCGA-STAD RNA-seq dataset yielded a wide range of EBV reads, irrespective of the degree of intratumoral lymphocyte infiltration. EBV positivity had contrasting prognostic effects on the 'tubular adenocarcinoma' and 'adenocarcinoma NOS' types of GC, which may explain previous conflicting reports on the prognostic value of EBV in GC.
RNA-seq-based EBV-positive GC may exist as a distinct entity that includes EBER1/2 in situ hybridization-positive EBVaGC. The 9% positivity rate of EBV in the original TCGA-STAD study [15] or the 7-10% as determined by in situ hybridization [45,46] were much lower than the 49.6% observed here and thẽ 40% in another study [47] that were obtained by reanalysis of a portion of the TCGA-STAD data. These variable results suggest that the assignment of an appropriate cutoff value is essential and that the proportion of EBVaGC is higher than previously recognized. We should note that the definition of EBV positivity depends on the EBV detection method-for example, EBER1/2 in situ hybridization vs. RNA-seq. The latter detects various RNA species derived from the EBV genome, but not small RNAs such as EBER1/2. On the other hand, EBER1/2 in situ hybridization detects a predetermined target and may only detect small RNAs when there are a sufficient number of copies. There have been no previous reports that EBER expression is necessary and sufficient for confirming the existence of EBV in cancer tissue. In our study, we could not detect any correlation between EBV transcripts and intratumoral immune scores of T cells, B cells, and NK cells, which are the primary or potential hosts of latent EBV [4], suggesting that EBV reads were not derived from these intratumoral immune cells. These results also imply that EBV reads could be obtained from tumor cells, even at low read counts. The small number of EBV reads in tumor tissues may be partly explained by the hit-andrun hypothesis of herpesvirus [48][49][50], which posits that the virus inflicts genetic or epigenetic injury to infected cells. The agreement between the reliable results obtained from repeated stochastic mapping and the transcriptional profile of the EBV genome [15] (Fig. S1) allowed us to consider the EBV reads in this study as actual EBV transcripts. To validate the method for both false-positive and false-negative EBV transcripts in the future, an internal standard RNA sample that includes the relevant EBV transcripts may be a prerequisite for EBV detection in cancer tissues.
The high rate (46.9%) of RNA-seq-based EBV positivity in this study may be useful for estimating GC prognosis. We have shown that the prognoses of cases with EBV reads were poor in the tubular adenocarcinoma type of GC, but were improved in adenocarcinoma NOS. The variable prognostic significance of EBV positivity may be attributable to the specific genetic and/or epigenetic features that underlie tumor histopathology and the immune response. Given the inconsistency between the G1/G2/G3 scale and the ICD-O-3-harmonized histopathological type for 'adenocarcinoma NOS' (Figs 4 and 5, and Fig. S6), this GC type is most likely intrinsically heterogeneous and may be further classified by distinct tumor biology and immune features. Additional studies are needed to substantiate and confirm this at the molecular level.
It is important to note that the GC classifications made by the Lauren and WHO schemes are discordant. According to the Lauren classification, GCs are separated into two main histological types, diffuse and intestinal, in addition to the mixed and indeterminate types. In this study, we recategorized eight ICD-O-3-derived groups into four new groups: diffuse, intestinal, NOS, and mixed (Table S3). These groups resemble but are not identical to the Lauren scheme. As such, extrapolation of the results of this study to GC studies performed in different pathological contexts must be done with caution.
Additionally, while IPW adjustment can reduce the impact of potentially confounding factors, it is subject to biases from unobserved differences. Furthermore, it cannot be ruled out that EBV read detection is due to systematic contamination from laboratory reagents and processes and/or the environment, although, considering the high EBV infection rate in humans, our results can be regarded as plausible. Moreover, EBV has not been detected in reports on contamination [51][52][53]. Lastly, EBV positivity requires validation in different cohorts.

Conclusion
A reanalysis of the TCGA-STAD RNA-seq dataset revealed a wide range of EBV reads that are independent of the degree of infiltrating lymphocyte signatures. We found that EBV-positive cases had variable prognoses depending on their histopathological subtype. Thus, RNA-seq-based EBV positivity may be a useful tool for estimating the prognosis of specific histopathological types of GC.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article.          Table S1. Baseline characteristics of patients in the Cancer Genome Atlas-Stomach Adenocarcinoma (TCGA-STAD) dataset. Table S2. List of EBV DNA reference sequences for mapping next-generation sequencing short reads. Table S3. Provisional regrouped histopathological categories.