Genetic factors associated with cancer racial disparity – an integrative study across twenty‐one cancer types

Based on an integrative analysis using genomic data of 21 cancer types from TCGA, we identified biological factors contributing to racial disparity in cancer: XKR9 gene expression is significantly associated with overall survival for all cancers combined as well as for several individual cancers, suggesting that XKR9 could be a novel drug target for cancer immunotherapy.

It is well known that different racial groups have significantly different incidence and mortality rates for certain cancers. It has been suggested that biological factors play a major role in these cancer racial disparities. Previous studies on the biological factors contributing to cancer racial disparity have generated a very large number of candidate factors, although there is modest agreement among the results of the different studies. Here, we performed an integrative analysis using genomic data of 21 cancer types from TCGA, GTEx, and the 1000 Genomes Project to identify biological factors contributing to racial disparity in cancer. We also built a companion website with additional results for cancer researchers to freely mine. Our study identified genes, gene families, and pathways displaying similar differential expression patterns between different racial groups across multiple cancer types. Among them, XKR9 gene expression was found to be significantly associated with overall survival for all cancers combined as well as for several individual cancers. Our results point to the interesting hypothesis that XKR9 could be a novel drug target for cancer immunotherapy. Bayesian network modeling showed that XKR9 is linked to important cancer-related genes, including FOXM1, cyclin B1, and RB1CC1 (RB1 regulator). In addition, metabolic pathways, neural signaling pathways, and several cancer-related gene families were found to be significantly associated with cancer racial disparities for multiple cancer types. Single nucleotide polymorphisms (SNPs) discovered through integrating data from the TCGA, GTEx, and 1000 Genomes databases provide biologists the opportunity to test highly promising, targeted hypotheses to gain a deeper understanding of the genetic drivers of cancer racial disparity and cancer biology in general.

Introduction
Cancer is the second most common cause of death in the United States after heart disease [1]. It has a greater economic impact from premature death and disability than all causes of death combined worldwide [2]. It is well known that different racial groups have significantly different incidence and mortality rates for certain cancers [3][4][5][6][7][8][9]. For all cancer types combined, the incidence rates in the United States are 465.3, 463.9, and 291.7 per 100 000 people, for non-Hispanic white (NHW), non-Hispanic black (NHB), and Asian/Pacific Islander (API) populations, respectively [1]. The mortality rates for all cancers combined are 165. 4, 190.6, and 100.4 per 100 000 people for NHW, NHB, and API populations, respectively [1]. NHWs and NHBs have the highest incidence rates; NHBs have the highest mortality rates, followed by NHWs, and APIs have much lower incidence and mortality rates than both NHWs and NHBs.
Previous studies on the biological factors contributing to cancer racial disparity have generated a very large number (usually at the level of hundreds) of candidate factors, both genetic and epigenetic, with modest agreement among the results of these studies. This is likely due to the limited sample size in some of the previous studies as well as the differences in the populations from which the samples were selected. Each of the previous studies has addressed individual cancer types, and studies on different cancers have identified quite varied genetic and epigenetic factors. Furthermore, few studies have been conducted to investigate less common cancers.
Although the gene expression differences between tumors of different racial groups have been well established by previous studies in multiple cancer types, the drivers behind the gene expression differences have been poorly understood. To identify the potential genetic drivers of cancer racial disparity, we applied an integrative analysis approach using multiple datasets as illustrated in Fig. 1. We started from gene expression data (RNA-seq) in TCGA by conducting differential gene expression analysis (DGEA) comparing gene expressions of tumor samples from patients of different races. This generated differentially expressed (DE) genes between two specific groups which are defined by the races of the groups and the cancer type (or all cancer combined). Since some of the DE genes may be associated with race only, but not cancer, survival analysis using the DE genes was conducted to find DE genes that are significantly associated with overall survival of the cancer patients (XKR9 and CST1 were found in this study). To identify the driver genetic factors, we then used GTEx data to find SNPs correlated significantly with the expression of XKR9 or CST1. To ensure the SNPs are able to explain the KXR9 or CST1 differential expression, the consistency between differential gene expression with the proportions of the alternative allele in different race groups (using 1000 Genomes Project data) and the effect of the alternative allele to gene expression (using GTEx data) are checked, which further narrowed the list of SNPs down to a much smaller set of SNPs. Database (UCSC genome browser, GeneCards, etc.) searches were performed to collect more information of the genes and SNPs. The final step is literature search using BioKDE platform (https://biokde.com) to validate the discoveries and generate specific hypotheses which can be tested experimentally.
Differential gene expression analysis using all cancer types allowed us to substantially increase the sample size to effectively eliminate the number of false positives in identified differentially expressed genes (DEGs). Pathway analyses identified cancer-related pathways, which allow us to focus on genes and pathways that are truly associated with cancer racial disparities, instead of being associated with race only. We performed analyses using data with all cancers combined and data from individual cancer types. The majority of our analyses are focused on results taken from all cancers combined, and the companion website (https://rep.insilicom.com/cancer_ racial_disparity/index.html) contains details of all the results.
Our study is the first such large-scale study on the genetic factors of cancer racial disparity. It provides a comprehensive picture of the genetic factors associated with cancer racial disparity across multiple cancers. Furthermore, SNPs identified from the integrative analysis point to potential genetic drivers, which can be tested in future validation studies.

Materials and Methods
The data used in this study are described in Supporting Information.

Differential gene expression analysis
For the overall comparison with all cancers combined, Limma package [41] was used to perform three pairwise comparisons: Asian Americans (AS) vs. Caucasian Americans (CA), AS vs. African Americans (AA), and AA vs. CA. For the DGEA of individual cancer types, DESeq2 package [42] was used.

Heatmap and clustering analysis
We used Ward's method [43] for hierarchical clustering and Euclidean distance for clustering rows and columns. R package pheatmap (Perry M (2020). heatmaps: Flexible Heatmaps for Functional Genomics and Sequence Features. R package version 1.12.0) was used for plotting the heatmap.

Association of SNPs with gene expressions
Combining data obtained from GTEx eQTL studies, the 1000 Genomes Project, and TCGA, we infer specific SNPs that may be associated with the differential expression patterns observed from the differential gene expression analysis of TCGA gene expression data following a previously developed protocol [11,29]. Specifically, we searched for SNPs that may be associated with the initially identified DEGs. The conditions used for selecting SNPs required that the following should  all be consistent with one another: (1) the allele frequencies of the SNPs in different races (1000 Genomes Project), (2) the association between the genotype of the SNPs with differential expression patterns (GTEx), and (3) the differential expression patterns in different races (TCGA). For example, from the TCGA data we found that XKR9 is down-regulated in AS patient samples compared to AA and CA samples, and its expression is lower in AA samples compared to CA samples (Table 1). From the GTEx, we identified a SNP, rs17689585, which is associated with XKR9 expression, and the effect of this common allele is increased expression of the XKR9 gene. The frequencies of the most common allele are 0.0744, 0.2247, and 0.5149 for AS, AA, and CA racial groups, respectively. This SNP can potentially explain the differential expression of XKR9 across the three races. Manhattan plot showing the SNPs was generated using R package qqman (Turner, S.D. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. biorXiv https://doi.org/10.1101/005165).

Bayesian network modeling
We used a biological network inference method, GRASP, which we recently developed [44] to build a Bayesian network (BN) model for selected genomic features using several types of genomic data available from TCGA, including mRNA-seq, microRNA-seq, protein expression, and DNA methylation data. We first used RNA-seq data to identify transcripts highly correlated with a feature of interest (XKR9 gene expression in this study). There are hundreds of genes with significant correlation coefficients with XKR9 as measured by adjusted p-values (padj ≤ 0.05). We chose relatively large correlation coefficient cutoffs to give us a small number of variables to control the size of the resulting network. Six transcripts (mRNA-seq) were selected whose absolute values of correlation coefficients with that of XKR9 are greater than 0.38. We then included other genomic features, including microRNAs, DNA methylations, and protein expressions that are highly correlated with these transcripts, which produced 10 micro-RNAs (absolute correlation coefficient > 0.45), 10 DNA methylation regions aggregated around genes (absolute correlation coefficient > 0.4), and 3 protein expressions (absolute correlation coefficient > 0.3). The details of the BN structure learning method are given in Ref. [44].
The companion website, https://rep.insilicom.com/ca ncer_racial_disparity/index.html, provides more details on the data and methods we used in this study.

Results
In this study, we performed integrative genomics analysis across three races: Asian American (AS), Caucasian American (CA), and African American (AA) for all 21 cancer types combined as well as each individual cancer type. The patients are races matched with AS:AA:CA of 1:1:4. There are totally 289 AS, 289 AA, and 1156 CA patients. From TCGA data, racial disparity in overall survival is evident when pooling patients from all the cancer types: AS has much higher survival probability than CA and AA, while CA and AA have comparable survival probability to each other ( Fig. 2A). For most of the analyses,  (Fig. 2B). A survival analysis was performed on the DEGs, and XKR9 was found to have a significant effect on survival when considering all of the cancer types combined (Fig. 2C) as well as in three specific cancer types (kidney renal papillary cell carcinoma (KIRP), kidney chromophobe (KICH), and liver hepatocellular carcinoma (LIHC)) (Figs S1A, S1B, and S1C). Reduced expression of XKR9 is always associated with higher survival probabilities. XKR9 was found to be down-regulated in AS cancer patients in both the AS vs. CA comparison (log2 fold change = À1.58; adjusted p-value = 1.50e-39) and the AS vs. AA comparison (log2 fold change = À1.08; adjusted p-value = 6.01e-08). CST1 was found to have different effect on survival in three individual cancer types with elevated express of CST1 associated with higher survival probability in BLCA and BRCA (the higher the expression of CST1, the higher the overall survival time of the patients), while reduced expression of CST1 associated with higher survival probability in LGG (the lower the expression of CST1, the higher the overall survival of the patients) (Figs S1D, S1E, and S1F). Several other genes known to be involved in cancer were also found to be differentially expressed (Table 1).
We then looked at the differentially expressed genes in individual cancer type and identified top 30 DEGs between AS and CA tumors in individual cancer types as shown in Fig. 2D,E based on a total of 63 comparisons (21 cancer types and three comparisons in each, see below for more details). We can see that most of the DEGs are down-regulated in AS tumors, which is consistent with what we have observed in previous studies [11,29]. The correlation matrix of the genes in Fig. 2D was calculated and plotted to provide an overview on the correlation of these DEGs (Fig. S3 in Supplementary Materials). Results of the other two pairwise comparisons are given in Supplementary Materials (Fig. S2).

Pathway and upstream regulator analysis
A pathway analysis using the Ingenuity Pathway Analysis (IPA) suite using differentially expressed genes from overall comparison showed that a number of pathways are dysregulated in multiple cancers in the AS vs. CA comparison (Fig. 3A) as well as in the other two comparisons. The majority of the pathways are down-regulated in AS tumors except for the SPINK1 pancreatic cancer pathway and the SPINK1 general cancer pathway. Among the dysregulated pathways are several metabolism-related pathways such as nicotine degradation II, LXR/RXR activation, serotonin degradation, melatonin degradation I, the superpathway of melatonin degradation, nicotine degradation III, thyroid hormone metabolism, and acetone degradation I. Several neural pathways were also found to be significantly dysregulated, including those known as the neuroprotective role of THOP1 in Alzheimer's disease, synaptogenesis signaling pathway, neuroinflammation signaling pathway, endocannabinoid neuronal synapse pathway, synaptic long-term depression, and neuropathic pain signaling in Dorsal horn neurons.

Differential gene expression analysis for individual cancer types
To identify DEGs that may be only specific to certain individual cancers, we performed a DGEA for individual cancers. Surprisingly, unlike the overall comparison with all cancer types, the comparisons of individual cancers generally produced a much larger number of DEGs with larger-fold changes and smaller p-values, despite utilizing much smaller sample sizes. To further focus the results of our analysis, we manually checked the DEGs and selected a small number of gene families whose members have similar differential expression patterns in more than one type of cancer. These gene families are more likely to be truly differentially expressed given these two conditions: (1) more than one member of these gene families have similar differential expression patterns in a particular comparison in one cancer type; and (2) members of these families have similar differential expression patterns in more than one cancer type. The gene families selected are kallikreins (KLKs), metallothioneins (MTs), mucins (MUCs), melanoma antigen gene (MAGEs), and C-X-C motif chemokine ligands (CXCLs). Figure 3B shows the heatmap of the KLK gene family for cancers that include at least one KLK gene detected as a DEG.
More discussions of this study are provided in Supporting Information.

Genetic factors associated with breast cancer racial disparity-a case study
In addition to the analysis of the combined data from all 21 cancer types, we performed similar analyses on individual cancers and made all of the results available on the companion website. Each cancer type has a separate webpage documenting all of the results related to that cancer type. Researchers are encouraged to explore the website to mine new discoveries on cancer racial disparities specific to individual cancers. Here, we use breast cancer as an example illustrating how the individual cancer results can be explored. For breast cancer, there are a total of 781 CA, 184 AA, and 61 AS patients. The differential gene expression analysis returned 1319, 1309, and 1458 DEGs, for the AS vs. CA, AA vs. CA, and AS vs. AA comparisons, respectively, using a fold change of 2.0 (|log2 fold change|>= 1.0) and an adjusted p-value <= 0.05. Clicking on gene names in the table will display the box plot and histogram for the expression of the genes across the three racial groups. This can be quite informative as some DEGs are false positives driven by some outlier genes, which could not be effectively detected by the DESeq2 program. Visual inspection will identify such genes. The 'NCBI' and 'GeneCards' links can also take users to the corresponding webpage of the genes at NCBI or the GeneCards database. The Venn diagram shows the overlap among the three comparisons, the heatmap shows the expression of DEGs, and the survival analysis plot shows the survival curves for patients across the three racial groups. In addition, we also performed a pathway analysis using the IPA suite, which includes dysregulated pathways, upstream regulators, and diseases/functions related to the DEGs. A graph containing upstream regulators and the genes they regulate is shown in Fig. 3C.

SNPs associated with differentially expressed genes
A fundamental question of disease studies is to understand how genotypes affect the phenotypes of the disease of interest. It has been well established that common genetic variants account for differences in gene expression among ethnic groups [45]. In this study, we focused on two phenotypes of interest: the outcome disparity among CA, AA, and AS cancer patients and the genome-wide expression profiles of all the transcripts from tissue samples of twenty-one cancers. To identify genotypes associated with these phenotypes, we combined gene expression data with eQTL data from the GTEx project and SNP allele frequency information from the 1,000 Genomes Project [11,29] (see Materials and Methods for details). Figure 3D displays the SNPs identified from the AS vs. CA comparison using a Manhattan plot. We can see that many SNPs are clustered around certain regions, indicating they are likely to be truly associated with cancer racial disparity. The determination of whether or not they are driver SNPs would require further experimental studies.
3.6. Understanding the mechanism of the association between XKR9 and cancer health disparity using Bayesian network models XKR9 was found to be differentially expressed in all three pairwise comparisons of combined data for all cancers. XKR9 expression was also found to be associated with overall survival. Several SNPs associated with the differential expression of XKR9 were discovered using an integrative analysis combining data from the TCGA, GTEx, and 1000 Genomes Project databases. The allele frequencies in different races are also consistent with their differential expression patterns, indicating the SNPs are potential genetic drivers for cancer racial disparities across multiple cancer types.
A literature search for XKR9 returned little information on its function and its relationship with cancer. Using mRNA-seq, microRNA-seq, protein expression, and DNA methylation data obtained from TCGA, we inferred a Bayesian network (BN) structure (Fig. 3E) for the selected genomic features using our BN structure learning method, GRASP, described in Method. The learned network suggests that XKR9 is regulated by a regulatory protein, cyclin B1, which was regulated by forkhead box M1 (FOXM1). Regulation of cyclin B1 by FOXM1 is well established in literature [46,47]. FOXM1 (forkhead box protein M1) is an important proliferation-associated transcription factor widely spatiotemporally expressed during cell cycle. It is involved in the processes of cell proliferation, self-renewal, and tumorigenesis [48]. Due to its potential role as molecular target in cancer therapy, FOXM1 was named the Molecule of the Year in 2010 [49,50]. In the BN, XKR9 directly regulates LACTB2, which regulates other genes and microRNAs including a tumor suppressor gene, RB1CC1. RB1CC1 is a potential regulator of the tumor suppressor, RB1. From both upstream and downstream proteins/genes identified by the BN, XKR9 likely plays some intriguing functions in cancer. The BN model links XKR9 expression with tumor growth and apoptosis, which may explain its effect on overall survival of cancer patients. The model also provides guidance for formulating well-defined hypotheses for future validation studies.

Discussion
Despite extensive previous studies on the topic, a substantial proportion of cancer racial disparities still cannot be explained by known factors. Many previous studies have identified biological factors associated with the disparities. However, the agreement among these studies is rather poor. Limited sample sizes, sampling population differences, and/or randomness in multiple comparisons may have all contributed to the discrepancy. Most previous studies have focused on individual cancers and identified quite different genetic factors for different cancer types. A natural question one may ask is as follows: Are these racial disparities in different cancer types isolated events? It is well known that cancers from different organs can share very similar characteristics if they have similar driver mutations. The genetic factors affecting cancer racial disparities likely act across multiple cancers as well. Another question is as follows: With so many candidate genetic factors discovered in the past, which ones are the drivers? Identification of driver genetic factors that cause cancer racial disparity would help researchers design effective approaches to reduce or eliminate the disparities.
In this study, we used TCGA genomic data from a total of 7,789 patient samples representing twenty-one cancer types. Our study discovered a number of genes which are differentially expressed in multiple cancers between different racial groups. These genes are more likely to be truly associated with cancer racial disparity compared to the genes identified using smaller sample sizes from studies examining individual cancer types. One of the DEGs, XKR9, was found to be associated with overall survival for all of the cancers combined as well as for three individual cancer types. We further identified pathways differentially regulated across multiple cancer types. An additional differential expression analysis using individual cancers identified more DEGs, gene families, and pathways in individual cancer types. Combining data from three major public databases-TCGA, GTEx, and 1000 Genomes-allowed us to find promising candidate SNPs, which could be the genetic drivers of cancer racial disparity. We have put all of the results on a companion website available at https://rep. insilicom.com/cancer_racial_disparity/index.html. There is still much to be discovered from our analysis results, and we welcome cancer researchers to utilize these resources to make new discoveries. Below, we discuss some of our discoveries in more detail.
We used UCSC genome browser to find the location of the SNP, rs17689585, which was found to be associated with XKR9 gene expression and may be a potential driver SNP for cancer racial disparity (Fig. S4). The SNP overlaps with gene, AC015687.1, which is a novel transcript. Searching AC015687.1 at GeneCards database (https://www.genecards.org/cgi-bin/carddisp. pl?gene=ENSG00000285579) showed that this gene is an enhancer for XKR9. A few other important transcription factors bind this region of the genome, including FOXA2, RXRA, HNF4A, YY1, ZNF644, FOXA1, SP1, SOX13, and HNF4G. Among them, HNF4A, HNF4G, and FOXA2 correlate relatively strongly to XKR9, indicating they may play regulatory roles on XKR9 (Fig. S5). These discoveries point to some actionable hypotheses, which can be tested experimentally. For example, site-directed mutagenesis could be done to see whether the SNPs in this region indeed affect XKR9 expression and how they affect the physiology of cells. One can also check whether the mutations regulate XKR9 expression through modifying the binding of any of the transcription factors.
Given that XKR9 (XK related 9) could be the gene directly regulated by one or more of the transcription factors (FOXA2, RXRA, HNF4A, YY1, ZNF644, FOXA1, SP1, SOX13, and HNF4G) through one or more of the SNPs (multiple SNPs around chr8:71579362 and chr8:71917527) to cause some of the observed cancer racial disparities, we conducted literature search to find more about the links between XKR9 and cancer. It has been shown that apoptotic cells expose phosphatidylserine (PtdSer) on their surface as an 'eat me' signal and mammalian Xk-related (Xkr) proteins promote apoptotic PtdSer exposure [51,52]. A recent study found that this pathway can be hijacked by tumor cells to suppress the development of immunity toward tumor cells, which implies that PtdSer blockade by the therapeutic administration of PS-targeting agents may restore pathogen and tumor immunity [53]. Our study indicates another opportunity by targeting on XKR9 to regulate the exposure of PtdSer on cell surface. The observation that higher expression of XKR9 is associated with lower survival probability also suggests that targeting on XKR9 may be worth exploring as a potentially new therapeutic direction in cancer immunotherapy.

Conclusions
In this study, we found that higher expression of CST1 is positively correlated with longer overall survival for breast cancer (BRCA) and bladder cancer (BLCA), while it is negatively correlated with overall survival for low-grade glioma (LGG). This is in contradiction with a previous study indicating that elevated CST1 expression promotes breast cancer progression and predicts a poor prognosis [54]. Literature review showed that CST1 may play different roles in different cancers [55][56][57][58], but failed to reconcile the discrepancy between previous studies and ours. Further experimental work is needed to shed more light on the role CST1 plays in cancer.
Our study identified several neural pathways differentially regulated across multiple cancer types. These include the pathways known as the neuroprotective role of THOP1 in Alzheimer's disease, synaptogenesis signaling pathway, neuroinflammation signaling pathway, endocannabinoid neuronal synapse pathway, synaptic long-term depression, and neuropathic pain signaling in Dorsal horn neurons. Recent studies have found that the central nervous system and sympathetic nervous systems may play important roles in cancer's genesis, development, metastasis, and response to treatment [59][60][61]. The neurobiology of cancer is an emerging discipline spawning new perspectives in oncology. The discovery of multiple pathways related to neural systems that are associated with different cancer types indicates that neural systems may contribute to cancer racial disparities. It also confirms that these neural pathways may indeed play important roles in cancer. There have been few discoveries in previous cancer racial disparity studies pointing to neural pathways as associated genetic factors. Whether they act as drivers will require additional investigation in the future.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article. Fig. S1. Survival plots of XKR9 and CST1. Fig. S2. Heatmap of the selected top 30 differentially expressed genes. Fig. S3. The correlation matrix of the expressions of the genes in Figure 1D. Fig. S4. Location of the SNP, rs17689585, associated with XKR9 gene expression. Fig. S5. The correlation matrix of the expressions of transcription factors that may regulate XKR9. Table S1. Genetic and epigenetic factors reported in previous studies that may be associated with racial disparity of cancer .  Table S2. The pathways associated with genes identified through Bayesian network modeling of XKR9. Table S3. Sequence homology-based prediction of damaging coding SNPs using SIFT. Table S3. Sequence homology-based prediction of damaging coding SNPs using SIFT.