Development and validation of a four‐lipid metabolism gene signature for diagnosis of pancreatic cancer

Abnormal lipid metabolism is closely related to the malignant biological behavior of tumor cells. Such abnormal lipid metabolism provides energy for rapid proliferation, and certain genes related to lipid metabolism encode important components of tumor signaling pathways. In this study, we analyzed pancreatic cancer datasets from The Cancer Genome Atlas and searched for prognostic genes related to lipid metabolism in the Molecular Signature Database. A risk score model was built and verified using the GSE57495 dataset and International Cancer Genome Consortium dataset. Four molecular subtypes and 4249 differentially expressed genes (DEGs) were identified. The DEGs obtained by Weighted Gene Coexpression Network Construction analysis were intersected with 4249 DEGs to obtain a total of 1340 DEGs. The final prognosis model included CA8, CEP55, GNB3 and SGSM2, and these had a significant effect on overall survival. The area under the curve at 1, 3 and 5 years was 0.72, 0.79 and 0.87, respectively. These same results were obtained using the validation cohort. Survival analysis data showed that the model could stratify the prognosis of patients with different clinical characteristics, and the model has clinical independence. Functional analysis indicated that the model is associated with multiple cancer‐related pathways. Compared with published models, our model has a higher C‐index and greater risk value. In summary, this four‐gene signature is an independent risk factor for pancreatic cancer survival and may be an effective prognostic indicator.

Abnormal lipid metabolism is closely related to the malignant biological behavior of tumor cells. Such abnormal lipid metabolism provides energy for rapid proliferation, and certain genes related to lipid metabolism encode important components of tumor signaling pathways. In this study, we analyzed pancreatic cancer datasets from The Cancer Genome Atlas and searched for prognostic genes related to lipid metabolism in the Molecular Signature Database. A risk score model was built and verified using the GSE57495 dataset and International Cancer Genome Consortium dataset. Four molecular subtypes and 4249 differentially expressed genes (DEGs) were identified. The DEGs obtained by Weighted Gene Coexpression Network Construction analysis were intersected with 4249 DEGs to obtain a total of 1340 DEGs. The final prognosis model included CA8, CEP55, GNB3 and SGSM2, and these had a significant effect on overall survival. The area under the curve at 1, 3 and 5 years was 0.72, 0.79 and 0.87, respectively. These same results were obtained using the validation cohort. Survival analysis data showed that the model could stratify the prognosis of patients with different clinical characteristics, and the model has clinical independence. Functional analysis indicated that the model is associated with multiple cancer-related pathways. Compared with published models, our model has a higher C-index and greater risk value. In summary, this four-gene signature is an independent risk factor for pancreatic cancer survival and may be an effective prognostic indicator.
As a malignancy of the digestive system, pancreatic cancer is one of the most aggressive malignancies in the world. In recent years, pancreatic cancer morbidity and mortality rates have steadily increased, with an overall 5-year survival rate of about 8% for patients [1]. Despite the rapid development of diagnosis and treatment of pancreatic cancer over the past 20 years, the mortality rate of patients remains high [2][3][4]. The combination of tumor markers and imaging helps to diagnose the disease in a timely and accurate manner [5]; however, due to the insidious nature of the disease and the lack of early clinical signs, it is not easy to diagnose. It has been reported that about 50% of patients have confirmed that the cancer has metastasized [6,7]. Therefore, revealing the molecular mechanisms of pancreatic cancer progression and developing corresponding targeted therapies are critical to improving pancreatic cancer outcomes.
Lipids play an important function in maintaining normal cell function and homeostasis; they are not only an important part of the cell membrane but also provide precursors for important molecules needed in the growth and differentiation pathways [8,9]. Intracellular lipids come from two sources: one is food intake, and the other is lipid synthesis from scratch by hepatocytes and cells in need. Normal cells in the body acquire lipids mainly from diet and rarely from lipid synthesis from scratch. For most normal cells, lipids that meet cellular needs rarely are synthesized from scratch because of slow cell growth [10,11]. However, Medes et al. [12] found in the 1950s that tumor cells synthesize fatty acids primarily by synthesizing them from scratch. In tumor cells, on the one hand, lipids and cholesterol are often activated to meet the needs of tumor cells that are rapidly proliferating, and on the other hand, lipids alter the properties of biofilms and protect cells from oxidative damage from internal and external sources [13]. Lipogenesis is an important feature of rapid malignancy growth [14]. Normal cell lipid synthesis from scratch is rare, and about 90% of fatty acids are synthesized from scratch in tumor cells [12]. Activated lipid scratch synthesis was found to be associated with poorer prognosis and shorter diseasefree survival in tumor patients [15,16]. At the molecular level, increased lipid synthesis from scratch in tumors is often accompanied by increased lipid synthase and enhanced activity [17,18]. Thus, aberrant activation of lipid synthesis from scratch is a common feature of tumor cells. In addition, lipid metabolic reprogramming that promotes increased lipogenesis is associated with the abnormal development and progression of pancreatic adenocarcinoma [19]. Mammalian target of rapamycin complex 2 (mTORC2) stimulates the synthesis of sphingomyelin (glucoceramide) and glycerophospholipids (cardiolipin) to promote tumor progression [20]. Studies have shown that mTORC1 stimulates the synthesis of fatty acids and sterols by regulating the expression of SREBP1c, a major adipogenic transcription factor [20][21][22][23]. The active form of SREBP1c is sensitive to proteasomal degradation but can enter the nucleus and participate in its transcriptional targets, including its own gene promoter and the promoter of the major enzyme encoding fatty acid synthesis [24]. However, a deeper understanding of lipid metabolism-related genes in the prognosis and treatment of pancreatic cancer is needed.
In this study, lipid metabolism-related gene expression in pancreatic cancer was analyzed to identify key genes that could predict patient prognosis. A differentially expressed gene (DEG) analysis, Weighted Gene Coexpression Network Construction Analysis (WGCNA) and Cox proportional risk model were used to finally construct a signature based on the expression of several key genes as a prognostic signature for pancreatic cancer. This prognostic model can be used as an effective tool to predict the prognosis of patients with pancreatic cancer. These findings will also help identify new therapeutic targets for pancreatic cancer.

Expression spectral data and preprocessing
Human lipid metabolics-related pathways were downloaded from Molecular Signature Database v7.0 [25], and a total of 776 genes related to lipid metabolism were sorted out from the six lipid metabolic pathways in Table 1. Pancreatic cancer RNA sequencing (RNA-seq) expression data and corresponding clinical follow-up data were obtained from the public database The Cancer Genome Atlas (TCGA) (https://portal.gdc.cancer.gov/) [26], which contained RNAseq data of 182 patients and clinical information of 171 patients on December 3, 2019. GSE57495 is a microarray dataset from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/) [27], containing expression profile data and clinical sample information from 62 patients with early pancreatic cancer. The International Cancer Genome Consortium (ICGC) validation dataset included 257 patients with pancreatic cancer with expression profile data and clinical follow-up information. For TCGA dataset, (a) samples without clinical data and overall survival (OS) <30 days were removed, (b) normal tissue sample data were removed, (c) genes with fragments per kilobase of exon per million of zero in half of the samples were removed, and (d) the expression profile of genes related to lipid metabolism was preserved. For GEO datasets, (a) normal tissue sample data were removed, (b) OS data from months was converted to days, (c) samples with OS <30 days were removed, (d) Chip probes were mapped to the human gene SYMBOL using Bioconductor package, and (e) the expression profile of genes related to lipid metabolism was preserved. For the ICGC dataset, (a) sample data without survival status were removed, (b) samples with OS <30 days were removed, and (c) the expression profile of genes related to lipid metabolism was preserved. The GSE57495 and ICGC datasets were considered as the validation dataset. The clinical data information is shown in Table 2. The workflow chart is shown in Fig. 1.

Identification of prognostic genes
The expression profile of 776 lipid metabolism genes was extracted from TCGA dataset. However, 15 genes were not found. Furthermore, we keep genes that are not zero in more than half of the samples. As a result, 730 genes were used for subsequent analysis. Next, univariate Cox analysis of coxph function in R package was performed to obtain genes related to prognosis of pancreatic cancer with P < 0.05.

Identification of molecular subtypes
Cluster analysis of pancreatic cancer samples was performed by nonnegative matrix clustering algorithm (NMF), and the standard 'brunet' was selected by NMF method for 50 iterations. The clustering number k was set as 2-10, and the average contour width of the common member matrix was determined through the R package 'NMF'. The minimum member of each subclass was set as 10. According to the cophenetic, dispersion and silhouette index were used to determine the optimal clustering number.

Difference of tumor-infiltrating immune cells in molecular subtypes
Six types (B_cell, CD4_Tcell, CD8_Tcell, neutrophil, macrophage cell and dendritic cell) of tumor-infiltrating immune cell were retrieved from Tumor Immune Estimation Resource (https://cistrome.shinyapps.io/timer/) [28]. Immunity, matrix score, and tumor purity of each sample were calculated in R package estimate. These indicators were compared on molecular subtypes.

Identification of DEGs
R package differentially expressed Seq2 (DESeq2) [29] was applied to calculate the DEGs in molecular subtypes with a false discovery rate (FDR) <0.05 and |log 2 FC| > 1.

WGCNA
Based on expression profiles of DEG, the WGCNA coexpression algorithm was used to mine the coexpression module using the R package WGCNA (http://www.r-project. org/) [30]. First, the appropriate soft threshold is determined by approximate scale-free topology criteria. The adjacency matrix was transformed into a topological matrix, and the genes were clustered using average-linkage hierarchical clustering. Lastly, the dynamic tree cut method was used to determine module eigengenes, at least 30 coexpressed genes.

Functional enrichment and pathway enrichment analysis
Gene Ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis   were performed for DEG based on 'WebGestaltR' [31] in R. FDR < 0.05 was defined as significant.

Construction of a prognostic risk model based on differentially coexpressed genes
First, 90% of samples were randomly selected from the preprocessed 171 TCGA samples as the training set for model construction. To avoid the random allocation bias affecting the stability of subsequent modeling, we repeatedly sampled 100 samples with replacement in advance to ensure that the randomly selected samples were consistent with all samples in age, stage and TNM staging. Univariate Cox regression analysis for OS was performed to identify prognostic DEGs with P < 0.05 using survival coxph function in R. Lasso (least absolute shrinkage and selection operator) Cox regression analysis was performed to find characteristic genes using R package glmnet. Subsequently, the multivariate Cox proportional hazards regression model was used to build a prognostic model in the training group. The risk formula was as follows: RiskScore 4 = À0.0666 9 CA8 + 0.0413 9 CEP55 À 0.2189 9 GNB3 À 0.0339 9 SGSM2. Next, the Kaplan-Meier (KM) survival curve was used to compare prognosis between the low-and high-risk groups, which were classified by the median risk score as the cutoff value in all patients. The receiver operating characteristic (ROC) curve was applied to assess diagnostic accuracy through comparing the areas under the ROC curves (AUCs) using timeROC package in the training and validation groups.

Gene set enrichment analysis
The R software package Gene Set Variation Analysis (GSVA) [32] was used for single-sample gene set enrichment analysis (ssGSEA), and the function with correlation >0.45 was selected.

Identification of four molecular subtypes of pancreatic cancer
Univariate Cox survival analysis of lipid metabolism genes using coxph revealed 189 genes associated with the prognosis of pancreatic cancer. Pancreatic cancer samples were clustered by NMF algorithm, and the optimal number of clusters was determined to be four based on cophenetic, dispersion and silhouette metrics ( Fig. 2A,B). The expression of lipid metabolism-related genes showed that the expression of C2 genes was lower than that of C1, C3 and C4 genes (Fig. 2C). Analysis of the prognostic relationship among the four subtypes showed that C1 had the worst prognosis and a significant difference [P < 0.0001; hazard ratio (HR), 2.264 (1.582-4.352)], and C2 had the best prognosis (Figs 2D and S1). Subsequently, we compared the clinical characteristics of the four subtypes and found significant differences in T, TNM, cancer grade and age (Fig. S2). The immune evaluation of the four subtypes showed significant differences in immune scores and stroma (Fig. S3).

Identification of differential expression genes and functional analysis
DESeq2 was conducted to calculate the DEGs between C1 and C2, C3 and C4 molecular subtypes; a total of 4249 were obtained (Fig. 3). Next, according to the expression profile of coding genes, the WGCNA coexpression algorithm was used to mine the coexpressed coding genes and coexpression modules, and hierarchical clustering analysis was performed on the samples to show that there were no outlier samples (Fig. 4A). A soft threshold of 10 was selected (Fig. 4B,C). Genes were clustered using the averages-linkage hierarchy clustering method, and 14 modules were obtained by setting height = 0.25, deepSplit = 2 and minModuleSize = 30 (Fig. 4D). The correlation of each module with sex, age, ethnicity and clusters 1, 2, 3 and 4 was further analyzed. The results showed that the modules significantly related to clusters 1, 2, 3 and 4 were magenta, green, brown and turquoise, respectively (Fig. 4E-H). Finally, 4249 DEGs and genes with significant coexpression modules were intersected to obtain 1340 differentially expressed cogenes. To determine whether differential genes are related to pancreatic cancer function, we performed GO molecular function and KEGG function enrichment analysis of 1340 differential genes using R software package WebGestaltR (https://www.r-project.org/help. html). The significant pathways enriched by KEGG are related to insulin secretion and dopaminergic synapse pathways (Fig. 5A). GO enriched 233 GO cellular component (CC) (Fig. 5B), 195 GO molecular function (MF) (Fig. 5C) and 977 GO biological process (BP) (Fig. 5D).

Construction and risk prediction of Gene
Signature four-gene signature Ninety percent of the 171 TCGA samples were randomly selected as the training set for model construction. Univariate Cox proportional hazard analysis was conducted for the expression profile of each differentially expressed cogene, and R package survival coxph function was used to obtain 369 genes with significant prognostic differences. To further narrow the range of genes and construct the prognostic model with high accuracy, we used R software package glmnet for Lasso Cox regression analysis. First, the analysis of the change trajectory of each independent variable shows that as the lambda gradually increases, the number of independent variable coefficients tending to zero also gradually increases (Fig. 6A). Then the confidence interval (CI) under each lambda is analyzed, and the model reaches the optimal value when lambda = 0.1753479 (Fig. 6B). Four genes were selected at lambda = 0.1753479 as target genes ( Table 3). The four-mRNA signature formula is as follows:\def\mybox{\vrule depth -0.5mm height 4mm width 8mm} RiskScore4 = À0.0666 9 CA8 + 0.0413 9 CEP55 À 0.2189 9 GNB3 À 0.0339 9 SGSM2.
The RiskScore of each sample is calculated according to the expression level of the sample, and the RiskScore of the sample is plotted. The survival time of the samples with high RiskScore was significantly lower than that with low RiskScore. The gene expression changes with the increase of risk value showed that CEP55 was a risk factor, whereas CA8, GNB3 and SGSM2 were protective factors (Fig. 6C). ROC analysis of RiskScore by R software package timeROC showed that AUC of 1, 3 and 5 years was >0.70 (Fig. 6D). Finally, we carried out z-score for RiskScore and divided the samples with z-score-based RiskScore greater than zero into the high-risk group and the samples with less than zero into the low-risk group. KM prognostic analysis showed a significant difference between the two groups (Fig. 6E).

Robustness of four-gene signature
To determine the robustness of the model, we used the whole TCGA dataset, GSE57495 dataset and ICGC verification set as the verification dataset, and the same model and coefficient as the training set were adopted. Similarly, the high RiskScore sample had a worse prognostic capability, CEP55 was a risk factor, and CA8, GNB3 and SGSM2 were protective factors (Fig. 7A,D,G). ROC analysis showed that the model had high AUC (Fig. 7B,E,H). The results of the KM curve showed that there were significant marginal differences between the two groups (Fig. 7C,F,I). Moreover, we obtained three additional pancreatic cancer datasets from the GEO database, GSE28735, GSE62452, and GSE85916, and we used the same method to score the risk for each patient in these three cohorts. First, we performed ROC analysis on the GSE28735 dataset. Because of the short follow-up period, it was not possible to calculate the 5-year AUC, in which the AUC of 1 and 3 years reached more than 0.78 (Fig. S4A), and there was a significant prognostic difference between the high-and low-risk groups (Fig. S4D). ROC analysis in the GSE62452 dataset showed the highest AUC of 0.77 for 3 years ( Fig. S4B), with significant prognostic differences between the high-and low-risk groups (Fig. S4E). ROC analysis in the GSE85916 dataset showed the highest 5-year AUC of 0.84 (Fig. S4C), with a significant prognostic difference between the high-and lowrisk groups (Fig. S4F).

Prognostic analysis of risk models and clinical features
Survival analysis showed that only age, N stage and OS were significantly correlated in the TCGA training set sample (P < 0.05), and TNM stage presented significant margin (P = 0.05464; Fig. 8). It was further found that four-mRNA signature could distinguish the young and old groups, female, stage I + II, T1 + T2 and T3 + T4 patients from high-and low-risk groups (P < 0.05; Fig. 9). These data further illustrate that our model still has good predictive ability in different clinical signs.

Clinical independence and regulatory pathway of four-mRNA signature
To identify the independence of the four-mRNA signature model in clinical applications, we used univariate and multivariate Cox regression analysis to analyze relevant HR, 95% CI of HR and P value in the clinical information carried by the whole TCGA data. In TCGA dataset, the univariate COX regression analysis found that sex, T3, T4 versus T1/T2, stage III versus stage I, II and IV, and RiskScore are significantly associated with survival, but the corresponding multivariate Cox regression analysis found that age, stage of N and risk score (HR, 3.606; 95% CI: 1.659-7.839; P = 0.007) were significantly associated with survival (Fig. 10A,B). The earlier conditions indicate that our model four-mRNA signature has good predictive performance in clinical application value. To observe the relationship between risk scores of different samples and biological function, we used the R software package GSVA for ssGSEA analysis. The function with a correlation >0.45 was selected, from which it can be seen that most of them are negatively correlated with the risk score of the sample, while a few are positively correlated with the risk score of the sample (Fig. 10C). Cluster analysis results showed that among the 17 pathways, KEGG_P53_-SIGNALING_PATHWAY, KEGG_SYSTEMIC_LU-PUS_ERYTHE MATOSUS, KEGG_CELL_CYCLE and other metabolic-related pathways increased with the increase of RiskScore, and KEGG_pentose_phos-phoate_pathway declined with the increase of RiskScore (Fig. 10D). This also suggests that the dysfunction of these pathways is closely related to tumor development.

Advantages of risk models
Four prognosis-related risk models, 15-gene signature (Chen), 7-gene signature (Cheng), 5-gene signature (Raman) and 9-gene signature (Wu), were selected and compared with our four-gene model. To make the models comparable, we calculated the risk scores of each pancreatic cancer sample in TCGA using the same   method. Samples were divided into the risk-H and risk-L groups according to the median risk score, and the KM prognosis curve showed that there were significant differences in OS prognosis of samples from the four models in the risk-H and risk-L groups (P < 0.05; Fig. 11A-D). The ROC analysis results of the model showed that the prediction effect of the four models was worse than that of the four-gene signature models (Fig. 11E-H). The RMS curve was further drawn using the R language RMS package, indicating that the AUC of the four gene models was higher than that of the four models (Fig. 11I). Curves of risk coefficients also show that the model is relatively high (Fig. 11J).

Discussion
Pancreatic cancer is by default the king of all cancers, placing a heavy burden on families and the world [38]. Accurate prediction of the prognosis of pancreatic cancer is important for the choice of treatment and the improvement of the prognosis. In this study, four independent datasets were used to identify DEGs associated with lipid metabolism between pancreatic cancer and normal cervical tissue. A total of 189 DEGs were identified, and four molecular isoforms were identified based on lipid metabolism genes combined with WGCNA to finally identify four gene signatures. The prognostic value of various gene expression profiles has been studied in pancreatic cancer over the past decade. For example, Demirkol Canli et al. [39]  identified a gene signature composed of 20 prognostic genes (PPS20), indicating OS and event-free survival of pancreatic cancer. Wolfe et al. [40] developed a four-miRNA molecular signature that is associated with risk for local-regional recurrence and OS after pancreatic cancer resection. Meng et al. [41] constructed a novel eight-mRNA signature to predict the prognosis of PAAD patients by applying ESTIMATE scoring to RNA-seq-based transcriptome data. Chen et al. [33] developed a prognostic 15-gene signature to know OS by analyzing microarray data from 63 patients with early pancreatic ductal adenocarcinoma (PDAC) (stages IB, IIA and IIB) in the Moffitt cohort. Although these reports are promising, the proposed genome is either too large or traditionally genetically composed. Therefore, focusing on key biological processes, such as lipid metabolism, which have a significant impact on cancer occurrence and progression, may introduce potential biomarkers for pancreatic cancer screening, as well as new treatment strategies and targets. Abnormalities in signaling pathways are one of the important advances in cancer. With this in mind, abnormal activity of energy metabolism, such as lipid metabolism, has a unique role in cancer development, and its expression and active state have become of interest to researchers for screening and therapeutic inventions [42,43]. Based on this, we used a bioinformatics approach to mine the correlation between lipid gene status and prognostic prediction in patients with pancreatic cancer. We found significant aberrant expression of CEP55, CA8, GNB3 and SGSM2 as the key dysregulated metabolic factors within the study population. Overexpression of CEP55 activates p21 and enhances the cell-cycle transition. Also, CEP55 upregulation promotes PANC cell aggressiveness via activating pancreatic cancer [44]. Many studies have shown that CA8 is associated with poor prognosis in a variety of tumors, [45][46][47], but it has not been reported in pancreatic cancer. GNB3 was reported to influence development of metastasis in low-grade tumors [48]. SGSM2 downregulation promoted estrogen receptor-positive breast cancer cell migration via modulating cell adhesion and cytoskeleton dynamics [49]. These published reports reinforce the potential of these genes as a comprehensive prognosis. We screened prognostic genes from lipid metabolism-related genes and divided four molecular subtypes to select four genes that are likely to be involved in lipid metabolic processes, although these genes have not been studied in depth in lipid metabolism studies. On this basis, we suggest that four gene signatures are likely to serve as prognostic biological indicators of pancreatic cancer, and that these genes may be involved in important lipid metabolism processes.
Inevitably, there are some limitations in the research, and we hope to address these in future work. First, although three study cohorts were included in this study, our findings should be confirmed in a separate cohort. Second, the prognostic value of lipid metabolism genes was studied using gene microarrays, and this single assay should also be validated by other methods, such as real-time quantitative RT-PCR. Third, the majority of genes in our prognostic model have not been reported in studies related to lipid metabolism. Their specific clinical significance, biological function and potential mechanism of action should be studied in further experiments. In conclusion, more experimental evidence is needed to determine the function of these genes in pancreatic cancer.

Conclusions
Our study found that four lipid metabolism-related genes were significantly associated with prognosis in patients with pancreatic cancer; therefore, the four-gene signature with some clinicopathological characteristics could be a useful biomarker for the prognosis of pancreatic cancer.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article.