Combined RNA/tissue profiling identifies novel Cancer/testis genes

Cancer/Testis (CT) genes are induced in germ cells, repressed in somatic cells, and derepressed in somatic tumors, where these genes can contribute to cancer progression. CT gene identification requires data obtained using standardized protocols and technologies. This is a challenge because data for germ cells, gonads, normal somatic tissues, and a wide range of cancer samples stem from multiple sources and were generated over substantial periods of time. We carried out a GeneChip‐based RNA profiling analysis using our own data for testis and enriched germ cells, data for somatic cancers from the Expression Project for Oncology, and data for normal somatic tissues from the Gene Omnibus Repository. We identified 478 candidate loci that include known CT genes, numerous genes associated with oncogenic processes, and novel candidates that are not referenced in the Cancer/Testis Database (www.cta.lncc.br). We complemented RNA expression data at the protein level for SPESP1, GALNTL5, PDCL2, and C11orf42 using cancer tissue microarrays covering malignant tumors of breast, uterus, thyroid, and kidney, as well as published RNA profiling and immunohistochemical data provided by the Human Protein Atlas (www.proteinatlas.org). We report that combined RNA/tissue profiling identifies novel CT genes that may be of clinical interest as therapeutical targets or biomarkers. Our findings also highlight the challenges of detecting truly germ cell‐specific mRNAs and the proteins they encode in highly heterogenous testicular, somatic, and tumor tissues.


Introduction
Cancer/Testis (CT) genes are expressed in testicular cells and in somatic cancers but typically not in their corresponding normal somatic tissues, reviewed in Ref. [1]. The majority of CT genes function in gametogenesis and fertility, and their abnormal expression in somatic cancer cells can contribute to malignant properties. Indeed, previous work has identified CT genes that are essential for cancer cell division and that affect regulatory signal transduction pathways [2]. Gaining insight into CT gene's biological roles helps better understand how cancer cells proliferate, form metastases, repair DNA damage, suppress apoptosis, alter signaling pathways, and invade normal tissues; for review, see Ref. [3]. More recently, CT genes were proposed to be biomarkers for cancer stem cells that are thought to play roles in the maintenance of tumor growth and resistance to chemotherapy, reviewed in Ref. [4].
Molecular biological and genomic approaches have led to the discovery of several hundred CT genes referenced in the CT database (www.cta.lncc.br) [5]. These loci were initially classified into testis-specific, testis/ brain-specific, and testis-selective groups (whereby low expression in two nontesticular tissues is observed); it is noteworthy that the majority of CT genes has been identified via mRNA expression in limited somatic control sample sets, which may explain why they are frequently not genuinely testis-specific; for a detailed discussion, see Ref. [6]. It is therefore currently unclear how many CT genes are indeed expressed only in male gonads and somatic cancers both at the RNA and protein levels. More recent work carried out by the group developing the Human Protein Atlas, which monitors protein localization in all normal human tissues and numerous cancers, has made a major contribution to determining the human testicular proteome. This study also revealed the somatic cancer expression profile of testicular proteins, some of which are relevant for nonsmall-cell lung cancer [7,8].
We report a combined microarray-based RNA/tissue profiling analysis of somatic cancers, normal tissues, prepubertal and adult testis biopsies, total testis samples, and enriched meiotic and postmeiotic germ cells. The approach identified most known and also novel CT genes in addition to oncogenes and cancer-associated genes that had not been profiled in the male germline before. We selected promising cases and further characterized the proteins they encode using testicular sections and cancer tissue microarrays. Our results represent a rich source for further functional analyses of CT genes in the field of molecular oncology.

GeneChip RNA profiling data assembly
The entire dataset was generated with Affymetrix Human Genome U133 Plus 2.0 GeneChip (Thermo Fisher, Courtaboeuf, France). Expression data for human testis (total testis and isolated seminiferous tubules), prepubertal biopsies (high, intermediate, and low infertility risk; HIR, IIR, and LIR), adult testicular biopsies with different Johnson scores indicating the steps where spermatogenesis is disrupted (JS1, JS2, JS3, JS5, JS7, JS8, and JS10) and enriched germ cells (pachytene spermatocytes and round spermatids) were described in reference [9]. Expression data for 45 normal somatic control tissues were downloaded from the NCBI's Gene Omnibus (GEO: GSE7307, GSE6565, and GSE11839) repository [10]; see Supplemental File S1 columns BZ to DR for tissue-type annotation data. Cancer expression data for 214 cancer subtypes produced by the Expression for Oncology (expO) project (www.intgen.org) were retrieved from GEO (GSE2109) and combined with two other datasets (GSE10802 and GSE6891); see Supplemental File S1 columns DT to LN for cancer types.

GeneChip data processing and analysis
GeneChip U133 Plus 2.0 expression data were qualitycontrolled, processed, and normalized as in reference [9,11]. After quality control, raw data CEL files were normalized, background-corrected, and summarized with the robust multiarray average function implemented in AMEN [12].
Next, we defined classes of transcripts according to their expression pattern in normal and cancer tissue samples. Transcripts specifically expressed (SE) or preferentially expressed (PE) in a given normal tissue were identified by applying three filtration steps. First, intensity signal is above the background expression cutoff (BEC = 5.5, corresponding to the overall median log 2 -transformed intensity) in the tissue of interest and below this threshold in all the other normal tissues (with three exception for PE transcripts). Second, we required at least a twofold change between the signal in the tissue of interest and those of all other tissues with three exceptions for PE transcripts. Third, statistically significant changes across the samples were identified using a LIMMA statistical test with the false discovery rate (FDR) adjustment method (P ≤ 0.01). Transcripts preferentially or specifically expressed in testis as compared to the other somatic normal tissues are termed PET and SET, respectively. Furthermore, two types of transcripts, which show expression signals in testis or germ cells that are in the upper quartile of the overall log 2 -transformed expression matrix are termed specifically expressed and highly expressed in testis (SEHET) and preferentially expressed and highly expressed in testis (PEHET).
Transcripts upregulated in a given cancer subtype (UC) were identified by applying three filtration steps. First, the intensity signal had to be above the background expression cutoff (BEC = 5.5, corresponding to the overall median log 2 -transformed intensity) in the cancer subtype of interest. Second, we required a twofold change between the signal in the cancer subtype and that of the corresponding somatic tissue. Third, the statistically significant changes across the samples were identified using a LIMMA statistical test with the FDR adjustment method (P ≤ 0.01).
Upregulated and highly expressed in cancer (UHEC) corresponds to UC transcripts for which the expression signal in a given cancer subtype is in the upper quartile. Upregulated in cancer and not detected in healthy tissues (UCNDH) transcripts correspond to UC transcripts for which the expression signal in the corresponding somatic tissue is below the BEC. Finally, upregulated and highly expressed in cancer and not detected in healthy tissues (UHECNDH) transcripts correspond to UC transcripts for which the expression signal in a given cancer subtype is in the upper quartile and the expression signal in the corresponding somatic tissue is below the BEC.

Bulk and single-cell RNA-sequencing data processing and visualization
We integrated RNA-sequencing data for total testis samples and enriched testicular cells (Sertoli cells, Leydig cells, peritubular cells, spermatocytes, and round spermatids) published by Ref. [13,14] and single-cell RNA sequencing (scRNA-Seq) data for adult testicular cells published by Ref. [15]. The single-cell expression profiles (scatter plots) for individual genes were generated online via the Reproductive Genomics Viewer (https://rgv.genouest.org) [16,17].
RNA-sequencing data for total testis, somatic tissues, and cancer samples from the Human Proteome Atlas were processed by applying pseudocounts of +1 and log 2 transformation to the dataset (www.proteina tlas.org [18]). The signals were visualized using the heatmap.2 package in R (CRAN); samples and genes were grouped using the default hclust algorithm using default color scaling for normal and testis tissues and row-wise color scaling for cancer data provided by the TCGA consortium [19].

Immunohistochemistry analysis using testicular sections
Regarding human samples, the local ethics committee approved the experimental protocol "Study of Normal and Pathological Human Spermatogenesis" (registration PFS09-015) at the French Biomedicine Agency; informed consent was obtained from all donors. The study's methodologies adhere to the standards set by the Declaration of Helsinki.

Experimental rationale
In earlier work, we used GeneChips to determine the testicular transcriptome in biopsies from prepubertal and adult individuals, total testicular samples, and adult meiotic and postmeiotic germ cells. These analyses included an extensive set of somatic samples to classify genes into testis-specific, preferentially expressed in testis and ubiquitous [9,21]. We integrated our data with normal somatic controls from the NCBI's GEO repository and high-quality GeneChip cancer expression data provided by the expO project (www.intgen.org) [22]. Our RNA profiling study is thus based on robust expression data from multiple sources that were produced using a highly standardized RNA profiling method. Our work also distinguishes itself from similar analyses by comprehensive total testis, testicular biopsy, and male germ cell sampling in combination with antibody-based tissue profiling [23][24][25]. We analyzed 61 testis-associated samples including total testis (two samples), seminiferous tubules (2), meiotic (2) and postmeiotic (2) germ cells, and pubertal (15) and adult (38) testicular biopsies. Furthermore, we processed 544 samples from 45 normal somatic tissues that we used as controls and 2281 samples corresponding to 214 cancer subtypes from 23 distinct tissue origins.

CT gene identification and definition of expression-level cutoff values
We assembled our testicular and germ cell data [9], cancer data from the expO project, and data for normal somatic control tissues from the GEO repository (Fig. 1A). The dataset comprises 2998 GeneChip (Human Genome U133 Plus 2.0) among which 2887 passed quality control (see methods for details; Fig. 1A). They were processed, normalized, and used in a differential gene expression analysis like in reference [9]. The median log 2 expression value was set at 5.5, and lower and upper boundaries between the 25th and 75th percentiles defining a window of gene expression are 4.4 and 6.9, respectively. Values above the 75th percentile represent high expression and transcripts associated with values below the 25th percentile were considered to be undetectable (Fig. 1B). In step 1, among 54613 probesets we selected 2140 (corresponding to 1433 genes) that displayed significant expression in testis or germ cells, including 1285 probesets that were SE in testis. In step 2, we identified 2819 probesets (2025 genes) as being upregulated in at least one somatic cancer subtype and not expressed in the corresponding normal somatic tissue (UCNDH). The intersection of steps 1 and 2 identified 602 probe sets (478 genes) that displayed a pattern broadly corresponding to CT genes ( Fig. 2; see filtering options in Supplemental File S1).

Extended somatic control sample sets are critical for identifying bona fide CT genes
Many putative CT genes are not testis-specific [6]. We therefore determined expression patterns of 176 probeset-associated mRNAs referenced in the CT database using our sample set (www.cta.lncc.br [5]). We find that 39% (69/176 probesets) are SEHET and 12% (22/ 176) are SET ( Fig. 3; CT genes referenced in CT database are available via Supplemental File S1). However, 14% (25/176) are only PEHET and 5% (8/176) are preferentially expressed (PET), which indicates that their mRNA is detected in at least one somatic sample. For 13% (22/176), we find intermediate expression (IE, expressed in 4-10 somatic tissues) and a large group of 16% (28/176) even show ubiquitous expression (UE) in all somatic controls (Fig. 3). This pattern is unsurprising because testis contains not only germ cells but also Sertoli cells, Leydig cells, peritubular cells, smooth muscle cells, and immune cells. We note that among 1079 testicular proteins, 261 were also detected in 22 normal somatic tissues, including fallopian tube (109), cerebral cortex (46 proteins), and epididymis (28) (Fig. S1) [7]. These results underline that comprehensive control sample sets are critical for the identification of testis-specific CT genes.

RNA profiling using GeneChip and RNAsequencing data identifies novel CT genes
To validate our filtration method, we selected 182 probesets (corresponding to 124 unique genes) that show testis-specific expression (SET and SEHET classes), upregulation in at least one cancer subtype, and no expression in the corresponding somatic tissue (UCNDH; select SET plus SEHET and UCNDH filter options in columns G and L, respectively, in Supplemental File S1). To confirm and extend our initial GeneChip expression data, we analyzed testicular expression with our RNA-sequencing data from total testis samples and enriched meiotic spermatocytes, postmeiotic round spermatids, and Sertoli, Leydig, and peritubular cells that were available for 115 core CT genes [13,14]. As expected, we found that the vast majority of the core CT genes are highly induced in the male germline (see RNA-Seq data in Fig. S2A). We then extended the analysis using single-cell RNAsequencing data for testicular somatic cells (Sertoli, Leydig, and peritubular cells and macrophages), mitotic germ cells (dividing, differentiating, and differentiated spermatogonia), meiotic germ cells (leptotene, zygotene, pachytene, diplotene, and diakinesis spermatocytes), and postmeiotic germ cells (spermatids) [15]. The result confirmed that nearly all CT genes are expressed in germ cells at different stages of differentiation, including mitotic, meiotic, and postmeiotic phases of male gametogenesis (see scRNA-Seq data in Fig. S2B).
Next, we confirmed the testis-specific or testis-enriched expression pattern determined with GeneChip data for core CT genes by using RNA-sequencing data available to us [18]. We compared expression levels in male and female gonads to 35 normal somatic tissues and found that the majority of the genes show the expected testis-specific, preferential, or testis/brain expression patterns ( Fig. 4A; see Supplemental File S2 for gene annotation and signal intensities). This result underlines that GeneChip RNA profiling data obtained with samples that were processed using highly standardized methods are reproducible in the majority of the cases, even across RNA profiling methods based on fundamentally different technologies. It is unclear why some transcripts detectable in normal somatic tissues by RNA sequencing fail to be scored as expressed by GeneChip. Different threshold levels of detection, signal processing procedures, and evolving genome annotation data that analysis procedures are based on may at least in part explain the discrepancies. Finally, we analyzed RNA-sequencing data from the TCGA consortium to explore the expression patterns of core CT genes in testicular and ovarian cancer versus 15 selected somatic malignancies [19,26]. We again found the majority of them to be expressed in at least one tumor. The identification of CT genes in ovarian and testicular cancer points to meiotic functions shared by male and female gonads (Fig. 4B).
We further investigated expression patterns of novel CT gene candidates, for which the available literature either shows testicular roles or reports critical molecular functions in somatic tumors but not both. The SEHET class gene DCAF4L1 (DDB1-and CUL4-associated factor 4-like protein 1) has no currently annotated molecular function. However, its mRNA is testis-specific in our sample set and peaks in embryonic ovary germ cell tumor and adult seminomas (compare RNA-sequencing data in Fig. 4 with GeneChip data in Fig. 5A). Interestingly, genetic variations in this locus  were associated with hemangioblastoma, a benign brain tumor [42]. DCAF4L1 shows significant differential expression in 14 cancers versus normal controls and is detected in ovarian and testicular germ cell tumors. This includes kidney cancer and bladder cancer. Interestingly, high expression of DCAF4L1 is associated with a decreased probability for survival in kidney cancer patients and an increased probability in the case of bladder cancer (TCGA Consortium, http:// timer.cistrome.org [43], Figs S3A and S4A).
Another SEHET-type gene is DMRT1 (doublesexand mab-3-related transcription factor 1), which belongs to a highly conserved family of DNA-binding transcription factors important for development and sex differentiation. The mouse protein controls germ stem cell differentiation and the transition from mitotic growth to meiotic development in the germline (for review, see Ref. [44]). We find elevated levels of the human mRNA notably in endometrium, ovary, and breast cancer samples as well as testicular cancer (Figs 4 and 5B). The latter is in keeping with genetic data that associate DMRT1 with testicular germ cell tumor susceptibility [45,46]. Immunohistochemical data from the Human Protein Atlas (HPA) confirm that pattern in the case of breast cancer (see www. proteinatlas.org [47]). DMRT1 is upregulated in nine cancers and also shows strong signals in testicular germ cell cancers in both GeneChip and RNA-Seq datasets. This includes endometrial cancers, for which we detect a strong expression peak that corresponds to the findings reported by the TCGA consortium (Figs 5B and S3B). We note that high expression in uterine corpus endometrial carcinoma is associated with decreased survival (Fig. S4B).
Transmembrane protein 217 (TMEM217) also shows a SEHET pattern, encodes a predicted transmembrane protein, and transcriptionally responds to an antiproliferative agent [48]. The mRNA accumulates in leukemia samples in our dataset and in various somatic malignancies, including thyroid cancer, as  reported by HPA (Figs 4 and 5C; www.proteinatlas. org [47]). TMEM217's function and its expression in normal and cancer tissues are currently unknown. The gene is significantly differentially expressed in 14 cancers versus controls, including kidney renal papillary cell carcinoma and thyroid carcinoma where it is induced. We note that TMEM217 is highly induced in leukemia samples assayed with GeneChips and RNA sequencing (Figs 5C and S3C) and patients diagnosed with acute myeloid leukemia (LAML) show a decreased survival rate when the gene is highly expressed (Fig. S4C).
The SET-type NLRP7 (NACHT, LRR, and PYD domain-containing protein 7) is involved in the inflammatory response and was associated with myometrial invasion in human endometrial cancer [49]. The mRNA peaks in different ovarian tumors and appears to be strongly induced in testicular cancer (Figs 4 and  5D). NLRP7 is differentially expressed in 14 tumors and is also induced in ovarian cancer and testicular germ cell cancers in particular, like in our RNA profiling datasets (Figs 5D and S3D).
Finally, the SEHET class gene synaptonemal complex protein 2 (SYCP2) shows a particularly striking CT gene pattern because it is derepressed in a variety of somatic tumors, especially breast and cervical cancer to an unusually high level (Figs 4 and 5E). The protein interacts with other components of the synaptonemal complex, which ensures the separation of homologous chromosomes during the first meiotic division in male germ cells [50]. Normal expression of SYCP2 is essential for male fertility [51]. Importantly, SYCP2 was recently reported to be a biomarker for luminal A/B breast cancer [52]. SYCP2 is differentially expressed in 11 cancers and shows much stronger RNA-Seq signals in breast and cervical cancer as compared to normal controls, which is coherent with the GeneChip data (Figs 5E and S3E).
These five cases exemplify potential CT genes that are promising candidates for functional analyses since they have been broadly associated with cell growth, differentiation, and cancer.

CT gene analysis at the protein level by tissue microarrays
We next sought to further investigate the RNA/protein profiles of new CT genes for which no direct evidence was reported in the scientific literature (referenced in PubMed) that links them to altered (benign or malign) mitotic cell division [22]. To this end, we selected four candidates that showed promising mRNA/protein profiling patterns using our GeneChip expression data and IHC assays from HPA (www.proteinatlas.org [47]). We first employed published scRNA-sequencing data to explore their expression patterns within testicular tissue [15] (Fig. 6A). The results are coherent with broad expression in the germline (SPESP1), induction in spermatocytes, and peak expression in spermatids (the core gene GALNTL5 and PDCL2) and mostly spermatid-specific expression (C11orf42) (Fig. 6B-E).
SPESP1 is an interesting gene because we classified it as specifically and highly expressed in testis (SEHET) and it encodes a membrane protein, which localizes to the sperm acrosome (see Fig. 6B for scRNA-sequencing data and Fig. 7A for GeneChip data). Infertile male patients were found to make antibodies against SPESP1, and the mouse ortholog was shown to be involved in male fertility [53][54][55]. These results suggest a similarly important function in mammalian spermiogenesis and male fertility for human SPESP1. The gene is transcribed in, among several malignancies, skin, liver, and vulval cancer, which is  confirmed at the protein level for skin and liver cancer by HPA (www.proteinatlas.org; Fig. 7A). We first performed an IHC assay using an HPA antibody and confirmed the protein's presence in round and elongated spermatids, which corresponds to the gene's expression in testicular samples and enriched germ cells (Fig. 7B); see also www.proteinatlas.org [47]. Next, we employed a commercial tissue microarray (TMA) covering 84 benign and malign skin tumors, 12 samples from other tumors (breast, ear, fibrous tissue, parotid gland, vulva), and four normal skin samples to analyze SPESP1's staining pattern (SK803a) (Fig. S5A). We observed that 67% (62/93) of the cancer samples on a TMA showed variable immunohistochemical signals, while the remaining cases were not stained ( Fig. 7C and samples A7/8 and C7/8 in panel D). Unexpectedly, we also detected cytoplasmic staining of keratinocytes in four normal skin samples (Fig. 7D samples J7/8). The faint signal is also present in epidermal cells of normal skin samples published by HPA that were analyzed with the antibody we employed (HPA051040); however, a second antibody (HPA045936) does not yield a signal. This is in contrast with the acrosomal staining pattern that is similar for both antibodies (www.proteinatlas.org [47]). Our GeneChip data and published RNA-sequencing data using skin samples do not indicate expression values for SPESP1 that are above background (www.proteinatlas. org [47]). Moreover, a mouse model lacking Spesp1 shows male infertility but no defect in any somatic tissues, including skin, which argues in favor of testisspecific roles (hence expression) [54]. It is therefore unclear whether weak signals in keratinocytes annotated as normal are true and physiologically relevant. Taken together, the data support the notion that SPESP1 is a testicular protein that accumulates in a substantial fraction of skin cancer samples.
We next analyzed GALNTL5, which encodes a membranous inactive polypeptide N-acetylgalactosaminyltransferase-like protein likely important for acrosome function and proteolysis in sperm [56]. A mutation in the gene was associated with abnormal spermatogenesis in human [57]. Our GeneChip data indicate strong expression of GALNTL5 in adult testis, enriched spermatocytes, and round spermatids (Figs 6C and 8A). To test the commercially available HPA antibody (HPA011140), we first assayed GALNTL5 on adult testicular sections and observed membrane and cytoplasmic staining in spermatocytes, nuclear staining in round spermatids, and cytoplasmic staining in Leydig cells (Fig. 8B). These signals broadly correspond to patterns displayed in HPA (www.proteinatlas.org [47]). GALNTL5's RNA/protein expression profile is thus consistent with a role in the male germline as suggested by earlier work [57,56].
GeneChip profiling data for GALNTL5 show a classical CT gene pattern in an adenocarcinoma of the endometrium. HPA's data confirm expression in endometrium at the protein level and also reveal thyroid and skin cancer samples as positively stained by IHC ( Fig. 8A; www.proteinatlas.org [47]). We therefore confirmed the testicular expression of GALNTL5 using sections (Fig. 8B), and then, we employed tumor TMAs for endometrium (EM1021a, UT721) and thyroid papillary carcinoma (HThy-Pap120CS-01) for further analysis (Fig. S5B). Consistently, we detected cytoplasmic staining of variable intensity in 23/97 (24%) endometrium cancer samples. We also detected weak staining in 1/5 (20%) controls (EM1021a, Fig. 8C cancer samples G8-11 and normal controls H8-11). In a similar experiment using a different TMA 33/83 (40%), malignant samples contained detectable levels of the protein, while 0/9 controls were stained (UT721 A1, Fig. 8D cancer samples E and F1-4). We also observed that 5/8 (63%) of the normal adjacent tissue (NAT) controls showed variable levels of staining. This again suggests that histologically normal tissues can transcribe and translate CT genes and therefore possess a molecular feature that may mark them out as (pre)malignant in spite of their normal histological appearance. Finally, we detected typically strong cytoplasmic staining in 54/58 (93%) cancerous thyroid samples, while only 14/62 (23%) of the NAT samples showed almost exclusively weak signals (HThy-Pap120CS-01; Fig. 8E cancer samples A1/B1 and A3/B3 and normal adjacent tissue A2/B2 and A4/B4). We conclude that histological and molecular data concur in more than three quarters of the cases and it appears that a substantial number of histologically normal tissues accumulate unphysiological levels of GALNTL5.
In summary, GALNTL5 is a testicular protein that strongly accumulates in a substantial fraction of      endometrium and thyroid cancer samples. We note that HPA reports weak RNA expression signals in brain samples and two out of three brain sections show GALNTL5-positive neuronal cells (www.prote inatlas.org [47]). This pattern is reminiscent of the testis-brain-specific gene class, although the reliability and biological relevance of nontesticular RNA/protein expression signals for GALNTL5 remain to be determined [6]. From our current study and earlier work published by others, it emerges that RNA profiling alone is a suboptimal approach to identify testis-specific CT genes that are likely relevant for somatic cancer progression. Major caveats of transcriptomic approaches are that RNA signal intensities used to identify target genes largely depend on the technologies and data normalization methods used and that transcribed CT gene mRNAs are not necessarily translated into physiologically relevant protein levels. Moreover, most current RNA profiling data from cancer tissues yield no information about how many and which cells in the sample express the target gene. Future work using improved single-cell RNA-sequencing approaches will alleviate this critical issue. Large-scale protein    profiling data obtained via IHC assays of testis, tumor, and normal somatic samples might also be a promising method complementary to RNA profiling, provided that the antibodies yield specific and reproducible results.
To further explore this protein-based approach, we selected two testis-specific genes for which we did not observe a typical CT gene pattern but that showed strong signals in cancer samples in HPA (www.prote inatlas.org [47]). PDCL2 encodes a phosducin-like testis-specific protein [58]. Human PDCL2 is induced in the male germline and expressed at the RNA and protein levels in male meiotic and postmeiotic germ cells (Fig. 6D). Furthermore, its mRNA peaks in a frequent form of testicular cancer (seminoma, Fig. 9A,B).
PDCL2 is a membrane protein that accumulates in round spermatids and in kidney cancer (www.prote inatlas.org [47]). Using an HPA antibody (HPA048260) and a TMA for renal cancer (KD1503), we found that 92/100 (92%) of malignant samples show variable levels of staining (including 16 cases for which we observed strong signals) ( Fig. S5C; in Fig. 9C, compare cancer samples C7-8/ D7-8 and normal adjacent tissues C9/D9). While 31/ 50 (62%) of the NAT samples were positive, only two samples displayed strong staining (Figs S5 and  9C). This demonstrates that PDCL2 protein signals for renal cancer are reproducible using custom-made and commercial TMAs [26]. The results also reveal that a substantial proportion of NAT samples appear to accumulate low levels of PDCL2. Since healthy kidney samples displayed no PDCL2 staining on our custom-made TMAs, while testicular germ cells are clearly marked, it is conceivable that some histologically normal tissues might already be precancerous at the molecular level ( Fig. 9D testis sample D6 and kidney samples A-C5). It is noteworthy that high expression of PDCL2 appears to be associated with decreased survival of kidney cancer patients (Fig. S6A).
C11orf42 mRNA is moderately expressed in adult testis, upregulated in enriched round spermatids and although it is likely expressed in endometrium adenocarcinoma its RNA profile in control tissues does not mark it out as a bona fide CT gene (Figs 6E and 10A,  B). The protein is annotated as testis-specific, and C11orf42 is detected in lung and thyroid cancer (www. proteinatlas.org [47,26]). Consistently, we observed cytoplasmic signals for C11orf42 in 3/40 (8%) thyroid cancer samples on a commercial TMA (TH481), while 0/8 of the normal controls showed cellular staining (Figs S5D and 10C and cancer sample A3 versus normal sample F3 in panel D).
Taken together, these results underline the robustness of combined RNA/protein-based methods to identify novel Cancer/testis genes and highlight the limits of approaches based on RNA profiling alone.

Discussion
We combined RNA/protein expression data from testis, male germ cells, normal controls, and numerous somatic cancers to identify novel CT genes suitable for biomarker discovery and mechanistic analyses in the field of molecular oncogenesis.

The difficulty of identifying Cancer/Testisspecifically expressed genes
The male gonad is a complex organ and expresses the largest number of known genes among all tissues analyzed so far, together with the brain [59,18]. However, identifying bona fide testis-specific genes is a challenging task because somatic tissues that are used as negative controls in profiling studies are typically also composed of different cell types. When only a small subpopulation of cells in such a tissue expresses the testicular gene, its mRNAs may be diluted below the threshold level of detection, thereby yielding a false-negative somatic control sample. This can lead to incoherent results with protein-based assays, especially immunohistochemistry, that detects signals in any cell population (or layer) of a given somatic organ. An analysis at the single-cell resolution level of human somatic and reproductive tissues will facilitate tackling this critical issue [60,61,15].

CT genes are promising candidates for prognostic biomarkers
CT genes represent a rich source for genes that confer oncogenic properties when abnormally expressed in somatic cancer cells [62], reviewed in Ref. [3]. A growing body of evidence links CT gene expression levels to unfavorable or favorable outcomes in the progression of a variety of somatic cancers, which underlines the clinical importance of CT genes as potential biomarkers and oncogenes [3]. Among core CT genes, we identified 31 cases for which such data revealed various prognostic outcomes, including three MAGEA family members (for more details, see www.proteinatla s.org [26]). Additional examples among the core CT genes are CMTM1 (signaling molecule) and SLC1A6 (amino acid transporter), which are unfavorable expression markers for pancreatic and urothelial cancer, while KHDRBS3 (RNA splicing) and GLUD2 (glutamate dehydrogenase) are favorable markers for kidney cancer. Interestingly, the type of prognosis appears to be dependent on the tissue that is affected, because LEMD1 (signaling molecule) is a favorable marker for ovarian cancer but an unfavorable one for pancreatic cancer (see Supplemental File S2 and Fig. S7 for the complete list of genes). Such dual-function genes have been found to be involved in cell cycle regulation; for review, see Ref. [63].

Novel CT genes may act as oncogenes or tumor suppressors
SPESP1 was associated with homologous recombination repair [64] (referenced in www.genomernai.org [65]) and binds LYN, a tyrosine protein kinase important for cell proliferation and the response to DNA damage [66] (http://thebiogrid.org [67] and www.ne xtprot.org [68]). Mouse SPESP1 interacts with CENPC1, a centromere-binding protein that plays a role in mitotic chromosome segregation [68,69], referenced in IntAct [70]. In summary, misexpression of SPESP1 may contribute to genetic instability and altered growth properties in somatic cancer cells.
GALNTL5 interacts with RHOU, a Rho-related GTP-binding protein implicated in cancer cell migration (reviewed in Ref. [71][72][73]), and TP53BP1, a protein involved in double-strand break repair, response to DNA damage, and telomere dynamics [68,74]; http://thebiogrid.org [67]. This points to a potential role for GALNTL5 in cancer cell division and resistance to chemotherapy based on drugs that introduce DNA breaks, such as 5-fluorouracil and cisplatin [75].
A genome-wide RNAi screen identified C11orf42 as being important for normal mammary epithelial cell growth in vitro [76] (www.genomernai.org [65]). In light of this potential role in cell division, it is noteworthy that C11orf42 physically interacts with the protein transporter SNX5 [77] (IntAct [70]), which is expressed in the male germline (www.proteinatlas.org  [47]; www.germonline.org [78]; https://rgv.genouest.org [16]). Given that SNX5 is a negative prognostic marker for liver cancer and plays a role in promoting thyroid cancer progression by stabilizing growth factor receptors, C11orf42 may contribute to these pathological processes via its interaction with SNX5 [79,26]. PDCL2 interacts with ACTRT1 and REST (RE1-silencing transcription factor) (http://thebiogrid.org [67]). ACTRT1 is associated with sporadic basal cell carcinoma [80]. Mutations in REST predispose to the Wilms tumor (the most common form of childhood renal cancer), suggesting that the gene acts as a tumor suppressor in this pediatric cancer [81]. We note that high REST expression correlates with increased survival in kidney cancer contrary to PDCL2, which shows the opposite effect (http://timer.cistrome.org; Fig. S5A,B). This raises the intriguing possibility that PDCL2 may act as a negative regulator of a renal tumor suppressor gene via direct protein-protein interaction with REST.

Conclusions
The accumulating evidence underlines that CT gene products, which have been touted as major targets for tumor neoantigen-based immunotherapies, are also interesting from an oncogenic perspective [82]. Further mechanistic studies of testicular proteins abnormally expressed in somatic cancer cells will help gain insight into molecular oncogenic processes. Such work may therefore facilitate efforts to optimize existing treatments or even open up novel therapeutic opportunities.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article. Fig. S1. Detection pattern of testicular proteins in somatic tissues. Fig. S2. Core CT gene expression in testis and male germ cells. Fig. S3. TCGA expression data. Fig. S4. Kaplan-Meyer (KM) plots for CT genes. Fig. S5. Commercial-and custom cancer TMA sample annotation. Fig. S6. Kaplan-Meyer (KM) plot for PDCL2 and REST. Fig. S7. Gene expression/cancer prognosis matrix. Supplemental File S1. Searchable annotation and expression data. Supplemental File S2. Core CT gene annotation, cancer prognosis and RNA-Sequencing data.