Mining disease genes using integrated protein–protein interaction and gene–gene co-regulation information

Highlights • An eQTL-based gene–gene co-regulation network was constructed.• We adopted a random walk with restart (RWR) algorithm to mine for Alzheimer-disease related genes.• The integrated HPRD PPI and GGCRN network had faster convergence than using HPRD PPI alone.• The integrated network also revealed new disease-related genes.


Introduction
In humans, despite the rapid increase in the discovery of disease-associated genes, the molecular basis of many diseases is still known. Even for diseases for which the molecular basis is partially understood, a large proportion of the associated genes are still unknown. The known disease-associated genes have been reported to represent only a very small proportion of the actual number of disease-associated genes [1,2]. Hence, mining for disease genes remains important.
Expression quantitative trait loci (eQTLs) analyses of DNA use hundreds of thousands of single-nucleotide polymorphism (SNP) markers that capture human genetic variation [22]. This strategy has been successfully applied to several diseases, such as celiac disease [23], asthma [24] and type 2 diabetes [25]. An eQTL is a locus that regulates a gene expression phenotype [26]. If two genes are regulated by one or more of the same SNPs, they are considered to be co-regulated. Obviously, this co-regulation is only one type of gene interactions. We constructed a gene-gene co-regulation network (GGCRN) using eQTL data and believe that it will be useful for disease gene mining.
In this study, we developed a GGCRN, integrated it with the PPI network, and used the RWR method to mine for candidate disease genes. Using Alzheimer disease (AD) as an example, we demonstrated that this newly developed GGCRN is an effective resource for disease gene mining.

Protein-protein interaction data
The Human Protein Reference Database (HPRD) describes interaction networks in the human proteome [27]. All information in the HPRD has been manually extracted from the literature by expert biologists who read, interpreted and analyzed the published data. For this study, we used HPRD, release 9, which contains 38,989 protein-protein interactions among 9605 proteins.

EQTL data
We used human brain tissue data for this disease gene mining study. The data were obtained from a series of 193 neuropathologically normal human brain samples using the Affymetrix GeneChip Human Mapping 500 K Array Set and Illumina HumanRefseq-8 Expression BeadChip platforms [28]. The eQTLs were determined by Matrix eQTL [29]. In this study, the cis-eQTL definition was a SNP within the gene body +1 Mb up/down stream of the gene body. We calculated cis-eQTLs and trans-eQTLs and performed FDR adjustment (q value < 0.1) separately; then we combined the cis-eQTLs and trans-eQTLs. Finally, we obtained 25,866 significant SNP-gene association pairs of 3709 genes. The results can be downloaded from the seeQTL database [30].

AD-related genes
Online Mendelian Inheritance in Man (OMIM) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. AD is classified as a neurodegenerative disorder, and it is associated with plaques and tangles in the brain [31]. We obtained 29 AD-related terms from OMIM [32]. After removing the terms with no approved gene symbol, we obtained 15 AD-related genes. Of these 15 genes, 14, 4 and 14 genes were present in the HPRD PPI, the GGCRN, and the HPRD PPI and GGCRN integrated network (Union network). We used the 14 genes that were present in the Union network for the subsequent analyses. See Supplemental Table 1.

Gene-gene co-regulation network construction
The human brain data included 25,866 significant SNP-gene association pairs of 3709 genes. For each gene, we first extracted the SNPs that regulate it, and we called these significant related SNPs. If a SNP regulated two genes, we called it a common SNP of the two genes. We considered two genes to be co-regulated if a specific proportion of SNPs regulated both genes. Mathematically, for any 2 genes (G i and G j ), there are n 1 and n 2 significant related SNPs, respectively. The gene-gene co-regulation coefficient is defined as Where # (A) is the element number in set A. In other words, #ðSNPs in G i \ SNPs in G j Þ is the number of common SNPs that regulate both gene G i and G j ; and #ðSNPs in G i [ SNPs in G j Þ is the number of SNPs that regulate gene G i or G j . For example, if G i and G j have 100 and 80 significant related SNPs, respectively, and 30 of them are common SNPs for G i and G j , then the co-regulation coefficient is 30/(100 + 80 À 30) = 0.2. After calculating all co-regulation coefficients for all gene pairs, a reasonable threshold value for filtering the significant co-regulated gene pairs had to be established; for this purpose, we used the clustering coefficient difference maximization method [33]. The main function of this method is the determination of the difference in the maximum clustering coefficient difference between a real network and a random network if the real network is highly credible. Finally, we obtained the GGCRN using the significant co-regulated gene pairs.

Random walk with restart algorithm
In this paper, we focused on the genetic data resources rather than on statistical methods. Therefore, we adopted a classic and efficient method. The random walk algorithm (RW) for graphs is defined as an iterative walker's random transition from its current node to a neighboring node, and this is initiated at a given source node [34,35]. The random walk with restart algorithm (RWR) [36] is a variant of the random walk that allows for the restart of the walk at every time step at source node s with probability r. Formally, the RWR is defined as: where W is the column-normalized adjacency matrix of the graph and p t is a vector in which the ith element holds the probability of being at node i at time step t. A special case is the initial probability vector, p 0 , which is the probability of being at source node, s. In our application, p 0 was constructed such that equal probabilities were assigned to the known disease genes, with the sum of the probabilities equal to 1. Genes were ranked according to the values in the steady-state probability vector p N . This was obtained by performing the iteration until the change between p t and p t+1 fell below 10 À6 . The results of RWR are affected by the restart probability, r. We perform a numerical experiment to select the proper r value.
The flow chart is illustrated in Fig. 1.

Direct neighborhood algorithm
We also compared the RWR method with the direct neighborhood (DN) method. In the DN method, the interaction partners in the network were determined for each known disease gene. The more linkages that exist between a gene and known disease genes for a particular disease, the greater the possibility that it is related to that disease. In this study, genes with more than 1 linkage with known disease genes were considered candidate disease genes.

Gene-gene co-regulation network
From the brain data, we identified 181,906 co-regulated gene pairs of 2830 genes with nonzero co-regulation coefficients. We set the threshold to 0 to 1, a distance of 0.01. We calculated the trends in the clustering coefficients, and found the first extreme point of clustering coefficient difference (the real clustering coefficient minus the background clustering coefficient, Fig. 2). As a result, we identified 25,937 gene pairs of 1444 genes under a threshold value 0.63.

Comparison between HPRD and brain co-regulation network
The comparison of the HPRD PPI and GGCRN network revealed only 4 common gene-gene pairs between 8 genes (Table 1). This indicates that the gene-gene regulation patterns in the GGCRN were different from the protein-protein interactions in the HPRD PPI network. Therefore, the GGCRN can provide new information about gene pairs.
In the following analysis, we used 3 networks, the HPRD PPI network, the GGCRN and the Union network.

Disease gene mining
In the RWR method, to obtain a reasonable restart probability, r, we performed a numerical experiment. First, we set r to 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9. When r was 0, the RWR could not reach steady state, so we added a stop condition of walking no more than 100,000 times. To mine for disease risk genes, we selected the genes with a steady-state probability greater than the initial known disease genes and considered them candidate genes. The results of this analysis are described in Table 2. We found that when r was 0, we obtained too many candidate genes and that when r was more than 0.1, we obtained too few candidate genes. Then, we set r to 0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045 and 0.05. We did not know exactly how many disease genes existed; thus we defined a standard for mining risk genes of mining less than 2 times of the known initial genes. For AD, we examined 14 initial genes in the HPRD PPI and Union networks and 4 initial genes in the co-regulation network. Using the co-regulation network, we obtained 41 candidate genes with r set at 0.015 and 2 candidate genes with r set at 0.02. As there were only 4 initial genes, this was not a good result, and this is because the co-regulation network is small, only containing 1444 genes. Therefore, it is unsuitable for candidate gene mining using only the GGCRN. Then, we compared the results using the HPRD PPI and Union network. With r set at 0.01, the numbers of candidate genes obtained were still insufficient. While, with an r value larger than 0.015, we obtained more candidate genes using the HPRD PPI network than the Union network with the same restart probability. In other words, to obtain the same number of candidate genes using these two networks, a higher restart probability is needed for the HPRD PPI network than for the Union network. This indicates that more convergent results can be obtained using the Union network than the HPRD PPI network alone. Therefore, the GGCRN constructed by eQTL data is a useful resource for mining disease genes.   In the DN method, genes with more than 1 linkage with known disease genes were considered candidate disease genes. In the coregulation network, there were only 4 initial known AD genes, and no gene had more than 1 linkage with known disease genes. In the HPRD PPI and Union networks, there were 14 initial known AD genes, and there were 38 and 39 genes with more than 1 linkage with known disease genes, respectively. These results show that the co-regulation network is not suitable for gene mining by the DN method, because it is too small. Furthermore, the results of gene mining with DN method using the combination of the co-regulation network and the HPRD PPI network were almost the same as those for the HPRD PPI network alone. The detailed results of this analysis are shown in Supplemental Table 2.
For gene mining with the RWR method using the Union network we obtained 27 candidate genes with r set at 0.015, and for gene mining with the DN method using the Union network we obtained 39 genes with more than 1 linkage with known disease genes. Only 1 gene was common between these two gene sets. Even when we included all genes that neighbored known AD genes with the DN method, only 11 genes were common between these two gene sets. These results show that the RWR method tends to identify more indirect neighborhood genes, and this is consistent with the gene function pattern of biological networks: genes that interact indirectly may perform the same functions as those that interact directly. Additionally, it was previously reported that the RWR method has the best performance in terms of precision and recall compared to other neighborhood methods [20,21].

Gene annotation
To test the usefulness of the candidate genes mined by RWR, we compared them to those published in the literature. We first verified the 4 most likely AD risk genes (CP, EP300, TP53, and YWHAG, with a restart probability 0.035 in the HPRD PPI and 0.025 in the Union network). The protein encoded by CP is a metalloprotein that binds most of the copper in plasma and is involved in the peroxidation of Fe(II) transferrin to Fe(III) transferrin. Mutations in this gene cause aceruloplasminemia, which is associated with neurologic abnormalities. Some studies confirmed that ceruloplasmin is increased in patients with AD [37,38]. EP300 encodes the adenovirus E1A-associated cellular p300 transcriptional co-activator protein, which mediates cAMP-gene regulation by binding specifically to phosphorylated CREB protein and has been shown to be related to AD [39,40]. TP53 (P53) encodes a tumor suppressor protein that contains transcriptional activation, DNA binding, and oligomerization domains. TP53 is involved in neurodegenerative syndromesis, and is up-regulated in AD [41][42][43]. YWHAG has been shown to interact with RAF1 and protein kinase C and is involved in various signal transduction pathways. A gene expression analysis showed that YWHAG is down-regulated in AD [44]. These results indicate that these 4 genes are related to AD with high confidence.
To compare between the results of gene mining using the HPRD PPI and Union networks, we analyzed the 27 genes detected in the HPRD PPI with r set at 0.02 and the 27 genes detected in the Union network with r set at 0.015. The detailed results are shown in Supplemental Table 3. Twenty-two genes were common between the two networks, and they were in front of the 27 genes detected in the HPRD PPI. We then analyzed the 10 genes that were uncommon between the two networks, and we analyzed the ranks of these genes in the results. The 5 genes that only appeared in the results of the HPRD PPI network ranked from 28 to 37 in the results of the Union network, which was just below the level of significance (rank 1-27). However, the 5 genes that only appeared in the results of the Union network ranked from 66 to 547 in the results of the HPRD PPI network. Thus, the 5 genes that only appeared in the results of the Union network are new discoveries based on this gene-gene co-regulation information ( Table 3).
We verified these 5 genes by comparison with those published in the literature. RAF1 is the cellular homolog of the viral raf gene (v-raf). It is one of the physiological activators of the ERK pathway, which plays key roles in several steps of tumorigenesis, including cancer cell proliferation, migration, and invasion [45]. Raf-1 activation effectively mediates Ras-dependent signals in AD [46]. The protein encoded by MCM2 is one of the highly conserved minichromosome maintenance proteins (MCM) that are involved in the initiation of eukaryotic genome replication. Phosphorylated MCM2 (pMCM2) is markedly associated with neurofibrillary tangles, neuropil threads, and dystrophic neurites in AD [47]. RPS6KA1 encodes a member of the RSK family of serine/threonine kinases. This kinase contains 2 nonidentical kinase catalytic domains and phosphorylates various substrates, including members of the mitogen-activated kinase (MAPK) signaling pathway. Deviation from the strict control of MAPK signaling pathways has been implicated in the development of many human diseases, including AD [45]. MDM2 encodes a nuclear-localized E3 ubiquitin ligase. The encoded protein can promote tumor formation by targeting tumor suppressor proteins, such as p53, for proteasomal degradation. This gene is transcriptionally regulated by p53 [48], which is involved in neurodegenerative syndromes, and is up regulated in AD [41][42][43]. MAP3K3 (mitogen-activated protein kinase kinase kinase 3) is a 626-amino acid polypeptide. This protein directly regulates the stress-activated protein kinase (SAPK) and extracellular signal-regulated protein kinase (ERK) pathways. MAP3K3 phosphorylates RCAN1, and RCAN1-1L is overexpressed in neurons in AD patients [49]. Furthermore, two polymorphisms of the RCAN1 gene are associated with AD [50]. These results indicated that these 5 genes are related to AD.
Compared with using the HPRD PPI network alone, more disease genes can be identified by integrating the HPRD PPI and GGCRN networks. The results of this study indicate that eQTLbased gene-gene regulation information is a useful resource for disease gene mining.

Conclusion
There are certain relationships between genes that are associated with the same SNPs. The eQTL-based GGCRN developed here provides extra gene-gene interaction information than the HPRD PPI network. We integrated the GGCRN and HPRD PPI networks and used the RWR method and DN method to mine for disease gene candidates. The RWR method identified more indirect neighborhood candidate genes than the DN method. Applying the AD data, compared to using the HPRD PPI network, using the integrated network obtained better results. With the same restart probability, the integrated network provided faster convergence and identified some new disease-related genes. Table 3 The 10 candidate genes uncommon between the HPRD PPI and Union networks.

Genes
Rank in the results of HPRD   Rank in the results of Union  network   CTNNB1  23  28  RB1  24  36  TRAF2  25  30  AKT1  26  34  PIK3R1  27  37  RAF1  66  16  MCM2  423  20  RPS6KA1 476  24  MDM2  126  26  MAP3K3  547  27 Therefore, disease gene mining using an RWR approach with an integrated network (HPRD PPI and GGCRN) is an effective method. This eQTL-based GGCRN also has some disadvantages such as its coverage, the genes in this co-regulation network are not sufficient.
Author contribution statement J.L. and M.G. conceived the study. J.L., L.W. and Q.D. did most of the experiments. R.Z., X.L., C.W., Z.T., P.X. and M.Z. analyzed the data. All authors reviewed the manuscript.