Abstract
Background: Pancreatic cancer, has a very high mortality rate and requires novel molecular targets for diagnosis and therapy. Genetic association studies over databases offer an attractive starting point for gene discovery. Materials and Methods: The National Center for Biotechnology Information (NCBI) Phenome Genome Integrator (PheGenI) tool was enriched for pancreatic cancer-associated traits. The genes associated with the trait were characterized using diverse bioinformatics tools for Genome-Wide Association (GWA), transcriptome and proteome profile and protein classes for motif and domain. Results: Two hundred twenty-six genes were identified that had a genetic association with pancreatic cancer in the human genome. This included 25 uncharacterized open reading frames (ORFs). Bioinformatics analysis of these ORFs identified putative druggable proteins and biomarkers including enzymes, transporters and G-protein-coupled receptor signaling proteins. Secreted proteins including a neuroendocrine factor and a chemokine were identified. Five out of these ORFs encompassed non coding RNAs. The ORF protein expression was detected in numerous body fluids, such as ascites, bile, pancreatic juice, milk, plasma, serum and saliva. Transcriptome and proteome analyses showed a correlation of mRNA and protein expression for nine ORFs. Analysis of the Catalogue of Somatic Mutations in Cancer (COSMIC) database revealed a strong correlation across copy number variations and mRNA over-expression for four ORFs. Mining of the International Cancer Gene Consortium (ICGC) database identified somatic mutations in a significant number of pancreatic patients’ tumors for most of these ORFs. The pancreatic cancer-associated ORFs were also found to be genetically associated with other neoplasms, including leukemia, malignant melanoma, neuroblastoma and prostate carcinomas, as well as other unrelated diseases and disorders, such as Alzheimer’s disease, Crohn’s disease, coronary diseases, attention deficit disorder and addiction. Conclusion: Based on Genome-Wide Association Studies (GWAS), copy number variations, somatic mutational status and correlation of gene expression in pancreatic tumors at the mRNA and protein level, expression specificity in normal tissues and detection in body fluids, six ORFs emerged as putative leads for pancreatic cancer. These six targets provide a basis for accelerated drug discovery and diagnostic marker development for pancreatic cancer.
- Alzheimer’s disease
- biomarkers
- body fluids
- clinical variations
- dark matter proteome
- druggable targets
- expression quantitative loci
- gene ontology
- genome-wide association studies
- leukemia and lymphoma
- malignant melanoma
- neuroblastoma
- pancreatic cancer
- phenome-genome association
- open reading frames
Pancreatic cancer is currently the fourth leading cause of cancer-related death in the United States (US) and is anticipated to become the second by 2020. Pancreatic cancer accounts for about 3% of all cancers in the US and 7% of cancer deaths. The American Cancer Society’s most recent estimates for pancreatic cancer in the US is that in 2014 about 46,420 people (23,530 men and 22,890 women) will be diagnosed with pancreatic cancer and 39,590 people will die of pancreatic cancer. Current therapeutics for pancreatic cancer still revolve around chemotherapy (1). Thus, novel molecular targets are urgently required for this cancer type. Rational drug discovery for pancreatic cancer requires new drug targets to move further from the conventional chemotherapy (2-4). In addition, diagnosis of pancreatic cancer can be greatly aided by the discovery of pancreatic cancer-associated secreted proteins in body fluids.
The genetic association databases offer an attractive starting point for disease-relevant target discovery (5). The phenotypic traits associated with diseases can be readily enriched for associated genes using the Phenome Genome Integrator (PheGeni) tool (6). The genes identified from the association studies can be verified for gene ontology and pathways using diverse bioinformatics tools. The expression specificity of the genes can be verified at the transcriptome and proteome level. The recent availability of human proteome expression datasets, the Human Proteome Map (HPM) (7) and the Proteomics DB (8), which encompasses protein expression data from diverse body fluids, provides a valuable avenue for verifying expression relevance of the targets identified.
In the present study, using the PheGenI genetic association tool, 226 pancreatic neoplasm-associated genes were identified in the human genome. Among this list of genes, 196 were known proteins. Twenty-five previously uncharacterized open reading frames (ORFs) were identified. A large number of the human proteins are uncharacterized ORFs; these ORFs have been termed as the Dark Matter of the human proteome (9-11). The Dark Matter offers an opportunity to discover new targets. Using a streamlined approach we have recently demonstrated discovery of cancer- (12-14) and diabetes-associated proteins (15) among the uncharacterized ORFs to aid in diagnosis and therapy. A novel secreted protein ORF, termed Secreted Glycoprotein in Chromosome X (SGPX) (14), and a putative druggable calcium-binding transporter Carcinoma Related EF-Hand Protein (CREF) (12) were discovered in these studies.
Reasoning that the pancreatic neoplasm-associated ORFs may offer a rationale for novel pancreatic cancer target discovery, detailed bioinformatics and proteomics approaches were undertaken. Results indicate a potential for druggableness in these ORFs, which encompass enzymes, transporters and receptor-binding classes of proteins. Further, ORFs with a signal peptide, including a chemokine, were identified from these efforts. The ORF expression was detected in diverse body fluids, including ascites, bile, plasma, saliva, milk and pancreatic juice. Lead ORFs were identified based on copy number variations (CNV), correlation of protein and mRNA expression and protein motif and domain analysis for druggableness. These results provide further evidence of the power of mining the human proteome for accelerated novel target discovery.
Materials and Methods
The bioinformatics and proteomics tools used in the study have been described elsewhere (12-14). The following genome-wide association tools were used: the Genetic Association Database (GAD) (5), the Catalogue of Somatic Mutations (COSMIC) (16), the International Cancer Genome Consortium (ICGC) (https://dcc.icgc.org/), the cBioPortal (17), the Integrated Cancer Drug Discovery Platform (CanSar v2 database) (18), the Database for Annotation, Visualization and Integrated Discovery (DAVID v6.7 from the NCBI) (19), GeneALaCart (LifeMap discovery) from the GeneCards (20), the National Center for Biotechnology Information (NCBI) Phenotype-Genotype Integrator (PheGenI) (6), the Expression Quantitative Trait (eQTL) browser (21), the Database of Genomic Variants (DGV) (22), Clinical Variations (ClinVar) (23), the Oncomine database (24) and the International HapMap project (http://hapmap.ncbi.nlm.nih.gov/).
The entire database of GAD, Human Protein Atlas (HPA) (25) and UniGene was downloaded and the Excel filtering tool was used to scan for the ORFs. Batch analysis of the ORF database was performed for canSar, the Multi Omics Protein Expression Database (MOPED) (26), the DAVID annotation tool, the Human Proteome Map, Proteomics DB, the Human Proteins Reference Database (HPRD) (27), the PheGenI and the eQTL browser.
All of the bioinformatics mining was verified by two independent experiments. Big data were downloaded two independent times and the output verified for consistency. Only statistically significant results per each tool’s requirement are reported. Prior to using a given bioinformatics tool, a series of control query sequences was tested to evaluate the predicted outcome of the results.
Results
Mining the human genome for pancreatic neoplasm association. The NCBI PheGenI association tool was used to establish an initial database of genes (Figure 1). A list of 226 genes was generated, including known proteins, uncharacterized genes and non-coding RNAs (ncRNAs). These genes showed positive genetic association for 233 single nucleotide polymorphisms (SNPs). The known protein classes included enzymes, growth factors and receptors, transcription factors, apoptotic proteins, oncogenes and tumor suppressor genes, ribosomal proteins, hormones and hormone receptors.
The study further identified 25 novel uncharacterized ORFs with genetic association evidence (see Table I for details on the ORFs). These ORFs included five ncRNAs including long intergenic RNA and antisense RNAs. As these uncharacterized ORFs offer a potential for new pancreatic cancer target discovery, detailed bioinformatics and proteomics characterization of these ORFs was undertaken.
Genome-Wide Association Studies (GWAS) of the pancreatic neoplasm-associated ORFs. To establish a definitive link for these ORFs with pancreatic neoplasms, diverse GWAS tools were used. The International Cancer Genome Consortium (ICGC) database was screened for the ORFs and the number of mutations present in pancreatic cancer patient samples were identified (Table II).
Somatic mutations were identified for 24/25 ORFs in pancreatic tumors. The mutations for six of these ORFs (C1orf95, C7orf8, C7orf10, C20orf39, C20orf45 and FAM19A5) were seen in more than 10% of the patients analyzed. The largest percentage of mutations was seen for the ORFs FAM19A5 (42.86%) and C7orf10 (38.1%).
The ICGC mutations were also verified by using the COSMIC database for copy number variations (CNV) and over-expression. Seven of the ORFs (C1orf94, C7orf4, C7orf10, C9orf123, C10orf67, FAM91A and KIAA1549) showed elevated copy number and over-expression in pancreatic tumor patients. The cBioPortal GWAS tool identified coding sequence mutations with somatic mutations score 1 for C10orf67 (H147N), C20orf45 (R844C) and C20orf174 (E112K). The latter two mutations may be functionally important as the charge of the amino acids is changed due to mutations.
An additional hint of clinical relevance for these ORFs was established by mining the NCBI ClinVar database. The C7orf10 was associated with Glutaryl-coA oxidase deficiency (28). The C10orf67, FAM19A5, KIAA1217 and KIAA1549 were associated with lung cancer. The ORFs FAM149A, FAM19A5, FAM91A1, KIAA0232, KIAA1217 and KIAA1549 were associated with malignant melanoma. The C1orf44 was associated with left ventricular hypertrophy. These results helped establish a strong correlative evidence for the pancreatic association of the ORFs. Further, clinical relevance to other tumor types and disorders were also indicated.
Pancreatic neoplasm-associated ORFs characterization by Gene Ontology (GO). To develop a hint of function, cell location and the processes involved, the pancreatic neoplasm-associated ORFs were characterized using the CanSar and the GeneALaCart tools by batch analysis (Figure 2). ORFs related to signaling (including insulin and G-protein coupled receptor), nucleic acid and protein binding, transporters (vesicle and proteins), extracellular proteins (including secreted), transmembrane proteins (including receptors), cytoplasmic and mitochondrial proteins, developmental proteins (brain and embryo) and regulators (immune and neuronal) were identified by GO analysis.
Protein expression analysis in body fluids. The availability of diverse proteomics databases for expression studies, such as the MOPED (26), Human Proteome Map (7), the Proteomics DB (8), the HPRD (27) and the HPA (25) enabled this study for gene expression analysis of the pancreatic neoplasm-associated ORFs in diverse body fluids (Figure 3). In HPA tissue microarray samples, 18/20 protein coding ORFs were detected in pancreatic tumors by immunohistochemistry. ORFs expression were detected in ascites (C7orf49, FAM91A1), bile (KIAA1217), serum (KIAA1217), plasma (KIAA1217, KIAA1549, C1orf94), saliva (C7orf43), proximal fluid (C1orf39, C7orf8, FAM84B, FAM91A1, C14orf180) pancreatic juice (C7orf49) and milk (FAM84B, FAM91A1). In addition, membrane and nuclear localization was predicted for some of the ORFs.
The Human Proteome Map analysis helped establish a degree of selectivity in normal tissues. The C1orf94 and FAM19A5 expression was detected only in fetal and not in adult tissues. The C7orf orf9 expression was restricted to adult retina. The C10orf67, C11orf44 and C14orf180 expression was not detected in any tissue analyzed. The C20orf39 expression was restricted to adult frontal cortex. The C10orf84 expression was highly restricted to testis (fetal, adult) and B-cells. The C20orf174 expression was highly enriched in CD4+ T cells. None of the above-mentioned ORFs were detected in adult pancreas. The only ORFs detected in the normal adult pancreas included C1orf9, FAM84B, FAM91A1, KIAA0232 and KIAA1217. The lack of expression of these ORFs in diverse normal human tissues suggests a high level of specificity to the pancreatic neoplasms.
Protein classes of the pancreatic neoplasm-associated ORFs. The ORFs were next characterized for druggableness and secreted biomarker potential using diverse protein motif and domain analysis tools including the Meta Analysis tools GeneALaCart, the DAVID functional annotation tool, the protein Family (PFAM) (29), the InterPro (30), the Signal P (31), the SMART domain analysis tool, (32) the Prosite (33) and HPRD (27). A summary of these results is shown in Table III. Putative druggable proteins (enzymes, transporters, receptors, protein binding and transmembrane proteins) and biomarkers (secreted proteins and chemokines), signaling (hormonal, insulin), neuropeptides and transcription factor class was predicted for the ORFs. Five of the ORFs belong to the ncRNA class including long intergenic RNA (linc RNA) and antisense RNAs.
Correlation of gene expression: mRNA versus protein. The ORF expression in pancreatic tumors was investigated using the HPRD (27), the Oncomine Microarray Database (34), the HPA (34), the MOPED (26) and COSMIC database (16) (Table IV). Expression correlation at mRNA and protein levels was seen for 6/21 ORFs. This included up-regulation of gene expression (C1orf94, C7orf10, FAM19A5 and KIAA1217) and down-regulation (C7orf8, C20orf45). Analysis of the COSMIC database revealed a strong correlation across copy number variations and mRNA over-expression for four ORFs (C1orf94, C7orf10, C10orf67 and FAM91A1). These results are consistent with several findings that show limited correlation of gene expression between mRNA and protein (35-38).
Pancreatic ORFs in other diseases. The PheGenI tool allowed for identification of pancreatic neoplasm-associated traits in other diseases (Table V). Association was seen for other neoplasms (neuroblastoma, leukemia and lymphoma and prostate cancer); psychiatry (attention deficit disorder, neuropsychological tests); cardiac (coronary artery disease, echocardiography, left ventricular hypertrophy, heart failure); Alzheimer’s disease and amyloid beta peptides; metabolic (cholesterol, potassium, ferritin, body mass, height and index, vitamin D, fibrinogen, breath tests) and insulin. These results demonstrate the involvement of the pancreatic neoplasm-associated ORFs in a complex landscape of diseases.
Lead ORFs associated with pancreatic neoplasms. Based on genome-wide association, mutational prevalence and expression correlation in pancreatic tumors, six ORFs emerged as putative lead ORFs (Table VI). This included two secreted products (neuroendocrine secretory protein C20orf45/GNAS and a chemokine, FAM19A5) and an enzyme (C7orf10/SUGCT). Further, two ORFs belonging to protein binding classes (C1orf94 and C7orf8/CTTNB2) might offer druggableness. The KIAA1217, a Sickle tail protein homolog, is a serine- and proline-rich protein involved in actin binding.
Discussion
We have recently embarked on a systematic approach to deciphering the Dark Matter proteome of the human genome for accelerated drug discovery and diagnostic marker development (12). Utilizing GWAS, proteomics and transcriptome approaches, we have identified a fingerprint of cancer and diabetes-associated ORFs (12-15). In the present study, an effort was made to identify novel molecular targets for pancreatic cancer. In view of the relative paucity of druggable targets and diagnostic markers for pancreatic cancer, which is rarely diagnosed at an early stage and has a very high mortality rate, new targets are urgently needed for this cancer (2-4).
Disease target discovery is greatly aided by the availability of Phenome to Genome information mining from the human genome (6). An advantage of target discovery using this approach is the ability to amass the genetic association evidence for the predicted genes. Using the NCBI PheGenI association tool, an initial list of pancreatic neoplasm-associated genes was predicted. Among the 226 genes found to be associated with pancreatic cancer, 196 known proteins (encompassing oncogenes, tumor suppressor genes, growth factors and receptors, enzymes and adhesion molecules), 20 previously uncharacterized protein-coding ORFs and 5 ncRNAs were identified.
A comprehensive profiling of these twenty ORFs enabled for identification of six putative lead ORFs for drug discovery, as well as biomarker development.
The C7orf10, succinylCoA:Glutarate-CoA transferase, is a likely druggable target. Mutations in this gene are associated with glutaric aciduria type III (28, 39).
Inhibition of 3-hydroxy-3-methylglutaryl-coenzyme A (HMG-CoA) by fluvastatin and lovastatin has been shown to reduce human pancreatic cancer cell invasion and metastasis (40). The novel association of the C7orf10 with pancreatic neoplasms identified in this study provides a basis for drug discovery. The C7orf10 is also associated with other diseases and traits, including precursor cell leukemia and lymphoma, prostate neoplasm, coronary artery disease and cardiomegaly.
A novel protein family with sequence similarity 19 (chemokine (C-C motif)-like), member A5) also emerged from this study. This gene, FAM19A5, is a member of the TAFA family, which encodes small-secreted proteins. These proteins contain conserved cysteine residues at fixed positions and are distantly related to MIP-1alpha, a member of the CC-chemokine family (41). The TAFA proteins are predominantly expressed in specific regions of the brain and are postulated to function as brain-specific chemokines or neurokines that act as regulators of immune and nervous cells. The FAM19A5 was found to be one of the five loci associated with pancreatic cancer in a Chinese population (42). CC chemokines have been extensively implicated in pancreatic neoplasm (43). In the ICGC database, somatic mutations in pancreatic cancer patients were seen in 42.86% of the cases. Eventual identification of the cognate receptor for the FAM95A5 might lead to a drug therapy potential. Additional targets identified herein, C1orf94 (Topoisimerase II protein binding), C7orf8 (Contactin-Binding Protein 2 (CTTNBP2), C20orf45 (Guanine Nucleotide Binding Protein (G Protein), Alpha Stimulating Activity Polypeptide 1 (GNAS)) and KIAA1217, a skeletal developmental actin binding protein, expand the scope of druggable targets discovered in this study.
In summary, the results presented in this study demonstrate the feasibility of mining the human proteome from genetic association to target discovery for pancreatic neoplasms. The Phenome to Genome approach offers a rational basis for accelerated target discovery for drug therapy and diagnostic use.
Acknowledgments
This work was supported, in part, by the Genomics of Cancer Fund, Florida Atlantic University Foundation. I thank Dr. Stein of the GeneCards team for generous permission to use the powerful GeneALaCart tool; Dr. Montague, Kolker Laboratory of the MOPED Team for batch analysis of the ORFs and the Human Proteome Map and ProteomesDB for the datasets. I thank Jeanine Narayanan for editorial assistance.
Footnotes
-
Conflicts of Interest
-
None.
-
Data Availability
-
The detailed data of this study as a supplemental table is available upon request.
- Received November 8, 2014.
- Revision received November 13, 2014.
- Accepted November 17, 2014.
- Copyright© 2015, International Institute of Anticancer Research (Dr. John G. Delinasios), All rights reserved