1 Introduction

Colorectal cancer (CRC) is a major health concern worldwide. In Europe, CRC ranks second in terms of cancer incidence and mortality, after lung cancer [1]. Most often the disease is only diagnosed at a late stage, when clinical symptoms arise.

CRC develops in a multi-step manner from normal epithelium, through a pre-malignant lesion (so-called adenoma), into a malignant lesion (carcinoma), which invades surrounding tissues and eventually can spread systemically (metastasis). While pre-malignant lesions, adenomas, can easily be detected by colonoscopy, most of these adenomas have a low risk of progressing into cancer. It is estimated that this happens in about 5% of adenomas [2, 3]. Unravelling the biology of adenoma to carcinoma progression is crucial for identifying biomarkers for adenomas that are truly at high risk of progression.

By genome-wide analysis of DNA copy number changes we have shown that adenomas harbouring a focus of carcinoma (progressed adenomas), already contain in their phenotypically still benign components most chromosomal aberrations that are present in their invasive counterpart [4, 5]. This indicates that these chromosomal regions harbour genes that play a role in the adenoma-to-carcinoma progression.

Many genes have been implicated in the development of CRC. Initially, APC, KRAS, TP53 and DCC have been described in the progression model introduced by Fearon and Vogelstein [6]. More recently, by sequencing a panel of 11 colorectal cancers, Wood et al [7] identified 140 candidate cancer genes (CAN genes) mutated in CRC. As gene expression can be affected in multiple ways, including DNA copy number dosage and epigenetic silencing, studies integrating both copy number changes and point mutations [8, 9] as well as mutations and hypermethylation of promoter regions of genes [10] have enabled the discovery of additional genes and pathways relevant to CRC development. One example of a clinically relevant gene involved in CRC progression is AURKA, which is both amplified and overexpressed in the majority of CRCs [5, 11, 12] and is a target for anti-cancer drugs [13].

Microarray expression profiling is a robust technique to analyse the expression of thousands of genes simultaneously. Gene expression in CRC has been widely studied with microarrays, either comparing carcinomas to normal colon tissue [1418] or by comparing microsatellite instable with microsatellite stable CRCs [19, 20]. In addition, the mechanisms of metastasis [2125], prediction of recurrence risk of stage II and stage III CRCs [2629] and response to treatment in advanced CRC patients [30] have been investigated using expression profiling. To date, a limited number of studies have focused on the differential expression between colorectal adenomas and carcinomas and in most of them the number of samples analysed was rather limited, especially concerning the number of adenomas [3134].

After having studied differential mRNA expression of designated cancer pathways between colorectal adenomas and carcinomas [35], we set out to identify in an unbiased approach which genes show altered expression during progression from adenoma to carcinoma, using the same data set of 37 colorectal adenomas and 31 colorectal carcinomas analysed by oligonucleotide microarrays.

Gene Set Enrichment Analysis (GSEA) was used to identify altered expression of sets of genes associated with specific biological processes in order to identify the molecular mechanisms that drive colorectal adenoma to carcinoma progression.

2 Material and methods

2.1 Samples data set

Expression microarray data available from 68 snap-frozen colorectal tumour samples (37 adenomas and 31 carcinomas) prospectively collected at the VU University medical center (VUmc), Amsterdam, the Netherlands [5] were analysed in the present study. The study was carried out in accordance with the ethical guidelines of our institution concerning informed consent about the use of patient’s material after endoscopic or surgical procedures. The 68 frozen specimens corresponded to 31 females and 34 males (3 patients had multiple tumours). The mean age was 69 (range 47–89). Supplementary Table 1 shows all relevant clinical information about the samples used. Evaluation of tumour content was done by a pathologist (G.A.M.) on 4 μm haematoxylin and eosin-stained cryo sections obtained before and after (i.e. sandwich method) the tissue part from which RNA was isolated. Only cases containing at least 70% of tumour cells in both 4 μm sections were considered for further analysis [5]. All expression microarray data are available at Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/ [36], accession number GSE8067.

2.2 Microarray data analysis

As all hybridisations were performed dual channel using a common reference, all comparisons were relative between colorectal adenomas and carcinomas.

Supervised analysis for comparing carcinomas to adenomas was done using the Wilcoxon signed rank test. Genes were considered to be differentially expressed in a genome-wide setting when the p-value was less than 1e-5 (False discovery rate; FDR < 0.1).

2.3 Prediction Analysis for Microarrays (PAM)

PAM [37] was used to classify our samples based on the gene expression signatures. Leave One Out Cross Validation (LOOCV) was used to evaluate the classification accuracy of the entire data set. The cross validation process was run 100 times. This strategy was also applied on an independent micoarray expression data set (GEO database accession number GSE20916) [34] containing 45 adenomas and 36 carcinomas.

2.4 Gene Set Enrichment Analysis (GSEA)

Association of the differential expression with biological processes was performed using Gene Set Enrichment Analysis (GSEA v2.0, http://www.broad.mit.edu/gsea/) [38]. GSEA makes use of the fact that changes in biological characteristics require coordinate variation in expression of groups of genes, i.e. gene-sets, which regulate biological activity. While absolute changes in expression levels of the majority of individual genes are often modest to small and do not reach statistical significance, GSEA allows to estimate group-wise variation in expression of pre-defined gene-sets, indicative for pathway activity levels. In brief, GSEA performs a competitive analysis of pre-defined gene-sets, whereby all genes analysed by expression arrays are ranked according to their differential expression between two sample groups (i.e. colorectal adenomas and carcinomas). GSEA calculates a pathway Enrichment Score (ES) that estimates whether genes from pre-defined gene-sets are enriched among the highest- (or lowest-) ranked genes or whether they are distributed randomly. Default settings were used. Thresholds for significance were determined by permutation analysis (1000 permutations). False Discovery Rate (FDR), i.e. correction for multiple testing, q-values < 0.1 were considered significant.

For GSEA analysis, a total of 3,732 pre-defined gene-sets were available. We analysed the data using three different types of gene-sets, namely ontology gene-sets (n = 1454; of which 1,057 were selected using the default setting -microarray results for at least 15 genes per gene-set), curated gene-sets (n = 1892; of which 1,434 were selected using the default setting) and gene-sets organized by chromosomal position (n = 386; of which 257 were selected using the default setting).

2.5 Identification of key genes for pathway activity

Individual genes from cancer-related gene-sets for which mRNA expression levels differed most between colorectal adenomas and CRCs, were considered key genes for pathway activity. For individual genes within cancer related gene-sets that contributed to the enrichment score, Wilcoxon-ranking test p-values < 1e-5, which in our setting implied FDR < 0.1, of the differential expression at the single gene level were considered significant.

3 Results

Microarray expression analysis was performed on snap-frozen samples from 37 adenomas and 31 carcinomas.

Expression of 248 genes were significantly different between adenomas and carcinomas (Wilcoxon rank test, p < 1e-5), after the analysis of the whole data set (28830 probes). Of these, 96 were significantly upregulated in carcinomas compared to adenomas and 152 were significantly downregulated in carcinomas compared to adenomas. The top 20 of both upregulated and downregulated genes are shown in Table 1. A complete list of the differentially expressed genes, upregulated and downregulated in carcinomas is presented in Supplementary Tables 2 and 3, respectively. The most upregulated genes in carcinomas in comparison to adenomas were Spondin 2 (SPON2) and Regulator of G-Protein signalling 16 (RGS16), located at chromosome arms 4p16.3 and at 1q25.3, respectively. The most downregulated genes in carcinomas in comparison to adenomas were family with sequence similarity 55, member D (FAM55D) and protein kinase, cAMP-dependent, catalytic, beta (PRKACB), located at chromosome arms 11q23.1 and 1p36.1, respectively.

Table 1 Top 20 significantly upregulated and downregulated genes in colorectal carcinomas in comparison with adenomas

PAM analysis using LOOCV to accurately classify the tumours as adenomas or carcinomas, on the whole gene data set, revealed a set of eight genes that were frequently selected in the 100 runs (Table 2). All eight genes were included in the 248 differentially expressed genes, namely the top upregulated (SPON2) and the top downregulated (FAM55D) genes.

Table 2 Frequently selected features (in 100 runs) with Leave One Out Cross Validation (LOOCV) applied to the whole data set

Next, we wanted to check if the expression of the differentially expressed genes (n = 248 genes) would be accurate enough to classify adenomas and carcinomas in an independent adenomas and carcinomas expression dataset. We used the expression dataset of Skrzypczak and collaborators [34], from which we were able to match 169 genes out of 248. From this analysis, we observed that this set of genes was able to classify adenomas and carcinomas with 71% accuracy (Table 3). The same analysis using only the set of upregulated genes (n = 96) was performed, as these are suitable candidates to be used as biomarkers in clinical practice. From this analysis, we observed that this set of upregulated genes, from which we were able to match 71 genes out of 96, was able to classify adenomas and carcinomas with 89% accuracy (Table 4).

Table 3 Average classification percentage (100 runs) of the 169 matched genes (out of 248 differentially expressed genes) on the independent expression data set from Skrzypczak et al. [34] containing 45 adenomas and 36 carcinomas
Table 4 Average classification percentage (100 runs) of the 71 matched genes (out of 96 upregulated genes) on the independent expression data set from Skrzypczak et al. [34] containing 45 adenomas and 36 carcinomas

To compare mRNA expression between colorectal adenomas and carcinomas at the pathway level, we performed gene set enrichment analysis (GSEA). Data were analysed using three different types of gene-sets; i.e. ontology gene-sets (gene-sets organised by biological process, molecular function and cellular component), curated gene-sets (gene-sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts) and gene-sets organised by chromosomal position (gene-sets corresponding to each human chromosome and each cytogenetic band that has at least one gene).

Within these biological processes, pathways or locations, those genes whose expression was also significantly different between adenomas and carcinomas at an individual level were considered as biomarkers or key genes reflecting the changes occurring within these pathways during progression from adenoma to carcinoma.

Of the 1,057 ontology gene-sets analysed, twenty one were significantly differentially expressed between adenomas and carcinomas (FDR < 0.1) (Table 5) These were all upregulated in carcinomas compared to adenomas and most were involved in chromosome binding, chromosome segregation, mitosis or cell cycle. Key genes recurrently observed within these pathways are Structural maintenance of chromosomes 1A (SMC1A), Aurora kinase A (AURKA), TPX2, microtubule-associated, homolog (Xenopus laevis) (TPX2), polo-like kinase 1(PLK1), Sjogren syndrome/scleroderma autoantigen 1 (SSSCA1), and Polymerase (DNA directed), delta 1 (POLD1).

Table 5 Key genes in biological processes associated with colorectal adenoma to carcinoma progression (FDR < 0.10)a

Of the 1,434 curated gene-sets analysed, two gene-sets (progeria and old age) showed significantly increased expression in CRCs compared to adenomas and one gene-set (fatty acid metabolism) showed significantly decreased expression (FDR < 0.1) (Table 5).

In the progeria and old age pathways, prostaglandin-endoperoxide synthase 2 (PTGS2 or COX2) was significantly overexpressed in carcinomas in relation to adenomas (p < 1.7e-6) and therefore considered a key gene for these pathways. PLK1 was a significantly upregulated gene (p < 3.7e-7) in the old age pathway and therefore considered as well as key gene in this pathway (Table 5). Alcohol dehydrogenase 1 C (ADH1C) was significantly downregulated in carcinomas in relation to adenomas (p < 4.1e-6) and therefore considered as a key gene for the fatty acid metabolism pathway (Table 5).

Chromosomal instability, reflected in gross chromosomal copy number and structural changes, plays an important role in CRC progression. For this reason we also looked at gene-sets organised by genomic positions. From a total of 257 gene-sets based on genomic positions, those that were differentially expressed between adenomas and carcinomas were selected. Considering a cut-off value of FDR < 0.1, the gene-sets on 20q13 and 20q11 were significantly upregulated in carcinomas compared to adenomas while the gene set on 4q22 was significantly downregulated (Table 5). Key genes (p < 1e-5) within the 20q11 gene set are Chromosome 20 open reading frame 24 (C20orf24) and TPX2. AURKA, Bone morphogenetic protein 7 (BMP7), Chromosome 20 open reading frame 20 (C20orf20), Chromosome 20 open reading frame 11 (C20orf11), breast carcinoma amplified sequence 4 (BCAS4) and RNA binding motif protein 38 (RBM38) are key genes in the 20q13 position set. In the gene set on position 4q22 no individual key gene was significantly differentially expressed between carcinomas and adenomas (Table 5).

4 Discussion

In the present study we compared genome-wide gene expression in a collection of 37 colorectal adenomas and 31 carcinomas. Previously, this data set has been studied for 20q associated candidate oncogenes as well as for exploring the involvement of carcinogenic pathways in colorectal adenoma to carcinoma progression [5, 35]. The present study aimed to unravel, in an unbiased way, which genes show altered expression in the progression from adenoma to carcinoma, and could thus play a role in CRC progression. In addition, the effect of this altered expression on tumour biology was evaluated using GSEA pathway analysis with three different types of predefined gene-sets, i.e. ontology gene-sets, curated gene-sets, and gene-sets organized by chromosomal position.

4.1 Differentially expressed genes

mRNA expression was significantly different between adenomas and carcinomas (p < 1e-5) for 248 genes, of which 96 were upregulated and 152 downregulated in carcinomas compared to adenomas. Of these, some have previously been found to be upregulated in colon cancer, like, Frizzled-related protein 4 (SFRP4) [39], AURKA [40] Bone morphogenetic protein 7 (BMP7) [41] and Centromere protein H (CENPH) [42] or downregulated in cancer like, prostaglandin E receptor 4 (PTGER4) [43] and inhibitor of DNA binding 4 (ID4) [44], supporting the validity of the arrays and analyses procedures used in the present study. However, most genes were not been described in previous expression microarray studies comparing colorectal adenomas to carcinomas [3134]. This discrepancy could be explained by the fact that these studies used different platforms (either Affymetrix arrays or cDNA microarrays) and different strategies for gene selection compared to the present study. Nonetheless, 8 of the 248 differentially expressed genes observed in the present study, including 2 of the top downregulated genes, are also described to be differentially expressed through CRC progression by Skrzypczak and collaborators [34].

The most upregulated genes in carcinomas in comparison to adenomas were SPON2 and RGS16. SPON2 encodes for an adhesion molecule present in the extra cellular matrix. RGS16 encodes for a protein that inhibits signal transduction by increasing the GTPase activity of G protein alpha subunits. The most downregulated genes in carcinomas in comparison to adenomas were FAM55D and PRKACB. Not much is known about FAM55D while PRKACB is involved in the regulation of several cellular processes like proliferation, cell cycle and microtubule dynamics.

Of the top 20 most upregulated genes, 12 have already been found to be overexpressed in epithelial cancers namely, colon, prostate or breast cancer, i.e. SPON2 [45], RGS16 [46], SFRP4 [39], AURKA [5, 40] TIE1 [47], CTHRC1 [48], BMP7 [41], C20orf20 [5, 49], ISG15 [50], CCL21 [51], TCFL5 [5, 52] and CENPH [42]. Of the top 20 most downregulated genes, five have been associated with epithelial cancer before (including colon cancer), i.e. ENTPD5 [53], PTGER4 [43], CACNA2D2 [54], ID4 [44] and UGT1A6 [55]. Downregulation of UGT1A6 as well as PRKACB genes was also observed by others [34]. In addition, ten of the genes differentially expressed between adenomas and carcinomas (TPX2, TCFL5, SLC26A10 and CENPH upregulated and SPTBN2, SOCS6, SLITRK6, HR, ETFDH and APOB48R downregulated) have previously been described to be mutated in CRC [7]. This overlap with existing data corroborated the validity of the approach that was used.

Some of these genes have already been quite extensively studied and in the case of AURKA even used as target for anticancer therapy. Very recently, in a census of amplified and overexpressed human cancer genes, AURKA was described as one of the genes for which good evidence of involvement in the development of human cancer was shown [56]. In CRC in particular, AURKA was shown to be amplified [11] and overexpressed [40]. We also have shown that overexpression of AURKA in CRC was correlated with the copy number increase of 20q13.2 genomic region and was implicated in the chromosomal instability-related adenoma to carcinoma progression [5]. AURKA has a role in centrosome duplication as well as proper chromosome alignment and segregation and its abnormal expression leads to centrosome anomalies and chromosomal instability [57]. Moreover, overexpression of AURKA induces malignant transformation [58]. Another well studied gene is BMP7. It has been associated with e.g. lung, prostate, breast, cervical and colorectal cancer [41, 5962]. BMP7 can act as suppressor of tumour development by e.g. inhibiting cellular motility in lung cancer cells [59] or promoting growth arrest by repressing telomerase activity in cervical cancer cells [62]. In colorectal cancer, BMP7 has been shown to promote invasive potential [41] and to be associated with liver metastasis and poor prognosis of CRC patients [63].

The PAM analysis revealed a set of 8 genes (all included in the 248 differentially expressed genes), which helped the classifier to achieve best performance. Moreover, when looking at the performance of differentially expressed genes in an independent expression data set, we observed that these genes classified adenomas and carcinomas with an accuracy of 71%, which was improved to 89%, if using only the upregulated genes. This result is very promising, given the fact that upregulated genes are suitable to be used as biomarkers in a clinical setting.

4.2 Upregulated pathways

Analysis of the differences in gene expression between adenomas and carcinomas at the level of specific biological processes showed that ageing (defined by the curated gene-sets progeria and old age), ontology based gene-sets associated with chromosomal instability and the chromosomal position gene-sets 20q11 and 20q13 were upregulated in colon adenoma to carcinoma progression. Not surprisingly, several of the differentially expressed genes discussed above emerged as key genes in differentially expressed pathways.

The notion that cancer is associated with ageing is not new, as in both conditions DNA damage plays a crucial role [64]. One of the cellular features associated with ageing is cellular senescence, a process by which cells enter a state of permanent cell cycle arrest [65]. DNA alterations associated with early neoplasia, including mutations in oncogenes like KRAS can promote senescence and may thus contribute to the oncogene-induced senescence (OIS) [66, 67]. Genes that provoke bypass of OIS will drive oncogenic-mediated transformation enhancing tumor progression [68].

Senescence is a potent tumour suppressive mechanism but it can also contribute to malignant progression, as senescent cells secrete pro-inflammatory cytokines which may be harmful for the surrounding microenvironment [65, 69]. Interestingly, in the present study one of the key genes in the activity of both ageing-related gene-sets is PTGS2 (COX-2). PTGS2 is responsible for the biosynthesis of prostanoids involved in inflammation and mitogenesis and it is known to be overexpressed in colorectal cancer [70].

Analysis of the ontology gene-sets revealed that the genes contributing most to the difference between adenomas and carcinomas were involved in chromosome binding, chromosome segregation, mitosis and cell cycle. These functions are frequently disrupted in chromosomal instable tumours [71]. Interestingly, these genes overlap with upregulation of key genes within the gene-sets located on 20q (20q11 and 20q13). This result is consistent with the observation that gain of the long arm of chromosome 20 is the most frequent chromosomal event in colon cancer. Moreover, we have previously reported on an association between 20q gain and progression from colorectal adenoma to carcinoma [4, 5]. Within these locations eight key genes were identified, expression of which contributed most to the difference between adenomas and carcinomas. Among those, four have previously been described by us to have a putative role in progression, namely C20orf24, AURKA, C20orf20 and RBM38 [5]. The other four were BMP7, TPX2, BCAS4 and C20orf11.

Another gene ontology set that was overexpressed in carcinomas in comparison to adenomas was the collagen gene set. This emphasizes the role of the stroma in the progression from premalignant adenomas to invasive cancer, in which the tumor stroma response, which involved proliferation of fibroblasts, is a key factor. We previously showed that gain of 8q24 was associated with increased stroma percentage in CRCs with 20q gain, which is a key step in adenoma to carcinoma progression [72].

4.3 Downregulated pathways

Fatty acid metabolism (curated set) and the chromosome position set 4q22 were the only gene-sets that were downregulated in carcinomas when compared to adenomas.

The role of fatty acid metabolism in cancer development is rather complex. On one hand, fatty acid metabolism is involved in cancer development and cell growth for the necessary synthesis of e.g. cell and organelle membranes [73]. On the other hand, fatty acid metabolism, through beta oxidation in complex interaction with glucose metabolism and hypoxia, plays a role in the energy supply for the cancer cells [74]. These metabolic changes are stimulated or inhibited by several oncogenes or tumour suppressor genes respectively. These genes are known to be involved in proliferative and survival signalling pathways [75].

One of the top genes contributing to the differential expression of the fatty acid metabolism pathway is the alcohol dehydrogenase ADH1C, which is also significantly downregulated in colorectal carcinomas compared to adenomas at the single gene level and therefore considered a key gene in the activity of this pathway. Alcohol dehydrogenase produces carcinogenic acetaldehyde through ethanol oxidation. Polymorphisms in ADH1C have been associated with CRC risk in combination with alcohol consumption [76]. The exact role of ADH1C in adenoma to carcinoma progression, however, is not immediately obvious from these data.

GSEA analysis revealed the chromosome 4q22 gene-set as significantly downregulated in colorectal carcinomas compared to adenomas. However, in the 4q22 gene-set no key gene (significantly downregulated in carcinomas compared to adenomas) was identified and no 4q22 genes are in the top 248 list of differentially expressed genes. This indicates that rather subtle changes at the individual gene level may contribute to the difference observed for the overall set of genes. Deletions of 4q occur in colorectal carcinomas but not in adenomas [5, 77] and no significant gene dosage effects have been identified for genes at this locus, which could mean that other mechanisms like mutations and/or epigenetic regulation are involved. Nonetheless, we have shown that 4q22.1–4q35.2 deletion is associated with a worse outcome in stage II colon cancer patients [78].

4.4 ‘Vogelgram’ genes

In the present study, no significant differences in mRNA expression between adenomas and carcinomas were found for genes traditionally attributed to CRC development like APC, KRAS, TP53 and DCC [6]. It is not surprising that initial events like altered expression of APC or KRAS do not appear in this analysis because expression of these genes is already altered in many adenomas. On the other hand, to some extent this may be due to technical reasons. When base line expression levels of genes are not very high, loss of expression would result in only a relatively minor signal that could remain undetected.

In summary, whole genome mRNA expression patterns of colorectal adenomas and carcinomas were compared and a set of 248 genes were significantly up or downregulated in carcinomas compared to adenomas, suggesting a functional role of these genes in the progression from colorectal adenoma to carcinoma. Gene-sets at chromosomes 4q22 and 20q and biological processes like ageing (which is associated with senescence), chromosomal instability and fatty acid metabolism were differentially expressed in colorectal adenoma to carcinoma progression.

These data are consistent with the notion that adenomas and carcinomas are distinct biological entities. Disruption of specific biological processes like senescence, maintenance of chromosomal instability and altered metabolism are key factors in the progression from adenoma to carcinoma.