Main

Gene-expression studies attempt to extrapolate biologically and clinically relevant hypotheses from gene expression patterns. However, many current studies make little use of existing knowledge such as gene function within specific pathways, and prognostic signatures are often derived with no reference to the functional roles of their components.

One increasingly popular method that aims to make use of prior knowledge is gene set enrichment analysis (GSEA) (Subramanian et al, 2005). It first conducts a supervised analysis by ranking genes according to their ability to discriminate between different sample groups, and then maps them onto previously defined gene sets, typically formed according to common function using annotation sources. The goal is to identify sets containing a statistically significant number of highly ranked genes, and then to use this information to provide functional characterisations for the samples in question. Although powerful, GSEA relies on stratification of the experimental samples into distinct groups, often making it unsuitable for use with heterogeneous clinical data sets.

Another approach often applied to microarray data involves creation of a co-expression network within which each ‘node’ represents a gene, and ‘edges’ are created between genes when their expression patterns are significantly correlated. Co-expression networks have been used to formulate functional and clinical hypotheses from in vivo data (Butte and Kohane, 2003; Hahn and Kern, 2005; Wolfe et al, 2005). A disadvantage with the approach is that it can be susceptible to the multiple testing issues that arise due to the large number of genes represented on a typical microarray. Setting a low threshold for a significant correlation between genes will result in the inclusion of many spurious links, whereas a high threshold will control the false-positive rate at the expense of omitting many genuine edges.

Here we illustrate and validate a network-based approach with parallels to both GSEA and co-expression networks; for a workflow of the method see Supplementary Material and Methods. It can be applied directly to clinical data, even when the samples cannot be partitioned in advance into distinct groups. The algorithm begins with a collection of ‘seed’ genes that are then used as starting point from which to build an association network. Rather than simply connect gene pairs with high correlation between their expression profiles, the approach defines a ‘neighbourhood of co-expression’ around each seed gene, and then connects seeds that have a significant degree of overlap between their neighbourhoods. This approach is relatively robust against the inclusion of spurious edges, as edges are only added when there is consistently high correlation to many intermediate genes that form the intersection between seeds. We previously used a seed-based approach successfully to predict hypoxia-related genes (Winter et al, 2007); this study develops the method in a meta-analysis context to produce robust signatures requiring fewer genes, making them more suitable for clinical use, for example in quantitative RT-PCR analyses of biopsies at presentation.

Hypoxia has a key role in defining the behaviour of many cancers including head and neck squamous cell carcinomas (HNSCCs) (Nordsmark et al, 2005) and breast carcinomas (BCs) (Fox et al, 2007); thus the identification of common hypoxia-regulated genes is important both for understanding of cancer evolution, and for improved prognosis or development of novel therapies. The described approach was applied to a large meta-analysis of HNSCCs and BCs to define successfully a common and robust hypoxia signature.

Materials and methods

Seed clustering

The process begins with k seed genes, ∏={π1,π1,…πk} (‘gene’ is used throughout for convenience, although ‘transcript’ is generally more accurate). Spearman's correlation, ρ, is computed between seeds and genes Y={y1, y2,…ym} in a data set of n samples, X={x1, x2,…xn}. For each seed/gene pair, their ‘affinity’ is defined as:

where θt and θs define extent and sharpness of the cluster. When θs → 0, δ reduces to the step function with δ =0 if ρ2<θt, δ=1 if ρ2>θt. In this limit, the method is parameter free, and this will be used in this study. θt is defined objectively using a probability threshold, α, of observing a given correlation if the null hypothesis (i.e. no association) was true. This needs to be corrected for multiple testing (Hastie et al, 2001) to account for the size of Y; here, α=0.05 after Bonferroni correction was considered. Finally, a membership function is defined:

An increasing γ indicates stronger membership of a gene to a seed cluster.

Shared neighbourhood

The shared neighbourhood, S, between two seeds is defined as:

where γ is the membership (Eq. 2). Two seeds are considered to carry a high degree of related information if their clusters share many genes (high S values). A sign function is also defined:

where sgn(x) is the sign function: sgn(x)=1 if x>0, sgn(x)=−1 if x<0. If two seeds are correlated with their shared features in the same direction, F=1 (seeds are fully concordant); if they are correlated with their shared features in opposite direction, F=−1.

Seed-dependent connectivity

The strength of the relationship between a gene and the whole set of seeds is estimated using the connectivity function:

where γ is defined in Eq. 2 and w are weights that regulate the importance of each seed. In this study, we consider w=1, unless yi is one of the seeds, or a probe set biding to the same transcript as the seed; in this case, to avoid bias, w=0 for that seed.

A connectivity score is defined as the fractional rank of C; that is the ranking normalised between 0 (lowest C) and 1 (highest C).

Bootstrapping, Monte Carlo and meta-connectivity score

Random sets of seeds are generated by Monte Carlo sampling, clusters are aggregated around them, and C and S are calculated. This procedure is repeated to generate null distributions and it provides an estimate of the probability of observing by chance a given value of C and S.

Bootstrapping is re-sampling with replacement of the original population; it is used to provide maximum likelihood best estimates when an analytical approach is not feasible (Hastie et al, 2001). Here, it is used to provide best estimates and confidence limits for C and S. These are used in a meta-analysis across several data sets to define a meta-connectivity score as:

where R[C(yi)]k is the fractional rank of C (Eq. 5), Nd is the number of datasets, σ2k is the variance of the ranked C, R[C(yi)]k, in dataset k for gene yi.

A common metagene between tumour types is derived by taking the Ĉ scores product, πĈ. This is effectively a rank product, as Ĉ is an average rank (Eq. 6).

Cumulative forest plots based on connectivity score

A summary expression score, E, is defined in each sample as the median of the absolute expression of the genes in the signature. The median is used as summary statistics to reduce the effect of outliers. A cumulative forest plot is defined: genes are added to the signature, one by one, in order of their connectivity, C, score so that genes that are introduced first have the highest connectivity. At each step, a summary expression, E, is derived using the new gene and genes from the previous steps. Samples are then ranked by their E value; this assigns a hypoxia score (HS) from lowest (least hypoxic) to highest (most hypoxic). Hypoxia score is then re-normalised between 0 and 1; introduced into a Cox multivariate analysis that includes the other significant clinical covariates and the hazard ratio (HR) of the HS is calculated.

Data sets, data processing and annotation

NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) was searched for gene expression studies in cancer, published in peer-reviewed journals, where microarray were performed on frozen material extracted before chemotherapy, radiotherapy or adjuvant treatment. Eight data sets (Table 1) were selected that used similar platforms (Affymetrix U133A, B and plus2, www.affymetrix.com). Processing was performed using ‘simpleaffy’ (Wilson and Miller, 2005); the ‘gcrma’ function was used to estimate expression values, data were quantile-normalised and logged (base2). Other data sets were identified for validation in which different technologies were used (Table 1); non-Affymetrix data sets were processed as described in the original publications. More details on pre-processing and annotation are given in the Supplementary Methods.

Table 1 Data sets used to train and validate the hypoxia signature

Results

Derivation of a hypoxia expression network

A hypoxia expression network was built first in a data set comprising 59 HNSCC tumour samples (Vice 125; Table 1) using well-characterised hypoxia-related genes identified from the literature covering a comprehensive set of hypoxia-induced pathways (set A, Supplementary Table S1). These were adrenomedullin (ADM), adenylate kinase 3-like 1 (AK3L1), BCL2/adenovirus E1B 19 kDa interacting protein 3 (BNIP3), carbonic anhydrase IX (CA9), enolase 1 (ENO1), hexokinase 2 (HK2), lactate dehydrogenase A (LDHA), phosphoglycerate kinase 1 (PGK1), solute carrier family 2 member 1 (SLC2A1) and solute carrier family 2 (VEGFA). The resultant network (Figure 1) was observed to map distinct regions of the Reactome (www.reactome.org) network and several hypoxia-related pathways (Figures 2; Supplementary Figure S1). The method was applied to additional HNSCC and BC training data sets (Table 1) with similar results (Supplementary Table S2).

Figure 1
figure 1

Hypoxia gene-expression network in HNSCC (Vice 125 data set). Seeds (yellow) and learnt genes (blue) are shown; circle size is proportional to C score. Genes with top 20% C scores are shown. Solid edges connect cluster members with seeds; length is proportional to membership, colour represents Spearman correlation (blue, −1; red, +1). Green dotted edges connect seeds; their length is proportional to the shared neighbourhood, S. This figure appears in colour in the HTML version.

Figure 2
figure 2

Hypoxia network mapped onto Reactome pathways (A) coloured by increasing C score from dark blue to bright red; and validation of up-regulated HNSCC (B) and BC (C) signatures by comparison with the literature. The proportion of literature-validated genes is shown as function of the number of top-ranked (by C score) genes considered; standard errors estimated by bootstrap. This figure appears in colour in the HTML version.

In the resulting expression networks, high shared neighbourhood, S (Eq. 3), values between seed pairs were generally associated with a high pair-wise correlation. However, this relationship did not always hold. An example is given in Supplementary Figure S2, where genes in a published 245-gene literature list (LL) (Winter et al, 2007) were used as starting seeds. Many of the seeds with high pair-wise S but low correlation appeared in the same KEGG (http://www.genome.jp/kegg/) pathway but could not be detected in a straightforward correlation analysis (Supplementary Figure S2). Furthermore some seeds showed markedly different in vivo and in vitro behaviours; for example, PFKFB3 (set B, Supplementary Table S1) did not have significant overlap with any other seeds, whereas CCNG2 showed a consistent inverse correlation with other seeds (F<0; Eq. 4), supporting results from previous studies (Choi and Chen, 2005). Thus, the method was able to identify seeds that behave differently from their peers; for the rest of this study, only the conservative seed set A was used. This set showed higher pair-wise S values than any other set of randomly selected seeds (repeated 1000 times) from the 245-gene LL.

Seed-dependent connectivity identifies a hypoxia signature

Genes in the co-expression networks were ranked by their connectivity score, C (Eq. 5), and compared with the hypoxia 245-gene LL. As the latter is biased towards up-regulated genes (Harris, 2002), only genes showing consistent positive correlation with the initial seeds were considered. To avoid bias, the initial seeds were excluded from this comparison. The relative proportion of known hypoxia genes increased with increasing connectivity, C, score (Figure 2), confirming its benefit as a metric for predicting functional relationships. Similar results were observed with different clustering and pre-processing methods (Supplementary Figure S3). However, differences were observed between data sets. Much of this inter-experimental variation is likely to reflect differences in both the patient populations and the processing of the biological material. For example, both data sets GSE6791 and GSE3494, which showed a lower level of enrichment for hypoxia genes than others, featured samples with the highest proportions of tumour cells selected either by microdissection or visual scoring.

Next we selected a subset of ‘hub’ genes from the hypoxia network, with the goal of using them as a hypoxia signature. Genes with high connectivity, C (Eq. 5), score (P<0.01, estimated by Monte Carlo simulation) were considered (Supplementary Table S2). Each of these genes had a greater-than-expected overlap with the neighbourhoods of all other genes in the hypoxia network (Supplementary Figure S4). The seeds were only selected if they were hubs with respect to all other seeds. Using the Reactome database, we confirmed that pathways known to be regulated by hypoxia, such as glycolysis, gluconeogenesis, glucose metabolism and Cori cycle (recycling of lactic acid), were consistently over-represented in these genes (Figure 2; Supplementary Table S3). Similarly, GO analysis (http://genecodis.dacya.ucm.es) found over-representation (false discovery rate <0.05) of pathways such as glycolysis, phosphoinositide-mediated signalling, nuclear mRNA splicing, translational initiation, regulation of cell cycle, ubiquitin-dependent protein catabolism, apoptosis and regulation of cell proliferation. Over-represented molecular functions included ATP binding, nucleotide binding, lipoic acid binding, oxidoreductase and L-lactate dehydrogenase activity.

Meta-signature enrichment and the prognostic value of compact signatures

We selected genes that showed consistent high connectivity across data sets and derived meta-signatures for hypoxia in HNSCC and BC. Interestingly, although some of the data sets performed poorly on their own, meta-analysis signatures were robust to their inclusion and performed well (Figures 2B and C).

We assessed the prognostic relevance of meta-signatures in four independent data sets (Table 1). Samples were ranked using a summary expression score, E, of the genes in the signature; this produced a hypoxia score, which assigns a hypoxic status to the tumours in the validation data sets. Multivariate Cox analysis including available clinical factors was carried out using each data set; clinical variables were selected using backward-stepwise maximum likelihood. The HS was introduced into the reduced clinical model to estimate the prognostic significance of the meta-signatures independently from other clinical variables (Supplementary Figure S5 and Table S4).

To address whether smaller signatures with equal prognostic ability could be derived by using a more stringent C score, cumulative forest plots were generated in which genes were introduced into the HS calculation one by one, in decreasing order of their meta-Ĉ score (Supplementary Figure S5). Only a few genes were needed before the HR stabilised and a reduced signature was found to be at least as prognostic as a larger one (Supplementary Figure S5 and Table S4). Interestingly, when genes were introduced into the cumulative plots in random order, rather than by their ranked Ĉ score, more genes were needed to reach equivalent prognostic significance (Supplementary Figure S5).

A common hypoxia metagene across cancer types

Common hubs in HNSCC and BC were selected by considering, for each gene, the product, πĈ, of the Ĉ scores between the HNSCC and BC meta-analyses. A common metagene was derived by considering genes with πĈ>0.5 (Table 2; Supplementary Table S5). This hard cut-off was chosen because a gene with a πĈ score approaching that which would be expected by chance (πĈ≈0.5) in one tumour site would have to achieve a maximal score in the other tumour site to be included.

Table 2 Top-ranked genes of the common hypoxia metagene

We investigated in cell lines potential regulation of genes in the common metagene by hypoxia and by HIF1a, the main mediator of the hypoxia response in cancer. We considered two data sets: a hypoxia time course in a panel of epithelial and endothelial non-malignant cells (Chi et al, 2006), and an HIF1a and HIF2a siRNA experiment in MCF7 BC cells (Elvidge et al, 2006) exposed to hypoxia. For details of these data we refer to the original publications. Although differences between cell lines and BC in vivo are expected, a high proportion of genes in the common metagene (38 out of 51) showed either regulation in the hypoxia time course or in the siRNA experiment (Figure 3A, B; Supplementary Table S5). Several of these genes were also predicted as HIF1a targets and showed potential HIF1a binding sites (Supplementary Table S5). Furthermore, 22 had already been found hypoxia regulated by previously published report (Supplementary Table S5). Overall approximately 80% (42 out of 51) of genes in the common metagene were confirmed by at least one validation, several of them by more than one.

Figure 3
figure 3

Common hypoxia signature of 51 genes. (A) Hypoxia/normoxia expression ratio in endothelial, smooth muscle, human mammalian epithelial, renal proximal tubule epithelial cells (EC, SMC, HMEC, RPTEC); and in (B) HIF1a/HIF2a siRNA experiment. (C, D) Connectivity-ranked forest plots: metastases- and recurrence-free survival (MFS, RFS) hazard ratio (HR) (red) with 95% confidence intervals, and HRs if permuted list (black). Control: random sampling of N=51 genes ( × 100 resampling).

The common hypoxia metagene was prognostic in independent data sets of different cancer types (Table 3) and showed greater prognostic power than (1) an in vitro derived hypoxia signature (Chi et al, 2006, 2) the initial seeds and (3) our 99-gene HNSCC hypoxia metagene derived previously (Winter et al, 2007). A signature derived by selecting genes co-expressed with VEGF in BC (Desmedt et al, 2008) had no independent prognostic significance (data not shown), in agreement with the published study. In a further validation using Oncomine (http://www.oncomine.org), all but one of the 15 top-ranked (by πĈ score) genes showed prognostic significance in at least one tumour site (P<0.0001). The only top gene for which prognostic significance was not reported in Oncomine, SLC2A1 (GLUT1), is prognostic in other studies (Oliver et al, 2004).

Table 3 Prognostic significance of the common hypoxia metagene versus other hypoxia signatures

Finally, cumulative forest plots based on connectivity score (Figure 3) showed no further improvement in HR after addition of a small number of genes. Although differences were observed between HNSCC, BC and lung cancers, we found in all cases that a common signature reduced to a small number of πĈ score top-ranked genes was at least as prognostic as the full signature (Figure 3C, D; Table 3).

Discussion

Hypoxia is a frequent feature of poor-prognosis tumours, and the identification of common in vivo hypoxia-related genes is desirable both for prognostic stratification of patients and development of novel therapies. Although prognostic markers of hypoxia have been identified, there are discrepancies between studies and powerful methods used in large meta-analyses are needed to define generally applicable signatures. A method is described for defining a hypoxia signature that combines previous knowledge derived from in vitro experiments, with co-expression data produced from in vivo samples. We show that by constructing a gene expression network and then extracting core ‘hub’ (high connectivity) genes it is possible to define signatures that are significantly enriched for phenotype-specific genes, and pathways. Although we have used this method to derive a compact and clinically relevant signature of hypoxia in cancer, the approach is likely to have broader applicability.

Specifically, we used the described method in a meta-analysis of 1136 HNSCCs and BCs to derive tissue-specific and common signatures of hypoxia by including only genes that are consistently useful across multiple experiments or tissue types. The ability of the method to derive highly prognostic hypoxia signatures despite differences between data sets highlights its robustness.

The gene expression network used to construct the signature was found to be biologically relevant and to map to a discrete set of biochemical pathways, which is significantly enriched for hypoxia-regulated genes and pathways. This finding highlights that not only in vitro data can assist understanding of clinical data, but also the reverse, that clinical data can be used to formulate specific biological hypotheses.

Remarkably, a reduced common hypoxia metagene containing as few as three genes, namely VEGFA, SLC2A1 and PGAM1, was as prognostic as a large signature in independent BC and HNSCC series. Furthermore, it was more prognostic than several reported signatures when tested in a set of independent data sets, suggesting a level of general applicability. Specifically, genes with highest connectivity were also the most prognostic across a panel of cancers. This further validates the method, as prognosis was not used to select genes that were only ranked by their connectivity; and this ranking was derived in independent data sets. Although a reduced signature was prognostic in all tumour sites tested, the number of genes before convergence was lower in HNSCC and BC than lung cancer. This offers another positive control as this was a common signature between HNSCC and BC, thus it is expected to reflect their biology to a better extent; however, it also indicates a degree of tumour specificity. The common signature and the tumour-type-specific signatures are being evaluated in prospective prognostic and predictive studies in HNSCC and breast cancer.

In summary, this study uses information from in vitro experiments regarding the function of multiple genes combined with in vivo co-expression patterns to derive a common hypoxia metagene in multiple cancers that is highly prognostic, while being compact and robust.