Abstract
Background: Known breast cancer-predisposing genes account for fewer than 25% of all familial breast cancer cases and further studies are required to find the remaining high- and moderate-risk genes. We set-out to couple linkage analysis using microsatellite marker data and sequence analysis of linked regions in 13 non-BRCA1/2 families in order to find novel susceptibility loci and high-penetrant genes. Materials and Methods: Genotyping with 540 fluorescently-labeled microsatellite markers located on the 23 chromosomes at 7.25 cM resolution was used for primary linkage analysis and an additional 40 markers were used for fine-mapping of loci with a logarithm of odds (LOD) or heterogeneity LOD (HLOD) score greater than one. Whole-exome sequencing data of 28 members from all 13 families were used for the bioinformatics sequence analysis on the linked regions of these families. Results: Linkage analysis identified three loci on chromosome 18q as a putative region of interest (overall LOD=1, HLOD=1.2). Sequencing analysis of the three linked regions on 18q and mutation prediction algorithms did reveal three probable damaging variants. Conclusion: Overall, our study identified three weakly linked loci on 18q and three probable damaging variants of interest in the 13 families with breast cancer.
- Familial breast cancer
- microsatellite markers
- linkage analysis
- next generation sequencing
- sequencing analysis
Breast cancer is the most common type of cancer among females and the most important risk factor is a family history of breast cancer. Epidemiological studies have shown that first-degree relatives of patients with breast cancer have a two-fold higher risk of breast cancer compared to the general population (1). Even though this could be, at least in part, due to shared environmental or lifestyle factors, twin studies found that up to 30% of all breast cancer cases may be due to genetic factors (2). Nonetheless, only approximately 5% of all breast cancer cases are attributed to the segregation of a germline mutation of a highly penetrant gene within the family (3). The two major high-risk genes are breast cancer 1 and 2 genes BRCA1 and BRCA2, which together account for ~16% of the familial risk of breast cancer, while mutations in other high-risk genes such as: phosphatase and tensin homolog PTEN, serine/threonine kinase 11 STK11, cadherin 1 CDH1 and tumor protein p53 P53 or in the moderate-risk genes: ATM serine/threonine kinase ATM, checkpoint kinase 2 CHEK2, partner and localizer of BRCA2 PALB2 and BRCA1 interacting protein C-terminal helicase 1 BRIP1 explain a further 5% of familial cases (4). Thus, the great majority of families in whom the genetic association remains unexplained have been referred to as non-BRCA1/2 families.
Linkage analysis has been the method-of-choice for finding genes responsible for monogenic diseases and led to the identification of BRCA1 and BRCA2 (5, 6). However, many additional linkage studies have been performed in non-BRCA1/2 families using either short tandem repeat/microsatellite or single nucleotide polymorphism (SNP) markers without identifying any additional susceptibility genes (7-14). The lack of success could be attributed to extensive locus heterogeneity, where only a small proportion of the studied cases are linked to a particular locus. Greater statistical power could be achieved by analyzing subsets of families from more homogeneous populations (for example isolates, such as Finnish, Icelandic or the Ashkenazi Jewish population) in which the number of risk loci might be smaller. This approach was proved to be successful in a few cases, such as in the identification of transmembrane protease, serine 6 TMPRSS6 and RAD50 homolog RAD50, as susceptibility genes in the Finnish population (15, 16).
In the past 4 years, new susceptibility alleles have been found through large global collaborative approaches, either via the analysis of individual genes or, recently, through genome-wide tag SNP experiments. So far, 27 different breast cancer susceptibility variants were detected (17-30) by GWAS, with an additional locus (caspase 8, apoptosis-related cysteine peptidase CASP8) being identified through a candidate-gene approach (31). In total, these loci account for approximately 30% of familial breast cancer. Recently Michailidou et al. identified SNPs at 41 new breast cancer susceptibility loci (32). Despite their high frequency, the fraction of the familial risk explained by all known risk alleles is only 30% in European populations (23), suggesting that other loci possibly remain to be identified.
Recently, an exome-sequencing study of families with multiple breast cancer-affected individuals identified two families with mutations in X-ray repair complementing defective repair in Chinese hamster cells 2 gene XRCC2 and further association studies confirmed that rare mutations in this gene increase the risk of breast cancer (33). Increasingly, several studies are using next-generation sequencing as an explorative tool to screen the whole genome or exome in order to find susceptible variants.
Exome sequencing developed from next-generation sequencing (NGS), also known as massively parallel sequencing technologies, is revolutionizing our ability to characterize cancer at the genomic level by cataloging all mutations, copy number variations and somatic rearrangements at base-pair resolution. The exome covers ~1% of the human genome (human exome ~30 Mb) and is so far the most functionally relevant in phenotype variation. NGS combined with efficient DNA capture enables for use of whole-exome sequencing (WES) studies to target exons and is emerging as an efficient tool for testing for association of rare coding variants with complex diseases. Short tandem repeats or highly polymorphic microsatellite markers are distributed widely and evenly in the genome, are relatively easy to score, and information about the markers is readily accessible through several databases such as the Marshfield Institute, deCODE etc. Therefore, in the present study, we set out to utilize linkage analysis using microsatellite data to identify candidate regions in the non-BRCA1/2 families and furthermore, to use WES data to find potential high-penetrant genes in the identified linked regions.
Materials and Methods
Patients. The present study is based on a cohort of 13 large familial non-BRCA1/2 hereditary breast cancer families. All families have Swedish origin and were recruited from the Genetic Counseling Unit at the Department of Clinical Genetics, Karolinska University Hospital, Stockholm. The study was undertaken in accordance with the Swedish legislation of ethical permission and according to the decision made by the Stockholm Regional Ethics Committee (97/205 and 00/291 and 08/125-31.2). Informed consent from all the families was obtained prior to study initiation. DNA was obtained from 96 family members (Table I). A total of 50 family members were diagnosed with breast cancer raging between 2-9 affected individuals in each family. All families screened negatively for mutations in the BRCA1 and BRCA2 genes. The number of genotyped family members in each family ranged from 2-16, and for the WES study, 1-3 members from each family were chosen from the available DNA so that they represented the most distantly related affected woman.
Genotyping. Genomic DNA was extracted from peripheral blood by standard procedures. A quality control of the samples and microsatellite genotyping were performed at deCODE Genetics (Reykjavik, Iceland). The deCODE markers were validated and originally selected from the Marshfield genetic map (34). A total of 540 fluorescently-labeled microsatellite markers located on the 22 autosomes and the X chromosome were genotyped. The average distance between the markers was 7.25 cM and the genetic map used in the analyses was the one provided by deCODE. The overall successful genotyping rate was 94.3% for the samples.
For the fine-mapping analysis, an additional 5, 10, 19, 2 and 4 highly informative markers on chromosomes 8, 11, 18, 22 and X respectively were chosen from the Marshfield genetic map. (D8S1785, D8S1775, D8S1807, D8S1475, D8S525, D11S1392, D11S4102, D11S905, D11S1993, D11S1983, D11S1765, D11S4178, D11S4136, D11S4184, D11S4081, D18S1157, D18S1145, D18S460, D18S1119, D18S1144, D18S55, D18S1131, D18S58, D18S1121, D18S454, D18S65, D18S970, D18S455, D18S467, D18S1126, D18S363, D18S473, D18S470, D18S1110, D22S686, GATA198B05, DXS1003, DXS8032, DXS8031 and DXS8082). Primers for these markers were obtained from uniSTS, NCBI database, labeled with 6-FAM fluorescein modification at the 5’ end of forward primers. Markers were amplified according to an in-house polymerase chain reaction PCR protocol (available on request). The amplified products were analyzed on either an Applied Biosystems 3130xL Genetic Analyzer or 3500xL Genetic Analyzer (Thermo Fisher Scientific, Waltham, MA, USA) after being primed with Genescan 400HD ROX size standard (Applied Biosystems, Foster City, Canada) and Hi-Di Formamide (Applied Biosystems, Warrington, UK). Subsequently, Genemapper software (version 3.7 or 4.1) (Thermo Fisher Scientific, Waltham, MA, USA) was used to analyze the peaks.
Linkage and fine-mapping analysis. Linkage analysis assessed two criteria: the first more stringent, breast cancer-strict and the next less stringent, breast cancer-loose. Both criteria listed all patients with breast cancer as affected, healthy spouses as unaffected, and all other cancer cases as unknown. In the breast cancer-loose criteria analysis, healthy women over 50 years old were coded as unaffected and those less than 50 years old were coded as unknown whereas in breast cancer-strict criteria all healthy women were set as unknown.
All genotyping data were checked for Mendelian inconsistencies between parents and offspring using the Pedcheck (version 1.00) software (35). Pedcheck executes three different error detection algorithms at three levels 0, 1 and 2 to report all the possible ambiguous genotypes violating Mendelian rules. Next, Mega2 (versions 4.4.3 and 4.5.4) was used to prepare the files for the genetic analysis of the family-based study (36). Mega2 calculates the marker allelic frequencies from all the genotyped individuals from pedigree, locus, map and marker input data while generating the input files for subsequent Simwalk2 (version 2.91) analysis (37).
The available linkage analysis programs use the Markov chain Monte Carlo simulation algorithms for computing the location scores which are directly comparable with multipoint logarithm of odds (LOD) scores and are presented in log10 units. For the parametric linkage analysis, we assumed a dominant mode of inheritance and a disease allelic frequency of 0.0001. The penetrance for homozygous-normal, heterozygous, and homozygous-affected was set to 5%, 80% and 80%, respectively, and the phenocopies account for 5% of the observed affected individuals. Simwalk2 linkage analysis program was run for both breast cancer-loose and -strict criteria to calculate multipoint parametric linkage LOD (PL LOD) scores and locus heterogeneity LOD (HLOD) scores, as well as nonparametric linkage LOD (NPL LOD) scores for the 22 autosomes. Similar multipoint analysis on chromosome X was performed using Merlin (version 1.1.2) software (38).
WES and bioinformatics analysis. The WES of 28 individuals from these 13 families was performed at the Science for Life Laboratory, Stockholm as paired end reads to 100 bp on an Illumina HiSeq2000 instrument (Illumina, San Francisco, California, USA). Bioinformatics analysis of the raw exome sequencing data included, alignment of sequence reads to the reference human genome (with chromosome coordinates GRCh37/hg19) using alignment tools: BWA (Wellcomme Trust Sanger Institute, Cambridge, UK), PICARD (Broadinstitute) and SAMTools (Wellcome Trust Sanger Institute, Cambridge, Massachussets, US), applying GATK (39) base quality score recalibration, indel realignment, duplicate removal and SNP and INDEL discovery. Genotyping was performed across all 30 samples simultaneously using standard hard filtering parameters or variant quality score recalibration (40) and the output generated as an annotated variant call format (VCF) file. The VCF is a generic format for storing next-generation sequence polymorphism data such as SNPs, insertions, deletions and other variants including structural variants together with annotated information. ANNOVAR (Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA) is a software tool to functionally annotate the genetic variants detected from diverse genomes present in the VCF file. Sequence analysis by ANNOVAR was preceded by generating a compatible file from VCF, followed by a gene based annotation using University of California, Santa Cruz (UCSC) genome browser (41). The huge number of hits from the bioinformatics analysis was reduced to a sizable number through sequential reduction steps for final manual validation. The deleterious and missense sequence analysis hits initially found, flagged with LowCoverage filter, VeryLowQual, LowQual, LowQD, HARD_TO_VALIDATE and SNPCluster, which were suggested to be either false-positives or potential artifacts were eliminated. In the next step, all the hits not shared by all the cases of at least one family among the LOD score-contributing linked families were excluded. Subsequently, for each position of a SNP/variant, the minor allelic frequency (MAF) was either taken from the 1000 genome project or when not available in the 1000 genome, taken from an available Swedish population of patients with colon cancer. This MAF has been used as one of the criteria to remove the hits having MAF more than 10% in the normal population. The calculated allelic frequency (CAF) used one and the same case from our 13 families each time. Finally, the variants at each position with a CAF of at least twofold more than the MAF were shortlisted as candidate variants. Furthermore, the variants were analyzed by the Alamut, an Interactive Biosoftware (version 2.3 on 2014-08-21).
Results
Linkage and fine-mapping analysis. Genome-wide linkage analysis was performed using breast cancer-strict and -loose criteria, and no statistically significant results (LOD >3) were obtained. We chose to use an LOD of one, as of interest for further studies. PL LOD, NPL LOD and HLOD scores of interest are reported in Table II, which shows either individual markers or set of markers with PL LOD or HLOD greater than one in any of the analyses and the corresponding NPL scores. Concerning the sets of markers, the highest score in this region is shown. The six consequently selected regions on five chromosomes were 8q12.3-q13.3, 11q13.2, 18q21.1, 18q21.32-q22.3, 22q11.1-q11.21 and Xq13.1. These loci were further fine mapped with an additional 40 markers and the results are displayed in Table III. After fine mapping, three regions with LOD scores greater than one were suggested on chromosome 18. Indeed, the whole of chromosome 18 below the marker D18S464 could not be ruled-out. The families contributing prominently to this overall LOD score (Family ID: 862, 2060, 6100) each have an individual family LOD score of at least 0.8 in these regions. None of the loci on chromosome 8, 11, 22 and X resulted in positive LOD scores after fine mapping. The primary NPL analysis identified two other regions on chromosome 18 and one each on chromosomes 19 and X with LOD scores greater than one. As none of these loci were supported by the overall PL LOD or HLOD analysis, these regions were not fine-mapped.
Sequencing and bioinformatics analysis. WES data from 28 patients with breast cancer from the 13 non-BRCA1/2 families were used for analyzing the three linked regions. The exome sequencing data were analyzed using ANNOVAR, followed by a sequential variant reduction steps developed in our laboratory (see Materials and Methods). The bioinformatics analysis of the three putative candidate marker regions on 18q from our current linkage and fine-mapping analysis (Table III) resulted in several genes with either deleterious or missense mutations. Three families contributing to the positive LOD score on chromosome 18 were taken into consideration for finding genes harboring mutations in the linked regions (Family ID: 860, 2060, and 6100). The variants obtained before and after our variant reduction steps are summarized in Table IV. Among the total 257 variants found on chromosome 18, two deleterious and nine missense mutations were validated further by Alamut Interactive biosoftware (Table V).
Discussion
To the best of our knowledge, the present study represents the first single-Center genome-wide linkage analysis in Swedish non-BRCA1/2 families combined with sequence analysis published to date. We set out to test the hypothesis whether further susceptibility loci exist that confer moderate or high risk to breast cancer.
The current study was carried out in 13 extended non-BRCA1/2 breast cancer families. The advantage of our study is that large non-BRCA1/2 families were used. However, as breast cancer is a heterogeneous disease, using only 13 families was not likely to detect all involved breast cancer-predisposing genes, but we hypothesized that the larger size of the families would make up for the small number of families, in particular since Sweden can be considered a fairly homogenous population. The pedigrees from the families included in the current study indicated that the involved risk alleles were likely to be dominant. Such unusual high aggregation was not likely to be explained by shared environmental or low-penetrant variants alone, but rather by genes with a main effect, where the aforementioned factors could act as modifiers. Hence our primary analysis assumed a dominant mode of inheritance. In order to avoid overlooking a possible linkage signal due to misspecification of the model, we also used an allele-sharing approach (model-free or NPL analysis). However, this approach did not identify any linked regions.
The study showed no statistically significant results of LOD greater than ‘three, but did find several putative candidate regions with an overall PL LOD or HLOD of greater than one on chromosome 18q. HLOD scores also supported the PL LOD scores but some loci were identified only by the HLOD score analysis. None of these loci overlapped with the regions previously identified in linkage studies or reported among the 25 SNP loci from the GWAS (17-30); since only a very limited part of all genetic contribution is currently known, this is not surprising, especially since our study used a small number of families with breast cancer, where chance alone can result in suggested risk loci.
As several sets of markers throughout chromosome 18 had LODs or HLODs greater than one without clear boundaries, it makes the whole chromosome 18 below marker D18S464 of interest. The bioinformatics analysis of exome sequencing data on the 18q candidate loci identified two deleterious and nine missense variants. Among the shortlisted variants, except desmocollin 2 DSC2, desmoglein 2 DSG2, solute carrier family 14 (urea transporter), member 2 SLC14A2 and alpha-kinase 2 ALPK2 (rs7234999, rs33910491), the remaining were shown by Alamut to be neutral/benign and not evolutionarily conserved. DSC2 variant is reported as likely pathogenic in The Human Gene Mutation Database HGMD but only in arrhythmogenic right ventricular dysplasia/cardiomyopathy (CM0910201). Its clinical significance in breast cancer has not yet been studied. DSG2 was shown to be clinically pathogenic but conversely predicted to be benign by mutation- and conservation-prediction algorithms. SLC14A2 was predicted to be probably damaging (PolyPhen2) and conserved (Grantham and GERP), but no clinical significance was found. Moreover, ALPK2 (rs7234999 and rs33910491) were predicted to be possibly damaging and conserved, with no clinical significance being found. These breast cancer families will be further studied since we have whole-exome data for all of them. The present exome sequence analysis study was a follow-up of the linkage analysis, and therefore it focused only on the linked regions of chromosome18q and looked for the presence of potential risk variants in our candidate region.
Conclusion
Overall, the present study found three weakly-linked loci in the chromosome 18q region. Our findings are consistent with results of several other genetic studies, in that getting overall LOD score greater than 3 is not possible with such a small number of families. The three predicted to be probably/possibly damaging variants are of interest, but further confirmatory studies are needed to prove their clinical significance in causing breast cancer. Our data provide further support for the hypothesis that multiple loci on chromosomes harboring common low-penetrance, or rare moderate -risk genetic variants are more likely responsible for familial breast cancer.
Acknowledgements
The Authors are grateful to all the families for their cooperation and commitment to this study.
Footnotes
This article is freely accessible online.
Conflicts of Interests
The Authors declare that they have no competing interests with regard to this study.
Financial Support
AL (three grants): The Swedish Cancer Society; The King Gustav V Jubilee Foundation; Department of Research, Education and Development at Karolinska University Hospital (FoUU). JR (two grants): Anders Otto Swärd, and Nilsson-Ehle Foundation TA: Magnus Bergvall Foundation.
- Received March 9, 2015.
- Revision received March 23, 2015.
- Accepted March 26, 2015.
- Copyright© 2015 International Institute of Anticancer Research (Dr. John G. Delinassios), All rights reserved