from a presentation by Scott Gleim
Genetics 144: Oncogenomics
Dartmouth Medical School
January 24, 2005
TOP Fundamentals Penetrance Haplotypes Candidates Screening Conclusions References
A single nucleotide polymorphism (SNP) refers to a single base change in DNA. SNPs comprise the most abundant form of human genetic variation, with frequency estimates ranging from as many as 1 in 300 to about 1 in 2000 base pairs. In a genome of approximately 3.2 billion base pairs (1), this would equate to anywhere from 1.6 to 12 million human SNPs. Other estimates suggest as many as 15 million SNPs (2). With more than 1.4 million SNPs identified in the course of initial human genome sequencing (1), it is plausible that these estimates may continue to grow.
The considerable difference in SNP estimates is partially explained by variation in the polymorphic tendency of genomic regions, while some of this discrepency might also be explained by differences in the definition of what constitutes a SNP. Nucleotide changes (Figure 1) between purine and pyrimidine bases are transversions, while same class changes (purine to purine or pyrimidine to pyrimidine) are called transitions. The prevailing SNP definition indicates that an inherited allelic variation must have > 1% population frequency in order to be classified as a SNP. Also, while some definitions include methylated and deaminated dinucleotides, others do not. Regardless of the specific definition or an accurate accounting of all population SNPs, the potential importance of these polymorphisms is phenomenal.
As a testament to the inherent value of identifying single nucleotide polymorphisms, the National Center for Biotechnology Information (NCBI) has established an impressive online database to collect reported polymorphisms (dbSNP). Similarly, The SNP Consortium, a collaboration between pharmaceutical corporations, biotech companies, and The Wellcome Trust, has been established to further SNP tracking. Numerous other useful resources have also been constructed to aid in SNP and related genomic research including creation of the National Human Genome Research Institute (NHGRI), the Whitehead Institute SNP database, and the UCSC Genome Browser.
The impact of nucleotide differences is variable and elusive, but is clearly dependent upon the location of the polymorphism in the genome. Although the majority of SNPs are likely to occur outside of actual gene encoding regions, polymorphisms located within the context of a gene (Figure 2) need not be involved in protein encoding to result in a functional change. Nucleotide differences in regions upstream of the protein-encoding gene regions may influence the binding of promotors or repressors, resulting in differential regulation of transcription. Polymorphisms at intron/exon boundaries may effect exonic or intronic splicing enhancer or silencer positions, or especially conserved GT donor or AC acceptor positions, modifying the resulting polypeptide. There is even demonstrated potential for phenotypic effects from non-coding or synonymous SNPs through alteration of RNA secondary structure (3). Similarly, untranslated distal 3' differences may have additional effects, including interruption of poly-adenylation, which would alter the template effectiveness.
Polymorphic impairment of the APC tumor suppressor gene, for example, can lead to autosomal-dominant familial adenomatous polyposis through direct missense mutations or silent mutations resulting in aberrant splice variants (4). A less obvious example of untranslated polymorphic influence comes from a 3920 Thymine to Adenine change in the APC gene which is neither a truncation mutation nor directly deleterious to gene function; however, this change establishes a hypermutable mononucleotide repeat (5).
Figure 3. Disease-Related Mutations
Despite these additional polymorphism effects, most SNPs are generally thought of in the context of the protein product. In fact, based upon diseases associated with Mendelian inheritance, the majority of intragenic mutations involve missesnse and nonsense mutations (Figure 3), with the smallest number of mutations found in regulatory gene regions (2). Although it is likely that these numbers represent a significant proportion of experimental bias towards coding regions, it is also possible that regulatory mutations are less implicated in Mendelian diseases due to a tendency for such changes to result in low penetrance phenotypic changes.
Modification of a nucleotide within the coding section of a gene may be synonymous, resulting in no amino acid change, or non-synonymous, resulting in production of a different residue (missense) or premature termination (nonsense) of the polypeptide change (Figure 4). The precise impact of a missense SNP can be equally variable, depending upon the physico-chemical properties of the residue and the functional and/or structural importance of the residue in the resulting protein. Obviously, some missense polymorphisms are more conservative than others. For instance changing a CUU (leucine) to AUU (isoleucine) should have minimal structural impact, wherease modification of CAU (histidine) to CCU (proline), or GGU (glycine) to UGU (cysteine), would be expected to have dramatic structural and/or functional influence.
Performing disease association studies with genotyping is primarily limited by current technological capacity of genomic sequencing and availability of genomic samples. Although some difficulties can be partially alleviated through PCR minimization from sample pooling (6),and the availability of a plethora of sequencing methodologies, there remain many hurtles to widespread utilization of genotyping association studies. A comprehensive review of genotyping technologies (7) identifies the ideal attributes of a sequencing technique would include fast and easy development with low development cost, be robust and automated with simple data analysis, and utilize a flexible and scalable reaction to yield a low per sample genotyping cost. Sequencing can employ methods such as hybridization, primer extension, ligation, or invasive cleavage and can be performed as homogeneous or solid phase reactions. Detection and measurement can be performed using luminescence, fluorescence, mass spectrometry, pyrosequencing, or electrical detection.
Among the first of these technologies to reach wide-scale use is the Affymetrix HuSNP GeneChip, combining solid phase hybridization with fluorescence detection, but apparently suffers from a lower yield of confidence compared to other methods (7). A currently popular approach seems to be use of a quantitative and highly discriminating 5' nuclease assy, as employed in example 1 (below), whose major drawback is the high cost of labeled primers. Arguably the gold standard in sequencing is direct chemical measurement via mass spectrometry, although exhibiting lower throughput capability and a considerable requirement for sample purificaiton. Such methodology is exemplified by the MassEXTEND method (6) from Sequenom, using a primer extension analysis by matrix assisted laser desorption ionization - time of flight mass spectrometry (MALDI-ToF MS) in example 2 (below).
TOP Fundamentals Penetrance Haplotypes Candidates Screening Conclusions References
Historically, the identification of genes related to human cancer relied upon genetic inheritance of alleles which demonstrated familial linkage, transmitting a high likelihood of disease susceptibility to progeny. A popular example of genetic cancer susceptibility comes from the correlation of familial breast cancer risk with mutations in BRCA1 and BRCA2. These well known mutations result in dysfunctional proteins with a very high penetrance risk for breast cancer. As with numerous other oncogenic mutations, abnormal BRCA1 and BRCA2 would be expected to result in high penetrance cancer risk owing to their apparent roles in DNA repair, along with other roles important to cell cycle progression (8).
Lichtenstein, et. al., performed a retrospective analysis of nearly 45,000 pairs of monozygotic and dizygotic twins to establish relative contribution of environmental and hereditary factors on cancer incidence (9), enabling partial dissection of causal influences. By demonstrating higher concordance of certain neoplasms in monozygotic twins, heritability can be inferred as a predominant risk factor. In circumstances where dizygotic and monozygotic concordance is similar, the shared environment of the twins would appear to provide the dominant contributing factor. Nonshared environmental factors, such as viral infections and sporadic mutations, are implicated in situations lacking similarity between twins.
Differences in the contribution of heritable and environmental factors were apparent between various cancer types. Not surprisingly, most cancers showed significantly greater monozygotic than dizygotic concordance, indicating genetic inheritance to be a significant causal factor. Interestingly, Hodgkin's disease, non-Hodgkin's lymphoma, and cancers of the lips, oral cavity, pharynx, kidney, thyroid, bone, and soft tissue demonstrated no concordance between twins. Similarly, the vast majority of influences appear to come from non-shared environment. Nevertheless, the authors found that nearly 30% of breast cancer cases involved heritable factors (Figure 5), a frequency far higher than can be explained by known susceptibility genes.
The notion that a significant proportion of breast cancer can be attributed to unknown genetic susceptibility supports the notion of multiple gene contributions to susceptibility, or polygenic susceptibility (10). The polygenic hypothesis of susceptibility particularly implicated in cases of unexplained familial risk, where incomplete penetrance of multiple genes is more likely than the occurance of widespread unidentified high penetrance alleles (11).
Despite the discovery of inherited alleles which predominantly result in oncogenesis, cancer can be considered a primarily sporadic (or stochastic) event. This is due to the concept of genetic penetrance, or the likelihood of a given genotype to result in a particular phenotype, in this context the phenotype of oncogenesis. In the case of a gene with complete penetrance, all cells carrying this gene would develop cancer. Owing to the cell cycle regulatory involvement of cancer related genes, such a situation is unlikely to occur as the carrying organism would have a very low probability of adequate development and/or survival. With incomplete penetrance, the genes can be disseminated into general population, as additional genetic modifications are required for the phenotypic expression. As low-penetrance alleles require further genetic modifications to occur over time, these alleles generally have little or no negative impact on reproductive fitness, allowing these genotypes to accumulate with much higher prevalence than high-penetrance alleles.
Taken together, the concept of low-penetrance gene accumulation and unaccounted for inheritable susceptibility support the notion behind the common disease/common variant (CDCV) theory of disease. CDCV simply proposes that many commonly occurring diseases are caused by commonly occurring alleles. Presumably, alleles of relatively low penetrance and high prevalence. Although cancer is well recognized as a genetic disease, the CDCV hypothesis opens the discussion to cover countless other elusive health problems including cardiovascular disease, neurological disorders, and auto-immune tendencies. Since low-penetrance alleles are insufficient for phenotypic development, such alleles fail to exhibit traditional Mendelian inheritence, and only confer a degree of susceptibility to a particular phenotype.
Whereas linkage mapping is familial in nature, such analyses are low resolution, covering loci containing hundreds of genes. Traditional familial linkage analysis is not capable of assessing non-Mendelian phenotypic inheritance and is therefore not suitable to discovery of low penetrance susceptibility genes. An interesting biochemical property of DNA, however, enables the use of Mendelian principles in evaluating low penetrance genes without familial genetic mapping. Linkage disequilibrium (LD) refers to a population phenomenon where two alleles are inherited at greater frequency than would be predicted by random recombination. Through identification of shared haplotypes among multiple strains of inbred mice, blocks of non-random variable diversity were discovered (12), indicating that the occurance of linkage disequilibrium and single nucleotide polymorphisms is not uniform throughout the genome. Although the majority SNPs likely have no direct functional effect, leveraging the phenomena of linkage disequilibrium enables association of these non-functionally involved SNPs to disease through LD with a functional variant.
TOP Fundamentals Penetrance Haplotypes Candidates Screening Conclusions References
Association studies are the primary means of establishing the correlation between a given gene and the risk of having a particular disease. In order to demonstrate such a gene-disease correlation, a study must attain statistical significance, achieve replication of the results in an independent sample, indicate the biological relevance of the gene in the context of the disease of interest, and demonstrate alteration of gene function or regulation (5). The use of single nucleotide polymorphisms in association with disease is primarily concerned with the propensity for such polymorphisms to effect the gene in which it is located, either through modification of the expression or function of the protein product. However, a large proportion of polymorphisms may demonstrate linkage disequilibrium with other functionally implicated genes (14).
Capitalizing upon linkage disequilibrium phenomena, it becomes possible to enhance the statistical power of an association study by evaluating co-occurrance of multiple nucleotide polymorphisms, or SNP haplotypes (13). The number of haplotypes is inversely proportional to the degree of linkage disequilibrium within a particular region. For a given region, each individual carries two haplotypes per region, each inherited as an independent unit. Thus, for n SNPs, there are 2^n possible haplotypes. Too few haplotypes in a study would cause a lack of discrimination power, whereas too many haplotypes leads to a loss of statistical power and information. Taking a counter-intuitive approach, Evans, et. al. developed a method to predict individual genotypes at hidden SNP loci based upon linkage disequilibrium with surrogate markers (15). Such techniques will likely become important in testing association hypotheses, while highlighting less evident benefits of constructing haplotype maps. While this method may have a slight bias toward regions of high SNP density, regions of low SNP density can also be leveraged for genotyping studies. Mitra, et. al., demonstrated that large LD regions in unrelated members of the Ashkenazi Jewish population could enable genome-wide SNP evaluation to identify novel cancer-associated genes (16).
The International HapMap Project has been established in effort to catalogue increasingly collected haplotype data. As SNP-disease association studies grow, understanding the complexity of polygenic influence will require genome-wide haplotype analyses to adequately explain the full inheritability of susceptibility. One simple example demonstrating the elusiveness with which an obscurely related gene can have an impact on cancer progression is the phenotypic reduction in bladder cancer awarded by three non-coding SNPs identified in two negative regulators of G protein signaling (RGS) (17). Implication of RGS polymorphisms broadens the number of protein classes typically associated with cancer susceptibility. Furthermore, consideration of non-coding SNPs broadens the genomic window through with associations must be evaluated.
In the context of study design, there appear to be two primary approaches to SNP-based disease association studies. As it remains technically unfeasible, and ethically questionable, to perform complete genomic sequencing of every individual involved in a study, association studies require hypothesis driven localization of the region(s) under investigation. The investigated region can be guided by the candidate gene approach, or objectively derived through genomic screening.
TOP Fundamentals Penetrance Haplotypes Candidates Screening Conclusions References
One approach to the discovery of low-penetrance cancer-related genes is through the candidate gene approach. This approach requires one to begin with a hypothesized functional relationship between the genes of interest and known disease etiology. Prior understanding of the functional specificity helps filter down the target list to a manageable level, minimizing the risk of observing a type II error associated with these targets. While utilization of intragenic haplotypes bolsters the statistical power of association, the minimization of observing a false negative would be expected to increase the possibility of observing a type I error. Based upon the known differential responsiveness of breast tumors to estrogen treatment, Gold, et. al., utilized haplotype analysis to evaluate the potential involvement of steroid receptor mutations in the progression and treatment prognosis of breast cancers (18).
Following a 5' nucleotidease assay with an Assays-on-Demand platform from Applied Biosystems, the authors analyzed SNPs representing non-synonymous mutations, splicing modification, or transcription factor binding site variants available utilizing this platform and having >10% Caucasion allele frequency. Thier analysis found a small number of haplotypes showing relatively large linkage disequilibrium (average distance >25 kb) within the estrogen receptor beta (ESR2) and progesterone receptor (PRG) genes. They also identified 17 SNPs within the estrogen receptor alpha (ESR1) gene. Fifteen of these SNPs are found in dbSNP and the two additional polymorphisms are exclusive to the Celera proprietary database. Of the evaluated polymorphisms, three were significantly associated with breast cancer in Ashkenazi Jewish patients, one within a putative promoter site, a S10S synonymous mutation in exon 1, and another SNP from within intron 1. Two haplotypes were found to associate significantly with higher breast cancer susceptibility, including a rare susceptibility haplotype having an apparent proximity to the estrogen binding domain of ESR1. Perhaps most importantly, this study also identified three protective haplotypes.
TOP Fundamentals Penetrance Haplotypes Candidates Screening Conclusions References
Another perspective to identifying low-penetrance genes involved in cancer etiology assumes no apriori knowledge of gene function. Presumably, taking an agnostic approach to SNP-based association studies makes use of largers screening areas, enabling the discovery of unanticipated gene associations. Of course, such an advantage depends heavily upon the ability to evaluate a larger region of DNA, a limitation still to be resolved. However, as larger areas are considered, the overall degree of linkage disequilibrium may diminish requiring larger numbers of samples to be analyzed to gain statistical confidence in association (11). Kammerer, et. al., performed such a screening approach to identify novel genes related to breast cancer susceptibility (19). Their study included 254 age-matched German breast cancer patients and 268 controls, and utilized the MassEXTEND method with pooled samples to look at over 25,000 SNPs within approximately 16,000 genes.
In order to reduce the potential for type I statistical error (false-positive), the study employed a three-step SNP selection strategy. In step one, PCR and primer extension for each SNP in the pool of cases and in the control pool. The 5% most statistically significant associations advanced to the next step where associations were measured in triplicate for each pool. Again, the most significant 5% were carried forward for individual genotyping. Through this rigorous SNP filtration procedure, 52 SNPs demonstrated confirmed association. To further discriminate against reporting of false positives, the final 52 SNPs were genotyped in independent case and control collections. Through this procedure they identified breast cancer status to be significantly associated with a 20kb span. This region was localized to chromosome 19p13.2, containing three intracellular adhesion molecule (ICAM) genes, among others. Included in this group is a K469E variant of ICAM1, previously implicated in tumor progression. ICAM5 demonstrated the strongest association of the genes found within the region, however due to tissue specific expression of ICAM5, it is thought that ICAM1 may be a more favorable candidate to investigate further.
In this particular example, it appears that the authors have cautiously avoided false positive associations through a judicious screening process. However, perhaps the most fundamental difficulty in large-scale screening efforts of any kind is a high likelihood of overlooking. The lack of sensitivity necessary to enable wide-scale approaches increases the probability of false negatives, suggesting that may potentially interesting associations may have been missed in this experiment. While broadening the search area could allow for the discovery of non-coing genetic elements in linkage disequilibrium with a functional correlate, the authors chose to reduce their dataload by evaluating only coding regions, further contributing to the likelihood of missing potential contributing factors.
TOP Fundamentals Penetrance Haplotypes Candidates Screening Conclusions References
Despite a wide selection of sequencing methodologies and multiple perspectives on study design, it remains clear that there is no ideal approach available for the identification of SNPs responsible for low-penetrance contribution to cancer development and maintenance. In order to screen for new polymorphic associations, countless potential candidates of interest must be filtered out in order to make the experimentation manageable. In attempting to demonstrate candidate association, the propensity of false positive associations remain unsavorably high. The choice of statistical burden relies upon whether the objective targets discovery of a new genetic association (screening) or the demonstration of a supposed association (candidate). As such, if the candidate gene approach is criticized through an addage like "can't see the forest for the trees", the screening approach can be equally criticized through the corollary "can't see the trees for the forest". Nevertheless, it
TOP Fundamentals Penetrance Haplotypes Candidates Screening Conclusions References
- Lander, E.S., et. al., Initial sequencing and analysis of the human genome. Nature, 2001. 409 (6822): p. 860-921.
- Botstein, D. and N. Risch, Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nature Genetics Supplement, 2003. 33 : p. 228-237.
- Shen, L.X., J.P. Basilion, and V.P. Stanton, Jr., Single-nucleotide polymorphisms can cause different structural folds of mRNA. PNAS, 1999. 96 (14): p. 7871-7876.
- Aretz, S., et al., Familial Adenomatous Polyposis: Aberrant Splicing Due to Missense or Silent Mutations in the APC Gene. Human Mutation, 2004. 24 : p. 370-380.
- Bonnen, P.E. and D.L. Nelson, SNPs and Functional Polymorphisms in Cancer , in Oncogenomics: Molecular Approaches to Cancer, C. Brenner and D. Duggan, Editors. 2004, John Wiley & Sons, Inc.: Hoboken, NJ. p. 57-75
- Bansal, A., et al., Association testing by DNA pooling: An effective initial screen. PNAS, 2002. 99 (26): p. 16871-16874.
- Kwok, P.-Y., METHODS FOR GENOTYPING SINGLE NUCLEOTIDE POLYMORPHISMS. Annual Review of Genomics and Human Genetics, 2001. 2 (1): p. 235-258.
- Welcsh, P.L., K.N. Owens, and M.-C. King, Insights into the functions of BRCA1 and BRCA2. Trends in Genetics, 2000. 16 (2): p. 69-74.
- Lichtenstein, P., et al., Environmental and Heritable Factors in the Causation of Cancer -- Analyses of Cohorts of Twins from Sweden, Denmark, and Finland. N Engl J Med, 2000. 343 (2): p. 78-85.
- Pharoah, P.D.P., et al., Polygenic susceptibility to breast cancer and implications for prevention. Nature Genetics, 2002. 31 : p. 33-36.
- Houlston, R.S. and J. Peto, The search for low-penetrance cancer susceptibility alleles. Oncogene, 2004. 23 : p. 6471-6476.
- Wiltshire, T., et al., Genome-wide single-nucleotide polymorphism analysis defines haplotype patterns in mouse. PNAS, 2003. 100 (6): p. 3380-3385.
- Hirschhorn, J.N., et al., A comprehensive review of genetic association studies. Genetics in Medicine, 2002. 4 (2): p. 45-61.
- Collins, A., C. Lonjou, and N.E. Morton, Genetic epidemiology of single-nucleotide polymorphisms. PNAS, 1999. 96 (26): p. 15173-15177.
- Evans, D.M., L.R. Cardon, and A.P. Morris, Genotype Prediction Using a Dense Map of SNPs. Genetic Epidemiology, 2004. 27 : p. 375-384.
- Mitra, N., et al., Localization of Cancer Susceptibility Genes by Genome-wide Single-Nucleotide Polymorphism Linkage-Disequilibrium Mapping. Cancer Res, 2004. 64 (21): p. 8116-8125.
- Berman, D.M., et al., A Functional Polymorphism in RGS6 Modulates the Risk of Bladder Cancer. Cancer Res, 2004. 64 (18): p. 6820-6826.
- Gold, B., et al., Estrogen Receptor Genotypes and Haplotypes Associated with Breast Cancer Risk. Cancer Res, 2004. 64 (24): p. 8891-8900.
- Kammerer, S., et al., Large-Scale Association Study Identifies ICAM Gene Region as Breast and Prostate Cancer Susceptibility Locus. Cancer Res, 2004. 64 (24): p. 8906-8910.
Monday, March 14, 2005 5:09 PM