Genome-wide Searches for Mutations in Human Cancers

Oncogenomics

Wood et al. The genomic landscapes of human breast and colorectal cancers. Science 2007.

Sjöblom et al. The consensus coding sequences of human breast and colorectal cancers. Science 2006.


Overview

The convergence of pan-genomic techniques with traditional biochemical and genetic approaches has provided the foundation for great insight into the neoplasic process. The gamut of mutated genes that participate in carcinogenesis has, until now, been unknown. Recent work by Sjöblom et. al and Wood et. al utilizing high-throughput sequencing of the protein coding regions within tumor genomes indicates that cancer is far more complex than previously assumed. By sequencing a cohort of both breast and colon cancer tumors, a catalog of somatic mutations was derived. This data depicts tumorigenesis as a polygenic disease where only a few genes are mutated at a high frequency (gene mountains) while the majority of mutated genes that give rise to the neoplastic phenotype are mutated at very low frequencies (gene hills). Thus, it appears that cancers are not dominated by recurrent, highly frequent, driver mutations in the same genes; rather, each tumor harbors a unique signature of approximately 15 driver mutations. Although the signature of mutations in each tumor appears different, each signature confers a selective advantage to the tumor over the surrounding tissue to drive neoplasia.

Discovery of Oncogenes and Tumor Supressors

One of the most important questions regarding cancer is: what are the mutated genes responsible for manifesting the cancer phenotype? Answers to this question offer perspicacious clues into the molecular wiring of tumorigenesis and can reveal many aspects of normal gene function. To this end, several techniques have been seminal in identifying candidate cancer genes. Prior to the availability of genomic sequencing, oncogenes could be uncovered using in vitro transformation assays (Fig.1). For example, human tumor DNA could first be transfected into rodent cells and any cell incorporating DNA that conferred a growth advantage would be selected for. DNA from these clones could then be isolated and packaged by phages. Upon infection, the DNA would be transferred, propagated, and replica plated in bacteria. Utilizing a probe specific to the original tumor DNA the oncogene conferring the growth advantage could then be isolated. In contrast, uncovering tumor suppressor genes have mostly been achieved by large-scale segregation analyzes. This allows the identification of diseased allele frequencies and subsequent dissection of families with near mendelian pedigrees (Fig. 2). Linkage analysis is then performed on these families to map the susceptibility gene of interest to a gross chromosomal location. From here, the candidate gene is mapped by positional cloning utilizing deletion fragments from individuals within the families to find the gene.


Advantages to Genomic Sequencing to Uncover Cancer Susceptibility Genes

Although the above techniques have uncovered a variety of cancer susceptibility genes, they suffer from several pitfalls. Chiefly, cancers with low penetrance or polygenic loci that drive the cancer can not always be isolated by in vitro transformation assays or large-scale segregation analysis. Genomic sequencing to uncover mutations does not suffer from these caveats because the identification of genes is not dependent on penetrance or the number of contributing loci. The inception of whole genome DNA sequence assemblages has only now provided the foundation for investigations into understanding the functions and genomic constellations of mutations within tumors. The dawn of the genomic sequencing en masse promises to herald great insight into the molecular definition of tumors that will potentially aid in clinical diagnosis as well as treatment recourse.

Discovery Screen

The genomic sequencing data by Sjöblom et. al and Wood et. al attempts to accomplish three goals. i) develop a framework to allow somatic tumor mutations to be uncovered, ii) characterize the spectrum of somatic lesions in colon and breast cancer, iii) identify common pathways and interacting genes that unify prevalent mutations between individual tumors to derive themes as to how the neoplastic process occurs. To this end, both studies apply the same rigorous method to identify mutations using 11 colon tumors, 11 breast tumors, and matched control DNA from the same patients. Both Sjöblom et. al and Wood et. al use highly annotated, manually curated gene sets that represent the most accurate datasets available. Sjöblom et. al use the Consensus Coding Data Set (CCDS) comprised of 14,661 transcripts representing 13,203 genes while Wood et. al use data from the Reference Sequence (Refseq) database which has higher quality and coverage of gene sequence and more annotation data. Due to the large degree of overlap between the CCDS and Refseq databases, Wood et. al choose to only sequence an additional 6,196 transcripts that were either absent from the CCDS data or genes that failed to be sequenced by Sjöblom et. al.

The methology used by Sjöblom et. al and Wood et. al to discover candidate cancer genes (CAN genes) was as follows (Fig. 3). Sequences derived from the coding regions and splice donor and acceptor sites were analyzed from each tumor to identify putative mutations. Somatic mutations were differentiated from germ line mutations and polymorphisms by sequencing matched non-tumor DNA from the same patients. Somatic mutations were elucidated by examining discrepancies between tumor DNA and non-tumor DNA sequences. Any gene sequences that were the same between the tumor and normal tissue but differed from the database's annotated sequence were removed since the mutations likely represent germ line mutations or polymorphisms not important to the tumor. Identified somatic mutations were then resequenced to validate the presence of the mutation.


Validation Screen

In the second phase of isolating CAN genes, a validation screen was performed on data collected in the discovery screen (Fig. 3). Here, the genes that harbored mutations discovered from above were resequenced in an additional 24 breast or colon cancers to assess the frequency and spectrum of mutations in these genes. As before, putative mutations in these genes were cataloged, resequences for validation, and somatic mutations were determined. These methods led to the identification of 921 breast and 751 colon cancer genes by Sjöblom et. al and when added to the mutations identified by Wood et. al, a total of 1,718 somatically mutated CAN genes were identified in both cancer types. The majority of nonsynonymous mutations discovered represent missense mutations while the remaining small fraction of other mutations are nonsense, small insertions, deletions, and amplifications (Fig. 4). Interesting, statistically significant differences in the species of mutations exists between colon and breast cancer. For example, the number of C:G to T:A transitions is enriched in colon cancer. The authors speculate that this different is due to the inherent differences between the mutagens that each tissue is exposed to. In the gut, it is known that several dietary mutagens specifically lead to C:G to T:A transitions.

Distinguishing Passenger Genes

A central difficulty in genomic sequence data derived from cancers is distinguishing passenger mutations from driver mutations. Thus, parsing functional mutations from background mutations is a central crux to genomic sequencing data. During the neoplasic process, several key genes are likely to suffer mutations that confer selective growth advantages to the tumor while a number of other genes are likely to accrue neutral mutations that bestow no selective advantage to the tumor. To estimate the rate of passenger mutations, Sjöblom et. al utilize a statistical metric based on the probability that the number of mutations in a given gene was greater than expected from the background mutation rate. The output of this analysis was called the Cancer Mutation Prevalency Score (CaMP). Wood et. al on the other hand, isolate passenger rates using a more complete method based on Empirical Bayes Simulation (a more sophisiticated CaMP metric) as well as experimentall data derived from measuring the mutation rate of colon cancers that exhibit LOH at chromosome 8. This analysis led to a more accurate estimation of the passenger mutation rate and the removal of one sample that exhibited an abnormally high level of passenger mutations. CAN genes were then isolated from passenger mutations by taking genes identified in the validation screen with CaMP scores higher that 1.0 in Sjöblom et. al while Wood et. al define CAN genes as genes that harbored at least one nonsynonymous mutation in both the discovery and validation screens and if the total number of mutations per nucleotide sequenced exceeded a minimum threshold (Fig. 5). These filters defined 189 and 280 CAN genes in Sjöblom et. al and Wood et. al, respectfully. Importantly, each metric essentially identified the same population of mutant genes.


Secondary Validation

Further validation of these genes was performed using several analyzes. The prevalence of somatic mutations in 40 CAN genes with high statistical scores was analyzed in an additional 96 colon cancers. Each of these genes was sequenced to determine the frequency and spectrum of lesions in each tumor. Over half of the CAN genes analyzed were found mutated in these tumors; however, each mutant gene was only present at a very low frequency with respect to all the tumors that were sequenced. In addition to this analysis, several other aspects of the somatic mutations identified from the discovery and validation screens from above were studied. Firstly, 622 structural models were analyzed for mutations that were predicted to destabilize protein folding. A number of mutations in enzyme active sites and mutations accumulating along particular protein: protein interaction surfaces were discovered. By examining interaction data for each gene the analysis of genes that interact with one another and the predicted somatic mutations that disrupt these interactions was assessed. Several genes were discovered whose protein products were predicted to interact with a large number the somatic gene that were mutated. Interestingly, a large number of these genes are known schizophrenia susceptibility genes or genes that are involved in DNA repair.

Investigation of pathways enriched with CAN genes

Finally, CAN genes were assessed for enrichment in signal transduction, metabolic, or other cell processes to determine the pathways and molecular complexes frequently targeted by tumors. Many genes that function in specific pathways were identified. For example, the PI3K, NF-kB, and Wnt pathways all harbored a large number of mutations (Figs. 6, 7, and 8; blue asterisks breast cancer, orange asterisks colon cancer). While many genes such as APC were selectively mutated in one cancer type, other genes like TP53 exhibited mutations in both tumor types. Finally, other genes such as B-Raf and K-Ras generally exhibited mutually exclusive mutations with respect to individual tumors suggesting that the deregulation of specific pathways was important but the individual gene that becomes mutated is less important. This data indicates that certain pathways within tumors are selectively dysregulated and provides important information for efficacious cancer therapy targeting.



Conclusion

The general theme that these studies project is that while a few genes are frequently mutated (gene mountains, eg. APC and TP53) in most tumors, the majority of mutations are present at much lower frequencies within tumors (gene hills) where each hill represent a gene that is mutated at a low frequency (Fig 9). This model suggests that tumors can arise through many different constellations of mutations and that the individual genes that are mutated is less important than the effect of the total collection of mutations to the tumor. The synthesis of the total spectrum of nonsynonymous somatic mutations within coding regions of the genome is only the first step. More robust metrics to identify driver mutations coupled with functional screens will be necessary to define driver mutations. Additionally, to truly acquire the all salient driver mutations within a tumor the entire genome of a tumor must be sequenced. miRNAs, other ncRNA species, synonymous mutations, and cis-regulatory elements all represent other noncoding genomic features that cancer cells can target to drive neoplasia. Moreover, gross chromosomal abnormalities such as large translocations, deletions, and amplifications cannot be discovered by the sequencing strategy utilized in this study. Thus, more advanced methods must be preformed to reconcile these problems and acquire the total cohort of mutations within tumors.


1) Strachan, T and A, Read (1999). "Human Molecular Genetics 2." New York. John Wiley & Sons Inc.

2) Wood, L. and B. Vogelstein (2007). "The genomic landscapes of human breast and colorectal cancers." Science 318(5853): 1108-13.