gene expression analysis
overview of SAGE
Serial analysis of gene expression, or SAGE, is a technique designed to take advantage of high-throughput sequencing technology to obtain a quantitative profile of cellular gene expression.
graphical protocol of how serial tags are made
SAGE quantifies a tag which represents the transcription product of a gene.
1. extract mRNA
2. make cDNA from this mRNA population by extending from biotinylated-oligo-dT
3. cleavage of biotinylated cDNA with anchoring enzyme (NlaIII cuts at |GTAC)
4. bind biotinylated beads to streptavidin magnetic beads to purify
5. ligate linkers to bound DNA
6. release cDNA TAGs using tagging enzyme (FokI, cuts at GGATG..8..|)
7. make blunt ends using Klenow
8. ligate blunt ended fragments
9. PCR amplify the ditag - should have size of about 100.
10. isolate ditags by cleaving with anchoring enzyme
11. purify ditags
12. concatenate pool of ditags
13. clone and sequence
an example of how SAGE works
Making a ditag serves several purposes
A ten base tag is not a perfect representation of a gene's entire transcript.
human pancreas tags
gene to Tag mapping at NCBI
an entry in the SAGE tag database
some libraries containing SAGE Tags for a specific unigene entry
selecting libraries for comparison studies
NCBI ratios of expression of tags in cancer vs. normal
individual listings for some hits in this comparison between two libraries
looking at profile neighbors - have similar expression profiles
DNA MicroArray technology - uses
how do gene arrays work - steps involved
making cDNA microarrays - graphic
16 slide series about making Affymetrix Chips
probes and hybridization - two colors used
red and green images as grayscale TIFFs
glucose vs. galactose in yeast - some questions
How do we measure levels of expression?
bit depth for images - how much information to save
resolution of image scanner
dealing with background
spot quality can be measured by taking multiple level within a sample spot. The more even those spot levels are, the better the spot quality.
red vs. green intensity
what are we trying to learn
We will look at a time course experiment in which probes are generated at various times during the experiment. Each time point will consist of hybridization to an array with the RNAs/cDNAs from that time point. The time course might represent times during development, times after treatment with a drug, or perhaps just times during the cell cycle.
A series of experiments are done that result in array data for each time point. Levels of expression are measured and data are plotted for each gene.
data from time course experiment (multiple genes)
this discussion relies heavily on Brian Fristenskys web site. It is used with his generous permission.
Note that the discussion we just had about data quality applies to each array used and must also apply for the same gene in each different array!
each little graph represents the level of expression of that particular gene (that location in an array) across all the arrays (time points) used in the experiment.
gene expression profile for two genes
Experimental error, as indicated in the error bars, can be due to many factors, each contributing a small error but when taken together might add up to a significant error.
The goal is to set up the experiment to minimize the standard error of the observations.
In the figure, some time points show clearly different levels for the different genes. For other time points, the error is great enough that the two levels cannot be reliably distinguished.
sources of experimental variation
long hybridization times are important
In any hybridization experiment, the time required for hybridization to go to completion is proportional to the concentration of the probe.
For example, at the time indicated by the dotted line on the X-intercept, the moderately abundant transcript would be estimated with only a small error, while the abundance of the rare transcript would probably be greatly underestimated.
possibilities to solve problem are presented on the slide
best is to allow hybridization to go to completion... long hyb times
Washing stringency - For genes that are members of multigene families, hybridization results could vary depending on hybridization and washing stringency.
Image acquisition - The acquisition of the image data carries similar built in sources of variation as does hybridization. Within a certain intensity range, the amount of signal detected is linearly proportional to the time of exposure.
Usually, data acquisition entails accumulation of signal by a CCD camera. Data is saved as a TIFF image, where intensity of a given pixel is proportional to the amount of signal coming from part of the filter or slide.
It is important to recognize that these errors of detection are compounded on top of the errors associated with hybridization time!
Raw intensities from each spot on an array are not directly comparable. Depending on the types of experiments done, a number of different approaches to normalization may be needed. Not all types of normalization are appropriate in all experiments. Some experiments may use more than one type of normalization.
subtracting negative controls
negative controls might be DNA that is not present in the mRNA population
normalize to positive control
To allow comparison of genes from one filter to the next, it is often useful to spike the labeling reaction with some foreign RNA or DNA that is not normally in the RNA population.
This should provide a reliable constant signal level that actually normalizes for the differences between arrays.
While in principle some presumably constitutive genes like actin, tubulin, ribosomal proteins, or ubiquitin might serve as controls, careful experiments often show that these genes are not really constitutive and can vary from experiment to experiment or tissue to tissue.
Therefore, foreign DNA sequences, known not to be present in the species being studied, are better controls.
For example, a human RNA population might be spiked with plant RNA, and plant genes used as positive controls on the array.
Normalization of signal for each gene as a ratio makes it possible to compare ratios between experiments, provided that the spiked controls are the same in all experiments.
Normalization to a positive control is typically used in single-label experiments. Comparison of one experiment to another can either be done by plotting signal si directly on a graph, or signals from two experiments can be converted into a ratio, usually by choosing one treatment as a control.
For example, in a timecourse, a 0 hour timepoint might be chosen, and signal from all other timepoints divided by the signal for the 0 hour timepoint, to give a ratio.
from: Renu A. Heller, Mark Schena, Andrew Chai, Dari Shalon, Tod Bedilion, James Gilmore, David E. Woolley, and Ronald W. Davis (1997) Discovery and analysis of inflammatory disease-related genes using cDNA microarrays Proc. Natl. Acad. Sci. USA 94:2150-2155.
In this experiment, they were interested in learning how cells respond during an inflammatory response. This array contains multiple sections of control genes.
layout of genes and controls
spot data from induction
graphic of spot data
Because of the many sources of variation from experiment to experiment, one of the best possible controls is to choose some experimental condition as a baseline, to use as a control against all other experimental conditions or treatments.
For example, the level of expression in a wild type organism might be the baseline, for comparison with expression levels in mutants.
An excellent control can then be implemented by labeling the control RNA population with one dye (e.g. Cy3) and all other RNA populations with a different dye (e.g. Cy5).
Each labeled experimental population is then mixed with an equal quantity of the labeled control RNA, and the mixed sample is hybridized with a gene array. The array is scanned at the wavelengths for each dye, and the ratio of the experiment to the control is the ratio of the intensities for each dye (corrected for background) for the two dyes.
In this case, ratio to the control genes is used to determine signal strength for each gene spot on the array. This ratio vs. control genes might be different for each color. Then the ratio of red to green signal is determined.
from: Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JCF, Trent JM, Staudt LM, Hudson J Jr, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO (1999) The transcriptional program in the response of human fibroblasts to serum. Science 283(5398):83-87.
cDNA samples were made from Human fibroblasts treated with serum (Cy5-dUTP) or serum-deprived cells (Cy3-dUTP). 3 replicate arrays were scanned, and the same regions from the replicate arrays are shown. Serum inducible genes appear green, while serum suppressible genes appear red.
1 - protein disulfide isomerase-related protein P5;
2 - IL-8 precursor;
3 - EST AA057170;
4- vascular endoghelial growth factor.
To facilitate easier mathematical handling of the data, as well as comparisons over a wide range of expression levels, ratios are usually expressed as logs.
For example, if a gene is expressed at 16-fold greater level in the control than in the mutant, log2 (1/16) = -4.
A log ratio of 0 is therefore indicative of a gene whose expression is the same in both conditions or treatments.
Clustering allows for the identification of genes with similar expression profiles.
It is sometimes useful to manipulate DNA microarray data before clustering. The preliminary manipulations usually involve centering or normalizing the data in some way.
How you choose to perform the initial manipulations depends on the question that you are trying to answer and the way in which the experiment was performed.
normalizing vs. centering
similarity in profiles?
Consider two sine curves, one of amplitude 1 (gene A) and one of amplitude 5 (gene B). These might represent the time course of expression of each gene during a particular experiment.
These values represent the log2 ratio of the gene compared to some standard value.
How do we compare these two genes to see if they are behaving the same way?
These are not simple questions because the answers depend on the experimental design and what you want to define as similar. For example, the profiles are similar in that they have the same shape, but they are different in magnitude and in average value.
If the experiment is a cell cycle experiment that involves exactly one cell cycle, then it is reasonable to center the results because the average value throughout one cell cycle should be the same for all genes.
They might go up and down, but they will all end up at the same place they started. We want to know how they vary relative to each other during the cell cycle. Centering would place gene A ranging from -1 to +1 and leave gene B from -5 to +5.
Further, if we want to look for genes that have the same pattern of expression (they go up and down together), it would make sense to normalize the data first so that differences in magnitude do not influence the clustering. This would make each genes results range from -1 to +1.
The bottom line is that you need to be careful how you choose to treat your data. Often you might do manipulations of the data in a spreadsheet before trying to cluster them using one of the clustering programs.
The best advice is to think about what you are doing and what questions you really are asking.
series of 5 slides on PCA
To facilitate mathematical manipulations, it is useful to think about each gene in an experiment as a multidimensional vector. For example, if there are 10 experiments (time points or doses or other conditions), we can think of a 10 dimensional space in which a vector is drawn. This vector starts at the origin and extends to a point in space corresponding to the value along each axis of that particular experimental value (e.g. - a specific time point).
This is not easy to visualize. Try to think of a graph that has 10 orthogonal (right angle) axes by extension of what you already can visualize as 2 and 3 dimensional axes. The reason for thinking in vector terminology is that it allows for easy mathematical manipulations.
time points as a table
converting to vectors
Each vector then represents a set of data points, or experimental conditions. The vector represents the entire set of experimental results for a particular gene.
Normalizing the vectors would make them all have a length of one, although they can point in any direction.
Those vectors pointing in the same general direction represent genes whose behavior is similar.
Clustering involves trying to group together genes or experiments (arrays) that have similar behaviors. Thinking in terms of vectors, we want to find genes whose vectors are similar in direction and perhaps length.
This comparison can be accomplished using either of two metrics.
Euclidean measures measure the distance between the end points of the vectors, and therefore take into consideration the direction AND the magnitude of each vector. This means that genes having the same shaped profiles (going up and down at the same experimental points) but differing in magnitude would not be that similar (add new arrow to slide).
By normalizing first, so that each vector has a length of one and then figuring out distances (called a Pearson correlation) between the vectors, any genes having the same behavior (up and down in tandem) would therefore be considered similar - same as looking at angle beween vectors.
Clustering attempts to take those vectors that are most similar (in either Euclidian or Pearson space) and group them together.
Create a table containing similarity scores for every pair of genes (how far apart each pair of vectors are).
K Means Clustering
One problem with this approach is in choosing the correct number of nodes to start with. The best advice is to take a guess and then try using more or less nodes based on the results of the clustering. Trial and error will lead you to the best solution.
Self Organizing Maps (SOM) Clustering
Vector Space Clustering
Example 1: Michael B. Eisen, et al PNAS 95:14863 (1998)
human fibroblasts after serum stimulation
Example 2: gene expression mapping of CNS development in rat; Wen, et. al PNAS 95:334 (1998)
categories of genes used
clustered expression patterns
hierarchical clustering was done based on direction of vectors and then tree was created by measuring distance between vectors
tree view of data
PCA analysis plot
Example 3: Molecular Portraits of Human Breast Tumors; Perou, et al Nature 406:747 (2000)
some of the data, centered about median of expression level for that gene
clustering by gene expression profiles
clustering profiles by tumor type
conclusions - tumor portraits
some difficult ethical questions
1. assaying for diabetes and find breast cancer gene - tell patient? doctors legal obligation? does insurance company need to know?
2. If pharmacogenomics tests are not done prior to treatment, can doctor/drug company be sued? give prozac example
3. If insurance company pays for test do they have a right to the results?
4. If patient pays for test is she/he hiding information from insurance company and subject to having policy cancelled?
5. If tests are expensive, and patients pay for them, is this a kind of economic discrimination?