8/9/01
Lecture #16 - Computational Biology
DNA and Protein Sequence Databases
- GenBank: if new sequence found, it's automatically put in
database, set up in the USA by the National Library of Medicine
- EMBL database: set up by European Molecular Biology Laboratory
- DDBJ
- the above three exchange sequences daily
- PIR and SwissProt: protein databases, in all 6 reading frames
- GenProt- translated sequences from GenBank
- http://ncbi.nlm.nih.gov is National Center for biotechnology
information; contains links to databases mentioned above
- for other kinds of databases, see handout
Analysis of Sequence Information
- algorithms- finite sequence of well-defined actions whose
purpose is to accomplish a given task
- example: take DNA and find ORF - have 6 possible ways
- need to know the mechanism of the algorithm, otherwise you may
misinterpret your results simple algorithm: sequence assembly
- cut up the DNA into fragments, sequence them and then try to
re-assemble the fragments into a big piece but if you have
repeats, you'll have a problem
- could have sequence errors or mismatch. How much mismatch
should you allow?
- sliding window algorithm - view the property of the sequence
as a function of position along the sequence
- example: see slide #4 - get a picture of how the base
composition varies along the DNA
Hydropathy
- look at side chain of all amino acids, look at relative
solubility (how hydrophilic/ hydrophobic)
- different types of hydropathy:
- 1. Hopp-Woods- window of 6
- look up the value of each amino acid in a table, and
average the values over the window, and plot those numbers
- above 0 = hydrophilic (outside face of protein)
- below 0 = hydrophobic (inside face of protein, interact
with hydrophobic parts of other proteins)
- used to find out the property of a segment of the protein,
not just one amino acid
- 2. Kyte-Dolittle
- for the entire protein and its structure (not looking at
just the side chains of the a.a.)
- 3. antigenic index
- if you superimpose the above three types, it's a little
different, so where all three agree is the most accurate
- window size is usually figured out by trial and error and
finding one that is the best choice
Predicting protein structure
- 1. CF prediction graph: Chou and Fasman found an algorithm
for determining protein folding; find a.a. stretches that always
form alpha helices, or beta sheets
- plot the probability that it would form an alpha helix or
beta sheet
- gives you a 60% accurate plot of what the protein structure
looks like
- 2. GOR Structure Prediction (squiggles) &endash; employ a
slightly different mechanism to find the alpha and beta sheets
- sliding window of 17, look at all 17 values (weighted)
- plotted as a squiggle
neural nets
- example: train it to understand a page of text and speak it;
output = phonym (specific sound of English language)
- have 3 levels (1-3-7) - 1 output fed by 3 layers which is in
turn fed by 7 detectors
- first time of read through, computer goes through page of text
and makes random sounds
- then adjust values and reread the page, and generates new
output
- network learns spaces between words, vowels then consonants
- same mechanism for neural net for protein- you train the
neural net with a protein sequence so that it will predict the
protein structure
- it's about 70-75% accurate
Codon usage
- there's different preference for codons depending on the
organism
- codon usage table- use it for codon preference analysis
- slide #14 &endash;
- curve shows use of codons in ORF (how the use of codons
matches the known codon preference in the Drosophila codon table)
- above the line = 95% likelihood that the codon is used
- rare codon = used less than 10%
- see few rare codons in the middle (good, because it will be
translated better); an example of biased codon usage
Dot matrix comparisons:
- comparison of sequences of myoglobin from pig and that of cow
- sliding window of 10 on top sequence, and scan along side
sequence
- straight line = clear pattern, relationship
- leghemoglobin dot matrix: don't see a diagonal across, but
gap in diagonal means intron (it's not in the cDNA)&endash; so the
plot shows the locations of introns and exons
- ribosomal DNA dot matrix #1 = allow up to 2 mismatches, used
to find repeated sequences
- ribosomal DNA dot matrix #2 = allow up to 5 mismatches, window
of 15, see repeated sequences in background (splotches)
- Dot Matrix (Protein&endash; Identity Table): alpha and beta
globin; gaps = different codons specifying the same a.a.?
- PAM250 table:
- Lys + Phe = -5 (because very diff a.a.)
- Lys + Arg = 3 (both have + charge, similar)
- Lys + Tyr = -4
- Ser + Ser = 2 (low number because Ser is very common, so
you can substitue Ser for many a.a.)
- Trp + Trp = 17 (very high number because it's a rare a.a.,
it's a very important match)
- Dot Matrix (Protein&endash; PAM250 Table): plot values using
the PAM250 table, get a nice diagonal)
- can compare your sequence with that of a database (e.g.
BLAST). Gives you a value (it's a scoring system) so you can know
how similar the sequences are
Sequence similarity
- alignments and gaps- devising a scoring system is difficult
- need gap penalties
- problem with end gaps- if the gap's length is important, you
need to give a high penalty
Database searches
- GenBank database has grown very fast
- NCBI website has BLAST search
- finding genes on the website - feed your DNA sequence into the
algorithm and gives you possible genes and predicts where those
sequences are; different algorithms give you different results; if
you link that with Swissprot, you can get the protein