Panel 2


Why BEAM Works

The bases mediating the contact between a cis-regulatory element and a transcription factor are invariant, as they form the basis of recognition for the transcription factor. Bases in the binding site that are not required for the contact are subject to genetic drift. Non-critical bases are thus more variable than critical ones. BEAM is designed to recognize the invariant cores of cis-regulatory elements, using the following principle: if a word of a given length is over-represented in a set of upstream sequences, it will contain within it words of shorter length that are also over-represented.

This “core” motif property can be demonstrated empirically by tracing all possible paths that arise during the expansion of all 256 4-mers in a data set where five motifs have been planted at high levels of over-representation (Figure 2). The objective function used here, the Sig value, is a statistical measure of the degree of over-representation. In this example, motifs corresponding to prefixes for the planted motifs rapidly rise above background Sig values. At each motif length w, the highest scoring w-mers come from the highest scoring (w-1)-mers. As soon as expansions pass the length of the planted motifs, scores suddenly drop.

Figure 3 shows how effectively a beam search reduces the search space. Five randomly selected motifs from the TRANSFAC database were planted in each of five randomly selected genes. BEAM was run on this test set using a beam width of 1000. For various levels of the tree generated in this process, the fraction of w-mers that contain as prefixes any one of the top 1000 (w-1)-mers was measured directly. It can be seen that the turnover between zone C and zone B (see Figure 1) occurs largely at their boundary, and the top of the ranked list of motifs at every length w is remarkably stable. Thus, for a large enough size of zone B relative to zone A, the error rate can be made extremely small.

Genome Background Models

In order to be able to assess the significance of any candidate motif in a set of genes, one needs to be able to determine the background frequency of that motif within the rest of the genome (this contributes to the Sig score). Therefore, we measured the perplexity of two different motif distributions: known cis-regulatory elements (Pcre) and motifs sampled randomly from upstream sequences (Prand). Perplexity can be roughly thought of as the number of viable options that remain after the test model has been taken into account – the higher the perplexity, the less able the model is to explain the distribution. The results are shown in Figure 4. Twelve models were constructed: a maximum entropy model (maxEnt), in which each base is equally likely, ten Markov models drawn from the set of all upstream regions and ranging in order from 0 to 9, and a variable-length Markov model (varLen), which is equivalent to Maximum Likelihood Estimation generated by directly looking up the frequency of occurrence of each motif in all upstream regions. Pcre was estimated by randomly sampling 200 non-degenerate yeast cis-regulatory elements from the yeast transcription factor database TRANSFAC. These motifs ranged in length from 4 bp to 26 bp, with a median of 9 bp. For each selected cis-regulatory element, a motif of the same length was randomly sampled from the set of all upstream sequences. As a control, motifs of matching lengths were generated using 0th (P0) and 4th (P4) order Markov models.

As expected, the motifs drawn from P0 have a minimum perplexity when modeled by a 0th order Markov model (red line). Similarly, the perplexities on P4 indicate that motifs generated from a 4th order Markov model are best modeled by a 4th order Markov model (black line). Remarkably, the perplexity for cis-regulatory elements modeled by low order Markov models generated from the upstream regions as a whole is greater than that of the naive MaxEnt model (blue line). This suggests that Pcre and Prand are distinct distributions, even at the genomic level. Furthermore, an attempt to model an unknown motif with a low order Markov model includes an unknown amount of error that will depend on whether the motif is a cis-regulatory element or not – the very question that motif-finding programs seek to answer. The perplexities also indicate that both cis-regulatory elements and random motifs are most accurately modeled by high order Markov models, which are closely approximated by a variable order Markov model. Of particular interest is the fact that a 9th order Markov model drawn from upstream sequences models Pcre and Prand equally well, and that a Variable Length Markov model employing direct lookups is a good approximation to the unwieldy 9th order Markov model.

Abstract

Panel 1

Panel 3

Panel 5

Panel 2

Panel 4

Panel 6