We were interested in comparing the performance of the pattern-based programs to that of the profile-based programs. To this end, we averaged the scores of BEAM-PRISM and YMF and compared them against the average scores of MEME, AlignACE and SuperPosition. In order to visualize trends more clearly, we smoothed the results by averaging over a sliding window of three regulons (Figure 10). Interestingly, when the results of pattern-based and profile-based methods are averaged and plotted against the degeneracy of the reported motifs, pattern-based methods consistently outperform profile-based methods. When the cis-regulatory elements have an average degeneracy of 0.63 (corresponding to a motif such as BBHADND), pattern-based methods are close to twice as accurate as profile-based methods (for reference, HSWTAWS and CRTGTWWWW have degeneracies of 0.80 and 0.56 respectively). As the degeneracy of the cis-regulatory elements increases, the performance gap diminishes as the Phi scores of both sets of programs decrease to zero. This suggests that the limited search space of pattern-based methods is a significant advantage in finding highly degenerate motifs. We also ran PRISM on all 33 SCPD regulons and compared pattern-driven programs and profile-driven programs across a full range of degeneracies. The results show that the pattern-based algorithms outperformed the profile-based algorithms at all levels of degeneracy for S. Cerevisiae cis-regulatory elements.
The identification of potential protein binding sites (cis-regulatory elements) in the upstream regions of genes is key to understanding the mechanisms that regulate gene expression. To this end, we present a simple, efficient algorithm, BEAM, aimed at the discovery of cis-regulatory elements in the DNA sequences upstream of a related group of genes. This algorithm dramatically limits the search space of expanded sequences, converting the problem from one that is exponential in the length of motifs sought to one that is linear. Unlike sampling algorithms, BEAM converges and is capable of finding statistically over-represented motifs with a low failure rate. Further, our algorithm is not dependent on the objective function or the organism used. Limiting the space of candidate motifs enables the algorithm to focus only on those motifs that are most likely to be biologically relevant and enables the algorithm to use direct evaluations of background frequencies instead of resorting to probabilistic estimates. In addition, limiting the space of candidate motifs makes it possible to use computationally expensive objective functions that are able to correctly identify biologically relevant motifs.
In theory, the identification of statistically over-represented sequences in the upstream regions of coregulated genes should serve to identify potential cis-regulatory elements. However, in practice, many cis-regulatory elements are highly degenerate, with many different non-degenerate sequences (instantiations) serving as functional binding sites for the same transcription factor. This poses problems for both pattern-based and profile-based algorithms owing to the size of the search space. Pattern-based algorithms are unable to exhaustively enumerate the entire search space, while profile-based algorithms that use sampling techniques tend to get stuck in local maxima. We present a novel pattern-driven algorithm, PRISM, that identifies highly degenerate cis-regulatory elements. PRISM employs a pruned search, based on our finding that the statistical over-representation of a highly degenerate cis-regulatory element is a linear combination of the statistical over-representation of its non-degenerate instantiations. PRISM takes non-degenerate motifs as its input, uses them to construct a consensus representing a set of closely related sequences, and generates a Position Weight Matrix from the consensus. PRISM is extremely robust to noise, and identifies cis-regulatory elements of arbitrary length and degeneracy in seconds. In comparison tests, BEAM-PRISM outperformed 7 other motif-finding algorithms on 28 S. Cerevisiae regulons.
Pevzner, P.A. and Sze, S.H. (2000) Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8:269-78.
Shinozaki, D, Akutsu, T., Maruyama, O. (2003). Finding optimal degenerate patterns in DNA sequences. Bioinformatics 19 Suppl 2:II206-II214.
Sinha, S. and Tompa, M. (2000). A statistical method for finding transcription factor binding sites. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8:344-54.
Sinha, S. and Tompa, M. (2003) Performance comparison of algorithms for finding transcription factor binding sites. In Third IEEE Symposium on Bioinformatics and Bioengineering, IEEE Press, Los Alamitos, pp. 214-220.