We planted the same highly degenerate motif at a number of different levels of over-representation and looked at the sums of the Sig scores of the top n non-degenerate instantiations in relation to the Sig score of the planted degenerate motif. One run of this experiment is shown in Figure 7. The planted motif, WWWWGCGCG, has 16 possible non-degenerate instantiations. When the Sig scores of the top non-degenerate motifs are summed up and compared against the Sig score of the planted motif, we observe the surprising result that the Sig scores of even the top two motifs are a strong predictor for the score of the planted motif. In general, for any value of n, the sum of the scores of the top n degenerate motifs correlates very strongly with the degree of significance of the planted motif (R2 values were greater than 0.96 for all values n ranging from 2 to 5). The experiments were repeated with other planted motifs, giving identical results.
When a highly degenerate motif such as WWWW was planted at a high degree of over-representation in a random group of upstream sequences, a number of its non-degenerate instantiations (such as TTAT, TTAA and TAAT) were also over-represented relative to the background. Each column in Figure 8 represents motifs containing different numbers of degenerate bases. The numerators of the fractions in parentheses represent the ranks of each motif relative to all other instantiations that arose from the motif WWWW, while the denominators represent the total number of possible motifs at that level of degeneracy (e.g. there are 16 non-degenerate instantiations, and 32 singly degenerate instantiations). Combining together a small subset of the most over-represented non-degenerate instantiations of WWWW is sufficient to reconstruct the motif. For instance, combining the top two non-degenerate motifs (TTAT and TTAA) yields the most over-represented motif with one degenerate base (TTAW). As can be seen in Figure 8, this relationship is repeated at every level of degeneracy, as over-represented motifs of any given degeneracy combine to produce over-represented motifs at higher levels of degeneracy. This general result follows naturally from the earlier result that the score of a degenerate motif can be predicted as a linear combination of its non-degenerate instantiations.
These results are consistent with the assumptions that over-represented degenerate motifs contain over-represented non-degenerate instantiations and that a single instantiation is enough to derive the degenerate motif.
Table 2 shows the best motifs found by the PRISM algorithm on the planted datasets. The columns show the Sig value and the sequence of the planted motif, compared to the Sig value and the sequence of the best motif identified by PRISM. In general, PRISM performs well on this dataset, with the motif identified by PRISM overlapping the planted motif exactly in 9 cases out of 15, and substantially overlapping the planted motif in 4 more cases. When the planted motif has a high Sig value, the motif identified by PRISM is identical to the planted motif. When the planted motif has a low Sig value, random variations in nucleotide frequencies are likely to give rise to motifs that are more highly over-represented than the planted motif. An example of this is seen when the planted motif BBHADBND is planted at a Sig value of 5. PRISM returns the motif DNDNAR, which only loosely resembles the planted motif but has a Sig value of 16, considerably higher than the planted motif’s Sig value.