|
Download
Download two dataset of 3'end sequence in our 3'end processing
sequence analysis
Surrounding dataset
Upstream dataset
Introduction
After
processing, two datasets of sequence were constructed. One
dataset includes 7662 sequences of 150 bases upstream of the
putative poly(A+), named the "Upstream" dataset. The other
dataset is comprised of 1693 sequences of 250 bases, ranging
from -150 to +100 from putative cleavage sites, named the
"Surrounding" dataset. Notably, the position of poly(A+) itself
is -1. All the sequences were aligned at the putative poly(A+).
Only the sequences in our non-redundant dataset were selected
for further statistical analysis, to avoid skewing of statistical
data. To assure unambiguousness, the sequences including one
"N" were also eliminated.
In order to identify signal sequences, we measured the position
distribution of series of 6-mer words in the sequences flanking
the putative poly(A+) site in two datasets respectively. The
Markov chain model was used to measure the overrepresentation
of words. This provides a reliable basis for estimating the
expected word frequencies in large sequence sets. Chi-Square
was calculated to screen words with biases in position distribution,
and parallel analysis on two datasets was used to discriminate
false positives.
Based on the position distribution profile analysis and base
frequency statistic,a model for the distribution and feature
of the cis-elements on mRNA 3'end processing was proposed
here (Figure 1). Model's core components
was composed by following elements, T-rich region surrounding
the poly(A+) site, Near Upstream Element (NUE) and Far Upstream
Element(FUE). For poly(A+) site, it was a YA (Y: C,T) di-nucleotide
itself, the T-rich region downstream was commonly more conserved
than the one upstream. For NUE, it was an A, T-rich region
situated between 10 and 30 nt upstream the poly(A+) site,
including two specific sequences of AATAAA and TATATA respectively.
AATAAA was a kind of typical position element that determine
the poly(A+) site downstream. For FUE, as the element 50~70
nt upstream the poly(A+) site, one kind was ATGTAA-like with
a core consensus motif TGTA and the other was T/GT rich. In
addition, in some mRNAs, there were some continuous A closely
downstream of the poly(A+) site.
 |
| Figure
1. General structure of mRNA 3'-end processing related
sequence in rice |
|