|
¡¡The
transcribed and expressed sequences in genomes only occupy
roughly 2-5 percent, but sequencing on them can lead to the
discovery of genes and contribute to mining the key information
in genome. It is important for revealing the function of gene
(family) , the gene network structure and pathway to study
the time-space expression profile of genes involved in specific
phenotype and crucial biological progress (Donson et al.,2002).
The gene expression level can indicate the type, development
phase and response state of according cell. The gene expression
pattern can be studied systematically and globally through
expression profile that also can provide the clues of physiological
research. Recent research showed the cancer orientation of
normal cell could be predicted early by gene expression profile
(Shoemaker et al., 2002), which mean the regulation network
of gene expression could suggest the orientation of life before
the symptom emergent. Understanding precisely how the network
on transcriptional level regulate the process of life can
conduce to reveal the principles of genes' systematically
expression , the informatics' characteristic of development
and the theoretical base of dynamical balance of regulation
network. The common tools in processing the whole genome scale
expression data are descriptive. ESTs (Expressed Sequence
Tags) concept originated by Adams (Adams et al., 1991) is
a kind of classical t technology complement to nucleotide
array and SAGE (Serial Analysis of Gene Expression). Gene
expression profiling based on EST had been matured both on
theory and practice since Okubo (Okubo et al., 1992) a decade
ago. The considerable amount of ESTs can present the gene
expression situation of the source tissue or cell (Mekhedov
et al.,2000), and can be used for exploring the complex relationship
between the gene expression pattern and genome sequence (Iseli
et al.,2002). ESTs had been the cost-efficient and valuable
resource of genome annotation and played important role in
functional genomics research (Ewing et al.,1999; Fernandes
et al.,2002; Lee et al.,2002; Okano et al.,2001; Qutob et
al.,2000)¡£By this technology, specific gene can be identified
(Zhan et al., 2000), and metallization pathway can be interpreted
(Ohlrogge et al., 2000) with some key gene cloned efficiently
(Runnsley et al., 1996), i.e. Cahoon (Cahoon et al., 1999)
had acquired the gene of fatty-acid conjugase that is the
key enzyme of lipid-linked biosynthesis by analysis on the
EST datasets from the oil producing tissue of Momordica Charantia
and Impatiens Balsamina. The tissue and biological specific
ESTs identified from original EST datasets are the essential
source for high-quality cDNA array (Lofftus et al.,1999).
Data mining on the EST datasets also can reveal potential
information or rule about pre-mRNA processing mechanisms,
such as signal elements, alternate splicing or alternate 3'end-processing
sites (Kan et al.,2001). Gene regulation exists on five levels,
including the DNA level, transcriptional level, post-transcription
level, translation level and post-translation level. In Eukaryote,
the matured mRNA's formation need the post-transcriptional
modification of pre-mRNA essentially including the appendance
of 5'capping, splicing of intron and 3'end processing. The
general structure of a matured transcript included 5'Untranslated
Region (5'UTR), Open Reading Frame (ORF) and 3' Untranslated
Region (3'UTR). 3'UTR is transcript-specific (Coulson et al.,
1997, Wu et al 2000) and play crucial regulation role in post-translation
modification, inter-cellular localization and transmission,
mRNA stabilization and assuring for the translation efficiency
(Mignone et al., 2002). The cis-elements in 3'UTR and 3'clip
region that can regulate the 3'end processing by interacting
with specific trans-elements involved in 3'end processing(Pauws
et al., 2001). Research on mammal show that 3'end processing
consists of two key steps, one is the cleavage on a specific
site and the other is appending a Poly(A+) tail on the end.
The core cis-elements had been identified by absence experiment
and sequence analysis (Barabino et al., 1999).There are three
kinds of elements: Poly(A+) site, Position Element (PE) and
Downstream Element (DSE). Poly(A+) site is a di-nucleotides
with conserved composition YA (Y:C,T) (Chen et al., 1995;
Zhao et al., 1999),which is also can be called cleavage site
(CS).PE locates 10~30nt upstream the Poly(A+) site ,and have
a conserved motif AATAAA. DSE is T/GT rich element downstream
the Poly(A+) site and plays important role in stabilizing
the trans-elements' complex. In plant, the elements involved
in 3'end processing are more dispersive, lower conservative
and are more complicated (Rothnie, 1996). A simple model on
the distribution of the elements was shown in Fig.1 (Zhao
et al., 1999). No very conserved pattern had been found for
the PE in plant mRNA and different mRNA has diverse PE which
is only efficient for its owner (Zheng et al., 2000). PE in
plant is called Near Upstream Element (NUE). The low conserved
element crucial for the efficiency of 3'end processing is
called Far Upstream Element(FUE) locates upstream the NUE
and also can be called Efficiency Element (EE)¡£The composition
of Poly(A+) site is similar to mammal. Structure and distribution
of 3'end processing related elements decentralize in a wide
variety among different mRNA and plants. Even for the same
gene, the elements function in 3'end processing could be different,
which could cause the polymorphism of 3'end of matured mRNA.
Some gene has multiple NUE, i.e. pea rbcS-E gene)(Fig.2)¡£It
was known that at least four kinds of trans-elements participate
the 3'end processing (Zheng et al., 2000), including Cleavage
Polyadenylation-Specific Factor (CPSF) recognizing NUE, Cleavage
stimulation Factor (CstF) recognizing FUE, Cleavage factors
(CFs) responsible for cleavaging the pre-mRNA 3'end and Poly(A+)
Polymerase (PAP) which produce the Poly(A+) tail. The potential
model of combination of these enzymes was shown as Fig.3.
It is notable that not all matured mRNA have Poly(A+) tail,
i.e. Histone mRNA.
 |
| Fig.1.
Plant cis-elements in mRNA 3' end processing (from Zhao
et al., 1999) |
 |
| Fig.2.
Multi Poly(A+) sites of pea rbcS-E9 gene, from http://www.uky.edu/~aghunt00/polya.signal.html. |
 |
| Fig.3.
Plant trans-elements in mRNA 3' end processing, retraced
from Zheng et al., 2000. CPSF recognizes NUE directly,
CstF recognizes FUE, CFs are required for the cleavage
reaction, PAP is required for the Poly(A+) sequence generation. |
¡¡Data
mining on the 3'end processing cis-elements had provided amounted
clues for further experiment (van Helden et al., 2000; Graber
et al., 1999). in silico experiment based on Poly(A+) EST
datasets can help to identify and characterize some significant
cis-elements(Pauws et al.,2001).Accumulation of data and analysis
on the primary structure of 3'UTR and 3'clip can also make
for the research on the secondary structure (Pesole et al.,1999),promoting
the understanding of the sequence characteristic of the 3'end
region. Arithmetic had been applied on this kind research
focuses on the statistic and analysis of sequence composition,
including to identify a potential elements by determining
the statistical significance of nucleotide word (represented
by "word" in followed text) , by discriminating the word's
position distribution and by comparing the words similar in
composition and distribution by alignment. Statistical model,
clustering and discriminate model, Markov model and etc. had
been employed in data mining (van Helden et al., 2000)¡£Some
research also had dealed with the relationship between the
sequence and other biological characteristic, i.e. the association
between sequence pattern and gene function (Conklin et al.,
2002).Comparing the matured mRNA from the same gene expressed
in distinct tissue indicated that the distribution of Poly(A+)
site was tissue-specific to some extend (Beaudoing et al.,2001).
Rice (Oryza sativa) is one of the most important cereal crops
in the world. It has become a model plant because of its economic
value, small genome size (430 Mb), high gene density and syntenic
relations with other cereals (Serageldin, 2002). With the
draft genome sequences for Japonica and Indica rice having
been released in public databases, the coming arduous and
important mission is function genomic research, which mean
to reveal the gene regulation and interaction network based
on the precisely annotating on the 30,000~50,000 genes(Yu
et al,2002£»Goff et al,2002)¡£As an important part of function
genomics, a large-scale EST analysis project of the genome
has progressed for rice. An early large-scale sequencing and
analysis project of the rice genome generated an enormous
collection of ESTs (Sasaki T et al., 1994; Sasaki T et al.,
1996; Kimiko Yamamoto1 et al., 1997).A total of 202,290 rice
ESTs had been released in dbEST on NCBI (2003.5.2, http://www.ncbi.nlm.nih.gov/dbEST_summary.html)
with more and more rapidly accumulation. However, on the side
of data size, compared to the progress of genomic sequencing
projects and other plants absent on genome sequence, the amount
of ESTs is low. On the side of data sources, large-scale EST
libraries generated from specific biological processes or
tissues, i.e. from the interaction between plant and pathogen
microbe, are lacking. On the side of data characteristic,
most ESTs were from 5' end. The lack of the transcript specific
3'EST made against the further research based on ESTs, not
only expression profiling, cDNA array and genome sequence
analysis, but also the research on 3'end processing related
elements. For the reason above, as part of our rice gene discovery
plan, we generated 25,160 high-quality ESTs from large-scale
3' sequencing of three cDNA libraries. Analysis was performed
on the three EST datasets respectively from leaf induced by
Magnaporthe grisea, stem in the 3- to 5-leaf stage and endosperm
10~15 days after anthesis. Each library reflects an important
tissue expression pattern under specific conditions. Comparative
analysis of the expression patterns among the three EST datasets
was performed to investigate further similarities and differences
among three distinct libraries. A systematic and detailed
analysis of 3'end processing elements on rice has not been
performed, and there is rare research on the 3'UTR general
structure of rice, except for common analysis on expression
pattern. Hence, we performed an investigation on sequence
features in 3' UTR and 3'clip based on our non-redundant database
of 3'EST and published rice genome sequence.
|