Our large-scale
EST project sequenced the 3'ends of mRNA transcripts, ESTs were
categorized from specific biological processes and tissues,
for which the sample size is not large enough in the public
database. Bioinformatics analysis, including systematic annotation
and data organization analysis were performed on three large
EST datasets generated from three representative cDNA libraries,
with a total of 25,160 high-quality 3'ESTs.
Source of cDNA library
1. Leaf cDNA library: constructed from leaf of rice induced
by Magnaporthe grisea, by Chaozuhe (Chaozu he et al., 1999).
2. Stem cDNA library: constructed from stem of rice in the 3-
to 5-leaf stage, by GIBCO Company.
3. Endosperm cDNA library: constructed from endosperm of immature
rice seed, about 10-15 days after anthesis, by Hong Mengmin
(Member of CAS, P.R.China, Institute of Plant Physiology & Ecology,
CAS, and P.R.China).
3'end Sequencing
Recombinant colonies were randomly picked from the cDNA library.
Each clone was assigned a unique and library-specific name for
identification, consistent with its 96-well plate number and
position; e.g., clone S001A07 was found at position A07 in plate
001 from the stem cDNA library, and H020F06 was from the endosperm
cDNA library, and the clone whose name with first character
as [A-Z] was from leaf cDNA library (this last part doesn't
make sense to me). All selected clones were sequenced in a single-pass,
using automated sequencing reactions from the 3' end only, using
a MegaBase1000 automated sequencer.
Sequence processing
Raw data from the sequencer were processed manually and carefully
with a view to acquiring high-quality ESTs. First, sequences
were exported after poor-quality sequences were removed with
reference to original trace data, using Chromas 1.45 software.
Second, ESTs shorter than 160 bps and containing more than 3%
Ns were considered not useful for further data analysis and
excluded. Finally, all ESTs with at least 16 adenines of the
final 20 bases were assumed to contain poly (A+) tails, and
were selected for further analysis after vector trimming, for
3' UTR can represent unique gene expression. The ESTs without
poly (A+) were organized as another dataset but were not included
in further analysis. Sequences originating from M.g. were removed
by comparison to M.g. genome sequence. To evaluate the quality
of sequencing, Length-frequency distribution analysis for each
library was performed.
ESTs in the same library were taken as one dataset. To provide
basic expression pattern analysis and deduce tissue- or biological-specific
redundancy, ESTs showing significant homologies were partitioned
to non-redundant sets of clusters with DNAtools software manually.
Such EST assembly manipulation was implemented first on each
dataset. For each dataset, each cluster is assumed to represent
a tentative unique transcript (TUT).
For further analysis on a total non-redundant dataset, we performed
assembly manipulations, based on cluster results from each dataset,
checking for redundancy and detecting overlap between pairs
and among all three datasets.
Investigations on the most abundant TUTs from each library were
performed with a view that assessment for distribution of redundancy
in each library can show tissue and biological process specificity.
Annotation
All TUTs had been searched for homology, annotation and chromosomal
location in the GeneBank with the BLASTn 2.0.5 program in NCBI
(http://www.ncbi.nlm.nih.gov) in both manual and automated method,
querying on the nr sequence database for Oryza sativa. The BLASTn
score (bits) value was used to classify the alignments into
strong, medium and weak homology. Scores below 10e?12 for the
E value and higer than 100 for the HSP (High Scoring Pair extracted
from BLAST report) were considered significant. TUTs with known
chromosomal location information were also organized. All the
results were organized in a data sheet (Excel, Microsoft). TUTs
with known gene matches were categorized into different functional
groups, mainly according to categories described by MIPS (define
MIPS here). The "Manually assigned functional catalog" table
from MIPS is available at (http://mips.gsf.de/cgi-bin/proj/thal/filter_funcat.pl?all
).
With comprehensive characterization of the expression patterns
of the three datasets, an abundant resource is provided for
further research, such as gene network exploration aided by
microarray. The leaf dataset was generated from the biological
interactions between plant and pathogen-related microbes. This
dataset provides many stress-response ESTs and protection-related
genes. The stem dataset provides clues for researching the gene
related to the formation of solidly built stems. The composition
of the endosperm dataset indicates some genes related to the
quality of rice grain.
(1) Endosperm dataset: Moderately and highly expressed genes
in endosperm were mainly involved in storage proteins encoding,
starch synthesize, photorespiration and related biochemical
process. 35% of the moderately or highly expressed genes were
not be found in current database and may be specifically associated
with the physiological function of the endosperm.
| Basic
statistic of Endosperm dataset |
| Clones |
10245 |
Total
bases (Mb) |
4.35 |
| Poly(A+)
EST |
9369 |
Poly(A+)
EST percent (%) |
90.6 |
| Longest
(nt) |
1060 |
Average
(nt) |
462 |
| Singleton |
2092 |
Contig |
912 |
| TUTs |
3004 |
TUTs
percent (%) |
32.1 |
| A:
identified TUT |
576 |
Percent
of A (%) |
19.2 |
| B:
mapped TUTs |
921 |
Percent
of B (%) |
30.7 |
| A”B |
182 |
TUTs
only found |
653 |
Navigation:
Browse all 3004 TUTs and relate information of endosperm dataset
here.
(2) Leaf dataset: Moderately and highly expressed genes in leaf
(20.1%) covered the 58.5% of total ESTs, in which 60% were un-known
genes. Two categories of genes were detected with high redundancy,
including the leaf tissue-specific genes that involved in photosynthesis
and stress resistance genes induced by the threatening. The
protection related genes involved 66 kinds of known genes covering
171 TUTs and 993 ESTs.
| Basic
statistic of Endosperm dataset |
| Clones |
15396 |
Total
bases (Mb) |
6.2 |
| Poly(A+)
EST |
13316 |
Poly(A+)
EST percent (%) |
86.5 |
| Longest
(nt) |
830 |
Average
(nt) |
488 |
| Singleton |
3483 |
Contig |
2150 |
| TUTs |
5633 |
TUTs
percent (%) |
42.3 |
| A:
identified TUT |
642 |
Percent
of A (%) |
11.4 |
| B:
mapped TUTs |
1560 |
Percent
of B (%) |
27.7 |
| A”B |
197
|
TUTs
only found |
916 |
Navigation:
Browse all 5633 TUTs and relate information of leaf dataset
here.
(3) Stem
dataset: Two groups of genes were significant in stem, the
signal transduction related genes covering 8.3% of the identified
ESTs and the plant specific transcriptional factor, i.e. ethylene
responsive element binding protein (EREBP) and others.
| Basic statistic of Endosperm dataset |
| Clones |
2683 |
Total bases (Mb) |
1.23 |
| Poly(A+) EST |
2485 |
Poly(A+) EST percent (%) |
92.6 |
| Longest (nt) |
1040 |
Average (nt) |
522 |
| Singleton |
1591 |
Contig |
312 |
| TUTs |
1903 |
TUTs percent (%) |
76.6 |
| A: identified TUT |
353 |
Percent of A (%) |
18.5 |
| B: mapped TUTs |
593 |
Percent of B (%) |
31.2 |
| A”B |
109 |
TUTs only found |
259 |
Navigation:
Browse all 1903 TUTs and relate information of stem dataset
here.
Functional Classification
The comparison summary of TUTs falling into a different functional
class for each library according to known or putative biological
function is shown in followed Figure.
 |
Distribution
of functional classification.
From inner to outer, the circles represent Leaf library,
Stem library and Endosperm library. The following are
code descriptions. BIO: Cellular biogenesis and cellular
organization; COM: Cellular communication/signal transduction;
DEV: Development; ENG: Energy; GRO: Cell growth, cell
division and DNA synthesis; MET: Metabolism; MIS: Miscellaneous:
unclassified or unknown gene; PLD: Transposable elements,
viral and plasmid proteins; PRO: Protein synthesis and
destination; RES: Cell rescue, defense, cell death and
aging; SCR: Transcription; TRA: Transport facilitation
and mechanism; it is significant that a high percentage
of TUTs in the endosperm dataset is assigned to the functional
class "Protein synthesis and destination", , which is
consistent with the physiology function of rice endosperm.
|
Detection
of Dataset Overlaps and TUTs only find in our project
To detect overlaps of the TUTs between pairs and among all three
datasets, we performed assembly manipulations on all TUTs by
clustering according representative ESTs. And to find the TUTs
only can be found in our general dataset, we BLAST our TUTs
with all rice ESTs in GenBank. Within general dataset with 25170
Poly(A+) ESTs and 9321 TUTs ,1787 TUTs was identified as can
only be found in our database.An indication of the overlaps
of unique transcripts from each EST dataset at the sequence
alignment level is shown in followed figure and table. Among
the 143 TUTs common to all three datasets, 43 TUTs were assigned
a function and grouped by the functional categories mentioned
in the "functional classification" section. In turn, among these
TUTs, we found relatively high percentages in groups ENG (13.95%),
RES (16.28%), MET (18.60%), and PRO (18.60%). These results
are consistent with the physiological activity of rice despite
the fact that most TUTs were not assigned a function annotation.
The genes in groups ENG, MET, and PRO are necessary in many
different tissues, whereas genes in group RES prepare the system
for defense mechanisms. Among 96 TUTs shared by the three datasets,
their redundancies were diverse in different datasets. These
are consistent with the specific tissues or biological processes
of the cDNA library used for generating the according EST datasets.
For example, TUTs S058A08, H123C11 and A005C10 respectively
from the stem, endosperm and leaf dataset were aligned to one
cluster assigned a function annotation as "O .sativa hsp82 gene
for heat shock protein 82", with according percentages in each
dataset being 0.04%, 0.05% and 0.22%. The protein coded by this
gene is produced by stressed plants and performs important roles
in plant survival. The comparatively high percent in the leaf
dataset is consistent with a leaf dataset generated from leaf
tissue in a biological defense process induced by M.g.
| Leaf
|
|
TUT |
Function
only |
TUT |
EST |
| L,S
overlap * |
359
|
38 |
17 |
1345 |
| S,H
overlap * |
\ |
\ |
\ |
\ |
| L,H
overlap * |
448
|
40 |
30 |
1269 |
| L,S,H
overlap |
143 |
43 |
1 |
875 |
| Respective
|
5633
|
642 |
916 |
13316 |
| Stem |
|
TUT |
Function
only |
TUT |
EST |
| L,S
overlap * |
359
|
38 |
17 |
455 |
| S,H
overlap * |
126 |
30 |
6 |
173 |
| L,H
overlap * |
\ |
\ |
\ |
\ |
| L,S,H
overlap |
143 |
43 |
1
|
237 |
| Respective
|
1903 |
353 |
259 |
2485 |
| Endosperm |
|
TUT |
Function
only |
TUT |
EST |
| L,S
overlap * |
\ |
\ |
\ |
\ |
| S,H
overlap * |
126
|
30 |
6
|
352 |
| L,H
overlap * |
448
|
40 |
30 |
820 |
| L,S,H
overlap |
143 |
43 |
1
|
461 |
| Respective
|
3004 |
576
|
653 |
9369 |
*The statistic doesn't including the ones shared by all datasets.
H represents Endosperm Dataset; L represents Leaf Dataset; S
represents Stem Dataset; "only TUT" represents TUTs only find
in our d
 |
| Venn
diagram showing overlaps among three datasets. The number
of shared TUTs between datasets is shown where each Venn
circle intersects with another. "Known" means TUTs with
functional annotation. |
Download data and related information: Overlap of TUTs in
three datasets
If you have interests in browsing and downloading other data
and related information of Poly(A+) Dataset , Non Poly(A+)
Dataset , or our data processed by NCBI UniGene, please go
to the "Navigation" and "Download" page.
|