Email Address

Password

Home
 
Magnaporthe grisea EST
Navigation
 
Rice EST
Introduction
EST analysis
3'end Analysis
Navigation
Reference
Search
Supply Data
 
Array
Introduction
Navigation
Search
Supply Data
 
Services
Querying Clones
 
Bio-Links
 
Software
DNAuser
Pusamen
 
Books
Practical Protocols of Gene Engineering
Basic knowledge of rice
 
EST database
Our large-scale EST project sequenced the 3'ends of mRNA transcripts, ESTs were categorized from specific biological processes and tissues, for which the sample size is not large enough in the public database. Bioinformatics analysis, including systematic annotation and data organization analysis were performed on three large EST datasets generated from three representative cDNA libraries, with a total of 25,160 high-quality 3'ESTs.
Source of cDNA library
1. Leaf cDNA library: constructed from leaf of rice induced by Magnaporthe grisea, by Chaozuhe (Chaozu he et al., 1999).
2. Stem cDNA library: constructed from stem of rice in the 3- to 5-leaf stage, by GIBCO Company.
3. Endosperm cDNA library: constructed from endosperm of immature rice seed, about 10-15 days after anthesis, by Hong Mengmin (Member of CAS, P.R.China, Institute of Plant Physiology & Ecology, CAS, and P.R.China).
3'end Sequencing
Recombinant colonies were randomly picked from the cDNA library. Each clone was assigned a unique and library-specific name for identification, consistent with its 96-well plate number and position; e.g., clone S001A07 was found at position A07 in plate 001 from the stem cDNA library, and H020F06 was from the endosperm cDNA library, and the clone whose name with first character as [A-Z] was from leaf cDNA library (this last part doesn't make sense to me). All selected clones were sequenced in a single-pass, using automated sequencing reactions from the 3' end only, using a MegaBase1000 automated sequencer.
Sequence processing
Raw data from the sequencer were processed manually and carefully with a view to acquiring high-quality ESTs. First, sequences were exported after poor-quality sequences were removed with reference to original trace data, using Chromas 1.45 software. Second, ESTs shorter than 160 bps and containing more than 3% Ns were considered not useful for further data analysis and excluded. Finally, all ESTs with at least 16 adenines of the final 20 bases were assumed to contain poly (A+) tails, and were selected for further analysis after vector trimming, for 3' UTR can represent unique gene expression. The ESTs without poly (A+) were organized as another dataset but were not included in further analysis. Sequences originating from M.g. were removed by comparison to M.g. genome sequence. To evaluate the quality of sequencing, Length-frequency distribution analysis for each library was performed.
ESTs in the same library were taken as one dataset. To provide basic expression pattern analysis and deduce tissue- or biological-specific redundancy, ESTs showing significant homologies were partitioned to non-redundant sets of clusters with DNAtools software manually. Such EST assembly manipulation was implemented first on each dataset. For each dataset, each cluster is assumed to represent a tentative unique transcript (TUT).
For further analysis on a total non-redundant dataset, we performed assembly manipulations, based on cluster results from each dataset, checking for redundancy and detecting overlap between pairs and among all three datasets.
Investigations on the most abundant TUTs from each library were performed with a view that assessment for distribution of redundancy in each library can show tissue and biological process specificity.
Annotation
All TUTs had been searched for homology, annotation and chromosomal location in the GeneBank with the BLASTn 2.0.5 program in NCBI (http://www.ncbi.nlm.nih.gov) in both manual and automated method, querying on the nr sequence database for Oryza sativa. The BLASTn score (bits) value was used to classify the alignments into strong, medium and weak homology. Scores below 10e?12 for the E value and higer than 100 for the HSP (High Scoring Pair extracted from BLAST report) were considered significant. TUTs with known chromosomal location information were also organized. All the results were organized in a data sheet (Excel, Microsoft). TUTs with known gene matches were categorized into different functional groups, mainly according to categories described by MIPS (define MIPS here). The "Manually assigned functional catalog" table from MIPS is available at (http://mips.gsf.de/cgi-bin/proj/thal/filter_funcat.pl?all ).
With comprehensive characterization of the expression patterns of the three datasets, an abundant resource is provided for further research, such as gene network exploration aided by microarray. The leaf dataset was generated from the biological interactions between plant and pathogen-related microbes. This dataset provides many stress-response ESTs and protection-related genes. The stem dataset provides clues for researching the gene related to the formation of solidly built stems. The composition of the endosperm dataset indicates some genes related to the quality of rice grain.
(1) Endosperm dataset: Moderately and highly expressed genes in endosperm were mainly involved in storage proteins encoding, starch synthesize, photorespiration and related biochemical process. 35% of the moderately or highly expressed genes were not be found in current database and may be specifically associated with the physiological function of the endosperm.
Basic statistic of Endosperm dataset
Clones 10245 Total bases (Mb) 4.35
Poly(A+) EST 9369 Poly(A+) EST percent (%) 90.6
Longest (nt) 1060 Average (nt) 462
Singleton 2092 Contig 912
TUTs 3004 TUTs percent (%) 32.1
A: identified TUT 576 Percent of A (%) 19.2
B: mapped TUTs 921 Percent of B (%) 30.7
A”B 182 TUTs only found 653

Navigation:
Browse all 3004 TUTs and relate information of endosperm dataset here.
(2) Leaf dataset: Moderately and highly expressed genes in leaf (20.1%) covered the 58.5% of total ESTs, in which 60% were un-known genes. Two categories of genes were detected with high redundancy, including the leaf tissue-specific genes that involved in photosynthesis and stress resistance genes induced by the threatening. The protection related genes involved 66 kinds of known genes covering 171 TUTs and 993 ESTs.
Basic statistic of Endosperm dataset
Clones 15396 Total bases (Mb) 6.2
Poly(A+) EST 13316 Poly(A+) EST percent (%) 86.5
Longest (nt) 830 Average (nt) 488
Singleton 3483 Contig 2150
TUTs 5633 TUTs percent (%) 42.3
A: identified TUT 642 Percent of A (%) 11.4
B: mapped TUTs 1560 Percent of B (%) 27.7
A”B 197 TUTs only found 916


Navigation:
Browse all 5633 TUTs and relate information of leaf dataset here.

(3) Stem dataset: Two groups of genes were significant in stem, the signal transduction related genes covering 8.3% of the identified ESTs and the plant specific transcriptional factor, i.e. ethylene responsive element binding protein (EREBP) and others.

Basic statistic of Endosperm dataset
Clones 2683 Total bases (Mb) 1.23
Poly(A+) EST 2485 Poly(A+) EST percent (%) 92.6
Longest (nt) 1040 Average (nt) 522
Singleton 1591 Contig 312
TUTs 1903 TUTs percent (%) 76.6
A: identified TUT 353 Percent of A (%) 18.5
B: mapped TUTs 593 Percent of B (%) 31.2
A”B 109 TUTs only found 259

Navigation:
Browse all 1903 TUTs and relate information of stem dataset here.
Functional Classification
The comparison summary of TUTs falling into a different functional class for each library according to known or putative biological function is shown in followed Figure.
Distribution of functional classification.
From inner to outer, the circles represent Leaf library, Stem library and Endosperm library. The following are code descriptions. BIO: Cellular biogenesis and cellular organization; COM: Cellular communication/signal transduction; DEV: Development; ENG: Energy; GRO: Cell growth, cell division and DNA synthesis; MET: Metabolism; MIS: Miscellaneous: unclassified or unknown gene; PLD: Transposable elements, viral and plasmid proteins; PRO: Protein synthesis and destination; RES: Cell rescue, defense, cell death and aging; SCR: Transcription; TRA: Transport facilitation and mechanism; it is significant that a high percentage of TUTs in the endosperm dataset is assigned to the functional class "Protein synthesis and destination", , which is consistent with the physiology function of rice endosperm.
Detection of Dataset Overlaps and TUTs only find in our project
To detect overlaps of the TUTs between pairs and among all three datasets, we performed assembly manipulations on all TUTs by clustering according representative ESTs. And to find the TUTs only can be found in our general dataset, we BLAST our TUTs with all rice ESTs in GenBank. Within general dataset with 25170 Poly(A+) ESTs and 9321 TUTs ,1787 TUTs was identified as can only be found in our database.An indication of the overlaps of unique transcripts from each EST dataset at the sequence alignment level is shown in followed figure and table. Among the 143 TUTs common to all three datasets, 43 TUTs were assigned a function and grouped by the functional categories mentioned in the "functional classification" section. In turn, among these TUTs, we found relatively high percentages in groups ENG (13.95%), RES (16.28%), MET (18.60%), and PRO (18.60%). These results are consistent with the physiological activity of rice despite the fact that most TUTs were not assigned a function annotation. The genes in groups ENG, MET, and PRO are necessary in many different tissues, whereas genes in group RES prepare the system for defense mechanisms. Among 96 TUTs shared by the three datasets, their redundancies were diverse in different datasets. These are consistent with the specific tissues or biological processes of the cDNA library used for generating the according EST datasets. For example, TUTs S058A08, H123C11 and A005C10 respectively from the stem, endosperm and leaf dataset were aligned to one cluster assigned a function annotation as "O .sativa hsp82 gene for heat shock protein 82", with according percentages in each dataset being 0.04%, 0.05% and 0.22%. The protein coded by this gene is produced by stressed plants and performs important roles in plant survival. The comparatively high percent in the leaf dataset is consistent with a leaf dataset generated from leaf tissue in a biological defense process induced by M.g.
Leaf
TUT Function only TUT EST
L,S overlap * 359 38 17 1345
S,H overlap * \ \ \ \
L,H overlap * 448 40 30 1269
L,S,H overlap 143 43 1 875
Respective 5633 642 916 13316
Stem
TUT Function only TUT EST
L,S overlap * 359 38 17 455
S,H overlap * 126 30 6 173
L,H overlap * \ \ \ \
L,S,H overlap 143 43 1 237
Respective 1903 353 259 2485
Endosperm
TUT Function only TUT EST
L,S overlap * \ \ \ \
S,H overlap * 126 30 6 352
L,H overlap * 448 40 30 820
L,S,H overlap 143 43 1 461
Respective 3004 576 653 9369

*The statistic doesn't including the ones shared by all datasets.
H represents Endosperm Dataset; L represents Leaf Dataset; S represents Stem Dataset; "only TUT" represents TUTs only find in our d
Venn diagram showing overlaps among three datasets. The number of shared TUTs between datasets is shown where each Venn circle intersects with another. "Known" means TUTs with functional annotation.


Download data and related information: Overlap of TUTs in three datasets
If you have interests in browsing and downloading other data and related information of Poly(A+) Dataset , Non Poly(A+) Dataset , or our data processed by NCBI UniGene, please go to the "Navigation" and "Download" page.