The Representative Sequences DataBase

The importance of Representatives in Protein Sequence Families

Sequence databases can effectively be reduced to 50% mutual sequence identity
at 1/3 of its original size.

 ! Warning ! <CopyTheft>
BiOHome
BiOFTP serv
BiONews
BiOWeb sources

Dir Index>>

<Bioinformatics
paper: PDF>

(1) About: The Importance of Representatives in Protein Families
Motivation: Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database? 

Results: Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database in terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. 
 

(2) BiO Authors
 Jong Park, Liisa Holm, Andreas Heger and 
Cyrus Chothia

The European Bioinformatics Institute, EMBL Outstation, Cambridge CB10 1SD, UK 
LMB, MRC Centre, Hills Road, Cambridge, CB2 2QH, UK 

(3) BiO People
  • Biology field people directory(BioPeople)
  • Friends dir
  • Users dir
  • Pers dir
  • (4) BiOLabs
  • Holm lab (EBI), FSSP
  • Church lab, CGR
  • NCBI www
  • MRC-LMB, Cambridge, UK
  • UCSC, HGMP
  • WASHU Eddy lab
  • EBI, (official web site ->EBI), HGMP
  • KEGG (kyoto university)
  • Universities, 
  • (5) BiO Links
  • SRS (EBI), PFAM
  • Swissprot
  • SAGE homepage (http://www.sagenet.org/)
  • (6) BiO Projects
    1. Soc
    2. Bio
      1. Genome related
      2. Search_meth_comp
      3. SAT
      4. Prediction : Casp4
        1. cafasp targets summary
      5. PDB_ISL (LMB)
    3. Comp
    4. Txt
    .
    (7) BiO Features
    (8) BiO Services
    1. Web services
      1. ISS sequence searchserver (uk)
      2. PDB_ISL sequence search (MRC site)
      3. Persus 
      4. Project servers
        1. BioPerl
        2. BioJava
        3. BioBean
        4. BioComponent
        5. BioEntity
    2. Ordering service
    3. NNPSL : http://predict.sanger.ac.uk/nnpsl/
    (9) BiO Hot line
    1. Computer related
      1. Linux
      2. Perl
    2. Protein related
    3. DNA related
    (10) BiO Related sites &machines
    (11) BiO References, FAQs,Presentations, books
    1. BiO Papers
    2. BiO References
    3. BiO FAQs
    4. BiO Presentations
    5. BiO Pictures
    6. BiO Utilities
    7. BiO Journals
    8. BiO Conferences/courses
    9. BiO Glossary
    (12) BiOMisc

    AltaVisTA, MetaCrawler, WebCrawlerINFORSEEK, InforMine(for Biology data), MedLine ,  Indexed
    MEDLINE, New PubMed, PUBMED, IDEAL, Yahoo, DejaNews , FtpSearch, GOOGLE 
    Google Search:  

    ...or browse web pages by category.
    Language options
    Altavista >> 

    Warning: this is a CopyTheft page! Restrictions Apply!! Read the link below.
    CopyTheft,CopyFree, Copyleft,
    jong@biosophy.org