• Superfamily list
 • Fold list
Phyletic Distribution
 • Phylum
 • Class
 • Order
 • Genus
Quick Search :
Contact
 • Dr.R.Sowdhamini
mini@ncbs.res.in
Home | Genomes | Tools | Search | Download | Links | Help | Lab page

GenDiS

Genomic Distribution of structural Superfamilies identifies and classifies evolutionary related proteins. The database has been curated in direct correspondence with SCOP and represents 1194 structural superfamilies. GenDis aims to associate primary structure of protein sequences with the 3-dimensional homologous superfamilies. Sequences showing identifiable homology to entries in PASS2 has been obtained from the non-redundant protein sequence database and aligned. Similar alignments of the superfamily members are provided in the genome level, thus creating a sequence platform for cross genome comparison at the superfamily level.
Flowchart representation of the steps involved in the curation of GenDiS


Superfamily code :

This is a unique identifier(5 digit code) for each protein described 
within the SCOP database

Superfamily name :

Name of the superfamily as in SCOP

Protein Code :

    Each superfamily member has a 6 letter unique identifier starts with number.
One can get the details of this protein in various databases by using the link
provided in 6 letter code.

First 4 letter :pdb code(start with number)
Fifth digit    :Chain identifier.(- indicates that no chains in that protein)
Sixth dogit    :Domain number (as given in SCOP.(- indicates that the whole 
                protein is considered as single domain)
For example

1. 1fpoa1

   1fpo       -----> pdbcode
   a          -----> Chain A
   1          -----> DOmain 1 of 1fpo


2. 1gh6a-

   1gh6       -----> pdbcode
   a          -----> Chain A
   -          -----> Single domain


Genome name:

Species name as given in NCBI taxonomy database.

Structural members of alpha-helical ferredoxin suprefamily :
Structural domains assigned to the superfamly in SCOP. User can get relevent information of the domains in various resources from the link given.

Taxonomy :
Taxonomy is the scientific discipline of categorizing various species of organisms into
a convenient sized group (referred to as a taxon; plural taxa) which shares common,
identifiable traits. To accomplish this, taxonomists have defined a hierarchy of taxa,
each level of which is given a Latin name. This hierarchy provides a minimum of seven
Latin names for each species. For the animal kingdom, they are ranked as follows
(the plant kingdom uses slightly different terminology):
  Kingdom
    Phylum
      Class
        Order
          Family
            Genus*
              species*
* Genus and species comprise the scientific name.
To further facilitate grouping similar or closely related groups, these taxa may be further
divided with up to three named intermediate-level taxa, as required. For example:
  Class     Major division  (required)
    Subclass     1st optional  subdivision
      Infraclass     3rd optional  subdivision
        Superorder     2nd optional  subdivision
          Order          Major division  (required)


Phylum statistics :
The number of the various taxon listed with in a given phylum. It lists Class, Order, Family and the Genus within itself. Sequence in phyla :
The number of sequences listed within a Phylum.
No of genomes :
No of genomes assigned for a given superfamily.

No of sequences :
No of sequences present in a genome of particular superfamily.

Pruned sequences : No of sequences in a superfamily after subjected to following procedure

i) Removal of 100% sequence idendity:
Sequences having 100% sequence idendity to each other are considered to be rdeundant and are filtered using cd-hit
ii) Removal of false positives:
Sequences having less than 40% of the query length are considerd as false positives and are removed from the dataset.

Total sequences :
Total number of hits obtained using HMM, PSI-BLAST and interacting motif based PHI-BLAST. Sequences having 100% sequence idendity to each other are considered to be redundant and are filtered using cd-hit

Alignment :

Superfamily alignment:
The alignment of all homologues listed with the a particular structural superfamily. 
The alignment has been performed using the CLUSTALW program.

Genome alignment:
Alignment of the member proteins belonging to a given structural superfamily and 
a particular genome. The alignment has been performed using the CLUSTALW program.

Dendrogram :
The dendogram has been obtained employing the drawtree program of the PHYLIP
package. The clustering of the sequences in a given genome provides a treefile 
which has been used as an input for the drawtree program.
Dendrogram is not provided for the genomes having more than 20 sequences.

Treefile :

Treefile is built using PHYLIP. PHYLIP uses distance matrix containing all the pairwise
distances between all the sequences to create treefile. The distance matrix is provided
by the Multiple sequence alignment program CLUSTALW. The sample treefile is shown below

((dmras1,ddrasa),((hschras,spras),(scras1,scras2)));

1. The first line contains the number of species, the number of characters and, possibly,
   one or more program options.

2. If any option requires extra information, add it using one line per option: start the
   line with the option and follow with the data.

3. Next comes the species and character data in separate lines. Each line starts with 10
   letters or symbols reserved for the species name and is followed by the characters to
   analyze. If the characters require more than one line you may use either sequential or
   interleaved format. See the documentation for details.

HMM profile: :

 Structural alignments of evolutionary related proteins have been used for 
build sensitive HMMs for the different superfamilies. HMM build have been 
used to build global and local models.

The basic strengths of profile HMMs are:

  1. the model can be used to search a database and/or parse sequences for
     the presence of similar domains

  2. profile HMMs can be used to maintain alignments of huge numbers of
     sequences, starting from carefully constructed "seed" alignments of a 
     representative set of sequences.  


Structural Domain Architecture
Substrate specificity and selective recruitment of proteins in a pathway
is channeled under the influence of adjacent domains. Adjacent domains for
every member of present in a superfamily have been assigned using a
sequence to profile match method. Employing IMPALA every superfamily
member have been queried against a database of profiles comprising of
PASS2 domains

All genomes :

List of all genomes including completed and incompleted genomes .

Complete genomes :

List of genomes which are completely sequenced. This list is downloaded from
NCBI database.

Overlap score

Overlap score between two genomes is calculated by dividing the number of common superfamilies by the number of unique superfamiles present in both genomes.
Example: Overlap score between Aeropyrum pernix and Agrobacterium tumefaciens
Genome 1            : Aeropyrum pernix
No of Superfamilies : 294 [Aeropyrum pernix occurs in 294 superfamilies]
Genome 2            : Agrobacterium tumefaciens
No of Superfamilies : 521 [Agrobacterium tumefaciens occurs in 521 superfamilies]
Total superfamilies : 584 [Unique superfamilies from 294+521 superfamilies]
Common superfamilies: 231 [231 superfamilies with atleast one member are identified in both genomes]
Overlap score = Common superfamilies/Total superfamilies
              = 231/584
              = 0.396


Identify structural domain architecture

Provided query sequence is assigned structural domains pribing against sequence profiles of strucural members
of respective proteins superfamilies, employing IMPALA with the user defined e-value. Default is 10 -5.

1. Paste sequence in FASTA format.
2. Press "Submit query" button.

Output will be in the following format.

Structural Domain architecture


   46548   -----    54862    


     alpha-helical ferredoxin
     4Fe-4S ferredoxins




alpha-helical ferredoxin [scop] <------------- Hyperlink to this superfamily in SCOP Superfamily code : 46548 <------------- Hyperlink to this superfamily in GenDiS Super family : alpha-helical ferredoxin Fold : Globin-like Query : 1 APVLSKDVADIESILALNPRTQSHAALHSTLAKKLDKKHWKRNPDKNCFHCEKLENNFDD 60 1h7wa1: 1 APVLSKDVADIESILALNPRTQSHAALHSTLAKKLDKKHWKRNPDKNCFHCEKLENNFDD 60 Query : 61 IKHTTLGERGALREAMRCLKCADAPCQKSCPTHLDIKSFITSISNKNYYGAAKMIFSDNP 120 1h7wa1: 61 IKHTTLGERGALREAMRCLKCADAPCQKSCPTHLDIKSFITSISNKNYYGAAKMIFSDNP 120 Query : 121 LGLTCGMVCPTSDLCVGGCNLYATEEGSINIGGLQQFASEVFKAMNIPQIRNPCLPSQEK 180 1h7wa1: 121 LGLTCGMVCPTSDLCVGGCNLYATEEGSINIGGLQQFASEVFKAMNIPQIRNPCLPSQEK 180 Query : 181 MP 182 1h7wa1: 181 MP 182
4Fe-4S ferredoxins [scop] Superfamily code : 54862 Super family : 4Fe-4S ferredoxins Fold : Ferredoxin-like Query : 183 ELQGWDGQSPGTESHQKGKPVPRIAELMGKKLPNFGPYLEQRKKIIAEEKMRLKEQNAAF 242 1h7wa5: 1 ELQGWDGQSPGTESHQKGKPVPRIAELMGKKLPNFGPYLEQRKKIIAEEKMRLKEQNAAF 60 Query : 243 PPLERKPFIPKKPIPAIKDVIGKALQYLGTFGELSNIEQVVAVIDEEMCINCGKCYMTCN 302 1h7wa5: 61 PPLERKPFIPKKPIPAIKDVIGKALQYLGTFGELSNIEQVVAVIDEEMCINCGKCYMTCN 120 Query : 303 DSGYQAIQFDPETHLPTVTDTCTGCTLCLSVCPIIDCIRMVSRTTPYEPKRG 354 1h7wa5: 121 DSGYQAIQFDPETHLPTVTDTCTGCTLCLSVCPIIDCIRMVSRTTPYEPKRG 172
Search GenDiS:
                                                                                
The database can be browsed efficiently using several search facility. The
database may be queried using the superfamily code, superfamily name, genome
name,Phyla and keywords.
1.The search pattern should contain atleast four character. 2.Protein code should contain atleast four character. 3.Select the search option.
Search by superfamily code
To search by superfamily code,
i) enter a valid superfamily code ( 5 digit code as mentioned in SCOP) ii) select the "Superfamily code" option from radio button given iii) submit the form. The result page contains superfamily code (with link) and Superfamily name. User can reach the choosen entry by using link provided in superfamily code.
Search by superfamily name:
To search by superfamily name,
i) enter a valid superfamily name (as mentioned in SCOP), ii) select the "Superfamily name" option from radio button given) iii) submit the form. The result page contains superfamily code (with link) and Superfamily name . User can reach the choosen entry by using link provided in superfamily code.
Search by genome name :
To search by genome name,
i) enter a valid Fold name (as mentioned in NCBI), ii) select the "Genome name" option from radio button given) iii) submit the form. The result page contains genome name and number of superfamilies(with link). User can reach the choosen entry by using link provided in numbers.
Search by Phyla name :
To search by Phyla name,
i) enter a valid phyla name (as mentioned in NCBI), ii) select the "Phyla name" option from radio button given) iii) submit the form. The result page contains superfamily code and genome name within the given phylum. User can reach the choosen entry by using link provided in superfamily code.
Search by Keyword :
To search by keyword,
i) enter keyword ii) select the "Key word" option from radio button given iii) submit the form. The result page contains superfamily code (with link) and genome name. User can reach the choosen entry by using link provided in superfamily code.

Align two genomes of a given superfamily
This option provides the facility to align the sequences present in two genomes of a particular superfamily.The sequences are aligned using CLUSTALW. 1.Select the superfamily name from the dropdown menu 2.Enter first genome name 3.Enter second genome name 4.Press "Align" button Method:


Align your sequence with genome in GenDiS
This option facilitates the user to align query sequence with the particular genome of a superfamily. The query is aligned with the members of the given genome using CLUSTALW. 1. Select the superfamily name from the dropdown menu 2. Enter genome name 3. Paste query sequence in PIR format 4. Press "Align" button