GenDiS
Genomic Distribution of structural Superfamilies identifies and classifies
evolutionary related proteins. The database has been curated in direct
correspondence with SCOP and represents 1194 structural superfamilies.
GenDis aims to associate primary structure of protein sequences with the
3-dimensional homologous superfamilies. Sequences showing identifiable
homology to entries in PASS2 has been obtained from the non-redundant protein
sequence database and aligned. Similar alignments of the superfamily members
are provided in the genome level, thus creating a sequence platform for cross
genome comparison at the superfamily level.
Flowchart representation of the steps involved in the curation of GenDiS
Superfamily code :
This is a unique identifier(5 digit code) for each protein described
within the SCOP database
Superfamily name :
Name of the superfamily as in SCOP
Protein Code :
Each superfamily member has a 6 letter unique identifier starts with number.
One can get the details of this protein in various databases by using the link
provided in 6 letter code.
First 4 letter :pdb code(start with number)
Fifth digit :Chain identifier.(- indicates that no chains in that protein)
Sixth dogit :Domain number (as given in SCOP.(- indicates that the whole
protein is considered as single domain)
For example
1. 1fpoa1
1fpo -----> pdbcode
a -----> Chain A
1 -----> DOmain 1 of 1fpo
2. 1gh6a-
1gh6 -----> pdbcode
a -----> Chain A
- -----> Single domain
Genome name:
Species name as given in NCBI taxonomy database.
Structural members of alpha-helical ferredoxin suprefamily :
Structural domains assigned to the superfamly in SCOP. User can get relevent
information of the domains in various resources from the link given.
Taxonomy :
Taxonomy is the scientific discipline of categorizing various species of organisms into
a convenient sized group (referred to as a taxon; plural taxa) which shares common,
identifiable traits. To accomplish this, taxonomists have defined a hierarchy of taxa,
each level of which is given a Latin name. This hierarchy provides a minimum of seven
Latin names for each species. For the animal kingdom, they are ranked as follows
(the plant kingdom uses slightly different terminology):
Kingdom
Phylum
Class
Order
Family
Genus*
species*
* Genus and species comprise the scientific name.
To further facilitate grouping similar or closely related groups, these taxa may be further
divided with up to three named intermediate-level taxa, as required. For example:
Class Major division (required)
Subclass 1st optional subdivision
Infraclass 3rd optional subdivision
Superorder 2nd optional subdivision
Order Major division (required)
Phylum statistics :
The number of the various taxon listed with in a given phylum. It lists Class, Order,
Family and the Genus within itself.
Sequence in phyla :
The number of sequences listed within a Phylum.
No of genomes :
No of genomes assigned for a given superfamily.
No of sequences :
No of sequences present in a genome of particular superfamily.
Pruned sequences :
No of sequences in a superfamily after subjected to following procedure
i) Removal of 100% sequence idendity:
Sequences having 100% sequence idendity to each other are considered to be rdeundant
and are filtered using cd-hit
ii) Removal of false positives:
Sequences having less than 40% of the query length are considerd as false positives
and are removed from the dataset.
Total sequences :
Total number of hits obtained using HMM, PSI-BLAST and interacting motif based PHI-BLAST.
Sequences having 100% sequence idendity to each other are considered to be redundant and
are filtered using cd-hit
Alignment :
Superfamily alignment:
The alignment of all homologues listed with the a particular structural superfamily.
The alignment has been performed using the CLUSTALW program.
Genome alignment:
Alignment of the member proteins belonging to a given structural superfamily and
a particular genome. The alignment has been performed using the CLUSTALW program.
Dendrogram :
The dendogram has been obtained employing the drawtree program of the PHYLIP
package. The clustering of the sequences in a given genome provides a treefile
which has been used as an input for the drawtree program.
Dendrogram is not provided for the genomes having more than 20 sequences.
Treefile :
Treefile is built using PHYLIP. PHYLIP uses distance matrix containing all the pairwise
distances between all the sequences to create treefile. The distance matrix is provided
by the Multiple sequence alignment program CLUSTALW. The sample treefile is shown below
((dmras1,ddrasa),((hschras,spras),(scras1,scras2)));
1. The first line contains the number of species, the number of characters and, possibly,
one or more program options.
2. If any option requires extra information, add it using one line per option: start the
line with the option and follow with the data.
3. Next comes the species and character data in separate lines. Each line starts with 10
letters or symbols reserved for the species name and is followed by the characters to
analyze. If the characters require more than one line you may use either sequential or
interleaved format. See the documentation for details.
HMM profile: :
Structural alignments of evolutionary related proteins have been used for
build sensitive HMMs for the different superfamilies. HMM build have been
used to build global and local models.
The basic strengths of profile HMMs are:
1. the model can be used to search a database and/or parse sequences for
the presence of similar domains
2. profile HMMs can be used to maintain alignments of huge numbers of
sequences, starting from carefully constructed "seed" alignments of a
representative set of sequences.
Structural Domain Architecture
Substrate specificity and selective recruitment of proteins in a pathway
is channeled under the influence of adjacent domains. Adjacent domains for
every member of present in a superfamily have been assigned using a
sequence to profile match method. Employing IMPALA every superfamily
member have been queried against a database of profiles comprising of
PASS2 domains
All genomes :
List of all genomes including completed and incompleted genomes .
Complete genomes :
List of genomes which are completely sequenced. This list is downloaded from
NCBI database.
Overlap score
Overlap score between two genomes is calculated by dividing the number of common
superfamilies by the number of unique superfamiles present in both genomes.
Example:
Overlap score between Aeropyrum pernix and Agrobacterium tumefaciens
Genome 1 : Aeropyrum pernix |
No of Superfamilies : 294 [Aeropyrum pernix occurs in 294 superfamilies] |
Genome 2 : Agrobacterium tumefaciens |
No of Superfamilies : 521 [Agrobacterium tumefaciens occurs in 521 superfamilies] |
Total superfamilies : 584 [Unique superfamilies from 294+521 superfamilies] |
Common superfamilies: 231 [231 superfamilies with atleast one member are identified in both genomes] |
Overlap score = Common superfamilies/Total superfamilies |
= 231/584 |
= 0.396 |
Identify structural domain architecture
Provided query sequence is assigned structural domains pribing against sequence profiles of strucural members
of respective proteins superfamilies, employing IMPALA with the user defined e-value. Default is 10 -5.
1. Paste sequence in FASTA format.
2. Press "Submit query" button.
Output will be in the following format.
Structural Domain architecture
|    46548    | ----- |     54862     |
|   |      | alpha-helical ferredoxin |
|   |      | 4Fe-4S ferredoxins |
alpha-helical ferredoxin [scop] <------------- Hyperlink to this superfamily in SCOP
Superfamily code : 46548 <------------- Hyperlink to this superfamily in GenDiS
Super family : alpha-helical ferredoxin
Fold : Globin-like
Query : 1 APVLSKDVADIESILALNPRTQSHAALHSTLAKKLDKKHWKRNPDKNCFHCEKLENNFDD 60
1h7wa1: 1 APVLSKDVADIESILALNPRTQSHAALHSTLAKKLDKKHWKRNPDKNCFHCEKLENNFDD 60
Query : 61 IKHTTLGERGALREAMRCLKCADAPCQKSCPTHLDIKSFITSISNKNYYGAAKMIFSDNP 120
1h7wa1: 61 IKHTTLGERGALREAMRCLKCADAPCQKSCPTHLDIKSFITSISNKNYYGAAKMIFSDNP 120
Query : 121 LGLTCGMVCPTSDLCVGGCNLYATEEGSINIGGLQQFASEVFKAMNIPQIRNPCLPSQEK 180
1h7wa1: 121 LGLTCGMVCPTSDLCVGGCNLYATEEGSINIGGLQQFASEVFKAMNIPQIRNPCLPSQEK 180
Query : 181 MP 182
1h7wa1: 181 MP 182
4Fe-4S ferredoxins [scop]
Superfamily code : 54862
Super family : 4Fe-4S ferredoxins
Fold : Ferredoxin-like
Query : 183 ELQGWDGQSPGTESHQKGKPVPRIAELMGKKLPNFGPYLEQRKKIIAEEKMRLKEQNAAF 242
1h7wa5: 1 ELQGWDGQSPGTESHQKGKPVPRIAELMGKKLPNFGPYLEQRKKIIAEEKMRLKEQNAAF 60
Query : 243 PPLERKPFIPKKPIPAIKDVIGKALQYLGTFGELSNIEQVVAVIDEEMCINCGKCYMTCN 302
1h7wa5: 61 PPLERKPFIPKKPIPAIKDVIGKALQYLGTFGELSNIEQVVAVIDEEMCINCGKCYMTCN 120
Query : 303 DSGYQAIQFDPETHLPTVTDTCTGCTLCLSVCPIIDCIRMVSRTTPYEPKRG 354
1h7wa5: 121 DSGYQAIQFDPETHLPTVTDTCTGCTLCLSVCPIIDCIRMVSRTTPYEPKRG 172
Search GenDiS:
The database can be browsed efficiently using several search facility. The
database may be queried using the superfamily code, superfamily name, genome
name,Phyla and keywords.
1.The search pattern should contain atleast four character.
2.Protein code should contain atleast four character.
3.Select the search option.
Search by superfamily code
To search by superfamily code,
i) enter a valid superfamily code ( 5 digit code as mentioned in SCOP)
ii) select the "Superfamily code" option from radio button given
iii) submit the form.
The result page contains superfamily code (with link) and Superfamily name.
User can reach the choosen entry by using link provided in superfamily code.
Search by superfamily name:
To search by superfamily name,
i) enter a valid superfamily name (as mentioned in SCOP),
ii) select the "Superfamily name" option from radio button given)
iii) submit the form.
The result page contains superfamily code (with link) and Superfamily
name . User can reach the choosen entry by using link provided in superfamily
code.
Search by genome name :
To search by genome name,
i) enter a valid Fold name (as mentioned in NCBI),
ii) select the "Genome name" option from radio button given)
iii) submit the form.
The result page contains genome name and number of superfamilies(with link).
User can reach the choosen entry by using link provided in numbers.
Search by Phyla name :
To search by Phyla name,
i) enter a valid phyla name (as mentioned in NCBI),
ii) select the "Phyla name" option from radio button given)
iii) submit the form.
The result page contains superfamily code and genome name within the given
phylum. User can reach the choosen entry by using link provided in superfamily code.
Search by Keyword :
To search by keyword,
i) enter keyword
ii) select the "Key word" option from radio button given
iii) submit the form.
The result page contains superfamily code (with link) and genome name.
User can reach the choosen entry by using link provided in superfamily code.
Align two genomes of a given superfamily
This option provides the facility to align the sequences present in two
genomes of a particular superfamily.The sequences are aligned using CLUSTALW.
1.Select the superfamily name from the dropdown menu
2.Enter first genome name
3.Enter second genome name
4.Press "Align" button
Method:
Align your sequence with genome in GenDiS
This option facilitates the user to align query sequence with the particular genome of a superfamily.
The query is aligned with the members of the given genome using CLUSTALW.
1. Select the superfamily name from the dropdown menu
2. Enter genome name
3. Paste query sequence in PIR format
4. Press "Align" button