Gendis Central Help

About

Genomic Distribution of structural Superfamilies (Pugalenthi G, Bhaduri A and Sowdhamini R, 2005) identifies and classifies homologues of structural members identified at the SCOP superfamily level. This version of GenDiS+ is in direct correspondence with SCOP 1.75, PASS 2.4, NR May 2015 and UniProt 2016. GenDis aims to associate primary structure of protein sequences with the 3-dimensional homologous superfamilies.

The database reports the analysis of the hits obtained at the superfamily, fold and class- level, the domain architecture (DA) and the taxonomic occurrence. We also provide a library of the DA observed in the homologues at the level of SCOP and Pfam for each superfamily and organism. Compared to the previous version, we have now computed DA of the hits at the Pfam level, identified correspondence between Pfam and SCOP domain definitions of the hits and classified all the Pfam and SCOP DA for each superfamily hits. Users can also compare superfamily homologues from different organisms and different DA in the same organism.

Fig. 1. A flowchart showing a simplified work-flow for the identification of sequence homologues and their validation is shown in the figure.

Methodology

Queries were obtained from PASS2.4, which provides the list of all SCOP (v1.75) superfamily members of domains of known structure, with less than 40% identity with each other. The search was carried out against the NR database, NCBI (May 2015) with an e-value and inclusion threshold of 10-3 for 20 iterations or till convergence. Two different sequence search approaches as described in Joshi et al, 2013 were used to identify more sequence homologues:

  1. Multi-query (MQ) approach
  2. All PASS2.4 superfamily members were used as queries.

  3. Best representative sequence (BRS) multi-pattern (MP) approach
    1. Identification of BRS
    2. A PSI-BLAST (e-value and inclusion threshold 10-3, 20 iterations) search was carried using all PASS2.4 members as queries. The member which picked all other SCOP members and had the highest number of true positive hits (please see the section on validation of hits) was taken to be the BRS.

    3. Identification of patterns for PHI-BLAST
    4. A stringent PSI-BLAST search (e-value and inclusion threshold) was carried out to identify closely related homologues of BRS of a superfamily. The hits having 60-90% identity with the query were aligned using ClustalW2.0. The MSA was used to generate patterns in PROSITE format using an in-house tool MOTIFS (available upon request). If there were more than three residues in a position, the position was denoted as X (any amino acid).

    5. PHI-BLAST search
    6. Searches were carried out using the BRS and each pattern as queries.


The data is arranged in three categories of superfamilies:

  1. Single-membered superfamilies (SMS): Single SCOP superfamily member in PASS2.4 database.
  2. Two-membered superfamilies (TMS): Two SCOP superfamily in PASS2.4 database.
  3. Multi-membered superfamilies (MMS): More than two SCOP superfamily members in PASS2.4 database.

Fig. 2. The work-flow followed for sequence search.

Each hit was validated using structure-based sequence alignments from PASS2.4.

  1. Creation of HMM libraries
  2. A structure-based sequence alignment is available for all superfamilies with two or more members in PASS2.4. We also used the sequences of all PASS2.4 superfamily members to create HMM libraries for each superfamily.

    1. SF-HMM
    2. Alignment of all PASS2.4 members.

    3. SF-HMM
    4. Sequences from PASS2.4 for superfamilies.

    SMS have only SQ-HMM libraries.

    The following table shows the statistics of the SQ-HMM and SF-HMM components of the superfamily HMM libraries:

    Type of HMM library Number of SMS HMM Number of TMS HMM Number of MMS HMM Total
    SF-HMM - 366 714 1180
    SQ-HMM 864 732 8973 10,569
    Total 864 1098 9687 11,794
  3. Screening the hits for superfamily HMM matches
  4. HMMSCAN from HMMER3.1 was used with an e-value of 0.01. All the domain matches with an independent e-value (i e-value) of 10-2 and HMM model coverage of 0.7 were extracted and are provided in the superfamily page in structural domains tab.

    Assigning a superfamily domain to a region:

    1. SF-HMM or SQ-HMM belonging to the same superfamily as the query should match the sequence.
    2. For matches of HMMs belonging to the same superfamily, SF-HMM is given precedence over the SQ-HMM.
    3. If different HMMs from different superfamilies, match the same region from a sequence, the HMM which matched with a lower i e-value is given precedence. Overlap of 25 residues has been allowed.
    4. Discontinuous domains have been identified in HMM matches with breaks by using HMM and envelope co-ordinates.

Fig. 3. A schematic representation of the validation using the HMM libraries.

DA was computed at the structural (SCOP) and sequence (Pfam) level for the hits which passed the above validation.

Pfam DA was calculated as follows:

  1. HMMSCAN was carried out using Pfam (v28.0) HMM libraries using an e-value of 10-2.
  2. Matches with an i e-value of 1 and Pfam family HMM coverage of 0.7 were considered.
  3. Overlaps were resolved using i e-value.

To understand the diversity of homologues for a given superfamily, the domain regions identified by SF-HMM and SQ-HMM matches were extracted and aligned using CLUSTAL Omega. The domain alignments can be downloaded for each superfamily.

There are 314 superfamilies with a single domain architecture out of which 310 have a single domain DA. For the other 1646 superfamilies, diversity of the homologues at the level of structural (SCOP) DA was checked. A distance matrix of the DA was computed using Alignment-free Domain Architecture Similarity Search (ADASS) tool (available upon request). DAD trees were constructed using Neighbor Joining (NJ) from Phylip and are available for download.

From the BLAST results, alternate accession identifiers having identical sequences were extracted. We also extracted taxonomic details for all the hits. The different accessions for the hits were retained to gather information about the different strains and variants that contained the homologue of the superfamily domain.

For each hit, we provide NCBI accession identifiers and identifiers from other databases like SCOP v1.75, SCOPe v2.06, Uniprot and Interpro from the mapping files provided in Uniprot and SCOP. PDB details were extracted from sequence defline as provided in NCBI.

Using the website

Users can browse the data using:

  1. SCOP hierarchy
  2. We have organised our database based on the SCOP (Structural Classification of Proteins) database heirarchy. The heirarchy for the same is:

    • Superfamily
    • Protein families sharing common structural features are categorised under the same superfamily. Being the first level of the heirarchy, the superfamily browsing page displays a list of all the protein superfamilies along with a short description and various other details. Internal (Gendis+) links to each superfamily's details are also provided.

      Fig. . Screenshot help 1.
      • When viewing the superfamily details, users can indulge into a plethora of information about that particular superfamily. Links to view and study the domain architectures (at both Pfam and SCOP superfamily level) of the protein homologues for each superfamily, genome and taxa details, and details of the structural members of the superfamily can be obtained.
      • Links for the PASS2 HMM files, extracted domain regions, and full-length sequences are also available for the user to download. Two different types of HMM files are available for each superfamily: alignment of all PASS2 superfamily members (SF-HMM) and sequences of all PASS2 members (SQ-HMM).

        Fig. . Screenshot help 2.
    • Fold
    • Protein superfamilies sharing common structural features are categorised under the same fold category. When browsing by the Fold classification, users can view the different folds under which each superfamily falls; along with other important information like, Fold description and also the Class under which the said fold lies. External links to SCOP database are also provided for each Fold and Class.

      Fig. . Screenshot help 3.
    • Class
    • Multiple folds sharing common structural features are categorised under the same class category. When browsing by the Class classification, users can view the different classes under which each fold falls; along with other important information like, class description and also the superfamilies which lie under the said class. External links to SCOP database are also provided for each Class and Fold.

      Fig. . Screenshot help 4.
  3. NCBI Taxanomy
  4. Users can browse the database on the basis of NCBI taxanomy. The three levels of the same are:

    1. Genome (NCBI organism name (scientific name with/without strain information))
    2. Under the browse by taxanomy-genome page, users will find a search bar with various associated parameters for advanced search. By clicking on a specific parameter and entering complete or partial query in the search bar, the user would be redirected to the results page.

    3. Phyla
    4. Class

The users may search the entire database at different levels:

  1. Using specific keywords.
  2. Using superfamily name or code.
  3. Using NCBI accession IDs to search for homologues or other information in Pfam or SCOP domain architecture.

Here's a short guide to use the global search tool:

For the ease of searching, along with the global search tool, local search bars are also provided with each table.

SCOP superfamily, fold and class code and description, number of organisms in which the homologues are found, number of hits, number of true positives, total number of accession ids, number of SCOP DA and associated domains, number of Pfam DA and average domain size have been provided. User can click on the tabs for taxonomy, DA to obtain detailed information. DAD tree, SF-HMM and SQ-HMM, domain sequences and domain alignments are available for download.

  1. Domain architecture prediction
  2. User can upload a sequence and the structural domain prediction results will be displayed. All the matching domains will be shown. The user can view the results according to different values of i e-value and HMM model coverage.

    Please contact us if you would like to run the GenDiS+ pipeline for a large number of sequences.

  3. Align domains from genomes
  4. The user can view domain alignments from two different genomes for the user-specified superfamily.

  5. Align sequences with domains
  6. Users can align their sequences with homologues of a superfamily from different genomes.

Please refer to the Downloads tab for a detailed description of the files for download. The user can download the following files:

  1. Combined SF-HMM and SQ-HMM files for all superfamilies.
  2. All sequence homologues identified in GenDiS+.
  3. Taxonomic details file which provides details for all true positive sequence accession identifiers in different organisms with the superfamily details. Our validated sequences belonged to 67,377 organisms out of ~160,000 organisms listed in NCBI Taxonomy.
  4. SCOP DA details file which contains details of all the SCOP DA assigned to true positive sequences identified in the study.
  5. Pfam DA details file which provides Pfam DA details for all true positive sequences.
  6. All the true positive sequences.
  7. File containing the domain sequences extracted from the true positive sequences.
  8. Sequence alignment of the domain sequences for each superfamily.