Help

Overview

GenDiS 3.0 is a database designed to bridge the gap between vast amounts of sequence data and known protein structures. The database catalogs over 151 million homologous sequences from 2,060 superfamilies, helping researchers perform functional annotations and evolutionary studies.

Below is the workflow diagram that outlines the process for searching and validating homologues in the GenDiS 3.0 database:

GenDiS Workflow

The GenDiS 3.0 database offers an intuitive interface for browsing data by:

  • Superfamily: View a list of all superfamilies, including SCOP codes, descriptions, and associated data. The interface allows searching and filtering by description or SCOP ID.
  • Fold: Browse folds and their associated superfamilies. Each fold entry includes a dropdown to explore superfamilies within the selected fold.
  • Class: Explore broader structural classes containing folds and their respective superfamilies.

Screenshots:

Browse Dropdown Menu

Fig: Dropdown menu to browse by superfamily, fold, or class.

Superfamily Table

Fig: Superfamily table showing SCOP codes and descriptions.

Fold Table with Dropdowns

Fig: Fold table with expandable dropdowns to view associated superfamilies.

Class Table

Fig: Class table showing folds and their hierarchical organization.

Each superfamily in GenDiS 3.0 has a detailed page that provides comprehensive data and tools, including:

  • Summary: Basic information about the superfamily, including SCOP code, description, and sequence statistics.
  • Taxonomic Distribution: A colorful sunburst chart visualizing the taxonomy of sequences within the superfamily.
  • Domain Architectures: Interactive visualizations of SCOP and Pfam domain architectures. Clicking on a domain redirects to the respective database for more details.
  • SF homologues: A table containing all the obtained homologues of the superfamily. User can search using NCBI accession ID and NCBI taxonomic ID. Links to the NCBI pages for the homologues can also be found here.

Screenshots:

Taxonomic Distribution Sunburst Chart

Fig: Taxonomic distribution visualized as a sunburst chart.

Domain Architecture Visualization

Fig: SCOP domain architecture visualization.

Superfamily Homologues Table

Fig: Table of homologues for the selected superfamily.

Methodology

The GenDiS 3.0 database uses sequences from SCOPe and the NCBI Non-Redundant (NR) protein database (version September 2022) as query sequences. DELTA-BLAST is used for the initial search against the NR database, setting an E-value threshold of 0.001 for significant hits. The searches leverage SCOPe version 2.08, filtered at 40% sequence identity.

Time Required for DELTA-BLAST and HMMSCAN:

Class Total #SFs Total DELTA-BLAST Hits Total Validated Hits Time Required for DELTA-BLAST Time Required for HMMSCAN
Mostly alpha 519 33M 20M ~4000H ~624H
Mostly beta 374 36M 21M ~3500H ~984H
Alpha and beta 578 48M 34M ~1900H ~840H
Alpha or beta 247 67M 53M ~3600H ~1512H
Small proteins 73 5M 3M ~500H ~260H
Multidomain proteins 130 5M 3M ~620H ~380H
Membrane proteins 139 6M 4M ~500H ~400H
Total 2060 202M 142M ~6 months ~4 months

The overall time required for running the search and validation processes spans around 6 months for DELTA-BLAST and 4 months for HMMSCAN, due to the computational complexity involved in analyzing millions of sequences.

The initial sequence hits obtained from DELTA-BLAST are validated using HMMSCAN, utilizing Hidden Markov Models (HMMs) from the PASS2 database. Additionally, single-query HMMs are built from Astral sequences at 70% sequence identity. HMMSCAN’s E-value threshold is set to 0.01 for validation, ensuring the accuracy of true positives.

HMMSCAN is also used to determine the domain architectures of homologous sequences. The PASS2 HMM library (versions 2.4 and 2.7) and the Pfam-A HMM library are utilized to map domain architectures. This step is critical in identifying both single-domain and multi-domain architectures.

Tools and Features

This tool allows users to predict the domain architecture of their sequences using SCOPe or Pfam. It can annotate new sequences or verify the domain structure of known proteins, supporting functional annotation efforts.

Screenshot of Predict Domain Architecture Tool:

Predict Domain Architecture Tool Interface

Fig: Predict Domain Architecture Tool Interface.

Tool Options:

Users must select between two options for domain architecture prediction:

  • SCOP: Uses SCOP hidden Markov models (HMMs) for domain prediction.
  • Pfam: Utilizes Pfam HMMs for domain prediction.

Help Button Functionality:

The "Help" button provides guidance on valid FASTA formats required for input, similar to other tools.

Compare the superfamily hits between two different genomes to explore evolutionary conservation and divergence. This tool is particularly useful for researchers studying comparative genomics.

Screenshot of Align Two Genomes Tool:

Align Two Genomes Tool Interface

Fig: Align Two Genomes Tool Interface.

Dropdown Functionality:

This tool features three dropdowns:

  • Superfamily Dropdown: Allows the user to select a superfamily.
  • Genome 1 Dropdown: Populates with the genomes available in the selected superfamily for comparison.
  • Genome 2 Dropdown: Populates similarly to Genome 1, enabling selection of the second genome for alignment.

The interaction ensures precise and meaningful alignment between the chosen genomes within the selected superfamily.

Users can align their own sequence against the sequences from a genome of a given superfamily to identify homologous regions and potential functional elements.

Screenshot of Align Tool:

Align User Sequence with Genome Tool

Fig: Align User Sequence with Genome Tool Interface.

Dropdown Functionality:

This tool features two dropdowns:

  • Superfamily Dropdown: Allows the user to select a superfamily.
  • Genome Dropdown: Gets populated dynamically based on the selected superfamily, showing the genomes available within the chosen superfamily.

The dropdown interaction ensures that users only work with relevant genomes for their selected superfamily, streamlining the alignment process.

Help Button Functionality:

The "Help" button provides a pop-up description of valid FASTA formats required for input, similar to the BLAST search tool.

This tool allows users to perform a BLAST search against the entire set of homologous sequences in the GenDiS database. It can help identify homologues by comparing sequences to those within superfamilies.

Screenshot of BLAST Search Tool:

BLAST Search Superfamily Tool

Fig: BLAST Search Superfamily Tool Interface.

Dropdown Selection and Performance Warning:

Superfamily Dropdown Selection

Fig: Superfamily Dropdown for BLAST Search.

The dropdown menu allows users to select a superfamily, displaying the number of sequences to be searched. This number directly impacts the search time. For example, 30,000 sequences may take approximately 20 minutes to complete the search.

Help Button Functionality:

The "Help" button provides a pop-up description of valid FASTA formats required for input.

Users can download list of homologues of SCOPe superfamilies identified by GenDiS 3.0 and the SCOP domain architectures of the homologues.

Screenshots of Download Options:

Superfamily sequence homologues download

Fig: Superfamily sequence homologues identified in GenDiS.

Superfamily SCOP domain architectures download

Fig: Superfamily SCOP Domain Architectures identified in GenDiS.

Description of Output Files:

The downloaded files open as text in a new tab and are formatted as TSV files. Below are the descriptions of the two types of output files:

  • Superfamily Sequence Homologues:
    • Columns: #SCOPe_sf_id, ncbi_accession, ncbi_taxid
    • Description: This file lists the homologous sequences identified in GenDiS for a selected superfamily. Each row corresponds to a sequence and provides its SCOPe superfamily ID, NCBI accession number, and taxonomy ID.
    • Example:
      #SCOPe_sf_id	ncbi_accession	ncbi_taxid
      		52129	1CVR_A	837
      		52129	1F1J_A	9606
      		52129	1F9E_A	9606
  • SCOP Domain Architectures:
    • Columns: ncbi_accession_id, scop_da, domain_boundaries, length
    • Description: This file provides the SCOP domain architectures for the sequences, including domain boundaries and total sequence length. The sequence positions in the domain boundries correspond to the residue positions in protein sequences obtained from NCBI (identified by NCBI accession ID).
    • Example:
      #ncbi_accession_id	scop_da	domain_boundaries	length
      		1CVR_A	52129-81296 d1cvra1	111-350-351-432	435
      		1F9E_A	52129_pass27	1-152	153