Gendis 3.0

GenDiS 3.0 is a database designed to bridge the gap between vast amounts of sequence data and known protein structures. The database catalogs over 151 million homologous sequences from 2,060 superfamilies, helping researchers perform functional annotations and evolutionary studies.

Below is the workflow diagram that outlines the process for searching and validating homologues in the GenDiS 3.0 database:

The GenDiS 3.0 database offers an intuitive interface for browsing data by:

Superfamily: View a list of all superfamilies, including SCOP codes, descriptions, and associated data. The interface allows searching and filtering by description or SCOP ID.
Fold: Browse folds and their associated superfamilies. Each fold entry includes a dropdown to explore superfamilies within the selected fold.
Class: Explore broader structural classes containing folds and their respective superfamilies.

Screenshots:

Fig: Dropdown menu to browse by superfamily, fold, or class.

Fig: Superfamily table showing SCOP codes and descriptions.

Fig: Fold table with expandable dropdowns to view associated superfamilies.

Fig: Class table showing folds and their hierarchical organization.

Each superfamily in GenDiS 3.0 has a detailed page that provides comprehensive data and tools, including:

Summary: Basic information about the superfamily, including SCOP code, description, and sequence statistics.
Taxonomic Distribution: A colorful sunburst chart visualizing the taxonomy of sequences within the superfamily.
Domain Architectures: Interactive visualizations of SCOP and Pfam domain architectures. Clicking on a domain redirects to the respective database for more details.
SF homologues: A table containing all the obtained homologues of the superfamily. User can search using NCBI accession ID and NCBI taxonomic ID. Links to the NCBI pages for the homologues can also be found here.

Screenshots:

Fig: Taxonomic distribution visualized as a sunburst chart.

Fig: SCOP domain architecture visualization.

Fig: Table of homologues for the selected superfamily.

The GenDiS 3.0 database uses sequences from SCOPe and the NCBI Non-Redundant (NR) protein database (version September 2022) as query sequences. DELTA-BLAST is used for the initial search against the NR database, setting an E-value threshold of 0.001 for significant hits. The searches leverage SCOPe version 2.08, filtered at 40% sequence identity.

Time Required for DELTA-BLAST and HMMSCAN:

Class	Total #SFs	Total DELTA-BLAST Hits	Total Validated Hits	Time Required for DELTA-BLAST	Time Required for HMMSCAN
Mostly alpha	519	33M	20M	~4000H	~624H
Mostly beta	374	36M	21M	~3500H	~984H
Alpha and beta	578	48M	34M	~1900H	~840H
Alpha or beta	247	67M	53M	~3600H	~1512H
Small proteins	73	5M	3M	~500H	~260H
Multidomain proteins	130	5M	3M	~620H	~380H
Membrane proteins	139	6M	4M	~500H	~400H
Total	2060	202M	142M	~6 months	~4 months

The overall time required for running the search and validation processes spans around 6 months for DELTA-BLAST and 4 months for HMMSCAN, due to the computational complexity involved in analyzing millions of sequences.

The initial sequence hits obtained from DELTA-BLAST are validated using HMMSCAN, utilizing Hidden Markov Models (HMMs) from the PASS2 database. Additionally, single-query HMMs are built from Astral sequences at 70% sequence identity. HMMSCAN’s E-value threshold is set to 0.01 for validation, ensuring the accuracy of true positives.

HMMSCAN is also used to determine the domain architectures of homologous sequences. The PASS2 HMM library (versions 2.4 and 2.7) and the Pfam-A HMM library are utilized to map domain architectures. This step is critical in identifying both single-domain and multi-domain architectures.

This tool allows users to predict the domain architecture of their sequences using SCOPe or Pfam. It can annotate new sequences or verify the domain structure of known proteins, supporting functional annotation efforts.

Screenshot of Predict Domain Architecture Tool:

Fig: Predict Domain Architecture Tool Interface.

Tool Options:

Users must select between two options for domain architecture prediction:

SCOP: Uses SCOP hidden Markov models (HMMs) for domain prediction.
Pfam: Utilizes Pfam HMMs for domain prediction.

Help Button Functionality:

The "Help" button provides guidance on valid FASTA formats required for input, similar to other tools.

Compare the superfamily hits between two different genomes to explore evolutionary conservation and divergence. This tool is particularly useful for researchers studying comparative genomics.

Screenshot of Align Two Genomes Tool:

Fig: Align Two Genomes Tool Interface.

Dropdown Functionality:

This tool features three dropdowns:

Superfamily Dropdown: Allows the user to select a superfamily.
Genome 1 Dropdown: Populates with the genomes available in the selected superfamily for comparison.
Genome 2 Dropdown: Populates similarly to Genome 1, enabling selection of the second genome for alignment.

The interaction ensures precise and meaningful alignment between the chosen genomes within the selected superfamily.

Users can align their own sequence against the sequences from a genome of a given superfamily to identify homologous regions and potential functional elements.

Screenshot of Align Tool:

Fig: Align User Sequence with Genome Tool Interface.

Dropdown Functionality:

This tool features two dropdowns:

Superfamily Dropdown: Allows the user to select a superfamily.
Genome Dropdown: Gets populated dynamically based on the selected superfamily, showing the genomes available within the chosen superfamily.

The dropdown interaction ensures that users only work with relevant genomes for their selected superfamily, streamlining the alignment process.

Help Button Functionality:

The "Help" button provides a pop-up description of valid FASTA formats required for input, similar to the BLAST search tool.

This tool allows users to perform a BLAST search against the entire set of homologous sequences in the GenDiS database. It can help identify homologues by comparing sequences to those within superfamilies.

Screenshot of BLAST Search Tool:

Fig: BLAST Search Superfamily Tool Interface.

Dropdown Selection and Performance Warning:

Fig: Superfamily Dropdown for BLAST Search.

The dropdown menu allows users to select a superfamily, displaying the number of sequences to be searched. This number directly impacts the search time. For example, 30,000 sequences may take approximately 20 minutes to complete the search.

Help Button Functionality:

The "Help" button provides a pop-up description of valid FASTA formats required for input.

Users can download list of homologues of SCOPe superfamilies identified by GenDiS 3.0 and the SCOP domain architectures of the homologues.

Screenshots of Download Options:

Superfamily sequence homologues download

Fig: Superfamily sequence homologues identified in GenDiS.

Superfamily SCOP domain architectures download

Fig: Superfamily SCOP Domain Architectures identified in GenDiS.

Description of Output Files:

The downloaded files open as text in a new tab and are formatted as TSV files. Below are the descriptions of the two types of output files:

Superfamily Sequence Homologues:
- Columns: #SCOPe_sf_id, ncbi_accession, ncbi_taxid
- Description: This file lists the homologous sequences identified in GenDiS for a selected superfamily. Each row corresponds to a sequence and provides its SCOPe superfamily ID, NCBI accession number, and taxonomy ID.
- Example:
```
#SCOPe_sf_id	ncbi_accession	ncbi_taxid
		52129	1CVR_A	837
		52129	1F1J_A	9606
		52129	1F9E_A	9606
```
SCOP Domain Architectures:
- Columns: ncbi_accession_id, scop_da, domain_boundaries, length
- Description: This file provides the SCOP domain architectures for the sequences, including domain boundaries and total sequence length. The sequence positions in the domain boundries correspond to the residue positions in protein sequences obtained from NCBI (identified by NCBI accession ID).
- Example:
```
#ncbi_accession_id	scop_da	domain_boundaries	length
		1CVR_A	52129-81296 d1cvra1	111-350-351-432	435
		1F9E_A	52129_pass27	1-152	153
```

Overview

Methodology

Time Required for DELTA-BLAST and HMMSCAN: