Home BrowseSearchDownloadHelp Group

HELP | 3PFDB - PSSM Profiles of Protein Families :

Sensitive sequence search techniques play a pivotal role in the post genome-era. The deluge of sequence data generated by high-through put experiments need to be rapidly and effectively annotated using sensitive sequence search methods to understand the biological implications of individual sequences. Due to the practical inability of biochemical validation of every sequences from genome project, Bioinformatics ools are using to enhance the function annotation of sequence data. BLAST suite of programs is the first choice for such annotation of individual protein sequences. Position Specific Iterative BLAST (PSI- BLAST) is one of the best flavours among the BLAST programs that offer a sensitive sequence search methods using Position Specific Scoring Matrices. PSI-BLAST can be effectively used to measure residue conservation in set of sequences. PSSMs can be created using PSI-BLAST, which finds similar protein sequences to a query sequence, and then constructs a PSSM from the resulting alignment. PSI-BLAST can save the PSSM (Position Specific Score Matrix) constructed through iterations. 3PFDB provides a collection of profiles generated using PSI-BLAST method processed using the FASSM method. FASSM (Function Association using Sequence & Structure Motifs) algorithm is used to assess the ability of individual sequence in a given sequence family to generate the PSSM profiles. The method is especially useful for difficult relationships such as discontinuous domains during whole-genome surveys and is demonstrated to perform accurate family associations at sequence identities as low as 15%.


Methodology:

PFam based family specific profiles were generated by considering individual sequence in a family as a refernce sequence. FASSM method and coverage analysis score based on FASSM is used as the filtering step to identify the best hit among different members belong to seed or full category. We have considered full length and domain sequences in different situations to generate the final profile for a given family.


PSSM for Pfam families are selected using Coverage Analysis of FASSM Score:

A detailed explanation of FASSM method is available here - FASSM: Enhanced Function Association in whole genome analysis using Sequence and Structural Motifs In Silico Biol. 5, 425-38. Individual sequences from Pfam families are used to search against the family itself at a relaxed E-Value of 10 using PSI-BLAST. PSI-BLAST based alignments are used to generate PSSM profiles for each entries.


3PFDB - Database Content:

In the Details section of a 3PFDB entry following details are available.

  • PFam ID
  • PFam Domain Description
  • Source of Profile Generation (SEED alignment derived sequences /FULL alignment derived sequences)
  • Type of sequence used for Profile Generation (Full length Sequence / Only Domain sequence)
  • Number of Sequences SEED/FULL dataset
  • Pairwise Identity of Sequences in SEED/FULL and Number of reference sequence assessed to generate final profile
  • FASSM based Coverage Analysis Results
  • PSIMOT-Motifs extracted using PSIMOT routine of FASSM
  • PSIMOT Motifs marked on PSSM
  • Sequence based PCA plot of the Protein Famly
  • Alignment of Protein Family using the sequence best representative sequence as the pilot sequence
  • Download PSSM, HMM Model, Alignment and FASSM coverage analysis files

3PFDB - Database Terminologies:

  • A SEED alignment which is a hand edited multiple alignment representing the family.
  • A FULL alignment which is an automatic alignment of all the examples ofthe domain using the two HMMs to find and then align the sequences.
  • Independent Dataset = Full Data – Seed Data.
  • In 3PFDB PSSMs are generated from either Seed alignment or Full alignment using FASSM Score and Coverage Analysis based on FASSM Score
  • Coverage of Best Profile : A Profile that annotates seed queries with highest FASSM score in the reults for a family.
  • Best Profile Based on Coverage Score : This profile may not be the profile with highest FASSM probability to annotate a query to a profile, but this profile annotates a larger set of seed sequence to the profiles of the family.

Coverage Analysis:

Seed Dataset:
FASSM run was performed for all the independent sequence in a Pfam family, using PSSM profiles which were generated from the seed sequence of the family in the FASSM-library. If a PSSM profile annotates an independent sequence (irrespective of the FASSM score), it will be consider as a hit for the particular PSSM profile. Like the number of hits (number of independent sequence it annotates) of all the PSSM profiles in the family were calculated. This total number of hits were divided by the total number of independent sequence to get the coverage value for the particular PSSM profile in percentage. All the coverage value of the PSSM profiles were summed up and divided by the total number of PSSM profiles (i.e., number of seed sequence in the family) in the family to get the average coverage of the family. Using the above values we perfomed a coverage analysis to select the representative PSSM for a Pfam family. Only the families which is having an average coverage above 50.0 % were taken into the dataset and others subjected to the full dataset generation. After this rigorous assesment of profiles using coverage analysis, only the best profile (selected from previous step) whose coverage is greater than 50.0 % were remained as the representative, if not the PSSM having highest coverage value was selected as the representative of the Pfam family. All the remaining families were subjected to full dataset generation.

Full Dataset:
In full dataset instead of independent sequence we used all the full sequences in the family for PSSM generation and also for the FASSM run and we calculated the coverage of particular profile and also the average coverage of the family. Only the families with average family coverage value greater 50.0 % were included in the full dataset and the PSSM profile having highest coverage value was selected as the representative of the family, whose coverage should be greater than 50%.
3PFDB Data Curation Flowchart:


Pfam Families not included in the current version of 3PFDB:

We generated 1.8 million of PSSMs and rigorous coverage analysis is performed using FASSM score. Based on poor perfomance in different levels of FASSM score based coverage analyis we excluded 794 Pfam families in the current version of 3PFDB. This could be due to various reasons. We have observed that this could be mainly due to two reasons.
1. Due to the large set of sequence in independent sequence/full sequence, seed based PSSM profile was not able to annotate all the sequence in the given family and eventually the average family coverage must have been fallen below 50% and the family will not be included in the database.
2. No PSSM profile in the family would have 50% coverage value (ie., its not able to annotate half of its family members).
List of Pfam families not included in current version of 3PFDB is available here


3PFDB - Database Features:
  • FASSM based Coverage Analysis Results
  • Details about the best representative member of the Family
  • PSIMOT-Motifs extracted using PSIMOT routine of FASSM
  • PSIMOT Motifs marked on PSSM
  • Sequence based PCA plot of the Protein Famly
  • Alignment of Protein Family
  • Download PSSM, HMM Model and alignment
  • Details about PFam Families

3PFDB Database Architecture:

Applications of 3PFDB:
Availability of wide collection of family specific PSSMs created using best representative reference sequence will be useful for large scale protein family analysis.
  • PSSMs are commonly used in the detection of remote homologues
  • PSSM Profiles from 3PFDB can be used as input in NCBI-BLAST
  • PSSMs corresponding to a given family can be used for in a RPS_BLAST search
  • Machine Learning - PSSM can be used as an effective feature in pattern recognition
  • PSI-MOT based motifs will be useful as a finger-print of protein families


3PFDB Team :
Prof. R. Sowdhamini (Contact : mini@ncbs.res.in)
K. Shameer, P. Nagarajan, Gaurav Kumar