Home Server Help Group
PURE [Prediction of Unassigned REgions] Server - Help :

PURE [Prediction of Unassigned REgions] is a method to identify putative domains in the unassigned regions. PURE is now available as a web server. Users can submit the sequence of unassigned regions / Linker regions to predict domains encoded in them. This page details about the input options, output details and example files available for PURE Server.


Help Index :
1. Input Method - 1 Upload Sequence File

2. Input Method - 2 Paste Sequence File

3. Details about various Input Options

3. PURE Server Output Details

4. Example Runs

5. Pre-Computed Results


Input Method : 1. Upload Sequence File
Input Option 1 : Upload a FASTA format file of unassigned region from a protein
Step 1 : Upload the FASTA formt file of unassigned region
Step 2 : Type your valid email address
Step 3 : Select the options for PSI-BLAST Database, PSI-BLAST E-Value:, CD-HIT Threshold Value, HMMPFAM E-Value and HMMSEARCH Cut-off.
Step 4 : Click the PURE button to submit the process.

Input Method : 2. Paste Sequence File
Input Option 2 : Paste a FASTA format file of unassigned region in the textarea
Step 1 : Type your valid email address
Step 2 : Type Query Name
Step 1 : Upload the FASTA formt file of unassigned region
Step 3 : Select the options for PSI-BLAST Database, PSI-BLAST E-Value:, CD-HIT Threshold Value, HMMPFAM E-Value and HMMSEARCH Cut-off.
Step 4 : Click the PURE button to submit the process.

Output Details:

PURE Server examines unassigned regions for the presence of coiled coils, transmembrane helices, appropriate extent of predicted secondary structural content and presence of homologous sequences before the assignment of probable structural domains. These are also provided as links to the URL where the output is stored in.

Output can be mainly divided into two:
A.Consensus output:
This section of the output Provides a summary of the overall results.This page will display a table that explains about the Indirect Domain Associations of Unassigned Region, Domain Frequency, and Pfam Link along with a Bio::Graphics image of the results and link to the Detailed Output page

[
Click here for the Consensus output from PURE Server for Query Sequence : Q8F152_1-566.fasta]

B.Detailed output: This section is divided into 10 sections for a successful PURE run. Each of the files provided in detailed output gives the background details about the final result file.
[Click here for the Detailed output from PURE Server for Query Sequence : Q8F152_1-566.fasta]

i. Disorder Prediction Results:

PURE is providing disorder prediction with in the PURE Workflow. PURE is using Globplot and Disopred for disorder prediction. Users may filter out disoredered regions and resubmit the query sequence to PURE. NB : These Disorder prediction programs are likely to underpredict or overpredict disordered regions in query sequence. Users are encouraged to analyse disordered regions using different programs to draw a conclusion.

ii. PEPCOILS Results:

This file is generated for the prediction of coiled coils by PEPCOILS (EMBOSS) program that works around COILS algorithm. Details about the coiled coils identified from the query sequence are available in this file. Another file with the query sequence parsed for coiled coil regions (such regions are substituted with '=') is also provided.

iii. TMAP Results:
This file is generated to record probable transmembrane helices in the query sequence as identified by TMAP (EMBOSS) program. Another file with transmembrane regions identified from query sequence substituted with 'x' is also provided for better understanding of the presence of transmembrane regions.

iv. Integrated Filter Results:
Both COILS and TMAP files are processed using a Perl program. This program integrates the query sequence with the Both Coiled Coils and transmembrane regions are substituted with ‘=’ and ‘x’ respectively are provided in this file. These regions are not considered for further analysis. Sequence is split into segments based on the presence of transmembrane regions & coiled coils regions. Only sequence segments having >70 residues are considered for further analysis in order to avoid spurious hits in subsequent PSI-BLAST jobs.

v. PSIPRED Results:
We have used PSIPRED program for secondary structure prediction, sequence segments with 15% secondary structural content considered for further analysis.

vi. PSI-BLAST Result file:
Sequence segments which passed the filtering criteria above are fed to PSI-BLAST for similarity search. PSIBLAST is provided for users to check for the details. MView based visualization of PSI-BLAST results are also provided for better insight into the PSI-BLAST results. [Click here for a sample MView output]

vii. ScanProsite Results:
PROSITE scan results are supplied as a supplement to our method. ScanProsite scans query sequence for the occurrence of patterns, profiles and rules (motifs) stored in the PROSITE database22.

viii. CD-HIT Results:
CD-HIT results report the clustering of hits extracted from the PSI-BLAST search using a fixed sequence identity threshold and report cluster representatives that are used for subsequent hmmpfam search against Pfam database.

ix. Hmmpfam Results:
The results obtained by running hmmpfam on homologous sequences and the assignment of PFAM domains for homologues are provided here. Only representative sequences, as recognized after clustering by CD-HIT are considered for the search for domains.

x. Domains assigned using PURE:
This is the major output file that integrates the PSI-BLAST and hmmpfam search output and provide the probable domains identified using PURE protocol for the unassigned regions in the query. Bio::Graphics based image is provided for the better overview of the detailed result. [Click here for a sample Bio::Graphics image from PURE Server]

xi. Concluding Remarks:
The concluding mark is derived at the end of each run after analyzing all the files that are generated. Various output possibilities and concluding remarks are discussed in the manuscript.


Example Run :

Typicaly, for a given sequence with an average length of <200 residues, PURE Server will take 90 minutes to process the query (HMMSEARCH Cut-off = 50). If a user wouldlike to see an ouput from PURE without submitting a query, We have provided results from serveral examples.
Click here for the Pre-computed PURE Server results for different examples
Click here for Example Run for Input Method - 1 : Upload File
Click here for Example Run for Input Method - 2 : Paste File
Click here for Example Input files


Description about Options:

Enter your valid email address:
Please submit your valid, preferably academic email address. Check your email address for any 'typos' before submitting the PURE process to the server. We remind the users that "PURE Server is using a computationally internsive protocol". We encourage users to submit their academic email address to the PURE Server.


Query Name:
Query name is a one word descritption that you users can use to recognise the query. For example: In a protein sequence NP_852793.1, 1-261 is an unassigned region. Users can give an identifier NP_852793.1.1-261 as the name of the query. Please avoid any blank spaces, if there is any blank spaces provided in the query name, blank spaces will be replaced with '_'.


Upload Sequence File:
Click on the browse button to select the sequence file (Unassigned Regions in FASTA Format only). Select the options and submit the process. Detailed description about FASTA file is available
here.


Paste Sequence of Unassigned Regions in FASTA format:
COPY/PASTE the Unassigned Regions in FASTA Format in the given TEXTAREA. Select the options and submit the process. Detailed description about FASTA file is available here.


Switch-off TMAP/COIL Filter:
PURE Server is using TMAP & COILS (EMBOSS Package) to predict Transmembrane regions and coiled coils in the query sequence. Users have an option to Switch-off TMAP/COIL Filter, so that the query sequence will not be filtered using TMAP and COILS programs.


PSI-BLAST Search Database:
PURE Server process the query sequence using a rigorous PSI-BLAST search for the given query. User can select a database from the dropdown menu. To identify homologues for a given sequence, we employ BLAST for searching sequence databases. The search can be performed on the SWISS-PROT, Non redundant sequence databases (NCBI-NR), SCOP - Sequence, PDB - Sequence and selected model organism genome database as well. Though a search against a large database may provide a better opportunity to select homologues, the searches are time consuming.
We, therefore, warn that searches against NR database may be long!


Database currently available with PURE Server :
NR - Non redundant: All Non redundant GenBank CDS translations+RefSeq Proteins+PDB+SwissProt+PIR+PRF
SWISS-PROT : Last major release of the SWISS-PROT protein sequence database

pdbaa : PDB Amino Acid Sequence Database
SCOP : SCOPSequence Database
Following Model Organism Genome Databases are also available : E.coli, A.thaliana, D.melanogaster, H.sapiens, M.musculus, S.cerevisiae, C.elegans

PSI-BLAST evalue :
The E-value or the expectation value is a statistical significance threshold for reporting matches against database of sequences. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches.


CD-HIT Threshold Value :
CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, usually a sequence identity. Each cluster has one representative sequence.


1    = 100% seq identities  max value
0.9
0.8
0.7
0.6
0.5
0.4  = 40% seq identities least value


HMMPFAM - E-Value :
hmmpfam reads a single sequence from seqfile and compares it against all the HMMs in hmmfile looking for significantly similar sequence matches. The E-value or the expectation value is a statistical significance threshold for reporting matches against database of sequences. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches.


HMMSEARCH - Cut-off :
PURE Server runs PSI-BLAST search using a given query sequence after filtering steps. After PSI-BLAST search users have options to run CD-HIT using a given threshold. PURE permits hmmpfam search for homologues =< 200 sequences. Due to extensive computational time currently PURE Server process only first 200 sequence after CD-HIT. Users can also reduce the computing time by chosing less number of sequences for hmm search. As most of the computational time is required for the hmmsearch, hmmpfam for limited number of hits will process the query in less duration. An average hmmpfam run will take aroung 2mins. Depending on the number of hits to be used for hmmpfam search the processing duration will vary between 1 hour to 3 hours



Contact:
Prof. R. Sowdhamini(mini@ncbs.res.in)
Dr. Bernard Offman
Chilamakuri Chandra Sekhar Reddy
K. Shameer