Home | Help | Group
Help page

Harmony

Harmony is a server to assess the compatibility of an amino acid sequence with a proposed three-dimensional structure. Structural descriptors such as backbone conformation, solvent accessibility and hydrogen bonding are used to characterise the structural environment of each residue position. Propensity and Substitution values are used together to predict the occurrence of an amino acid at each position in the sequence on the basis of the local structural environment. We demonstrate that the information from amino acid substitutions among homologous sequences (in the form of environment-dependent amino acid substitution tables) is a powerful tool for identifying errors that may be present in the protein structure.

Methodology

Derivation of propensity and substitution tables for a large dataset of aligned protein families

The tendency of an amino acid to adopt a putative structural environment is quantified by its occurrence in a large number of known structures. Structural environment of an amino acid is described by its backbone conformation (9 types), hydrogen bonding (2 types) and solvent accessibility (3 types) patterns giving rise to 54 combinations. Raw scores of amino acid occurrence are considered by examining more than 70 aligned homologous families. After a suitable normalization (as described in Wako and Blundell 1994), the frequencies are arranged as a propensity table. Similarly, the frequency of amino acid replacements are initially examined by considering aligned homologues. Exchange frequencies are weighted by a factor related to the sequence identity between query structure and homologous sequence such that there is no bias due to closely related homologues. After suitable normalization 54 amino acid exchange matrices are derived and arranged as amino acid substitution tables.



Amino acid propensity calculation:



Amino acid substitution calculation:



Scoring Scheme

The propensity and substitution values derived for the query and that of large number of unrelated globular domains are compared by chi-square test. Low values of propensity or substitution scores at individual residue positions indicate strain or error in structural assignment. These values are smoothened by a 21-residue window to recognize possible local errors.

Identification of errors in protein structure

Global errors can be easily recognized by plotting the total propensity or substitution scores of the query on a calibration plot derived from other protein crystal structures of high resolution.

Detection of local errors is possible by examining propensity and substitution scores along the entire protein length. The reverse sequence is employed as a control. Where the reverse sequence acquires higher score than the actual protein, such regions could retain possible local errors.

Searching for homologues

Sequence homologues for each of the query sequences can be identified from the SWISS-PROT database (Bairoch 1996), the Non Redundant protein sequence database using BLAST (Altschul 1997). The hits were filtered for redundancies (Weizhong 2001), and proteins no more than 90% identical were aligned with the query using MALIGN (Johnson et al., 1993). The server can accept a maximum of 75 homologous sequences to examine amino acid substitutions for a query. Hence, the extent of sequence identity cut-off for removing redundant homologues is decided depending on the number of hits obtained in BLAST run to any one of the three values: 100% or 90% or 70%.


Flow chart of the methodology

Figure1.pdf

Calibration plot for 4020 scop domains (Thangudu et al., 2005, Proteins)


The following figure shows validation scores for models of two-disulfide-bonded polypeptides using HARMONY (Topham et al.[38]; Sowdhamini, R., Srinivasan, N., and Blundell, T.L., unpublished results). a: Actual HARMONY scores plotted as a function of protein residue length. Points marked in blue correspond to 4020 nonredundant globular domains as described in SCOP database.[31] Points marked in green and red correspond to good and poor models of two-disulfide polypeptides, respectively (see Materials and Methods for a definition of good and poor models). A best fit straight line passing through the origin including all the 4020 protein domains acquire a slope of 5.3. In general, points corresponding to poor models appear below the fitted straight line indicative of strained conformations or misfolds. b: Inset shows the closeup of the same plot but zoomed to show only values corresponding to small folds. c: Extent of deviation of HARMONY scores from ideal expected values. -m is the difference between ideal value and observed HARMONY score after normalisation for the protein length. Percentage of SCOP protein domains, good and poor models that correspond to different bins of -m. The color representation is same as in (a). Non-redundant SCOP protein domains deviate very little from ideality (+1 to -1). Good models undergo lesser deviations compared to poor models. High -m are associated with strained or incorrect models.




Input methods

Upload your structure

n order to assess the quality of query protein, the user may upload query structure in the PDB format. Please provide the full-length protein only and not individual domains since the solvent accessibility patterns have to be properly described!

1. Structure should be in pdb format.
2. Use "Browse" button to upload the structure
3. Select database (swissprot or nr). Default is swissprot.
4. Select E-value. Default is 0.001.
5. Click on the "Submit" button to initiate the validation process.

Upload structure and alignment

1. Upload structure in pdb format.

Please provide the full-length protein only and not individual domains since the solvent accessibility patterns have to be properly described!

2. Upload alignment file.


The first sequence should be query sequence followed by its homologues.
The alignment should be in PIR or FASTA format

Example:
PIR format 
>P1;1a7v structureX: QTDVIAQRKAILKQMGEATKPIAAMLKGEAKFDQAVVQKSLAAIADDSKKLPALFPADSKTGGDTAALPKIWEDK AKFDDLFAKLAAAATAAQGTI-KDEASLKANIGGVLGNCKSCHDDFRA* >P1;4 sequence ----VEKREGMMKQIGGAMGSLAAISKGEKPFDADTVKAAVTTIGTNAKAFPEQFPAGTETG--SAAAPAIWENF EDFKAKAAKLGTDADIVLANLPGDQAGVATAMKTLGADCGTCHQTYR-* >P1;6 sequence ----VEKREGMMKQIGGSMGALAAISKGQKPYDAEAVKAAVTTISTNAKAFPDQFPPGSETG--SAAAPAIWENF DDFKSKAAKLGADADKVLASLPADQAGVTAAMQTLGADCGACHQTYR-*
FASTA format

>1a7v QTDVIAQRKAILKQMGEATKPIAAMLKGEAKFDQAVVQKSLAAIADDSKKLPALFPADSKTGGDTAALPKIWEDK AKFDDLFAKLAAAATAAQGTI-KDEASLKANIGGVLGNCKSCHDDFRA >4 ----VEKREGMMKQIGGAMGSLAAISKGEKPFDADTVKAAVTTIGTNAKAFPEQFPAGTETG--SAAAPAIWENF EDFKAKAAKLGTDADIVLANLPGDQAGVATAMKTLGADCGTCHQTYR- >6 ----VEKREGMMKQIGGSMGALAAISKGQKPYDAEAVKAAVTTISTNAKAFPDQFPPGSETG--SAAAPAIWENF DDFKSKAAKLGADADKVLASLPADQAGVTAAMQTLGADCGACHQTYR-

3. Press "Submit" button initiate the validation process.

Database:

To identify homologues for a given sequence, we employ BLAST for searching sequence databases. The search can be performed on the Protein Databank (PDB), the SWISS-PROT and the Non redundant sequence databases (NR). Though a search against a large database may provide a better opportunity to select homologues, the searches are time consuming. We, therefore, warn that searches against NR database may be long!

Non redundant: All Non redundant GenBank CDS translations+RefSeq Proteins+PDB+SwissProt+PIR+PRF

SWISS-PROT : Last major release of the SWISS-PROT protein sequence database (no updates)

E-value:

The E-value or the expectation value is a statistical significance threshold for reporting matches against database of sequences. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches.


Output

Detection of gross error

Misfolds and folds with elaborate regions of errors can be identified by plotting the scores of the query protein structure against a calibration plot. Protein with misfold regions will have significantly low harmony score and fall well below the straight fit line.

The following table shows the list of proteins used for calibration.



Example output

Detection of local error

This graph representation provides the smoothened scores between query sequence in comparison to the reverse sequence. The reverse sequence and its scores are used as a control to identify local errors in the proposed protein model. Regions where the reverse sequence acquire better substitution score could indicate possible local errors.

Example output



Local error mapped on the structure

Molscript image

Difference between the scores of the actual and reverse sequence is mapped on the structure colored according to the degree of the local error. Static image of the 3D structure is provided using MOLSCRIPT.
Red    ----> Strongly significant local error
Orange ----> Significant local error
Yellow ----> Probable local error 



Example output



Rasmol view

The coordinates of query protein colored by HARMONY can be downloaded and conveniently displayed with structural viewers such as RASMOL.

How to view

1. Click on the link "Rasmol view"

2. Save the file.The file name will be "rasmol.cgi"

3. Use rasmol -script rasmol.cgi command

Contact

Dr.R. Sowdhamini (mini@ncbs.res.in)