Help Page: 3dswap-pred | Prediction of 3D domain swapping from protein sequence

Help Page: 3dswap-pred - Prediction of 3D domain swapping from protein sequence:

3dswap-pred is a webserver developed for the prediction of the structural phenomenon "3D-domain swapping" from protein sequence data. 3dswap-pred is using a unique machine learning approach based on the ensemble classifier - "Random Forest" for the prediction. 3DSwap-pred server is providing a prior method to understand the 3D domain swapping to develop approaches that enable to scan sequences for identifying putative members which can be involved in 3D domain swapping.

3dswap-pred - Help Index:

3dswap-pred - Introduction
3D-domain swapping
3dswap-pred - Dataset
3dswap-pred - Features
3dswap-pred - Random Forest
3dswap-pred - Input / Output Features
3dswap-pred - FASTA format guidelines & example input files
3dswap-pred - References

3dswap-pred - Introduction:

3D domain swapping is a protein structural phenomenon that evolved as a mechanism for oligomeric assembly. Several experimental and computational studies are performed earlier to understand various aspects of domain swapping. 3DSwap-Pred use a RandomForest based algorithm for the prediction. 3D domain swapping is a protein structural phenomenon that mediates the formation of the higher oligomers in a variety of proteins with different structural and functional properties. This phenomenon plays important role in mediating functions ranging from oligomerization to pathological conformational diseases. 3D swapping can be observed only when a protein structure is solved in the swapped conformation in the oligomeric state. This is a limiting step to understand this important structural phenomenon in a large scale. 3Dswap-pred algorithm is designed to classify a given protein sequence as "swapping" or "non-swapping" based on the Random Forest based classifier. We have used literature curated sequences of proteins involved in 3D swapping as positive dataset. Negative data set is derived using a new sequence mining method to improve the accuracy of the algorithm. A set of 126 sequence based features are employed as vectors in the classifier. Using an independent validation dataset of 68 positive sequences and 313 negative sequences, 3DSwap-Pred achieved an accuracy of 63.78 in testing and accuracy of 62.34 during training.

3D-domain swapping:

3D domain swapping is a protein structural phenomenon that mediates the formation of the higher oligomers in a variety of proteins with different structural and functional properties. A protein structure is reported in swapped conformation, when a minimum of two chains of an oligomeric structure share a structural segment between the chains to forms a stable structure. 3D swapping is mainly facilitated by the hinge region and swapped regions. A hinge region can be generally defined as a short stretch of amino acids mostly in a loop conformation that link between the swapped segment and the remaining core of the protein. A swapped region is the region which is structural segment following the hinge region that is shared by the other chain.

Figure 1: Chains are colored in cyan and green. Figure 2: Hinge region is colored in red and Swapped region is colored in coffee_brown

3dswap-pred - Dataset:

Positive sequence dataset is obtained from the curated a database of protein structures reported to be involved in 3D domain swapping. This dataset is currently being compiled as a database of 3D domain swapping in proteins (3Dswap: Knowledgebase of 3D Domain swapping in proteins). Based on the literature curation and structure analyses, protein structures with well defined .hinge regions. and .swapped regions. were included in the positive dataset. 805 sequences from the structures were extracted using custom Perl scripts from a total of 299 structures. Redundant datasets are removed using CD-HIT at 40% cut-off. Negative data set id derived using a novel data mining approach. To add diversity to the negative dataset and to avoid potential bias within the dataset, we retrieved representative sequence of one structure from each SCOP superfamily. We only consider the major four structural classes: all-&beta, all-&alpha, &alpha+&beta and &alpha/&beta. From this large sequence pool of negative datasets, we removed representative superfamily members that are present in the positive dataset. We have also removed the redundant sequences based on a CD-HIT performed at 40% cut-off to remove further redundancy. As a validation step for selecting appropriate negative dataset, we used DIAL server to scan the structural co-ordinates of proteins in negative dataset to assure that we have only non-swap cases in negative dataset. Only single continuous domains reported by DIAL server is considered in the final dataset used in testing and training of the ensemble classifier.

3dswap-pred - Features:

Sequence derived features are used in this study this study can be broadly classified as composition based features, sequence derived fusion features and physico-chemical features derived from AAINDEX database. Composition based features refers to the feature derived from composition of amino acids present in the sequences used in positive and negative datasets. Sequence derived fusion features refers to the new class of hybrid features where two distinct features are combined to define a new, distinct features. Physicochemical properties derived from AAINDEX database is used to compute the properties.

3dswap-pred - RandomForest:

Decision trees provide an effective approach in a tree-like graph of model to predict the probable decision of a system by analysing various associated parameters. The Random Forest classification extends the concept of decision trees and has been successfully employed in developing solutions for a variety of problems in biology. Random Forest is an ensemble decision tree classifier, which incorporates two effective machine learning techniques (bagging and selection of feature from random subspace) in to a single method. Random forest is a collection of decision trees, where each tree is grown using a subset of the possible attributes in the input feature vector. Instead of using all features in all trees, Random Forest randomly selects a subset of features to split at each node when growing a tree and the final decision is derived by combining results from all the trees generated during a simulation. It has been shown that combining multiple decision trees produced in randomly selected subspaces can improve the generalization accuracy. Random forest constructs an ensemble of decision trees from randomly sampled subspaces of the input space, and the final classification is obtained by combining the results from the trees via "voting". The random subspace method is used to avoid overfitting on the training set while preserving the maximum accuracy when training a decision tree classifier. Prediction assessment of the 3Dswap-Pred classifier is measured using the cross-validation using out-of-bag (OOB) samples in the Random Forest. 3DSwap-pred Random Forest algorithm is implemented using the randomForest R package.

3dswap-pred - server input/output options:

Input Method : Paste Sequence File
Step 1 : Paste a protein sequence in FASTA format inside the textarea of the 3dswap-pred server page
Step 2 : Click on "3dswap-pred" button to upload the FASTA file for the prediction
Step 3 : Results will be returned by server in less than 1-2 minutes depending on the load on the server
Output / Prediction results from 3dswap-pred Server:
For a succesful input file of a protein sequence in FASTA format the server will return the output as "Domain-swap" or "non domain-swap" according to the prediction result from the Random Forest based prediction model.

3dswap-pred - References:

Bennett MJ, Choe S, Eisenberg D (1994) Domain swapping: entangling alliances between proteins. Proc Natl Acad Sci U S A 91: 3127-3131.
Bennett MJ, Choe S, Eisenberg D (1994) Refined structure of dimeric diphtheria toxin at 2.0 A resolution. Protein Sci 3: 1444-1463.
Bennett MJ, Schlunegger MP, Eisenberg D (1995) 3D domain swapping: a mechanism for oligomer assembly. Protein Sci 4: 2455-2468.
Bennett MJ, Eisenberg D (2004) The evolving role of 3D domain swapping in proteins. Structure 12: 1339-1341.
Bennett MJ, Sawaya MR, Eisenberg D (2006) Deposition diseases and 3D domain swapping. Structure 14: 811-824.
Ho TK (2002) A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors. Pattern Analysis & Applications 5: 102-112.
Breiman L (2001) Random Forests. Machine Learning 45: 5-32.
Andy L, Matthew W (2002) Classification and Regression by randomForest. R News 2: 18-22.
Ho TK (1998) The Random Subspace Method for Constructing Decision Forests. IEEE Trans Pattern Anal Mach Intell 20: 832-844.
Mitchell TM (1997) Machine Learning McGraw-Hill.
McGuffin, L. J., et al. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16:404-405.

3dswap-pred is Powered by : 3Dswap: Knowledgebase of 3D-domain swapping in proteins, PSIPRED, R, randomForest, Perl, Apache HTTP Server & JavaScript
3dswap-pred Server can be accessed using different browsers, but we recommend : Mozilla Firefox and Apple Safari

Contact:
Prof. R. Sowdhamini (Contact : mini@ncbs.res.in)

3dswap-pred - Team:
K. Shameer, G. Pugalenthi & Prof. R. Sowdhamini