3dswap-pred - Dataset:
Positive sequence dataset is obtained from the curated a database of protein structures reported to be involved in 3D domain swapping. This dataset is currently being compiled as a database of 3D domain swapping in proteins (3Dswap: Knowledgebase of 3D Domain swapping in proteins). Based on the literature curation and structure analyses, protein structures with well defined .hinge regions. and .swapped regions. were included in the positive dataset. 805 sequences from the structures were extracted using custom Perl scripts from a total of 299 structures. Redundant datasets are removed using CD-HIT at 40% cut-off. Negative data set id derived using a novel data mining approach. To add diversity to the negative dataset and to avoid potential bias within the dataset, we retrieved representative sequence of one structure from each SCOP superfamily. We only consider the major four structural classes: all-&beta, all-&alpha, &alpha+&beta and &alpha/&beta. From this large sequence pool of negative datasets, we removed representative superfamily members that are present in the positive dataset. We have also removed the redundant sequences based on a CD-HIT performed at 40% cut-off to remove further redundancy. As a validation step for selecting appropriate negative dataset, we used DIAL server to scan the structural co-ordinates of proteins in negative dataset to assure that we have only non-swap cases in negative dataset. Only single continuous domains reported by DIAL server is considered in the final dataset used in testing and training of the ensemble classifier.