Intelligent data analysis for protein disorder prediction
Article Ecrit par: Romero, P. ; Obradovic, Z. ; Keith Dunker, A. ;
Résumé: Although an ordered 3D structure is generally considered to be a necessary precondition for protein functionality, there are disordered counter examples found to have biological activity. The objectives of our data mining project are: (1) to generalize from the limited set of counter examples and then apply this knowledge to large data bases of amino acid sequence in order to estimate commonness of disordered protein regions in nature, and (2) to determine whether there are different types of protein disorder. For general disorder estimation, a neural network based predictor was designed and tested on data built from several public domain data banks through a nontrivial search, statistical analysis and data dimensionality reduction. In addition, predictors for identification of family-specific disorder were developed by extracting knowledge from databases generated through multiple sequence alignments of a known disordered sequence to other highly related proteins. Family-specific predictors were also integrated to test quality of general protein disorder identification from such hybrid prediction systems. Out-of-sample cross validation performance of several predictors was computed first, followed by tests on an unrelated database of proteins with long disordered regions, and the application of few selected predictors to two large protein data banks: Nrl-3D, currently containing more than 10,000 protein fragments of known 3D structure, and Swiss Protein, having almost 60,000 protein sequences. The obtained results provide evidence that long disordered regions are common in nature, with an estimate that 11% of all the residues in the Swiss Protein data bank belong to disordered regions of length 40 or greater. The hypothesis that different protein disorder types exist is supported by high specificity/low sensitivity results of two family-specific predictors, by hybrid systems outperforming general models on a two-family test, and by existence of significant gaps in Swiss Protein vs. Nrl-3D disorder frequency estimates for both families. These findings prompt the need for a revision in the current understanding of protein structure and function, as well as for the developing of improved disorder predictors that should have important uses in biotechnology applications.
Langue:
Anglais