Paper

Automated motif discovery in protein structure prediction

The protein structure prediction problem (PSP) is one of the central problems in molecular and structural biology. A computational method that could produce a correct detailed three-dimensional structural model for a protein, given its linear sequence of amino acids, would greatly accelerate progress in the biomedical sciences and industries. This thesis presents PSP as a combinatorial optimization problem, the straightforward formulations of which require search of an exponentially-large conformation space and are known to be NP-Hard. This otherwise intractable search can in practice be reduced or eliminated through the discovery and use of motifs. Motifs are abstractions of observed patterns that encode structurally important relationships among constituent parts of a complex object like a protein tertiary structure. Motif discovery is accomplished by particular combinatorial search and statistical estimation methods. This thesis explores in detail two particular motif discovery subproblems, and discusses how their solutions can be applied to the overall structure prediction problem: (1) For a complex multi-stage prediction task, what makes a good intermediate representation language? We address this question by presenting and analyzing methods for the discovery of protein secondary structure classes that are more predictable from amino acid sequence than the standard classes of $\alpha$-helix, $\beta$-sheet, and random coil. (2) Given a database of M objects, each characterized by values $a\sb{ij}\in {\cal A}\sb{j}$ for each of N discrete variables $\{c\sb{j}\}\sbsp{j=1}{N},$ return the list of most interesting higher-order features $\gamma\sb{l},$ i.e., sets of $k\sb{l}$ variables with highest estimated correlation, for any $2 \le k\sb{l} \le N$. In the PSP context, the problem is the detection of correlations between amino acid residues in an aligned set of evolutionarily-related protein sequences. We present and analyze a fast procedure, based on multinomial sampling and a novel coding scheme, that avoids the exhaustive search, prior limits on the order k, and exponentially large parameter space of other methods. The focus of this thesis is PSP, but the techniques and analysis are also aimed at wider application to other hard, multi-stage prediction problems.

Published 1997-01-01Paper link

Authors: Geoffrey E. Hinton · Evan W. Steeg

Topics

Coding

Relevant entities

People

openalex-author

Geoffrey E. Hinton

Computer Scientist

Related coverage

Linked coverage will appear here.

Related events

Linked events will appear here.

Related discussions

Related discussion nodes will appear here.