Amino acid usage and protein structure-function
.....The de novo design of predictably
folded synthetic polypeptides, using different amino acid combinations and
permutations which satisfy the classic three-state model, has been accomplished
using advanced computational systems. In nature too, the conservation of
protein structure has been recognized to exceed the conservation of amino
acid sequence, where proteins with little or no recognizable sequence homology
have been found to exhibit similar 3D structure. Conversely, the evolution
of new biological function by modification of existing protein predecessors,
rather than creation of new protein structure and function ab initio, has
led to wide-spread sequence similarities in functionally-different proteins
with common protein folds. This balance between the forces of evolutionary
change and the biochemical requirements of protein structure-function has
brought about the existence of large families of polypeptide/protein motifs,
modules and domains having similar protein structures while at the same
time exhibiting substantial sequence diversity.
.....Based on the prediction of an upper
limit to the number of evolutionarily successful protein structure/sequence
families in nature it is plausible to say that only a fraction of the nearly
infinite number of amino acid combinations and permutations possible in
a random world have survived into modern genomes. The retention of certain
elements of protein sequence and/or structure, despite sometimes substantial
duplication, amplification, modification and mobilization of primitive protein
sequences, is likely a result of the evolutionary pressures imposed by requirement
for coordination between interacting protein functions in overlapping physiological
and metabolic pathways. The development of computational methods for the
systematic searching/alignment of similar sequences and structures has been
instrumental in achieving the current level of knowledge with regard to
protein families, and will play an even more prominent role in the characterization
and assignment of protein sequences as numerous genome sequencing initiatives
near completion.
.....As the sequence of more and more newly
identified gene products are characterized and classified into families
of proteins with similar sequences and/or protein conformations, these established
groups of proteins will be subjected to various computational methods for
the secondary analysis or "mining" of amino acid functionalities
and molecular arrangements that are responsible for either shared or unique
molecular function.
Mining for conserved elements of protein structure-function
.....DanPatGenomics has developed a simple computational tool, AAPAIR.TAB, for the systematic analysis of protein motif/sequence families at the two-amino-acid level. Automated dipeptide frequency/bias analysis detects preferences in the distribution of amino acids within established protein families, by determining which "ordered dipeptides" occur most frequently in comprehensive motif-specific sequence data sets. Graphical display of the dipeptide frequency/bias data can reveal not only family- specific preferences, but identifies preferences shared by multiple protein families. Evaluating the distribution of high frequency/bias dipeptides frequently directs attention to common sites of localization for conserved sequence elements in the consensus sequence of diverse protein families. The similar employment of these high frequency/bias dipeptides in protein sequence families has also been correlated with the concurrence of these shared molecular determinants at similar positions within the 3D scaffolds of their respective protein structures.
Theory and Development of Dipeptide Frequency/Bias Analysis
.....The potential usefulness of dipeptide
analysis for the evaluation of biologically relevant sequence data is most
easily recognized when contrasted against the results from the artificial
hypothetical case where amino acid selection and distribution in proteins
occurs purely at random. Examination of a sequence data set comprised of
peptides generated by random coupling of the twenty different common amino
acids, at the two-amino-acid level, would reveal a set of dipeptides each
occurring with a frequency equal to the product of the occurrence of each
independent amino acid; (ie. 1/20 * 1/20 = 1/400). As all amino acid combinations
are equally like to occur in the random hypothetical case, the likelihood
of finding any specific dipeptide is the same as the likelihood of finding
its inversion; [ie. (1/400 / 1/400) = 1]. Any significant deviation from
these theoretical random-case values (frequency of 1 in 400, or bias of
1.0), observed in data sets comprised of native protein sequences, indicates
some degree of preference in the selection and/or distribution of amino
acids, as required for optimal structure and function.
.....Using the automated search program,
AAPAIR.TAB, different motif/sequence
data sets the relative frequency at which each of the four hundred possible
ordered dipeptides occurred in any data set can be recorded and any measurable
directional bias evaluated. Plots of the frequency vs bias data facilitate
the recognition of both shared and unique dipeptide elements in each family.
The non-conserved "dipeptide noise" clearly separated from those
dipeptides occurring with the greatest frequency and bias, which became
readily visible.
.....The dipeptide frequency/bias analysis
performed using AAPAIR.TAB represents
a simple-to-use computational method for reducing massive amounts of sequence
data into a quantitative and easily manageable format. This novel method
approaches the analysis of protein structure-function much differently that
other tools of modern protein bioinformatics, examining entire protein sequence
families two amino acids at a time. The resulting plots of dipeptide frequency
vs bias data readily relate information on the most highly conserved elements
of structure and function, and facilitates the identification, prioritization,
and comparison of two-amino-acid sequence elements both within and across
protein families. The tabulated frequency and bias data collected by this
automated process directly reflects the consensus evolutionary requirements
of protein families by "averaging out" the non-conserved primary
structure present in individual subsets of familial sequence, thereby permitting
the observance of preferred family-wide dipeptides. In addition to identifying
motif-specific dipeptides, the analysis of a broad range of sequence families
is revealing trends relating to more general aspects of protein structure-function
and will permit assignment of both common and rare dipeptides to definable
molecular conformations and environments, and/or to conserved physiological
function.