Amino acid usage and protein structure-function

.....The de novo design of predictably folded synthetic polypeptides, using different amino acid combinations and permutations which satisfy the classic three-state model, has been accomplished using advanced computational systems. In nature too, the conservation of protein structure has been recognized to exceed the conservation of amino acid sequence, where proteins with little or no recognizable sequence homology have been found to exhibit similar 3D structure. Conversely, the evolution of new biological function by modification of existing protein predecessors, rather than creation of new protein structure and function ab initio, has led to wide-spread sequence similarities in functionally-different proteins with common protein folds. This balance between the forces of evolutionary change and the biochemical requirements of protein structure-function has brought about the existence of large families of polypeptide/protein motifs, modules and domains having similar protein structures while at the same time exhibiting substantial sequence diversity.
.....Based on the prediction of an upper limit to the number of evolutionarily successful protein structure/sequence families in nature it is plausible to say that only a fraction of the nearly infinite number of amino acid combinations and permutations possible in a random world have survived into modern genomes. The retention of certain elements of protein sequence and/or structure, despite sometimes substantial duplication, amplification, modification and mobilization of primitive protein sequences, is likely a result of the evolutionary pressures imposed by requirement for coordination between interacting protein functions in overlapping physiological and metabolic pathways. The development of computational methods for the systematic searching/alignment of similar sequences and structures has been instrumental in achieving the current level of knowledge with regard to protein families, and will play an even more prominent role in the characterization and assignment of protein sequences as numerous genome sequencing initiatives near completion.
.....As the sequence of more and more newly identified gene products are characterized and classified into families of proteins with similar sequences and/or protein conformations, these established groups of proteins will be subjected to various computational methods for the secondary analysis or "mining" of amino acid functionalities and molecular arrangements that are responsible for either shared or unique molecular function.


Mining for conserved elements of protein structure-function

.....DanPatGenomics has developed a simple computational tool, AAPAIR.TAB, for the systematic analysis of protein motif/sequence families at the two-amino-acid level. Automated dipeptide frequency/bias analysis detects preferences in the distribution of amino acids within established protein families, by determining which "ordered dipeptides" occur most frequently in comprehensive motif-specific sequence data sets. Graphical display of the dipeptide frequency/bias data can reveal not only family- specific preferences, but identifies preferences shared by multiple protein families. Evaluating the distribution of high frequency/bias dipeptides frequently directs attention to common sites of localization for conserved sequence elements in the consensus sequence of diverse protein families. The similar employment of these high frequency/bias dipeptides in protein sequence families has also been correlated with the concurrence of these shared molecular determinants at similar positions within the 3D scaffolds of their respective protein structures.


Theory and Development of Dipeptide Frequency/Bias Analysis

.....The potential usefulness of dipeptide analysis for the evaluation of biologically relevant sequence data is most easily recognized when contrasted against the results from the artificial hypothetical case where amino acid selection and distribution in proteins occurs purely at random. Examination of a sequence data set comprised of peptides generated by random coupling of the twenty different common amino acids, at the two-amino-acid level, would reveal a set of dipeptides each occurring with a frequency equal to the product of the occurrence of each independent amino acid; (ie. 1/20 * 1/20 = 1/400). As all amino acid combinations are equally like to occur in the random hypothetical case, the likelihood of finding any specific dipeptide is the same as the likelihood of finding its inversion; [ie. (1/400 / 1/400) = 1]. Any significant deviation from these theoretical random-case values (frequency of 1 in 400, or bias of 1.0), observed in data sets comprised of native protein sequences, indicates some degree of preference in the selection and/or distribution of amino acids, as required for optimal structure and function.
.....Using the automated search program, AAPAIR.TAB, different motif/sequence data sets the relative frequency at which each of the four hundred possible ordered dipeptides occurred in any data set can be recorded and any measurable directional bias evaluated. Plots of the frequency vs bias data facilitate the recognition of both shared and unique dipeptide elements in each family. The non-conserved "dipeptide noise" clearly separated from those dipeptides occurring with the greatest frequency and bias, which became readily visible.
.....The dipeptide frequency/bias analysis performed using AAPAIR.TAB represents a simple-to-use computational method for reducing massive amounts of sequence data into a quantitative and easily manageable format. This novel method approaches the analysis of protein structure-function much differently that other tools of modern protein bioinformatics, examining entire protein sequence families two amino acids at a time. The resulting plots of dipeptide frequency vs bias data readily relate information on the most highly conserved elements of structure and function, and facilitates the identification, prioritization, and comparison of two-amino-acid sequence elements both within and across protein families. The tabulated frequency and bias data collected by this automated process directly reflects the consensus evolutionary requirements of protein families by "averaging out" the non-conserved primary structure present in individual subsets of familial sequence, thereby permitting the observance of preferred family-wide dipeptides. In addition to identifying motif-specific dipeptides, the analysis of a broad range of sequence families is revealing trends relating to more general aspects of protein structure-function and will permit assignment of both common and rare dipeptides to definable molecular conformations and environments, and/or to conserved physiological function.

Hosted by www.Geocities.ws

1