![]() |
![]() |
![]() |
|||
|
[ home | about | about me | my BII project
1 || my
BII project 2 |contact | email | guestbook| links ] |
|||||
|
BioInformatics
|
||||||||||||||||||||||||||||||||||||
| The World of BioInformatics | ||||||||||||||||||||||||||||||||||||
|
BioInformatics (Molecular) bio – informatics: bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry) and applying “informatics techniques” (derived from disciplines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications. One of the central goals of bioinformatics is the prediction of protein
function, and ultimately of structure, from the linear amino acid
sequence. Given a newly determined sequence, we want to know: what is my
protein? To what family does it belong? What is its function? And how can
we explain its function in structural terms?
Aims of bioinformatics The aims of bioinformatics are three-fold. First, at its simplest, bioinformatics organises data in a way that allows researchers to access existing information and to submit new entries as they are produced, eg the Protein Data Bank for 3D macromolecular structures. While data-curation is an essential task, the information stored in these databases is essentially useless until analysed. Thus the purpose of bioinformatics extends far beyond mere volume control. The second aim is to develop tools and resources that aid in the analysis of data. For example, having sequenced a particular protein, it is of interest to compare it with previously characterised sequences. This requires more than just a straightforward database search. As such, programs such as FASTA and PSI-BLAST must consider what constitutes a biologically significant resemblance. Development of such resources requires extensive knowledge of computational theory, as well as a thorough understanding of biology. The third aim is to use these tools to analyse the data and interpret the results in a biologically meaningful manner. Traditionally, biological studies examined individual systems in detail, and frequently compared them with a few that are related. In bioinformatics, we can also conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight features that are unique to some.
The Information We start with an overview of the sources of information: these may be divided into 1) raw DNA sequences, 2) protein sequences, 3) macromolecular structures, 4) genome sequences, and 5) other whole genome data. Raw DNA sequences are strings of the four base-letters comprising genes, each typically 1,000 bases long. At the next level are protein sequences comprising strings of 20 amino acid-letters. At present there are about 300,000 known protein sequences, with a typical bacterial protein containing approximately 300 amino acids. Macromolecular structural data represents a more complex form of information. There are currently 13,000 entries in the Protein Data Bank, PDB, most of which are protein structures. A typical PDB file for a medium-sized protein contains the xyz coordinates of approximately 2,000 atoms.
Protein Sequence Database Protein sequence databases are categorised as primary, composite or secondary. Primary databases contain over 300,000 protein sequences and function as a repository for the raw data. Some more common repositories, such as SWISS-PROT and PIR-International, annotate the sequences as well as describe the proteins’ functions, its domain structure and post-translational modifications. Composite databases such as OWL and the NRDB compile and filter sequence data from different primary databases to produce combined non-redundant sets that are more complete than the individual databases and also include protein sequence data from the translated coding regions in DNA sequence databases. Secondary databases contain information derived from protein sequences and help the user determine whether a new sequence belongs to a known protein family. One of the most popular is PROSITE, a database of short sequence patterns and profiles that characterise biologically significant sites in proteins. PRINTS expands on this concept and provides a compendium of protein fingerprints – groups of conserved motifs that characterise a protein family. Motifs are usually separated along a protein sequence, but may be contiguous in 3D-space when the protein is folded. By using multiple motifs, fingerprints can encode protein folds and functionalities more flexibly than PROSITE. Finally, Pfam contains a large collection of multiple sequence alignments and profile Hidden Markov Models covering many common protein domains. Pfam-A comprises accurate manually compiled alignments while Pfam-B is an automated clustering of the whole SWISS-PROT database. These different secondary databases have recently been incorporated into a single resource named InterPro.
Understand And Organise The Information For raw DNA sequences, investigations involve separating coding and non-coding regions, and identification of introns, exons and promoter regions for annotating genomic DNA. For protein sequences, analyses include developing algorithms for sequence comparisons, methods for producing multiple sequence alignments, and searching for functional domains from conserved sequence motifs in such alignments. Investigations of structural data include prediction of secondary and tertiary protein structures, producing methods for 3D structural alignments, examining protein geometries using distance and angular measurements, calculations of surface and volume shapes and analysis of protein interactions with other subunits, DNA, RNA and smaller molecules. The increasing availability of annotated genomic sequences has resulted in the introduction of computational genomics and proteomics – large-scale analyses of complete genomes and the proteins that they encode. Research includes characterisation of protein content and metabolic pathways between different genomes, identification of interacting proteins, assignment and prediction of gene products, and large-scale analyses of gene expression levels. In addition to finding relationships between different proteins, much of bioinformatics involves the analysis of one type of data to infer and understand the observations for another type of data. An example is the use of sequence and structural data to predict the secondary and tertiary structures of new protein sequences. These methods, especially the former, are often based on statistical rules derived from structures, such as the propensity for certain amino acid sequences to produce different secondary structural elements. | ||||||||||||||||||||||||||||||||||||
| BIO DATABASES | ||||||||||||||||||||||||||||||||||||
|
What is Redundancy?
DATABASES
|
||||||||||||||||||||||||||||||||||||
| BIOTOOL - BLAST | ||||||||||||||||||||||||||||||||||||
|
BLAST stands for Basic Local Alignment Search Tool; it provides a method for rapid searching of nucleotide and protein databases.Since the BLAST algorithm detects local as well as global alignments, regions of similarity embedded in otherwise unrelated proteins can be detected. Both types of similarity may provide important clues to the function of uncharacterized proteins.
blastp compares an amino acid query sequence against a protein sequence database. blastn compares a nucleotide query sequence against a nucleotide sequence database. blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). tblastx compares the six-frame translations of a nucleo- tide query sequence against the six-frame transla- tions of a nucleotide sequence database.
A simpler view:Protein Level Comparison
DNA Level Comparison
TYPES OF BLASTS
1. BLAST score-based single-linkage clustering - (BLASTCLUST ) 1. Procedure BLASTCLUST automatically and systematically clusters protein or DNA
sequences based on pairwise matches found using the BLAST algorithm in
case of proteins or Mega BLAST algorithm for DNA. In the latter case a
single Mega BLAST search is 2. Input formats. The primary input format for BLASTCLUST is a FASTA-format sequence file. Each sequence should have a unique identifier (as defined by formatdb). BLASTCLUST formats this sequence set into a BLASTable database (in the directory pointed to by the environment variable TMPDIR or in the current directory), then removes the database. Instead of a FASTA file, a database prepared by formatdb with -o option set to TRUE can be supplied as an input. Another type of input is a sequence hit-list previously saved by BLASTCLUST (in this case BLASTCLUST will use pre-computed HSP data instead of making de novo comparisons). You can restrict clustering to a subset of your data by supplying an ID list file (IDs separated by spaces, tabs, newlines, commas or semicolons). This is supposed to be used for re-clustering subsets of sequences using the previously computed hit-list file. 3. Output format. BLASTCLUST prints out clusters of sequence IDs, sorted from largest to smallest cluster (alphabetically by ID of the first sequence if of the same size), separating clusters by a newline character. Sequence identifiers within a cluster are space-separated and sorted from longest to shortest sequence (alphabetically by IDs if of the same length). 2. STAND-ALONE BLAST Stand-Alone BLAST is for users who wishes to run at their own institution. One reason to do so might be the wish to use private databases.
3. BLASTALL Blastall may be used to perform all five flavors of blast comparison. One may obtain the blastall options by executing 'blastall -' (note the dash). A typical blastall to perform a blastn search (nucl. vs. nucl.) of a file called QUERY would be: blastall -p blastn -d nr -i QUERY -o out.QUERY The output is placed into the output file out.QUERY and the search is
performed
4. MEGABLAST MegaBlast implements a greedy algorithm for the DNA sequence gapped
alignment search. Since MegaBlast can only work with DNA sequences, the
only program it supports is Blastn. Unlike BLAST, MegaBlast is most
efficient in both speed and memory requirements with non-affine gap
penalties. This program is optimized for aligning sequences that differ
slightly as a result of sequencing or other similar "errors". It is up to
10 times faster than more common sequence similarity programs and
therefore can
5. Reversed Position Specific Blast - (RPS-BLAST ) RPS-BLAST (Reverse PSI-BLAST) searches a query sequence against a
database of profiles. This is the opposite of PSI-BLAST that searches a
profile against a database of sequences, hence the 'Reverse'. RPS-BLAST
uses a BLAST-like algorithm, finding single- or double-word hits and then
performing an ungapped extension on these candidate matches. If a
sufficiently high-scoring ungapped alignment is produced, a gapped
extension is performed and those (gapped) alignments with sufficiently low
expect value are reported. This procedure is in contrast to IMPALA that
performs a Smith-Waterman calculation between the query and RPS-BLAST uses a BLAST database, but also has some other files that
contain a precomputed lookup table for the profiles to allow the search to
proceed faster. Unfortunately it was not possible to make this lookup
table architecture independent (like the BLAST databases themselves) and
one cannot take a RPS-BLAST databases prepared on a big-endian system
(e.g., Solaris Sparc) and run it on a small-endian system (e.g., NT).
Links: BLAST 2.0 Release
Notes - not a completed documentation on BLAST(
don't have RPS-BLAST), but still very useful. | ||||||||||||||||||||||||||||||||||||
| BIOTOOL - FASTA | ||||||||||||||||||||||||||||||||||||
|
The FASTA (pronounced "fast A") format is commonly used to capture sequence information, along with some primary annotation of that sequence. It is also called the "Pearson format" after its creator, William Pearson, who developed the format to be used with the FASTA alignment program. The format stipulates that a line beginning with the ">" symbol contain the sequence identifier or primary annotation. The sequence data then begins on the next line and continues until it concludes with another line starting with the ">" symbol. FASTA is designed to answer the question: which entries in the database are similar to my sequence?
Databases available
For our project, we use the FASTA3 package. Although there are a large number of programs in this package, they belong to three groups: (1) "Conventional" Library search programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3, TFASTY3, SSEARCH3; (remember A, X and Y) (2) Programs for searching with short fragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (remember S and F) (3) Statistical significance: PRSS3.
Programs that start with fast search protein databases, while tfast
programs search translated DNA databases.
A Simpler View:Comparison programs in the FASTA3 package
|
||||||||||||||||||||||||||||||||||||
| BIOTOOL - CLUSTALW (multiple alignment) | ||||||||||||||||||||||||||||||||||||
|
Clustal W is a general purpose multiple alignment program for DNA or proteins.
1st: All pairs of sequences are aligned separately (pairwise
alignments) in order to calculate a distance matrix giving the divergence
of each pair of sequences;
Links: |
||||||||||||||||||||||||||||||||||||
| BIOTOOL - PHYLIP | ||||||||||||||||||||||||||||||||||||
|
PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). Phylogenies means "the evolutionary development of an organ or other part of an organism" (source from www.dictionary.com). PHYLIP is mainly uses as 1) the molecular sequence programs, 2) the distance matrix programs, 3) the gene frequency and continuous characters programs, 4) the discrete characters programs, and 5) the tree drawing programs. 1. The molecular sequence programs: estimate
phylogenies from protein sequence or nucleic acid sequence data.
| ||||||||||||||||||||||||||||||||||||
| BIOTOOL - PRIMER3 | ||||||||||||||||||||||||||||||||||||
|
Primer3 picks primers for Polymerase Chain Reactions. Primer means "short pre-existing polynucleotide chain to which new deoxyribonucleotides can be added by DNA polymerase".
INPUT By default, Primer3 accepts input and produces output in Boulder-io format, a text-based input/output format used as a program-to-program data interchange format.
|
||||||||||||||||||||||||||||||||||||
| BIOTOOL - WISE2/DYNAMITE | ||||||||||||||||||||||||||||||||||||
|
Wise2 is a package focused on comparisons of DNA sequence and protein sequence;Wise2 is an Informatics Analysis Software.Dynamite is a code generating language whose main purpose is to produce efficient code for dynamic programming. Wise2's particular forte is the comparison of DNA
sequence at the level of its protein translation. Wise2 is a package
specialised in comparisons of DNA sequence at the level of their
conceptual translation, even if that DNA has either introns or sequencing
error present. This comparison allows the simultaneous prediction of say
gene structure with homology based alignment. Wise2 has
also been written with an eye for reuse and maintainability. Although it
is a pure C package you can access its functionality directly in Perl.
Parts of the package (or the entire package) can be used by other C or C++
programs without name space clashes as all externally linked variables
have the unique identifier Wise2 prep ended. Java and
CORBA ports are being considered - see 8 the API section.Finally
Wise2, although implemented in C makes heavy use of the
Dynamite code generating language. | ||||||||||||||||||||||||||||||||||||
| BIOTOOL - EMBOSS | ||||||||||||||||||||||||||||||||||||
|
Introduction
Within EMBOSS you will find around 100 programs (applications). These are just some of the areas covered:
Links: EMBOSS.org -
EMBOSS's organisation web site. Contains lots of
information. |
||||||||||||||||||||||||||||||||||||
| BIOTOOL - HMMER | ||||||||||||||||||||||||||||||||||||
|
What HMMs are
What HMMs can do for you Multiple sequence alignment
Programs
|
||||||||||||||||||||||||||||||||||||
| BIOTOOL - PHRAP & PHRED | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| Some Basic BioInformatics Terms | ||||||||||||||||||||||||||||||||||||
|
algorithm - a fixed procedure embodied in a computer program. The Basic Local Alignment Search Tool or BLAST is a sequence comparison algorithm that NCBI uses to search sequence databases for optimal local alignments with a query sequence. FASTA is another type of algorithm used for database similarity searching. Allele - Alternative form of a genetic locus; a
single allele for each locus is inherited from each parent (e.g., at a
locus for eye color the allele might result in blue or brown eyes).
conservation - when the substitution of one amino for another preserves the physico-chemistry properties of the original residue. For example, when a hydrophobic amino acid residue is replaced by another hydrophobic residue. E value - the number of different alignments with a score equal to or better than S that can be expected to occur simply by chance. Also referred to as the expectation value. FASTA format - sequence format that begins with a single-line description followed by lines of sequence data. This format can be used as query input when searching bioinformatic tools such as BLAST or clustal W. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. Blank lines are not allowed in the middle of FASTA input. An example of a protein sequence in FASTA format is: >GI|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP gap - A space introduced into an alignment to compensate for insertions or deletions in one sequence relative to another global alignment - when two nucleic acid or amino acid sequences are lined up along their entire length. See also local alignment homology - similarity in sequence that is based on descent from a common ancestor identity - the extent to which two sequences are invariant local alignment - the alignment of portions (rather than the entire sequence length) of two nucleic acid or amino acid sequences Locus (pl. loci) - The position on a chromosome of a
gene or other chromosome marker; also, the DNA at that position. The use
of locus is sometimes restricted to mean expressed DNA regions. masking - the removal of repeated or low complexity regions from a sequence so that sequences are compared orthologous - homologous sequences in different species that result from a common ancestral gene during speciation. Orthologous genes may or may not have similar functions. paralogous - homologous sequences within a single species that are the result of gene duplication query - the input sequence (in FASTA format or as bare sequence data) or sequence identifier with which all the sequences in a database are compared during a BLAST search similarity - how related one nucleotide or protein sequence is to another. The extent of similarity between two sequences is based on the percent of sequence identity and/or conservation. | ||||||||||||||||||||||||||||||||||||