My bioinformatics




		[ home \| about \| about me \| my BII project 1 \|\| my BII project 2 \|contact \| email \| guestbook\| links ]

BioInformatics

The World of BioInformatics

BioInformatics

(Molecular) bio – informatics: bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry) and applying “informatics techniques” (derived from disciplines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications.

One of the central goals of bioinformatics is the prediction of protein function, and ultimately of structure, from the linear amino acid sequence. Given a newly determined sequence, we want to know: what is my protein? To what family does it belong? What is its function? And how can we explain its function in structural terms?

Today, although we don't yet have all the answers, we can at least begin to address some of these questions. By searching secondary databases, which house abstractions of functional and structural sites characteristic of particular protein, we may recognise patterns that allow us to infer relationships with previously characterised families. Similarly, by searching fold libraries, which house templates of known structures, it is possible to recognise a previously characterised fold.

Aims of bioinformatics

The aims of bioinformatics are three-fold. First, at its simplest, bioinformatics organises data in a way that allows researchers to access existing information and to submit new entries as they are produced, eg the Protein Data Bank for 3D macromolecular structures. While data-curation is an essential task, the information stored in these databases is essentially useless until analysed. Thus the purpose of bioinformatics extends far beyond mere volume control. The second aim is to develop tools and resources that aid in the analysis of data. For example, having sequenced a particular protein, it is of interest to compare it with previously characterised sequences. This requires more than just a straightforward database search. As such, programs such as FASTA and PSI-BLAST must consider what constitutes a biologically significant resemblance. Development of such resources requires extensive knowledge of computational theory, as well as a thorough understanding of biology. The third aim is to use these tools to analyse the data and interpret the results in a biologically meaningful manner. Traditionally, biological studies examined individual systems in detail, and frequently compared them with a few that are related. In bioinformatics, we can also conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight features that are unique to some.

The Information

We start with an overview of the sources of information: these may be divided into 1) raw DNA sequences, 2) protein sequences, 3) macromolecular structures, 4) genome sequences, and 5) other whole genome data. Raw DNA sequences are strings of the four base-letters comprising genes, each typically 1,000 bases long. At the next level are protein sequences comprising strings of 20 amino acid-letters. At present there are about 300,000 known protein sequences, with a typical bacterial protein containing approximately 300 amino acids. Macromolecular structural data represents a more complex form of information. There are currently 13,000 entries in the Protein Data Bank, PDB, most of which are protein structures. A typical PDB file for a medium-sized protein contains the xyz coordinates of approximately 2,000 atoms.

Protein Sequence Database

Protein sequence databases are categorised as primary, composite or secondary. Primary databases contain over 300,000 protein sequences and function as a repository for the raw data. Some more common repositories, such as SWISS-PROT and PIR-International, annotate the sequences as well as describe the proteins’ functions, its domain structure and post-translational modifications. Composite databases such as OWL and the NRDB compile and filter sequence data from different primary databases to produce combined non-redundant sets that are more complete than the individual databases and also include protein sequence data from the translated coding regions in DNA sequence databases. Secondary databases contain information derived from protein sequences and help the user determine whether a new sequence belongs to a known protein family. One of the most popular is PROSITE, a database of short sequence patterns and profiles that characterise biologically significant sites in proteins. PRINTS expands on this concept and provides a compendium of protein fingerprints – groups of conserved motifs that characterise a protein family. Motifs are usually separated along a protein sequence, but may be contiguous in 3D-space when the protein is folded. By using multiple motifs, fingerprints can encode protein folds and functionalities more flexibly than PROSITE. Finally, Pfam contains a large collection of multiple sequence alignments and profile Hidden Markov Models covering many common protein domains. Pfam-A comprises accurate manually compiled alignments while Pfam-B is an automated clustering of the whole SWISS-PROT database. These different secondary databases have recently been incorporated into a single resource named InterPro.

Understand And Organise The Information

For raw DNA sequences, investigations involve separating coding and non-coding regions, and identification of introns, exons and promoter regions for annotating genomic DNA. For protein sequences, analyses include developing algorithms for sequence comparisons, methods for producing multiple sequence alignments, and searching for functional domains from conserved sequence motifs in such alignments. Investigations of structural data include prediction of secondary and tertiary protein structures, producing methods for 3D structural alignments, examining protein geometries using distance and angular measurements, calculations of surface and volume shapes and analysis of protein interactions with other subunits, DNA, RNA and smaller molecules. The increasing availability of annotated genomic sequences has resulted in the introduction of computational genomics and proteomics – large-scale analyses of complete genomes and the proteins that they encode. Research includes characterisation of protein content and metabolic pathways between different genomes, identification of interacting proteins, assignment and prediction of gene products, and large-scale analyses of gene expression levels.

In addition to finding relationships between different proteins, much of bioinformatics involves the analysis of one type of data to infer and understand the observations for another type of data. An example is the use of sequence and structural data to predict the secondary and tertiary structures of new protein sequences. These methods, especially the former, are often based on statistical rules derived from structures, such as the propensity for certain amino acid sequences to produce different secondary structural elements.

BIO DATABASES

What is Redundancy?
A key concept in comparing databases is the issue of redundancy. Many databases try to be "non-redundant". Unfortunately, biological data is too complex to fit a simple definition of redundancy. Are two alleles of the same locus redundant? Two isozymes in the same organism? The same locus in two closely related organisms? Hence, each "non-redundant" database has its own definition of redundancy. Some use automated measures, while others use manual culling; the former are amenable to large projects, the latter give higher quality. Other databases don't attempt to be non-redundant, but rather sacrifice this goal in favor of ensuring completeness.

DATABASES

Nucleotide (DNA & RNA)

nr (NCBI)
The nr nucleotide database maintained by NCBI as a target for their BLAST search services is a composite of GenBank, GenBank updates, and EMBL updates.
Non-redundant: Entries with absolutely identical sequences have been merged.

GenBank / EMBL / DDBJ
In theory, GenBank, the EMBL Datalibrary, and the DNA Databank of Japan (DDBJ) are just names for the same database. In reality, small timelags in propagating data between the database centers causes minor differences in these databases. However, if one of these libraries is merged with the updates to all of these databases, a complete set of sequences is formed.
Redundant: Little to no attempts to reduce redundancy

Protein

nr (NCBI)
The nr protein database maintained by NCBI as a target for their BLAST search services is a composite of SwissProt, SwissProt updates, PIR, PDB. Entries with absolutely identical sequences have been merged.

SwissProt
SwissProt is maintained by Amos Bairoch at the University of Geneva. SwissProt is a highly-curated, highly-crossreferenced, non-redundant database. Unfortunately, the cost of this labor-intensive quality enhancement process is that not every sequence is in SwissProt. If you wish to look up information about a sequence, SwissProt is the first place to look.
Non-redundant: manual curation used to provide only one entry per protein product; variants are annotated in entry.
Highly-cross-referenced to other databases.

PIR
The Protein Identification Resource was originated by the late Margaret Dayhoff. It attempts to enjoy the advantages of a complete and a non-redundant database.
Non-redundant: PIR1 section contains only one entry per protein product.
Redundant: Complete database (PIR1+PIR2+PIR3) has many redundancies

PDB
The Protein Data Bank, maintained by Brookhaven National Laboratory (Long Island, New York, USA), contains all publically available solved protein structures. Searches against the pdb can be used to ask whether any known 3D structures are similar to your query protein.
Non-redundant: Only the "best" determination of a given structure is left in the database; however, multiple structures for one molecule may exist due to other components (i.e. one entry uncomplexed, one complexed).

OWL
A composite, non-redundant composite of 4 publicly-available primary sources: SWISS-PROT, PIR (1-3), GenBank (translation) and NRL-3D. SWISS-PROT is the highest priority source, all others being compared against it to eliminate identical and trivially-different sequences
Non-redundant: Automatically generated from component database

BIOTOOL - BLAST

BLAST stands for Basic Local Alignment Search Tool; it provides a method for rapid searching of nucleotide and protein databases.Since the BLAST algorithm detects local as well as global alignments, regions of similarity embedded in otherwise unrelated proteins can be detected. Both types of similarity may provide important clues to the function of uncharacterized proteins.

Blast Family of Programs

The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases:

blastp compares an amino acid query sequence against a protein sequence database.

blastn compares a nucleotide query sequence against a nucleotide sequence database.

blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.

tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

tblastx compares the six-frame translations of a nucleo- tide query sequence against the six-frame transla- tions of a nucleotide sequence database.

A simpler view:

Protein Level Comparison

		DATABASE
		DNA	Protein
QUERY	DNA	tblastx	blastx
QUERY	Protein	tblastn	blastp

DNA Level Comparison

		DATABASE
		DNA	Protein
QUERY	DNA	blastn	N/A
QUERY	Protein	N/A	N/A

TYPES OF BLASTS

1. BLAST score-based single-linkage clustering - (BLASTCLUST )

1. Procedure

BLASTCLUST automatically and systematically clusters protein or DNA sequences based on pairwise matches found using the BLAST algorithm in case of proteins or Mega BLAST algorithm for DNA. In the latter case a single Mega BLAST search is
performed for all the sequences combined against a database created from the same sequences. BLASTCLUST finds pairs of sequences that have statistically significant matches and clusters them using single-linkage clustering.

2. Input formats.

The primary input format for BLASTCLUST is a FASTA-format sequence file. Each sequence should have a unique identifier (as defined by formatdb). BLASTCLUST formats this sequence set into a BLASTable database (in the directory pointed to by the environment variable TMPDIR or in the current directory), then removes the database.

Instead of a FASTA file, a database prepared by formatdb with -o option set to TRUE can be supplied as an input.

Another type of input is a sequence hit-list previously saved by BLASTCLUST (in this case BLASTCLUST will use pre-computed HSP data instead of making de novo comparisons).

You can restrict clustering to a subset of your data by supplying an ID list file (IDs separated by spaces, tabs, newlines, commas or semicolons). This is supposed to be used for re-clustering subsets of sequences using the previously computed hit-list file.

3. Output format.

BLASTCLUST prints out clusters of sequence IDs, sorted from largest to smallest cluster (alphabetically by ID of the first sequence if of the same size), separating clusters by a newline character. Sequence identifiers within a cluster are space-separated and sorted from longest to shortest sequence (alphabetically by IDs if of the same length).

2. STAND-ALONE BLAST

Stand-Alone BLAST is for users who wishes to run at their own institution. One reason to do so might be the wish to use private databases.

3. BLASTALL

Blastall may be used to perform all five flavors of blast comparison. One may obtain the blastall options by executing 'blastall -' (note the dash). A typical blastall to perform a blastn search (nucl. vs. nucl.) of a file called QUERY would be:

blastall -p blastn -d nr -i QUERY -o out.QUERY

The output is placed into the output file out.QUERY and the search is performed
against the 'nr' database. If a protein vs. protein search is desired,
then 'blastn' should be replaced with 'blastp' etc.

4. MEGABLAST

MegaBlast implements a greedy algorithm for the DNA sequence gapped alignment search. Since MegaBlast can only work with DNA sequences, the only program it supports is Blastn. Unlike BLAST, MegaBlast is most efficient in both speed and memory requirements with non-affine gap penalties. This program is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". It is up to 10 times faster than more common sequence similarity programs and therefore can
be used to swiftly compare two large sets of sequences against each other.

5. Reversed Position Specific Blast - (RPS-BLAST )

RPS-BLAST (Reverse PSI-BLAST) searches a query sequence against a database of profiles. This is the opposite of PSI-BLAST that searches a profile against a database of sequences, hence the 'Reverse'. RPS-BLAST uses a BLAST-like algorithm, finding single- or double-word hits and then performing an ungapped extension on these candidate matches. If a sufficiently high-scoring ungapped alignment is produced, a gapped extension is performed and those (gapped) alignments with sufficiently low expect value are reported. This procedure is in contrast to IMPALA that performs a Smith-Waterman calculation between the query and
each profile, rather than using a word-hit approach to identify matches that should be extended.

RPS-BLAST uses a BLAST database, but also has some other files that contain a precomputed lookup table for the profiles to allow the search to proceed faster. Unfortunately it was not possible to make this lookup table architecture independent (like the BLAST databases themselves) and one cannot take a RPS-BLAST databases prepared on a big-endian system (e.g., Solaris Sparc) and run it on a small-endian system (e.g., NT).

Links:

BLAST 2.0 Release Notes - not a completed documentation on BLAST( don't have RPS-BLAST), but still very useful.
BLAST: Further Information - a further, yet simple explanation on individual program.

BIOTOOL - FASTA

The FASTA (pronounced "fast A") format is commonly used to capture sequence information, along with some primary annotation of that sequence. It is also called the "Pearson format" after its creator, William Pearson, who developed the format to be used with the FASTA alignment program. The format stipulates that a line beginning with the ">" symbol contain the sequence identifier or primary annotation. The sequence data then begins on the next line and continues until it concludes with another line starting with the ">" symbol.

FASTA is designed to answer the question: which entries in the database are similar to my sequence?

Databases available

Nucleotide sequence databases:

EMBL database,
GenBank database,

Protein sequence databases:

Swiss-Prot database,
PIR/NBRF database.

For our project, we use the FASTA3 package. Although there are a large number of programs in this package, they belong to three groups:

(1) "Conventional" Library search programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3, TFASTY3, SSEARCH3; (remember A, X and Y)

(2) Programs for searching with short fragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (remember S and F)

(3) Statistical significance: PRSS3.

Programs that start with fast search protein databases, while tfast programs search translated DNA databases.

A Simpler View:

Comparison programs in the FASTA3 package

	Query Sequences	Database type	Remarks
fasta3	Protein/DNA	Protein/DNA
fastx3/fasty3	DNS	Protein	fastx3 uses a simpler, faster algorithm for alignments that allows frameshifts only between codons; fasty3 is slower but produces better alignments with poor quality sequences because frameshifts are allowed within codons.
tfastx3/tfasty3	Protein	DNA	Calculating similarities with frameshifts. Are preferred over tfasta3 because they can calculate similarity over frameshifts.
tfasta3	Protein	DNA	Calculating similarities without frameshifts.
fastf3/tfastf3	Ordered peptide	Protein/DNA	Ordered peptide mixture is obtained by Edman degredation of a CNBr cleavage of a protein. Is compares against a protein(fastf3) or DNA(tfastf3) database.
fasts3/tfasts3	Short peptide	Protein/DNA	Short peptide is obtained from mass-spec. analysis of a protein. Is compares against a protein(fasts3) or DNA(tfastfs3) database.
ssearch3	Protein/DNA	Protein/DNA	Is 10 times slower than fasta3 but is more sensitive for full-length protein sequence comparion.

BIOTOOL - CLUSTALW (multiple alignment)

Clustal W is a general purpose multiple alignment program for DNA or proteins.

Multiple alignments are carried out in 3 stages:

1st: All pairs of sequences are aligned separately (pairwise alignments) in order to calculate a distance matrix giving the divergence of each pair of sequences;
2nd: A guide tree (like a phylogenetic tree) is constructed from the distance matrix;
3rd: The sequences are progressively aligned according to the branch order in the guide tree.

Links:

CLUSTALW tutorial

README for Clustal W

BIOTOOL - PHYLIP

PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). Phylogenies means "the evolutionary development of an organ or other part of an organism" (source from www.dictionary.com). PHYLIP is mainly uses as 1) the molecular sequence programs, 2) the distance matrix programs, 3) the gene frequency and continuous characters programs, 4) the discrete characters programs, and 5) the tree drawing programs.

1. The molecular sequence programs: estimate phylogenies from protein sequence or nucleic acid sequence data.
2. The distance matrix programs: deal with data which comes in the form of a matrix of pairwise distances between all pairs of taxa, such as distances based on molecular sequence data, gene frequency genetic distances, amounts of DNA hybridization, or immunological distances.
3. The gene frequency and continuous characters programs: use gene frequencies and quantitative character values.
4. The discrete characters programs: are intended for the use of morphological systematists who are dealing with discrete characters, or by molecular evolutionists dealing with presence-absence data on restriction sites.
5. The tree drawing programs: are interactive tree-plotting programs that take a tree description in a file and read it, and then let you interactively make various settings and then plot the tree on a laser printer, plotter, or dot matrix printer.

BIOTOOL - PRIMER3

Primer3 picks primers for Polymerase Chain Reactions. Primer means "short pre-existing polynucleotide chain to which new deoxyribonucleotides can be added by DNA polymerase".

INPUT

By default, Primer3 accepts input and produces output in Boulder-io format, a text-based input/output format used as a program-to-program data interchange format.

BIOTOOL - WISE2/DYNAMITE

Wise2 is a package focused on comparisons of DNA sequence and protein sequence;Wise2 is an Informatics Analysis Software.Dynamite is a code generating language whose main purpose is to produce efficient code for dynamic programming.

Wise2's particular forte is the comparison of DNA sequence at the level of its protein translation. Wise2 is a package specialised in comparisons of DNA sequence at the level of their conceptual translation, even if that DNA has either introns or sequencing error present. This comparison allows the simultaneous prediction of say gene structure with homology based alignment. Wise2 has also been written with an eye for reuse and maintainability. Although it is a pure C package you can access its functionality directly in Perl. Parts of the package (or the entire package) can be used by other C or C++ programs without name space clashes as all externally linked variables have the unique identifier Wise2 prep ended. Java and CORBA ports are being considered - see 8 the API section.Finally Wise2, although implemented in C makes heavy use of the Dynamite code generating language.

BIOTOOL - EMBOSS

Introduction
EMBOSS stands for "The European Molecular Biology Open Software Suite ". EMBOSS is a new, free Open Source software analysis package specially developed for the needs of the molecular biology user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole.

The EMBOSS suite
Provides a comprehensive set of sequence analysis programs (approximately 100)

Provides a set of core software libraries (AJAX and NUCLEUS)

Integrates other publicly available packages

Encourages the use of EMBOSS in sequence analysis training.

Encourages developers elsewhere to use the EMBOSS libraries.

Supports all common Unix platforms including Linux, Digital Unix, Irix,

Tru64Unix and Solaris.

Within EMBOSS you will find around 100 programs (applications). These are just some of the areas covered:

Sequence alignment

Rapid database searching with sequence patterns

Protein motif identification, including domain analysis

Nucleotide sequence pattern analysis, for example to identify CpG islands or repeats.

Codon usage analysis for small genomes Rapid identification of sequence patterns in large scale sequence sets.

Presentation tools for publication

Links:

EMBOSS.org - EMBOSS's organisation web site. Contains lots of information.
EMBOSS - The EMBOSS Administrators Guide.

BIOTOOL - HMMER

What HMMs are
Hidden Markov models (HMMs) are statistical models of the primary structure consensus of a sequence family.

What HMMs can do for you

Multiple sequence alignment
HMMs can be ``trained'' by a learning procedure, given unaligned example sequences. Typically this is somewhat less effective than building an HMM from a trusted, structurally-based alignment. However, HMM alignments can be quite accurate. This package includes a simulated annealing procedure for training HMMs which has produced alignments quite close to several trusted structural alignments.
Database searching
The real power of HMMs is their sensitivity in database searches. This package includes software for several different searching tasks. The most useful search programs have been the local alignment programs hmmsw and hmmfs.

Programs
There are currently nine programs included in the package, namely:

hmmalign           Align sequences to an existing model.
hmmbuild           Build a model from a multiple sequence alignment.
hmmcalibrate     Takes anHMMand empirically determines parameters that are used to make searches more sensitive, by calculating more accurate                          expectation value scores (E-values).
hmmconvert     Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG                          profiles.
hmmemit          Emit sequences probabilistically from a profile HMM.
hmmfetch         Get a single model from an HMM database.
hmmindex         Index an HMM database.
hmmpfam         Search an HMM database for matches to a query sequence.
hmmsearch      Search a sequence database for matches to an HMM.

HMMER also provides a number of utility programs which are not HMM programs, but may be useful. These programs are from the SQUID sequence utility library that HMMER uses:

afetch        Retrieve an alignment from an alignment database
alistat         Show some simple statistics about a sequence alignment file.
seqstat      Show some simple statistics about a sequence file.
sfetch        Retrieve a (sub-)sequence from a sequence file.
shuffle       Randomize sequences in a sequence file.
sreformat Reformat a sequence file into a different format.

BIOTOOL - PHRAP & PHRED

Some Basic BioInformatics Terms

algorithm - a fixed procedure embodied in a computer program. The Basic Local Alignment Search Tool or BLAST is a sequence comparison algorithm that NCBI uses to search sequence databases for optimal local alignments with a query sequence. FASTA is another type of algorithm used for database similarity searching.

Allele - Alternative form of a genetic locus; a single allele for each locus is inherited from each parent (e.g., at a locus for eye color the allele might result in blue or brown eyes).

conservation - when the substitution of one amino for another preserves the physico-chemistry properties of the original residue. For example, when a hydrophobic amino acid residue is replaced by another hydrophobic residue.

E value - the number of different alignments with a score equal to or better than S that can be expected to occur simply by chance. Also referred to as the expectation value.

FASTA format - sequence format that begins with a single-line description followed by lines of sequence data. This format can be used as query input when searching bioinformatic tools such as BLAST or clustal W. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. Blank lines are not allowed in the middle of FASTA input.

An example of a protein sequence in FASTA format is:

>GI|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP

gap - A space introduced into an alignment to compensate for insertions or deletions in one sequence relative to another

global alignment - when two nucleic acid or amino acid sequences are lined up along their entire length. See also local alignment

homology - similarity in sequence that is based on descent from a common ancestor

identity - the extent to which two sequences are invariant

local alignment - the alignment of portions (rather than the entire sequence length) of two nucleic acid or amino acid sequences

Locus (pl. loci) - The position on a chromosome of a gene or other chromosome marker; also, the DNA at that position. The use of locus is sometimes restricted to mean expressed DNA regions.

masking - the removal of repeated or low complexity regions from a sequence so that sequences are compared

orthologous - homologous sequences in different species that result from a common ancestral gene during speciation. Orthologous genes may or may not have similar functions.

paralogous - homologous sequences within a single species that are the result of gene duplication

query - the input sequence (in FASTA format or as bare sequence data) or sequence identifier with which all the sequences in a database are compared during a BLAST search

similarity - how related one nucleotide or protein sequence is to another. The extent of similarity between two sequences is based on the percent of sequence identity and/or conservation.

Hosted by www.Geocities.ws