Software Architecture

Original Approach:

Our first approach to this project was in designing a java program that linked directly to NCBI's BLAST search engine and ran queries to that site. Our software would then parse the data results returned from the BLAST searches and input them in a readable form into our matrix algorithm. Our matrix algorithm calculates distances relative to a source and target organism and outputs the data graphically as a "tree". We chose to output it graphically as a tree because our focus was ease-of-use and readability for the end-user.

Modified Approach:

The problem with the original approach was the difficulty in designing the web interface to interact with NCBI's BLAST site. Per suggestion by Prof. Arkin, we've instead decided to create our own BLAST search locally by downloading the necessary genome data from the internet and source code for running BLAST.

Application Design:

Modular design divided into following portions:
(details of each are discussed in powerpoint presentation)

  1. Design allows for reuse of components in different applications with minor changes
  2. Design allows individual subrountines to be used recursively to generate desired results with minimal changes

Software Architecture

Input Description:

The input portion contains HTML/CGI files for user input and a Perl program for running BLAST and parsing the result.

1. HTML/CGI part has two files: blast.htm and blast.cgi.

http://sahara.lbl.gov/~lyan/cgi-bin/blast.htm

The above web interface allows the user to input a pathway name, organism names, protein names, multiple protein sequences, and a threshold. Blast.cgi will save those information into 5 files: Pathwayname, org.file.txt, Proteinnames, Proteinsequences, and Threshold. Then it will run the bp.pl9 program.

2. Protein Databases and BLAST tools (ie. blastall, fascacmd) for 5 organisms are located in /usr2/people/lyan/public_html. Those Databases and BLAST tools are used by "bp.pl9" which will do the following:

  1. Get the threshold value from the first command line argument
  2. Divide up the Proteinsequences file into individual sequence files and get the name of the protein (for example, wg.seq) from "Proteinnames". These are the 1st seed genes
  3. For each gene (ie. wg.seq), it will do the following:
  4. Run the next 1st seed genes one by one
  5. The final result will be several *_out files with names saved in gabe.txt

Output Display Description:

The output portion contains five classes in four different files. The classes are as follows: a) intree, b) dphy, c) Orthotable, d) dismat2 and e) newtree. They are stored in separate files except for dphy, which is saved as part of newtree. Intree.java takes in a text file of the output of the clustering algorithm and reads in the organisms, names and distances for drawing the phylogenetic tree. Dphy.java creates a JPanel and contains the method to draw the phylogenetic tree on the panel. Dismat2.java takes organism names and their distances from Metric.java and paints a graphical display of the distance matrix in another JPanel. Orthotable.java takes in organism names and a hashmap of vectors of orthologs in the different organisms from Metric.java and counts the number of orthologs to the seed genes in each organism. These are then displayed in a JTable in a third JPanel. Newtree.java is the major display class which calls methods from all the other classes and creates a tabPane containing the above three JPanels inside a JFrame. Newtree is then called by Phylogenetics to display the desired results.

Hosted by www.Geocities.ws

1