Program Description: bas2prot - 
Program for converting a file of bases to one of likely proteins

Algorithm:
The program asks the user for input parameters (see below), then begins a loop reading all the base characters from the input file. For each character read the program will:
1. Check it to see if it is one of A, T, G, or C, otherwise ignoring it and reading on.
2. Append the character to two 3-base buffers.  Each buffer this way contains the last 3 bases read, one in the order read, one in the reverse order read (so if the last 3 read were "ATG", the second buffer would contain "GTA").
3. Look up the 3 bases in each buffer against a table of base triplets to get the resulting amino acid character. For efficiency, the three characters are packed into three bytes of a long integer and the lookup is done in one step.
4. Append the amino acid characters to the current 2 of 6 frame buffers, buffers 4 through 6 being added to from the front (so they are kept in the reverse order read).
5. Cycle the numbers of the current frames being worked on (so if the last amino characters were added to frames 3 and 6, the next would be added to frames 1 and 4).
6. If the amino character added to a frame was a stop codon (TAA / TAG / TGA), check if the complete protein made this way should be written to the output file:
a) Search for a sequence beginning with a methanine ('M'), running to the end of the protein, and not containing any specifically excluded amino acids.
b) If no such sequence is found, or if the resulting sequence has fewer than the specified minimum number of amino acids, discard the protein without writing it.
c) Slide a 15 amino acid "window" from the beginning of the sequence for as many steps as specified. For each step sum the weights of the 15 amino acids in the "window" from the specified table, eu- or prokaryotic, remembering the maximum weight gotten in all the steps.
d) If the maximum weight gotten this way is greater than the minimum specified to write a protein, write the protein sequence to the output file. Otherwise return to step a), and look further in the sequence.
e) In any case, discard the protein from the frame in memory, and start recording a new protein sequence in that frame.
Upon reaching the end of the input file, the program will act as if all six of the frames being read were terminated with a stop codon, considering any sequences in the buffers as completed proteins.
Though no formal performance measurement has been done, the program should be fairly efficient: all of its decisions are made by table lookups and not sequential comparisons, and it only scans the input file once to produce all the output. 

Usage & Interface:
The program has a text-only interface. It asks the user for seven program parameters and file names, and then parses the input file, reporting the number of bases read and proteins written every 1000 bases to the screen.  For a long input file, the program may be left to run on its own - it is only interactive at start-up.
The program parameters, in the order requested, are:
1. Name of input file (including path)?
2. Name of output file (including path)?
3. Weight table to be used: eukaryotic(0) or prokaryotic(1)?
4. Minimum weight to write a protein (0 to 774)?
5. Maximum number of times to slide weighing window?
6. Minimum length (in amino acids) to write a protein?
7. Any amino acids to be excluded from database (especially W, P, N, R, K, D, or C)?
There is no on-line help, so the user should already understand the use and purpose of the program. The program will, however, echo the parameters as they are input, and will try to describe any error condition it encounters.
The input file should contain the bases to be scanned.  Any letters but A,T,G, or C in the file will be ignored, so files containing spaces or line breaks are acceptable.  The output file will contain the amino acids of the proteins found, up to 60 per line, each protein preceded by a blank line and a header line with the format:
> <frame> / <position in input> (<length> amino acids long, weight <weight>, from position <window start in protein>) 
The weight table will determine the table of signal sequences from von Heijne's paper to be used in determining whether or not to write an amino acid sequence as a protein.  If the total weights from the table of the first 15 amino acids in a sequence starting with a methanine ('M'),  are greater than the minimum weight parameter specified, then the protein starting with that sequence is written to the output file. If the weighing window is allowed to be slid 1 time, then the 2nd through 16th amino acids in the protein sequence will also be considered, and so forth.  Finally, any proteins that are shorter than the specified number of amino acids, or contain excluded amino acids, will not be written.

Coding Specifics:
The source code for the program was written mostly in ANSI C, borrowing only the "//" comment style from C++. As such, it should be able to compile under most C/C++ compilers and operating systems without modification. The specific executable was compiled with Borland Turbo C++ for Windows 3.1, and has been run under Windows 3.1 and on a Power Macintosh running SoftWindows. Simple possible improvements include: compiling under a Win32 compiler to enable the executable to handle long file names; and enabling the program to take command line arguments, to more conveniently run from a batch file or script, rather than interactively.
The source code consists of one file, bas2amin.cpp, which includes the eu- and prokaryotic weight tables, and seven functions:
1. char pair(char base): returns 'A' on input 'T', 'C' on input 'G' and vice versa
2. char lookup(char base[]): returns amino acid character for input base triplet
3. int getBase(FILE *in): returns next legal base character, ignoring others
4. char add(char amin, char frame): adds amino acid to frame, writes protein if stop codon
5. void writeProtein(...): writes a sequence to the output file if it meets the global criteria
6. short weigh(const char *prot, unsigned long len): returns weight of protein from table
7. main(): gets global parameters, runs main loop reading bases from input file

George Ruban
February 15, 1997
