Projects

Scripts

Data



Scripts

-- last updated 31 October 2006 --

UNDER CONSTRUCTION

McMahon Lab Home



GenBank to supermatrices pipeline

To study the potential of the sequence data in GenBank to resolve a tree for a clade of interest to my research, the Papilionoid legumes, I developed a "pipeline" in collaboration with Mike Sanderson and the Phylota project.  See paper (McMahon and Sanderson, 2006, Phylogenetic supermatrix analysis of GenBank Sequences from 2228 Papilionoid Legumes, Systematic Biology 55:818-836) for a full description of the methods for this project.

A few of the many scripts involved are available here; contact me for additional tools mentioned in the paper.


blast2blink.pl

Download current version:  v. 0.37

This tool is used for controlling the levels of length heterogeneity when building single-linkage clusters of sequences.  BLAST's table output (achieved using the command -m 8) is parsed and compared to a table that contains lengths for each of the sequences.  Used in combination with blink (M. J. Sanderson, written in C) or blinkPerl (below), it is similar in conception to NCBI's BLASTClust.  However, the separation of the blasting and the clustering steps allows full use of blast's various programs (blastn, etc.), and a critical difference (as of this writing) is that blastclust considers one hit at a time when calculating proportional overlap, whereas blast2blink considers the entire set of hits between a pair of sequences.  This means that a pair of sequences with, e.g., a string of N's or other low-complexity runs in the middle can still be considered a hit if the surrounding regions are found to be similar enough. 

sigma_phi_schematic


Running the program with no commands produces detailed instructions:  ./blast2blink.pl
After downloading the script, it may be necessary to make it executable:  chmod +x blast2blink.pl

The script has been tested with a wide set of parameter values and commands, but not exhaustively, so it is still in Beta form.  If errors occur, please contact me. 
NOTE:  a bug in v. 0.35 has been fixed, and the instructions have been updated. 


blinkPerl.pl

Download current version:  v. 1.0

This tool is used for making clusters of sequences.  Input consists of pairs of sequence identifiers that "hit" to each other, output is a list of clusters and their consituent sequences.  A sequence is added to a cluster if it hits to any other sequence in the cluster (single-linkage clustering). 

Running the program with no commands produces a list of commands, read the source for formatting details.  Please contact me with any errors or problems.
After downloading the script, it may be necessary to make it executable:  chmod +x blinkPerl.pl