performance question

Tue Feb 19 16:14:46 CST 2002

Thanks all for your help. I ended up combining several ideas into two 
subroutines. Besides our own pdx-perl ideas I found Beginning Perl 
for Bioinformatics, James Tisdall, very useful.
The main program takes a fasta format sequence file and a glimmer 
"gene predictor" output file.
I can then process the putative genes directly, identified by the 
"glimmer id#" as the key to my genes_hash, or create a secondary 
input file to plug into the pipeline (not shown).
[The latter refers to the GCG "Wisconsin package" of bioinformatics 
programs. It's a suite of utilities and analysis tools that many 
universities buy for analyzing biological data. It includes such 
things as fragment assembly and the blast genbank database query 
tool, and many other programs. ]

sub create_segment_list
{
	my ($glim_file, $seq_file, $prefix) = @_;
	print "in sub, glim_file and prefix($prefix) are: $glim_file 
of $seq_file\n";  ## sanity check
	my ($annotation, $putative_genes, @putative_genes);
	open GLIMMER_O, "$glim_file" or die "Can't open $glim_file\n";
	undef $/ ;
	my $record = <GLIMMER_O>;
	$/ = my $save_input_separator;	#resets $/
	($annotation, $putative_genes) = ($record =~ /^(.*Putative 
Genes:\s*\n)(.*)/s);
	close GLIMMER_O;
	@putative_genes = split "\n", $putative_genes;
	return $annotations, \@putative_genes;
}
sub create_genes_hash
{
	#input: list of putative genes (last section of glimmer output)
	my $array_ref = shift;
	my @input = @{$array_ref};
	my ($id, $start, $stop, $comment, %pairs);
	foreach my $line (@input)
	{
		if ( $line =~ m/^\s+(\d+)\s+(\d+)\s+(\d+)\s+\[(.*)\]/ )
		{
			$id = $1; $start = $2; $stop = $3; $comment= $4;
			$pairs{$id} = [ $start, $stop, $comment ];
			## hash, key=id, value=array_ref to list of 
start, stop, and comment
		}
	}
	return \%pairs;
}

Thanks for your help.
Tom
-- 
Thomas J. Keller, Ph.D.
MMI Research Core Facility
Oregon Health & Science University
3181 SW Sam Jackson Park Rd
Portland, Oregon  97201
TIMTOWTDI