lexical problems (i think)

Peter Scott peter at PSDT.com
Mon Aug 26 16:42:27 CDT 2002


At 02:30 PM 8/26/02 -0700, nkuipers wrote:
>Hello all,
>
>I need help with variable persistence.  As preamble, carefully consider an
>input file whose contents look like this:
>
>start of file
> >header1
>sequence...
>sequence...
> >header2
>.
>.
>.
> >headerN
>sequenceN...
>end of file
>
>My goal is a parsing one: parse the file contents into a hash keyed by header.
>  Here is the subroutine.
>
>
>sub parse_fasta_file {
>         my ($fh, $href) = @_;
>         my $header      = undef;
>         my $sequence    = undef;

The "= undef" is redundant.

>         #hash: header   => 'sequence'
>         while (<$fh>) {
>                 if    ( /^>(.*)\n$/ && !defined $header ) { ##first 
> '>' in file
>                         $header = $1 }

No need to test whether $header is defined; all the headers come before 
their contents.  Just do

         /^>(.*)/ and $header = $1

$ matches \n + end-of-string and . doesn't match newline, therefore 
this is the same as what you have.

>                 elsif ( /^>(.*)\n$/ &&  defined $header ) {

No need for the elsif, unless you think the data may contain lines that 
look like sequences before the first line that looks like a header.

>                         $sequence =~ s/\s//g;

Um, this is more complicated than you need.  A lot more.

>                         $href->{$header} = $sequence;
>                         $sequence = undef;
>                         $header   = $1; } #want persistence here
>                 elsif ( /^[acgtACGT\n]+$/ )               { $sequence .= $_ }
>         }
>         #last sequence (no '>' signal followed for dumping into hash)
>         $sequence =~ s/\s//g;         #This gets done, last sequence 
> is perfect.
>         $href->{$header} = $sequence; #Header undefined for hashing!
>}
>
>I get the following error messages:
>
>"Use of uninitialized value in hash element at REPfind line 121, <IN> line
>252702." (get this one twice in a row)
>"Use of uninitialized value in concatenation (.) at REPfend line 121, <IN>
>line 252702."
>
>Line 252702 is the very last line in the file, consisting only of letters
>acgt.
>
>Printing $sequence to STDOUT gives what I expect.  It's $header that is
>undefined.  I don't understand why the value of $header is apparently not
>retained after the while loop, while the value of $sequence is.  I've looked
>at what the last header is and it is absolute equivalent to all the other
>headers as far as format goes, so my regex is not breaking.

Let me suggest that this will do what you want, and hopefully you'll 
agree that it's not worth trying to figure out where your code is going wrong:

sub parse_fasta_file {
   my ($fh, $href) = @_;
   my ($header, $sequence);
   while (<$fh>) {
     /^>(.*)/ and $header = $1;
     next unless /^[acgtACGT]+$/;  # I guess there must be some junk lines
     s/\s//g;
     $href->{$header} .= $_;
   }
}

That, I think, is it...

--
Peter Scott
Pacific Systems Design Technologies
http://www.perldebugged.com/




More information about the Victoria-pm mailing list