performance question

Fri Feb 15 12:35:41 CST 2002

Tom Keller wrote:
> 
> Parse a list of start and stop positions for a fairly large (217061
> chars) file 

Do you mean a 217*k* file or 217*mb* file?  If it's 217k, I wouldn't
call that large :-).

> of DNA sequence data (m/[acgt]+/i - hopefully Not
> alphabetized!). I have about 200 putative genes demarcated with these
> start and stop positions within that sequence that I wish to further
> analyze.

Do you mean something like this:

A string in a data file  (obviously not real data):

   gggCCCggggTTTTgggg

An array of start/stop pairs: ( [3,5], [10,13] )

And you want to get the parts of the string demarcated by those
positions (e.g. the first would be 'CCC' and the second 'TTTT'?

If so, then you could 

  1. sort your start_stop codes to be in order by start position
  2. open the data file
  3. foreach (start-stop-pair)
        3a. seek start pos
        3b. read (stop_pos less start_pos) number of chars
        3c. process or store the resulting string
        3d. repeat from 3a as needed

This would never slurp the entire data file and would only go through it
once. (Although if you have overlapping start/stop positions it will do
some backtracking).

-- 
Jeff
TIMTOWTDI