performance question
Jeff Zucker
jeff at vpservices.com
Fri Feb 15 12:35:41 CST 2002
Tom Keller wrote:
>
> Parse a list of start and stop positions for a fairly large (217061
> chars) file
Do you mean a 217*k* file or 217*mb* file? If it's 217k, I wouldn't
call that large :-).
> of DNA sequence data (m/[acgt]+/i - hopefully Not
> alphabetized!). I have about 200 putative genes demarcated with these
> start and stop positions within that sequence that I wish to further
> analyze.
Do you mean something like this:
A string in a data file (obviously not real data):
gggCCCggggTTTTgggg
An array of start/stop pairs: ( [3,5], [10,13] )
And you want to get the parts of the string demarcated by those
positions (e.g. the first would be 'CCC' and the second 'TTTT'?
If so, then you could
1. sort your start_stop codes to be in order by start position
2. open the data file
3. foreach (start-stop-pair)
3a. seek start pos
3b. read (stop_pos less start_pos) number of chars
3c. process or store the resulting string
3d. repeat from 3a as needed
This would never slurp the entire data file and would only go through it
once. (Although if you have overlapping start/stop positions it will do
some backtracking).
--
Jeff
TIMTOWTDI
More information about the Pdx-pm-list
mailing list