[Purdue-pm] RADpools
Phillip San Miguel
pmiguel at purdue.edu
Wed Oct 20 05:45:58 PDT 2010
From the manual:
-f, --fuzzy_MIDs
If a MID sequence in a read contains an error, the read is usually
thrown
away. With this option, these reads will be accepted and assigned to the
nearest pool. If the MID could be assigned to more than one pool, a new
pool is created, named after all the possible pools for the
ambiguous MID.
"MID", standing for "Multiplex Identifier" is Roche-speak for "bar
code". Strangely this code processes Illumina reads. Illumina calls
their bar codes "indexes". Not important, though.
The code in question (with extra comments):
for my $i ( 1 .. $mid_length ) {
for my $base (qw{A C G T}) {
my $fuzzycode = $mid;
my $prebase_i = $i - 1;
my $postbase_i = $mid_length - $i;
$fuzzycode =~ s{
^([ACGT]{$prebase_i}) #capture bases, if any, before
current base
([ACGT]) #current base
([ACGT]{$postbase_i})$} #capture bases, if any, after
current base
{$1$base$3}xms; #replace current base with $base
push @{ $mid_pools{$fuzzycode} }, $pool_name;
}
}
Actually, I don't see any problem with this code. You might get extra
speed using substr() but the number of bar codes (probably no more than
100 or so) is drastically smaller than the number of sequence reads that
will be processed (millions, probably). So looking for speed ups in this
part of the code are unlikely to yield much.
Phillip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/purdue-pm/attachments/20101020/52dbe831/attachment-0001.html>
More information about the Purdue-pm
mailing list