[Purdue-pm] RADpools

Phillip San Miguel pmiguel at purdue.edu
Wed Oct 20 05:45:58 PDT 2010

 From the manual:

    -f, --fuzzy_MIDs
    If a MID sequence in a read contains an error, the read is usually
    away. With this option, these reads will be accepted and assigned to the
    nearest pool. If the MID could be assigned to more than one pool, a new
    pool is created, named after all the possible pools for the
    ambiguous MID.

"MID", standing for "Multiplex Identifier" is Roche-speak for "bar 
code". Strangely this code processes Illumina reads. Illumina calls 
their bar codes "indexes". Not important, though.

The code in question (with extra comments):

    for my $i ( 1 .. $mid_length ) {
         for my $base (qw{A C G T}) {
             my $fuzzycode  = $mid;
             my $prebase_i  = $i - 1;
             my $postbase_i = $mid_length - $i;
             $fuzzycode =~ s{
              ^([ACGT]{$prebase_i})     #capture bases, if any, before
    current base
               ([ACGT])                 #current base
               ([ACGT]{$postbase_i})$}  #capture bases, if any, after
    current base
             {$1$base$3}xms;            #replace current base with $base
             push @{ $mid_pools{$fuzzycode} }, $pool_name;

Actually, I don't see any problem with this code. You might get extra 
speed using  substr() but the number of bar codes (probably no more than 
100 or so) is drastically smaller than the number of sequence reads that 
will be processed (millions, probably). So looking for speed ups in this 
part of the code are unlikely to yield much.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/purdue-pm/attachments/20101020/52dbe831/attachment-0001.html>

More information about the Purdue-pm mailing list