[San-Diego-pm] selective matching (was: Re: Recompiling global substitution RE?)

Sat Mar 20 22:36:47 PDT 2010

Reuben Settergren <ruberad at gmail.com> writes:
> 
> I should give you more clarification that my actual problem
> (still simplified) looks more like:
> 
>    DIST - Distance
>         DISTANCE:       12.3456 ft              4.5678 m
>         STD DEV:        3.4567 ft               1.2345 m
> 
>    AREA - Area
>         AREA:           2345.6789 ft^2          345.6789 m^2
>         PERIMETER:      234.5678 ft             89.0123 m
>         STD DEV:        3.4567 ft               1.2345 m
> 
>    DIST - Distance
>         DISTANCE:       12.3456 ft              4.5678 m
>         STD DEV:        3.4567 ft               1.2345 m
> 
>    DIST - Distance
>         DISTANCE:       8.9012 ft               3.0123 m
>         STD DEV:        1.2345 ft               0.4567 m
> 
> ...etc

To be honest, neither the original problem description, nor this
clarification, made much sense to me.  I suspect that others on the
list had similar problems.

What are you actually checking against what?  Where are you getting
your "good" values?  Where are you getting "columns", and their
colors?

> So in any one situation, I will have a set of numbers/colors for a
> particular combination of function/metric/unit (i.e. DIST/'STD DEV'/ft:
> (3.4567,3.4567,1.2345),(red,green,yellow)). So I have to avoid the
> other functions (e.g. AREA) that might have the same numbers. 

Ok, this makes a bit more sense.  See below for my opinion of a good
way to tackle this problem.

> But with Mark Johnson's idea, I can chomp through my whole
> file-in-one-string bit-by-bit, something like:
> 
>   $head = '';
>   $tail = $s;
>   for my $i ( 0 .. $#nums ) {
>       $tail =~
>         s/(.*?$function.*?$metric.*?)($nums[$i]\s+$unit)(.*)/$3/;
>       $head .= "$1<span color="$cols[$i]">$2</span>";
>   }
>   $s = "$head$tail";
> 
> If there isn't already a name for this cool kind of trick, there should
> be: train cars? chewandswallow? sausage grinder? Any other ideas?

It's a very common pattern in functional and declarative languages
(lisp and prolog, if you want representative examples).

In Lisp, the pattern is to grab however many args off the front of the
list, and use the "&rest" specifier to put the remaining args into a
named list.  Then you do whatever you need to do with the head args,
and call yourself on the remainder:

| (defun process (head1 head2 head3 &rest tail)
|   (cond ((nilp tail) "")
|         (t (cons (munge head1 head2 head3)
|                  (process tail)))))

Prolog would do that somewhat like:

| process( [], [] ).
| process( [ Head1, Head2, Head3 | Tail ], [ MungedHead | MungedTail ] ) :-
|   munge( Head1, Head2, Head3, MungedHead ),
|  process( Tail, MungedTail ).

(I'm quite rusty on both languages, but hopefully you get the idea.)
Both of these languages (and this style of programming) go back at
least into the 70s; Lisp probably goes back into the 60s.  Or you
could use Mathematica, which combines the two...

My first approach was to use text matching to build domain-level
entities, and then operate on those entities.  You don't have to go
full-bore OOP, but that's basically the end point of following this
line of thinking.  In this case, I'll just build up a hash for each
"function", probably with some nested hashes for each metric within
the function.  Then you can match those iteratively against your
criteria.  (And yes, I really would write comments like this.)

You can see the result here:

  http://scrye.com/~tkil/perl/reuben1.plx

I didn't care for how that forced the highlighting functions to know
the output format, though; about halfway through that implementation
(although I did finish it), I realised the the "hard part" was
encoding the criteria.  Both of my programs use this structure:

| my @CRITERIA =
| (
| 
|  { function => 'DIST',
|    metric => 'STD DEV',
|    units => 'ft',
|    ranges => [ 3.4567,            # center value
|                ''      => 0.001,  # center tolerance
|                green   => 0.01,   # color => tolerance
|                yellow  => 0.2,
|                red     => -1 ] }, # default
| 
|  { function => 'AREA',
|    metric => 'STD DEV',
|    units => 'ft^2',
|    ranges => [ 2.78,
|                ''      => 0.01,
|                blue    => 0.1,
|                orange  => 0.3,
|                red     => -1 ] },
| 
|  { function => 'DIST2',
|    metric => 'DISTANCE',
|    units => 'm',
|    ranges => [ 3.2,
|                ''      => 0.01,
|                green   => 0.1,
|                orange  => 0.3,
|                red     => -1 ] },
| 
|  # if multiple criteria matches the same measurement, the
|  # later criteria will be nested inside the earlier ones.
|  { function => 'DIST2',
|    metric => 'DISTANCE',
|    units => 'm',
|    ranges => [ 3.2,
|                ''      => 0.01,
|                green2   => 0.1,
|                orange2  => 0.3,
|                red2     => -1 ] }
| 
| ); # end of @CRITERIA

(Although only my second effort handles the stacked-rules case
described in the last criterion.)

The second one works on the individual lines as they come in, using
the latest function name as a primitive state variable.  The main loop
there:

| my @cur_crit;
| 
| while ( my $line = <DATA> )
| {
|     if ( $line =~ $func_start_re )
|     {
|         my $func = $1;
|         @cur_crit = grep { $_->{function} eq $func } @CRITERIA;
|     }
|     elsif ( $line =~ $blank_line_re )
|     {
|         undef @cur_crit;
|     }
|     elsif ( $line =~ $metric_line_re )
|     {
|         my ( $indent, $metric, $space, $vals ) =
|           (       $1,      $2,     $3,    $4 );
| 
|         foreach my $crit ( grep { $_->{metric} eq $metric } @cur_crit )
|         {
|             $vals = highlight $vals, $crit;
|         }
| 
|         $line = join '', $indent, $metric, $space, $vals;
|     }
|     print $line;
| }

The whole program can be found at:

  http://scrye.com/~tkil/perl/reuben2.plx

That's the one that I think is the better code, depending on how
closely your sample data and criteria match your real-world problem.

With this input data:

| DIST - Distance
|      DISTANCE:       12.3456 ft              4.5678 m
|      STD DEV:        3.4567 ft               1.2345 m
| 
| AREA - Area
|      AREA:           2345.6789 ft^2          345.6789 m^2
|      PERIMETER:      234.5678 ft             89.0123 m
|      STD DEV:        3.4567 ft               1.2345 m
| 
| DIST - Distance
|      DISTANCE:       12.3456 ft              4.5678 m
|      STD DEV:        3.4567 ft               1.2345 m
| 
| DIST - Distance
|      DISTANCE:       8.9012 ft               3.0123 m
|      STD DEV:        1.2345 ft               0.4567 m
| 
| DIST2 - Distance
|      DISTANCE:       8.9012 ft               3.0123 m
|      STD DEV:        1.2345 ft               0.4567 m

And with @CRITERIA as shown above, here's the output:

| $ ./reuben2.plx
| DIST - Distance
|      DISTANCE:       12.3456 ft              4.5678 m
|      STD DEV:        3.4567 ft               1.2345 m
| 
| AREA - Area
|      AREA:           2345.6789 ft^2          345.6789 m^2
|      PERIMETER:      234.5678 ft             89.0123 m
|      STD DEV:        3.4567 ft               1.2345 m
| 
| DIST - Distance
|      DISTANCE:       12.3456 ft              4.5678 m
|      STD DEV:        3.4567 ft               1.2345 m
| 
| DIST - Distance
|      DISTANCE:       8.9012 ft               3.0123 m
|      STD DEV:        <span color="red">1.2345 ft</span>               0.4567 m
| 
| DIST2 - Distance
|      DISTANCE:       8.9012 ft               <span color="orange"><span color="orange2">3.0123 m</span></span>
|      STD DEV:        1.2345 ft               0.4567 m

The next time you hit a problem like this, you might want to spend
more time up-front on specification (and possibly writing test cases
and expected outputs) before diving right into the regexes.

(Also, remember that Perl offers many different data manipulation
techniques; regexes aren't always the right answer.)

Happy hacking,
t.