[Phoenix-pm] Newbie Code

Wed Sep 14 19:14:06 PDT 2005

Hi John,

>  #!/usr/local/bin/perl -w

I like do:

#!/usr/local/bin/perl

use strict;
use warnings;

Just by the way. Then I can turn around and neatly say 
no warnings 'uninitialized' ;)

>  #Mods and Global constants
>  use Readonly;

Curious. I hadn't seen that module before. Check out constant
(perldoc constant) too and use whichever you like best.

>  use tie::File;

Isn't that Tie::File? This might half-way work on Windows, but it'll
bust completely on Unix. Also, though Perl finds and loads the file, 
it's import() routine won't run because Perl will try to call 
tie::File::import() when it needs to be calling Tie::File::import().

>  #use strict;
>  Readonly my $FIELD_SEPERATOR =>  q{|};
>  Readonly my $TEXT_QUALIFIER   =>  q{^};
>  Readonly my $BACKTICK         =>  q{`};
>  Readonly my $BLANKLINE        => qq{     =====> THIS LINE WAS LEFT INTENTIONALLY BLANK!!!!! <=====\n};

IBM =)

>  # This script shoulld take some large text file possibly containing records
>  # spanning multiple lines and making it possible to view a selection of them
>  # in a specified manner.
>  # Things to note:
>  # 1: The input file contains a header that is a list of field names, which I am
>  # just using to count how many fields there are in this particular file as this
>  # is subject to change.
>  # 2: I am formatting the output because I don't know another way to display
>  # the data.
>  # 3: I want 100 lines from the beginning of the file, 100 from the middle and
>  # 100 lines from the end with some line declared using Readonly above to
>  # indicate a jump in records.
>  # 4: Formatting: I want to find the max value for each field and pad each
>  # value of that field to the max value unless it is greater than 40.  If the
>  # max value is greater than 40 then I want the first 20 character a backtick
>  # and the last 20 characters.
>  
>  #Prepares input and output files
>  print "\nEnter file to view: ";
>  my $input_file = <>;
>  chomp($input_file);
>  
>  
>  # I do this alot too creating variables at this level, so I can pass them
>  # around instead of learning how to actually use references.  <- I think that
>  # makes some sense
>  my $beg_line;
>  my $end_line;
>  my @messy_lines;

Considering that, maybe you should use Perl6::Contexts after all. You'll have
to patch your B::Generate before it'll build with newer Perls, but the
instructions are on http://perl6now.com in the "module statuses" section.
Whoa, I feel a nap coming on.

  use Perl6::Contexts;

  some_function($arg, @arr);

  sub some_function {
      my $arg = shift();
      my @arr = @{ shift() };   # the () is actually needed here
      # ...
  }

Of course, you don't want to copy (= copies) large arrays, so if the
argument is large, use Data::Alias:

  use Perl6::Contexts;
  use Data::Alias;

  some_function($arg, @arr);

  sub some_function {
      my $arg = shift();
      alias my @arr = @{ shift() };   # under alias, = doesn't copy but instead aliases >=)
      # ...
  }

Data::Alias does something even more evil than Per6::Contexts... it
re-implements several virtual machine operations and uses it's own
run-loop with the customized VM ops to achieve this behavior. Wicked.

Regardless, the @{ ... } syntax takes an expression or a reference
and returns an array from that expression or reference.

>  # This section could go in a subroutine or even a mod for later calling...
>  #Create an array of jumbled lines to be cleaned
>  tie(my @lines, "Tie::File", $input_file, mode => 0) 
>          or die "Can't tie $input_file: $!";

or die $!. Good.

>  my $max_lines = $#lines;
>  if ($max_lines <= 300) {
>      # Tie::File chomps lines so have to reappend
>      for my $ctr (0..$max_lines) {
>          $messy_lines[$ctr] = $lines[$ctr]."\n";
>      }

This seems kind of unnecessary and strange... 

Also, you should be using strict and declaring your variables 
in the smallest reasonable scope with 'my'. That is, don't
declare them at the top of the file if they're just used in
one block -- declare them in that block.

>  }
>  else {
>      my $count    = 0;
>      my $mid_beg  = $max_lines / 2 - 50;
>      my $mid_end  = $mid_beg + 99;
>      my $last_beg = $max_lines - 99;
>      my $last_end = $max_lines;
>      
>      # As mentioned above this should give me three chunks of 100 lines with 
>      # $BLANKLINE inserted between chunks.
>      for my $ctr (0..99,$mid_beg..$mid_end,$last_beg..$last_end) {
>          $messy_lines[$count++] = $lines[$ctr]."\n";
>          $messy_lines[$count++] = $BLANKLINE if (($count == 100) or ($count == 201));
>      }
>  }

You can always re-add the the \n when you print the lines. You don't
actually have to add it to the raw data. That'll just waste RAM and CPU.

>  my ($record,$field_count,$cum_line_count,$var_temp);
>  my $line_count = 0;
>  
>  # I am thinking about working on putting this in a mod, so I can use it later

This needs some useful comments. What's the intention of this? What requirements
are there on the input data (what are the limits of the forms it can take)?
What's returned?

>          $line_count++ while $line =~ /\^\|\^/g;

This looks an awful lot like and endless loop waiting to happen. If $line
ever does match /\^\|\^/, incrementing $line_count won't ever make
$line not match that, so the while will keep coming back true over and
over until the end of time.

I guess I'm having a hard time seperating the intention of this routine
from my interpretation of it, and comments are needed for that reason ;)

>  sub cleaner {
>      my @messy_records = @_;
>      my $header = $messy_records[0];
>      $field_count++ while ($header =~ /\^\|\^/g);
>      for my $ctr (1..$#messy_records) {
>          my $line = $messy_records[$ctr];
>          $line_count++ while $line =~ /\^\|\^/g;
>          $cum_line_count += $line_count;
>          if (($cum_line_count != $field_count)) {
>              chomp $messy_records[$ctr];
>              $messy_records[$ctr] .= splice @messy_records, (++$ctr), 1;
>              $ctr--;
>          }
>          else {
>              $cum_line_count = 0;
>          }
>      }
>      return @messy_records;
>  }
>  
>  my @field_lengths;
>  sub max_field_length {
>      my @clean_records = @_;
>      for my $ctr (0..$#clean_records) {
>          my $record = $clean_records[$ctr];
>  
>  ####### Problem with blanklines ########
>  ####### I want it to loop to the next for iteration when it encounters a $BLANKLINE
>  
>          if ($record eq $BLANKLINE) {
>              next;
>          }

I've lost track of what's-what at this point, but off the top of my head,
print out $record in a debug statement and see what it winds up containing.

>          my @field_values = split /\^\|\^/,$record;

Is max_field_length() the place to be actually splitting the records? And what
is cleaner() cleaning up? 

>          if ($ctr == 0) {
>              for my $count (0..$#field_values) {
>                  $field_lengths[$count] = length $field_values[$count];
>              }

Here's a little trick...

    my @field_lengths = map length, @field_values;

A little less tersely:

    my @field_lengths = map { length $_ } @field_values;

map() kind of translates values. For each input value, it runs code (the first
argument) on it, and outputs the result. This might seem excessively terse,
but you get used to it quickly, then all of the for, 0..$#arr, etc stuff 
just seems like distraction that has nothing to do with the algorithm after
that.

>          }
>          else {

Rather than placing if ($str == 0) inside of the loop, just place it's logic before
the loop, and then loop on 1 .. $#clean_records. That way, that if test isn't 
done once for each line. It's a bit faster, but more importantly, a human reading
it will see this logic -- that it runs once, then the loop does the rest of the
lines.

>              for my $count (0..$#field_values) {
>                  if (($field_lengths[$count]) < (length ($field_values[$count]))) {
>                      $field_lengths[$count] = length $field_values[$count];
>                  }
>              }
>          }
>      }
>      # Adjusting for the first and last text qualifiers
>      $field_lengths[0]--;
>      $field_lengths[$#field_lengths]--;
>  }
>  
>  my @cleaned_records = cleaner(@messy_lines);
>  max_field_length(@cleaned_records);

Here's a suggestion. Read one line at a time. Make a routine that takes care of
line continuations (partial lines, whatever). Call that from a loop. Then, in that
loop, split the line into fields. Then pass that array of fields to various routines
as needed. That'll probably only be a routine to format them for print. And the
loop has the option of not calling that format-for-print routine if the line
range is outside of what you want printed as part of your sample data. That'll
save having to use Tie::File, which you mentioned at the meeting as not being
any faster than slurping the whole file into memory (which sucks for large files).
For large files, line-at-a-time is the way to go. Otherwise, the computer tries
to pull it into main memory, but then Perl fills up the computer's memory, so it
has to swap it out to swap on the disk, and then as you try to use the memory,
the disk grinds away trying to swap in the right stuff. You're at least three
times as bad off as going line-by-line from a memory point of view.

One rule of thumb when designing routines -- don't make them return meta-data
that has to be passed in somewhere else. For example, don't make a routine
that returns a list of field lengths. Wrap it all up in a neat little bundle
that's easy to use. Then break the bundle down into its parts if appropriate.
In otherwords, it's okay to call a() and have a() call b() and then c(),
but the main program shouldn't ever have to call b() and then c() if b()
is only useful if you immediately call c() afterwards and c() is only useful
if you've just called b(). If you organize the functions this way, you'll
also find having lots of global data is far less useful. You'll also wind
up with approximately one while() loop rather than dozens, making it all
more concise and easier to follow.

Since you want to know how many lines are in the file, and since you
want to know the maximum size of each field, you could do this in two
passes with two while loops: the first would just count lines and count
field sizes, the second would do the actual printing.

>  my $output_file = "Processed_$input_file";
>  open OUTPUT, ">$output_file";

Please get in the habit of doing:

  open my $outfile, '>', $outputfile or die "$outputfile: $!";

Lexical (my) filehandles close their associated file when they go out of
scope, which takes care of a bunch of possibile errors with unwritten
data and filehandle stomping. Using three arguments rather than two
avoids a whole class of security problems where malicious input can
execute commands at the shell (doesn't apply in this case, but you 
don't want to get in the habit of using the two-argument open), and you
always want to die as close to the failure as possible, otherwise
finding the point of failure is a major chore.

>  for my $ctr (0..$#cleaned_records) {
>      my $record = $cleaned_records[$ctr];
>      my @entered_fields = split /\^\|\^/, $record;

Eeek, why are we splitting on this again? By the way, you can store regular
expressions in strings with the qr{ } quote-like operator:

    my $fieldsep = qr{\^\|\^};
    my @fields = split/$fieldsep/, $record;

Good for re-use and for building larger regexen out of smaller ones.

>      for my $count (0..$#field_lengths) {
>          if ($field_lengths[$count] <= 40) {

I'd take this kind of approach...

   for my $field (@fields) {
       $field = substr $field, 0, 40 if length $field > 40;
   }

>  ############# something like printf OUTPUT %(40-len(field))s,$field
>  ############# but %( or %$ cause problems
>              my $padding_length = 40 - $field_lengths[$count];
>              printf OUTPUT "%-${$padding_length}s", $entered_fields[$count];

Tempted to suggest just marking it up with HTML <table><tr><td>...</td><td>...</td></tr></table>
and then viewing it in lynx or w3m or something ;)

That 40 - $field_length[$count] seems kind of odd. Don't you just want to use
$field_lengths[$count] directly as in the "%-${padding_length}s" thing? Also,
you seem to have an extra $ -- it looks like you're doing soft-reference
on the number. 'use strict' would have cought that. Soft-refs are almost always
an error in Perl 5 code, and 'use strict' makes them go away =)

>          }
>          else {
>              my $begin_str  = substr($entered_fields[$count], 0, 20);
>              my $ending_str = substr($entered_fields[$count], 0, -20);

Ahh, a little more complicated than what I suggested. Still, I'd munge
all of the fields before counting lengths. 

>              my $verified_string = $begin_str." $BACKTICK ".$ending_str;
>              if (length($verified_string) < 43) {

Dumb but simple:

    $verified_string = " $verified_string" while length $verified_string < 43;

Smarter but ugly:

    $verified_string = join '', ' ' x (43 - length $verified_string), $verified_string;

Ahh, heck, use sprintf:

    $verified_string = sprintf '%43s', $verified_string;

Remember, program clutter becomes mental clutter, so the short way is often the
best way =)

>  ################# pad this string to 43 digits
>              }
>              print OUTPUT $verified_string;
>          }
>          if ($count < $#field_lengths) {
>              print OUTPUT '^|^';
>          }
>      }
>  }
>  print OUTPUT "^\n";
>  
>  #Holds the program open
>  print "\n\nEnter <Return> to continue:\n";

Hope this helps. More than anything else, this program needs a re-org. 
When it's working line-by-line and calling subroutines with those lines
or with the split out records from those lines, it'll be easier to follow
and easier to think about. I suggest you hack it all to heck and then
post it again =) When it's a third the size, everything will be more
clear.

Regards,
-scott

>  
>  
>  On  0, "Jonathan K. Smith" <jksmith at lexsolutio.com> wrote:
>  > 
>  >    Well here's something that I've been working on, for entirely too
>  >    long.  It is supposed format some sampling of another text file to
>  >    easily spot problems.  The sections that don't work at all are the one
>  >    with ############ comments above them, #This is just standard
>  >    notation.  For those who actually dig through this, drop me a line if
>  >    you have any questions.
>  >    
>  >    
>  >    
>  >    Jonathan Smith
>  >    
>  >    Encore Lex Solutio
>  >    
>  >    [1]www.lexsolutio.com
>  >    
>  >    1-888-389-1658
>  > 
>  > References
>  > 
>  >    1. http://www.lexsolutio.com/
>  
>  
>  > _______________________________________________
>  > Phoenix-pm mailing list
>  > Phoenix-pm at pm.org
>  > http://mail.pm.org/mailman/listinfo/phoenix-pm