[Cincinnati-pm] November Virtual cinci.pm - Nov. 10th

Wed Nov 10 19:06:11 PST 2021

Thanks, Jon, for the perlform[0] presentation!

Apropos of that, here's another example of using Perl formats (see 
attached, line 497), and the output they produce.

The idea was to give me a monthly report of disk usage by username, 
path, and file extension. The script uses multiple formats, including 
one for a page header. The formats in this case don't have any weird 
ANSI stuff going on, so data lines line up with the picture lines.

In a cron job, I pipe this into another script utilizing MIME::Lite[1] 
under the hood to create a MIME multipart email, with the HTML part 
being the plain text part wrapped in <html><body><pre> tags so it looks 
right for people using Outlook.

Here's what one of the formats looks like:

format BYEXT_TOP =
                      Top disk utilization by file extension
================================================================================

               extension                         disk utilization
               ---------                         ----------------
.

format BYEXT =
               @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<... @>>>>>>>>>>>>>>>
               @LINE
.

…and the corresponding chunk of output would look something like this:

                      Top disk utilization by file extension
================================================================================

               extension                         disk utilization
               ---------                         ----------------
               .bed                                         32.5T
               .fastq.gz                                    14.7T
               .bam                                         11.5T
               .fq.gz                                        8.7T
               .vcf                                          7.1T
               .fastq                                        4.9T
               <none>                                        3.6T
               .vcf.gz                                       3.1T
               .body.sam                                     2.6T
               .bgen                                         2.3T
               .txt                                          1.4T
               .pgen                                         1.3T
               .body.sorted.sam                            885.7G
               .fq                                         874.8G
               .read2.fastq.gz                             865.7G
               .read1.fastq.gz                             850.7G
               .trim.srt.bam                               806.0G
               .pvar                                       750.6G
               .bwt2glob.unmap.fastq                       735.4G
               .sam                                        678.1G

…where the input file used to generate the report would've been created 
in advance with a 'find -printf' command line like this:

     find . -type f -printf '%p\t%s\t%T@\t%A@\t%u\n' > usage.tsv

Now, disclaimer: I don't earn a living writing Perl, so I wouldn't hold 
this up as an example of any kind of best practices or anything. But I 
do *really* like Perl formats, and every few years I find a new reason 
to bust them out for some little reporting task like this one!

--Kevin

[0]: https://perldoc.perl.org/perlform
[1]: https://metacpan.org/pod/MIME::Lite#Create-a-multipart-message

On 11/10/21 at 8:58 AM, Jon Gentle wrote:
> Good Morning,
> 
> Quick reminder that we will be doing a virtual cinci.pm this evening
> about perlform. Hope to see or hear you then.
-------------- next part --------------
#!/usr/bin/env perl
##
##  Summarize disk usage based on directory, owner, file extension
##
##  Input:    the results of a previous 'find -printf' invocation, the details
##            of which are described in the manual (at the bottom of this file)
##
##  Output:   (by default) top five subdirectories by disk utilization, two
##            levels below the top-level directory in the 'find' results; other
##            options are configurable; see the output of '--help' or the manual
##
##  Author:   Kevin Ernst <kevin.ernst -at- cchmc.org>
##  Date:     29 October 2019; major updates ~17 August 2020
##

use strict;
use warnings;
use autodie;

use Carp;     # for 'croak'
use English;  # for $ACCUMULATOR, $FORMAT_NAME, etc.

# someday...
#package ReportUsage;
#require Exporter;
#@ISA = qw(Exporter);
#@EXPORT_OK = qw( humanize volume_free_space );

# someday...
#BEGIN { $ENV{ANSI_COLORS_DISABLED} = 1 unless -t STDOUT }
#use Term::ANSIColor ':constants';

our $TSV_HEADER = "path\tsize\tmtime\tctime\towner";
our $HEADER_SCAN_LINES = 0;         # header must appear in this many lines
                                    # or '0' to disable checking for a header
our $DEFAULT_EXTENSION_LENGTH = 8;  # max. length of extension(s) to consider
our $DEFAULT_EXTENSION_NUMBER = 3;  # dont' consider more than 3 extensions
our $DEFAULT_LIMIT = 5 ;            # limit top "n" to this many (0=unlimited)
our $DEFAULT_DEPTH = 2;             # depth from "root" of 'find' command
our $DEFAULT_SEPARATOR = "\t";      # default output separator for '--raw'
our ($QUIET, $DEBUG, $WEND);        # $WEND: print warnings/errors w/ context?

# need to use these in formats, so they need to be package global
our @LINE;      # holds format line; see perlform "WARNINGS"
our $BASEPATH;  # longest common parent path
our $DEPTH;     # how deep to descend from there
our $DISKFREE;  # stats on disk utilization

# allow this script to be used like a library, but also run like a script
main() if not caller();

sub main {
    use Getopt::Long;
    use Pod::Usage;

    my $help;
    my $man;
    my $filename = '';
    my $human;
    my $raw;
    my $deptharg = '';
    my @depthlist;

    # output modes ("by depth" is default)
    my $bydepth = 1;
    my $byowner = 0;
    my $byext = 0;

    my $limit = $DEFAULT_LIMIT;
    my $sep = $DEFAULT_SEPARATOR;
    my $depths = {};
    my $owners = {};
    my $exts = {};
    my $usage = {};

    Getopt::Long::Configure ('bundling');

    # source: "Documentation and help texts" section of Getopt::Long
    GetOptions(
        'help|?' => \$help,
        'manual' => \$man,
        'human-readable|human|h!' => \$human,
        'limit|l=i' => \$limit,
        'depth|subdirs|subdirectories|d=s' => \$deptharg,
        'all|a' => sub { $bydepth = $byowner = $byext = 1; },
        'by-depth|by-directory|by-dir!' => \$bydepth,
        'by-owner!' => \$byowner,
        'by-extension|by-ext!' => \$byext,
        'raw|r' => \$raw,
        'quiet|q!' => \$QUIET,
        'debug' => \$DEBUG,
        'separator|sep|s=s' => \$sep,
    ) or pod2usage(-exitval => 2);

    pod2usage(-exitval => 0) if $help;
    pod2usage(-exitval => 0, -verbose => 2) if $man;

    $filename = shift @ARGV;
    # fall back to reading from stdin implicitly if stdin is not a terminal
    $filename = '-' if not defined($filename) and not -t STDIN;  ## no critic

    @depthlist = $deptharg ? split /,/, $deptharg : ($DEFAULT_DEPTH);
    # add keys to $depths hashref for each one of the specified depths
    $depths = { map { $_ => {} } @depthlist };

    # if $DEBUG is *not* set, then append a new line to errors/warnings so that
    # the line number won't be shown
    $WEND = defined $DEBUG ? "" : ".\n";

    die "ERROR: an input file generated by 'find' is required. Try "
      . "'--help'$WEND" unless $filename;

    die "ERROR: the file '$filename' does not exist or is unreadable.$WEND" 
        unless -r $filename or $filename eq '-';

    if ($raw) {
        $QUIET = 1 unless defined $QUIET;

        warn "WARNING: '-h' / '--human' is ignored for \"raw\" output mode$WEND"
            if $human;

        die "ERROR: the '-r' / '--raw' option requires one and only one of the\n"
          . "       '--by-owner', '--by-extension', or '--by-depth' options"
          . "$WEND" if not ($byowner or $byext or $bydepth);

        # only accept *one* display option and *one* depth for "raw" output; other
        # combinations don't make sense because you'll have different kinds of
        # statistics concatenated together with no separators in between
        die "ERROR: the '-r' / '--raw' option accepts no more than ONE display "
          . "option.\n"
          . "       See '--help' or '--manual' for details$WEND"
            if $raw and $byowner + $byext + $bydepth > 1;

        die "ERROR: the '-r' / '--raw' option accepts no more than ONE depth.\n" .
            "       See '--help' or '--manual' for details$WEND"
            if scalar(@depthlist) > 1;
    } # if not $raw

    # default to human-readable figures
    $human = 1 unless defined $human;

    # returns hash reference of disk usage stats, and longest common prefix
    warn "Parsing input file '$filename'...\n" unless $QUIET;
    ($usage, $BASEPATH) = parse_usage($filename);

    # check to see that we actually got something back
    die "ERROR: parsing input file failed (empty results)$WEND"
        unless keys %$usage and $BASEPATH;

    if ($bydepth) {
        my $depthadd = 0;

        # consider "depth" to mean "this many subdirs from the longest common
        # path"; count how many path elements are in the $BASEPATH and add that
        # many to $depth argument ('grep' filters empty list elements)
        $depthadd = grep { $_ and $_ ne '.' } split /\//, $BASEPATH;

        foreach my $depth (@depthlist) {
            warn "Computing usage by depth=$depth in hierarchy...\n"
                unless $QUIET;

            $depths->{$depth} = usage_by_depth(
                $usage,
                depth=>$depth + $depthadd
            );

            # it's possible to go too deep with '-d' and get no results; check
            # FIXME: see GitLab #47
            if ($depths->{$depth}) {
                $depths->{$depth} = reverse_sort_and_take_n(
                    $depths->{$depth},
                    n=>$limit
                );
            } else {
                warn "WARNING: '-d' / '--depth' option ($depth) was *too* deep ".
                     "and yielded no results$WEND";
                delete $depths->{$depth};
            }
        } # for each @depthlist
    } # if '--by-ext'

    if ($byowner) {
        warn "Computing usage by file owner...\n" unless $QUIET;
        $owners = reverse_sort_and_take_n( usage_by_owner($usage), n=>$limit );
    }

    if ($byext) {
        warn "Computing usage by file extension...\n" unless $QUIET;
        $exts = reverse_sort_and_take_n( usage_by_ext($usage), n=>$limit );
    }

    print "\n" unless $QUIET;

    # if it's a mounted filesystem, get stats for it
    $DISKFREE = disk_free($BASEPATH);
    if ($DISKFREE) {
        $FORMAT_NAME = 'DISKFREE';
        write;
        # start a new page
        $FORMAT_LINES_LEFT = 0;
    }

    if ($raw) {
        if ($bydepth) {
            foreach my $depth (@depthlist) {
                print_delimited($depths->{$depth}, sep=>$sep)
                    if $depths->{$depth};
            }
        } elsif ($byowner) {
            print_delimited($owners, sep=>$sep)
        } elsif ($byext) {
            print_delimited($exts, sep=>$sep) if $byext;
        } else {
            croak "Shouldn't get here!";
        }
    }

    else {
        if ($bydepth) {
            foreach my $depth (@depthlist) {
                # set package-global $DEPTH because it's in the format header
                $DEPTH = $depth;
                print_formatted($depths->{$depth}, format=>'BYDEPTH',
                                human=>$human, trim=>$BASEPATH);
            }
            if (not ($byowner or $byext)) {
                warn "\nHint: Try adding '--by-owner' and/or '--by-extension'"
                   . ".\n" unless $QUIET;
            }
        }

        if ($byowner) {
            print_formatted($owners, format=>'BYOWNER', human=>$human);
        }
        #? else {
        #?     warn "\nHint: Try adding '--by-owner' for per-user utilization.\n"
        #?         unless $QUIET;
        #? }

        if ($byext) {
            print_formatted($exts, format=>'BYEXT', human=>$human);
        }
        #? else {
        #?     warn "\nHint: Try adding '--by-extension' for per-extension "
        #?        . "utilization.\n" unless $QUIET;
        #? }
    } # if '--raw'

    print "\n";
} # main

##############################################################################
##                    h e l p e r     f u n c t i o n s                     ##
##############################################################################

sub parse_usage {
    my $filename = shift;
    my $usage = {};
    my ($fh, $path, $size, $mtime, $ctime, $owner, $prefix);
    my ($has_header, $count) = (0, 0);

    # format is: path ⇥ size ⇥ mtime ⇥ ctime ⇥ owner
    if ($filename eq '-') {
        $fh = *STDIN;
    } else {
        open $fh, '<', $filename;
    }

    while (<$fh>) {
        $count++;
        next if /^#/;  # skip comments

        # require header within the first $HEADER_SCAN_LINES lines (or 0=don't)
        if ($HEADER_SCAN_LINES && $count > $HEADER_SCAN_LINES && !$has_header) {
            die "ERROR: Invalid input file. See '--manual' for required " .
                "format.\n" unless $has_header;
        }

        if (/$TSV_HEADER/) {
            $has_header = 1;
            next;
        }

        chomp;

        # add leading './' if it's a relative path without one
        $_ =~ s/^/\.\// if /^\w+/;

        ($path, $size, $mtime, $ctime, $owner) = split /\t/;

        # funny story: this literally happened to me when 'find' came across
        # files with embedded newlines in the filename and wrote the '-printf'
        # format over two lines
        if (grep { !defined } $path, $size, $mtime, $ctime, $owner) {
            warn "WARNING: record for $path had empty fields; skipping$WEND"; 
            next;
        }

        # sanity checks:
        # if you wanted to terminate on non-existent files
        #? croak "ERROR: Non-existent file '$path'" unless -f $path;
        croak "ERROR: Bad size '$size'" unless $size =~ /^\d+$/;
        croak "ERROR: Bad mtime '$mtime'" unless $mtime =~ /^-?[.\d]+$/;
        croak "ERROR: Bad ctime '$ctime'" unless $ctime =~ /^-?[.\d]+$/;
        croak "ERROR: Bad owner '$owner'" unless $owner =~ /^[-\w]+$/;

        $usage->{$path} = {
            size => $size,
            mtime => $mtime,
            ctime => $ctime,
            owner => $owner,
        };

        # h/t https://rene.seindal.dk/2005/09/09/longest-common-prefix-in-perl/
        $prefix ||= $path;
        chop $prefix while ($path !~ /^\Q$prefix\E/);  # \Q,\E = escape meta
    }

    close $fh;
    return ($usage, $prefix);
} # parse_usage

# if path argument is absolute, return free space for mounted volume
sub disk_free {
    my $path = shift;
    return if $path =~ /^\./;  # reject relative paths

    # [0] device, [1] size, [2] used, [3] free, [4] percent, [5] mount point
    my @stats = split /\s+/, `LC_ALL=C df -h '$path' | tail -n +2 2>/dev/null`;

    if ($?) {
        warn "WARNING: Unable to get free space for '$path'$WEND" if $DEBUG;
        # or do nothing
    } else {
        return "$stats[3] of $stats[1] free ($stats[4] full)";
    }
} # volume_free_space

sub usage_by_owner {
    my $usage = shift;
    my ($owner, $size);
    my $ownersizes = {};
    croak "Got an empty '\$usage' hashref" unless keys %$usage;

    # make a list of file owners and total byte sizes
    foreach my $path (keys %$usage) {
        $owner = $usage->{$path}->{owner};
        $size = $usage->{$path}->{size};
        $ownersizes->{$owner} += $size;
    }
    return $ownersizes;
} # usage_by_owner

sub usage_by_ext {
    my $usage = shift;
    my ($ext, $size);
    my $extsizes = {};
    croak "Got an empty '\$usage' hashref" unless keys %$usage;

    # make a list of file owners and total byte sizes
    foreach my $path (keys %$usage) {
        # strip off './' if it's there
        ($ext = $path) =~ s/^\.\///;

        # Pro Tip: m/// returns list of capture subexpressions in list context
        ($ext) = ($ext =~ /
            (
                (?:\.\w{1,$DEFAULT_EXTENSION_LENGTH})  # a dot, then <= 8 chars
                {1,$DEFAULT_EXTENSION_NUMBER}          # up to 3 of them
            )$                                         # at EOL; capture #1
        /x);
        $ext = '<none>' if not $ext;
        $size = $usage->{$path}->{size};
        $extsizes->{$ext} += $size;
    }
    return $extsizes;
} # usage_by_ext

sub usage_by_depth {
    my $usage = shift;
    my %opts = @_;
    my ($subpath, $size);
    my $depthsizes = {};

    # sum up disk utilization by the parent directory up to $opts{depth}
    # the regex matches $opts{depth} path elements followed by a filename
    foreach my $path (keys %$usage) {
        # Pro Tip: m/// returns list of capture subexpressions in list context
        ($subpath) = ($path =~ qr(
            ^(                    # at the beginning of the line
                \.?               # maybe a period (relative paths)
                (?:/[^/]+)        # a slash, then a bunch of non-'/' chars
                {$opts{depth}}    # exactly <depth> of them
                /                 # followed by a literal '/'
            )                     # capture #1
        )x);
        next unless $subpath;
        $size = $usage->{$path}->{size};
        $depthsizes->{$subpath} += $size
    }
    return $depthsizes;
} # usage_by_depth

# sorts a hash based on value; returns an array ref
sub reverse_sort_and_take_n {
    my $hashref = shift; 
    my %opts = @_;
    my $result = [];

    croak "Hashref references an empty hash" unless keys %$hashref;
    croak "Need 'n' option" unless exists $opts{n} and $opts{n};

    # sort entries based on the value, in reverse order
    my @ranked = sort { $hashref->{$b} <=> $hashref->{$a} } keys %$hashref;

    # subset the ranked keys, only if the "n" option is smaller than # of keys
    @ranked = @ranked[0 .. $opts{n}-1] if $opts{n} < scalar(@ranked);

    # take first $n
    foreach my $entry (@ranked) {
        push @$result, [$entry, $hashref->{$entry}];
    }
    return $result;
} # reverse_sort_and_take_n

sub print_delimited {
    my $arrayref = shift;
    my %opts = @_;

    croak "Got an empty arrayref" unless @$arrayref;
    $opts{sep} = $DEFAULT_SEPARATOR unless exists $opts{sep};

    foreach my $entry (@$arrayref) {
        print join($opts{sep}, @$entry), "\n";
    }
} # print_delimited

sub print_formatted {
    my $arrayref = shift;
    my %opts = @_;
    my (@pwent, $fullname);

    croak "Got an empty arrayref" unless @$arrayref;
    croak "Need 'format' option" unless exists $opts{format} and $opts{format};

    $FORMAT_NAME = $opts{format};
    $FORMAT_TOP_NAME = $opts{format} . '_TOP';
    $FORMAT_FORMFEED = "\n\n";

    foreach my $entry (@$arrayref) {
        if ($opts{format} eq 'BYOWNER') {
            @pwent = getpwnam $$entry[0];
            $fullname = @pwent ? $pwent[6] : '<unknown>';
            $$entry[0] = "$$entry[0] ($fullname)";
        } elsif ($opts{format} eq 'BYDEPTH') {
            $$entry[0] =~ s/^\Q$opts{trim}\E//;
        }

        @LINE = ($$entry[0]);
        push @LINE, $opts{human} ? humanize($$entry[1]) : $$entry[1];
        write;
    }

    # start a new page
    $FORMAT_LINES_LEFT = 0;
} # print_delimited

# source: http://perldoc.perl.org/5.26.1/perlform.html
sub swrite {
    croak "usage: swrite PICTURE ARGS" unless @_;
    my $format = shift;
    $ACCUMULATOR = "";
    formline($format, @_);
    return $ACCUMULATOR;
}

sub humanize {
    my $bytes = shift;
    # check to see if divisible by next higher prefix (remainder != self)
    return $bytes                             if $bytes % 1024 == $bytes;
    return sprintf("%0.1fK", $bytes/1024)     if $bytes % 1024**2 == $bytes;
    return sprintf("%0.1fM", $bytes/1024**2)  if $bytes % 1024**3 == $bytes;
    return sprintf("%0.1fG", $bytes/1024**3)  if $bytes % 1024**4 == $bytes;
    return sprintf("%0.1fT", $bytes/1024**4)  if $bytes % 1024**5 == $bytes;
    return sprintf("%0.1fP", $bytes/1024**5)  if $bytes % 1024**6 == $bytes;
    return sprintf("%0.1fE", $bytes/1024**6); # otherwise
} # humanize

##############################################################################
##                              f o r m a t s                               ##
##############################################################################

format DISKFREE =
================================================================================
Disk usage for volume @<<<<<<<<<<<<<<<<<<<<<<<<<...  @>>>>>>>>>>>>>>>>>>>>>>>>>>
                      $BASEPATH,                     $DISKFREE
================================================================================

.

format BYDEPTH_TOP =
Top disk utilization in @*, @* level(s) deep
                        $BASEPATH, $DEPTH
================================================================================

subpath                                                         disk utilization
-------                                                         ----------------
.

format BYDEPTH =
@<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<... @>>>>>>>>>>>>>>>
@LINE
.

format BYOWNER_TOP =
                       Top disk utilization by file owner
================================================================================

              file owner                        disk utilization
              ----------                        ----------------
.

format BYOWNER =
              @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<... @>>>>>>>>>>>>>>>
              @LINE
.

format BYEXT_TOP =
                     Top disk utilization by file extension
================================================================================

              extension                         disk utilization
              ---------                         ----------------
.

format BYEXT =
              @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<... @>>>>>>>>>>>>>>>
              @LINE
.

1;

__END__

=encoding utf8

=head1 NAME

reportusage - Generate a report of disk utilization

=head1 SYNOPSIS

Processes the results of a previous C<find -printf> command (see
L<"Input File Format"> for complete details) into a concise report of total disk
utilization at aribtrary depth(s) in the directory hierarchy, by file owner, and
file extension.

  reportusage [-?|--help] [--manual] [-h] [-l LIMIT] [-d DEPTH[,...]] [-r]
              [-s SEP] [--quiet] [--debug] [FILE]

where:

  -?, --help; --manual     prints a brief help message; displays the manual
  -h, --human-readable     uses 'K', 'M', 'G', and 'T' suffixes if appropriate
  -l, --limit LIMIT        limits results to LIMIT (default: 5)
  -d, --depth DEPTH[,...]  displays disk usage DEPTH levels into hierarchy
                           (default: 2); separate multiple DEPTHs with commas
  --[no-]by-depth          [suppresses] displays usage by depth in hierarchy
  --[no-]by-owner          [suppresses] displays usage by file owner
  --[no-]by-extension      [suppresses] displays usage by extension
  -a, --all                print usage in all three of the above formats
                           (use '--no-by-X' options to exclude specific ones)
  -r, --raw                means no fancy reports; plain text delimited records
  -s, --separator SEP      is the separator to use for '-r' / '--raw' records
  -q, --[no-]quiet         suppress progress messages (implied for '-r')
  --debug                  prints detailed errors/warnings for troubleshooting
  FILE                     is the output of a previous 'find DIR -printf ...'
                           invocation (see the manual); read stdin if omitted

Run C<reportusage --manual> for further details. Please report bugs at
L<https://tf.cchmc.org/s/ykbdo>.

=head1 DESCRIPTION

Command line options may be "cuddled" together (I<e.g.>, C<-hd5>, and
option/arguments may be separated with an C<=> if desired (I<e.g.>,
C<--limit=10>).

If the C<FILE> argument is omitted, C<reportusage> will read input from stdin
(or print an error message if run interactively). You can also specify C<-> as
the filename, if that makes you happy, but only file or stream will be read at a
time.

If C<-r> / C<--raw> (raw output) is specified, then only a single C<-d> /
C<--depth> option may be specified, and only one of C<--by-depth>,
C<--by-owner>, or C<--by-extension> may be specified.

If C<--all> is given (all three report formats), you can "subtract" formats you
don't want with the C<--no-by-X> options; I<e.g.>, C<--all --no-by-owner>.

If the longest common path of all the files in the report is 1) an absolute
path; and 2) corresponds to a mounted filesystem, a summary of used/free space
for the mounted volume will be reported at the top. For relative paths, this is
automatically suppressed.

=head2 Input File Format

The C<FILE> parameter is expected to be produced by running C<find> like this:

  # header is only required if $HEADER_SCAN_LINES is set (see below)
  echo -e "path\tsize\tmtime\tctime\towner" > output.tsv

  # format string is:
  # path, size, mtime (epoch secs), atime, file owner, newline (LF)
  find /path/to/dir -type f -printf "%p\t%s\t%T@\t%A@\t%u\n" >> output.tsv

Don't forget the C<-type f>, or you might get some odd results, since every line
in the input file is expected to be a file. You may wish to add some other
criteria to restrict the number of results from C<find>, such as C<-mtime +100>
or C<-size +10M>, because this will speed up the report generation.

Comments with the C<#> character are allowed (these lines are ignored when
parsing). The tab-delimited header row is not required unless
C<$HEADER_SCAN_LINES> is set to a non-zero value, and in that case, it must
appear within the first C<$HEADER_SCAN_LINES> lines of the input file or the
script will terminate with an error.

Files of the appropriate format are routinely generated and stored in
C</data/CAGE_clusterdata/.accounting> and C</data/weirauchlab/.accounting>,
with a C<.tsv> extension. Use the newest one you find there, or use the symlink
named C<latest.tsv>, if present.

=head1 EXAMPLES

=head2 Summary disk usage in current working directory, by file owner only

The C<--quiet> option suppresses printing of progress messages, which you may
not care about if the number of input records is small (in the tens of
thousands).

  find . -type f -printf '%p\t%s\t%T@\t%A@\t%u\n' \
    | reportusage --quiet --no-by-dir --by-owner

Remember that the C<--by-dir> option is the default, so you always have to
switch that off if you don't want it, before adding one of the other two
options.

=head2 Summary disk usage by file extension only, machine-readable

Adding C<--raw> suppresses the normal progress messages (unless you also give
C<--no-quiet>) and prints in a tab-delimited output format by default.

This is a good output format for processing the results with another tool.

  find . -type f -printf '%p\t%s\t%T@\t%A@\t%u\n' \
    | reportusage --raw --no-by-dir --by-extension > byext.tsv

=head2 Summary disk usage by file extension only, CSV output

The C<--separator> option can be used to specify a different output field
separator than the default of tab.

  find . -type f -printf '%p\t%s\t%T@\t%A@\t%u\n' \
    | reportusage --raw --limit=20 --sep=, --no-by-dir --by-ext > byext.csv

The C<--limit> option will print more than the usual top 5 directories with the
most disk utilization, which you might want if you're processing the output with
some other tool (I<e.g.>, plots with Excel). There is currently no way to ask
for "unlimited"; see #51 in the L<GitLab issue tracker|https://tf.cchmc.org/s/ykbdo>.

The longer "long" options have reasonable abbreviations, too, like C<--sep> for
C<--separator> and C<--by-ext> for C<--by-extension>. Have a look at the
C<GetOptions> invocation in the source for all the supported ones.

=head2 Summary disk usage for files ≥ 1 GB, not modified in last 100 days

The C<--limit> option will behave as described in the previous example, and the
C<--depth> option will show three different levels of hierarchy so you can zero
in on where the big files are.

  find / -type f -size +1G -mtime +100 -printf '%p\t%s\t%T@\t%A@\t%u\n' \
    | reportusage --limit=20 --depth=1,2,3

=head1 TROUBLESHOOTING

If you suspect problems with your input file, try the C<--debug> switch.

This can possibly reveal corrupted/partial records that could have arisen due
to, for example, embedded newlines or other funny characters in filenames
uncovered by your C<find> command (true story!).

Finally, double-check the L<"Input File Format"> section to make sure that your
input file has the correct columns.

=head1 BUGS

If you discover some behavior that could be a bug, report that behavior
L<here|https://tf.cchmc.org/s/ykbdo>.

Please include the exact command line invocation you tried and any relevant
error messages verbatim, in a L<Markdown code block|https://tf.cchmc.org/s/gitlab-markdown>.

=head1 AUTHOR

Kevin Ernst (kevin.ernst at cchmc.org)

=head1 LICENSE

MIT.

=cut

# vim: tw=80 colorcolumn=80