[pm-h] Perl merging many large files into one

B. Estrade estrabd at gmail.com
Mon Mar 31 07:06:46 PDT 2014


Just a few points:

0. If you are not appending to "out.text" each time the script is run,
use ">" not ">>" in the shell command; but I am assuming ">" moving
forward because the final shell command is much shorter (">>" requires
an additional temporary file step to remain atomic)
1. Separate the commands with a "&&" since this will require the
former to succeed before the latter is executed
2. You may want to capture STDERR, making the shell command look like
the following (untested):

(cat X > out.txt && tail --lines=+2 Y Z) 2>> error.out 1> out.txt #
out.txt is overwritten here each time

3. If you plan to rely on "out.txt" to exist fully in order for a
future process to use it, I'd ">" to a temporary file, then do an
atomic move (/bin/mv) to the final file name, bringing your shell
command to look something like:

(cat X > out.txt && tail --lines=+2 Y Z) 2>> error.out 1>temp.out &&
mv temp.out out.txt;

In general, I have absolutely no issue using shell scripts when what I
need is purely shell in nature. So if this is all you need to do, I
would recommend a shell script over Perl without hesitation.

The issue of keeping "out.txt" visible only in a complete state is
complicated if it needs to grow each time the concatenation is done.
If out.txt does need to grow, you will need to create an additional
temporary file that is the new content appended to the old out.txt;
then move that temp file to be the new out.txt.

Brett

On Mon, Mar 31, 2014 at 8:10 AM, Michael R. Davis <mrdvt92 at yahoo.com> wrote:
> G. Wade,
>
>>> Can anyone tell me if the diamond operator is optimized in a print
>>> statement or does it really read the file into memory then print it?
>>
>>The real question is are you doing more than just concatenating the
>>files?
>
> I guess this comment sparked research and `tail` actually does what I need
> to do.
>
> qx{cat X > out.txt; tail --lines=+2 Y Z >> out.txt};
>
> Maybe I'll just head out to the command line it just won't be portable.
> There will be about 300 CSV files to merge in the process but I only need
> the header row on the first one.
>
> But, performance is not really as big of a concern as memory for this
> process.
> Thanks,
> Mike
>
> _______________________________________________
> Houston mailing list
> Houston at pm.org
> http://mail.pm.org/mailman/listinfo/houston
> Website: http://houston.pm.org/


More information about the Houston mailing list