[kw-pm] Last night's meeting

Fri Jun 19 11:34:11 PDT 2009

On Fri, Jun 19, 2009 at 12:37:34PM -0400, abez wrote:
> On Fri, 19 Jun 2009, John Macdonald wrote:
>
>>>
>>>
>>> So, that's down to 18 chars, with no supporting cheats of any kind.
>>
>> However, I haven't seen the original challenge - does it require
>> that the program handle input that is too large to fit in memory?
>
> The purpose of the original challenge was for members to implement the
> randomized algorithm, that I presented, which allows for O(1) memory use, in
> terms of lines while ensuring the result was uniformly random (with some
> assumptions of course).
>
> This algorithm needs O(N) calls to rand but rand is pretty fast compared to
> I/O, thus this algorithm will work very well on huge inputs.
>
> Perhaps the golf should've been written more clearly but it was given a full
> presentation in the meeting :)

No algorithm can be O(1).  At best it can be O(n) on the number
of lines.

In fact, even if you were told the number of lines in the input
you'd still have to read an amount that was not O(1) unless
there were also sever constraints on the length of a line.

Imagine that the input file is created by the following program
that generates one 1000-char line and 999 1-char lines:

    my $big = int(rand(1000));
    for my $line (0..999) {
        print 'x' x 999
            if $line == $big;
        print "\n";
    }

If you knew that such a file was being searched, you could
process it in O(log n) using a binary search to find one end of
the non-blank line [O(log n)], deterine which line of the file
that was [O(1) - take the pos in file of the "\n" that follows
an "x" and subtract 999, that is the line number of the long
line and all of the rest are short] pick a single random number
[O(1)], and generating the required line based upon whether
the result of the binary search is the same as the randomly
chosen line number.  But even for this much simpler case,
you cannot get an O(1) algorithm.

The actual case, which has an input of unknown number of lines,
in which each line is of unknown length, has to be O(n) on the
length of the input (which is worse than the number of lines;
but we've been assuming that the input does not have any "lines"
that are too large to fit into memory, or even close, so O(n)
on the number of lines is a reasonable approximation.

How would *any* O(1) algorithm be able to work if the input file
it was given was not some html file, but was /dev/hda instead?
And still be true for the size that might be available for
/dev/hda 20 years from now.  If /dev/hda is a disk drive that is
1 TB, it could contain a single line containing 1 TB of "text",
or it could contain 10 billion lines of 100 characters each,
or whatever - you basically need to read the entire input to
know where lines start and their line number.

I would take the O(1) requirement to be a measure on the amount
of work done for each line read (which matches the cost of the
readline operation at least).