[kw-pm] Last night's meeting

Fri Jun 19 11:43:01 PDT 2009

I just re-read and realized the Abez was mixing memory size
and execution in his discussion and I read them both being
about execution time.

For the results (except for my misguided read everything into
an array) so far:

memory use is O(l) where l is the length of the longest line in
the input, which is actually O(1) if you assume a "reasonable"
bound on the length of the longest line.

execution is O(n) where n is the number of lines in the input
(still assuming that line length bound, so that each readline
is O(1)).

On Fri, Jun 19, 2009 at 02:34:11PM -0400, John Macdonald wrote:
> On Fri, Jun 19, 2009 at 12:37:34PM -0400, abez wrote:
> > On Fri, 19 Jun 2009, John Macdonald wrote:
> >
> >>>
> >>>
> >>> So, that's down to 18 chars, with no supporting cheats of any kind.
> >>
> >> However, I haven't seen the original challenge - does it require
> >> that the program handle input that is too large to fit in memory?
> >
> > The purpose of the original challenge was for members to implement the
> > randomized algorithm, that I presented, which allows for O(1) memory use, in
> > terms of lines while ensuring the result was uniformly random (with some
> > assumptions of course).
> >
> > This algorithm needs O(N) calls to rand but rand is pretty fast compared to
> > I/O, thus this algorithm will work very well on huge inputs.
> >
> > Perhaps the golf should've been written more clearly but it was given a full
> > presentation in the meeting :)
> 
> No algorithm can be O(1).  At best it can be O(n) on the number
> of lines.
> 
> In fact, even if you were told the number of lines in the input
> you'd still have to read an amount that was not O(1) unless
> there were also sever constraints on the length of a line.
> 
> Imagine that the input file is created by the following program
> that generates one 1000-char line and 999 1-char lines:
> 
>     my $big = int(rand(1000));
>     for my $line (0..999) {
>         print 'x' x 999
>             if $line == $big;
>         print "\n";
>     }
> 
> If you knew that such a file was being searched, you could
> process it in O(log n) using a binary search to find one end of
> the non-blank line [O(log n)], deterine which line of the file
> that was [O(1) - take the pos in file of the "\n" that follows
> an "x" and subtract 999, that is the line number of the long
> line and all of the rest are short] pick a single random number
> [O(1)], and generating the required line based upon whether
> the result of the binary search is the same as the randomly
> chosen line number.  But even for this much simpler case,
> you cannot get an O(1) algorithm.
> 
> The actual case, which has an input of unknown number of lines,
> in which each line is of unknown length, has to be O(n) on the
> length of the input (which is worse than the number of lines;
> but we've been assuming that the input does not have any "lines"
> that are too large to fit into memory, or even close, so O(n)
> on the number of lines is a reasonable approximation.
> 
> How would *any* O(1) algorithm be able to work if the input file
> it was given was not some html file, but was /dev/hda instead?
> And still be true for the size that might be available for
> /dev/hda 20 years from now.  If /dev/hda is a disk drive that is
> 1 TB, it could contain a single line containing 1 TB of "text",
> or it could contain 10 billion lines of 100 characters each,
> or whatever - you basically need to read the entire input to
> know where lines start and their line number.
> 
> I would take the O(1) requirement to be a measure on the amount
> of work done for each line read (which matches the cost of the
> readline operation at least).
> _______________________________________________
> kw-pm mailing list
> kw-pm at pm.org
> http://mail.pm.org/mailman/listinfo/kw-pm