[DFW.pm] SpamMeNot design considerations - unicode? sysread vs. read? parsing? security?

Tue Mar 19 17:34:03 PDT 2013

If anyone's watched the git repo for the SpamMeNot project as introduced
and discussed in our meetings for the last 2 months...

You may have noted some things.  You may have noted that it has become
both a functional daemon and it has a functional Catalyst backend.  Profit!

      Questions

But while hacking on the codebase recently I've started to ask myself
some questions.  Questions like:

  * I've started this out with full unicode support for both incoming
    streams and what will eventually be passed to the message storage
    mechanism (maildir), but do I really need unicode?  It's already
    probably base64 when it comes in right?  But what about headers with
    unicode characters?  D*o(*main names can have *ü*nicode chars
    nowad*a(~*ys!
  * The original codebase literally reads until *EOF*; this can't be
    good!  Someone could send me an email as big as my server's hard
    drive!  In the newest code you'll see that I decided to *read()* one
    utf8-encoded char at a time, and limit the maximum size of each
    header and the email's message content.  It has allowed for
    effective header parsing, but it's tedious and probably dreadfully
    inefficient.  Wouldn't it be better to *read STDIN, $buffer,
    $max_msg_size* and if there's anything left -- kick it back?  That's
    still DoS resistant, and much faster.  I could do post-processing
    and parse the headers/message after the fact.

      *The First SpamMeNot RFC
      *

_*What do you mongers think?*_ (That's the RFC)

While you think about it, I'm going to *'git branch'* the code (again)
and try:

  * *sysread*-ing using a*:unix:encoding(UTF-8)* IO layer stack into a
    buffer that is *$max_msg_size* in one go (there's an inherent bug in
    this -- can you identify it?)
  * parsing/splitting the message and its headers after the fact (and
    use a trusty CPAN module to do this for me, because as you know this
    makes one's coolness grow larger)
  * ...we'll take it from there.

      Things to consider

  * You can always downgrade unicode strings later, but you can't always
    go the other way after you've saved a file with the wrong encoding. 
    Isn't unicode the /"right"/ thing to do?
  * Dovecot seems to store files with ASCII encoding (i.e.- no UTF-8
    necessarily) --

    [/var/vmail/tommybutler.me/ace/cur]
    # file -i 1363734898.M59401P5961.peedoo,S=6676,W=6785:2,
    1363734898.M59401P5961.peedoo,S=6676,W=6785:2,: _*message/rfc822;
    charset=us-ascii*_

--Tommy Butler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/dfw-pm/attachments/20130319/de83ab2c/attachment.html>