<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    If anyone's watched the git repo for the SpamMeNot project as

    introduced and discussed in our meetings for the last 2 months...<br>

    <br>

    You may have noted some things.  You may have noted that it has

    become both a functional daemon and it has a functional Catalyst

    backend.  Profit!<br>

    <h3>Questions</h3>

    But while hacking on the codebase recently I've started to ask

    myself some questions.  Questions like:<br>

    <ul>

      <li>I've started this out with full unicode support for both

        incoming streams and what will eventually be passed to the

        message storage mechanism (maildir), but do I really need

        unicode?  It's already probably base64 when it comes in right? 

        But what about headers with unicode characters?  D<b>ŏ</b>main

        names can have <b>ü</b>nicode chars nowad<b>ẵ</b>ys!<br>

      </li>

      <li>The original codebase literally reads until <b><font

            face="Courier New, Courier, monospace">EOF</font></b>; this

        can't be good!  Someone could send me an email as big as my

        server's hard drive!  In the newest code you'll see that I

        decided to <b><font face="Courier New, Courier, monospace">read()</font></b>

        one utf8-encoded char at a time, and limit the maximum size of

        each header and the email's message content.  It has allowed for

        effective header parsing, but it's tedious and probably

        dreadfully inefficient.  Wouldn't it be better to <b><font

            face="Courier New, Courier, monospace">read STDIN, $buffer,

            $max_msg_size</font></b> and if there's anything left --

        kick it back?  That's still DoS resistant, and much faster.  I

        could do post-processing and parse the headers/message after the

        fact.</li>

    </ul>

    <h3><b>The First SpamMeNot RFC<br>

      </b></h3>

    <u><b>What do you mongers think?</b></u> (That's the RFC)<br>

    <br>

    While you think about it, I'm going to <font face="Courier New,

      Courier, monospace"><b>'git branch'</b></font> the code (again)

    and try:<br>

    <ul>

      <li><font face="Courier New, Courier, monospace"><b>sysread</b></font>-ing

        using a<b><font face="Courier New, Courier, monospace">

            :unix:encoding(UTF-8)</font></b> IO layer stack into a

        buffer that is <font face="Courier New, Courier, monospace"><b>$max_msg_size</b></font>

        in one go (there's an inherent bug in this -- can you identify

        it?)<br>

      </li>

      <li>parsing/splitting the message and its headers after the fact

        (and use a trusty CPAN module to do this for me, because as you

        know this makes one's coolness grow larger)</li>

      <li>...we'll take it from there.</li>

    </ul>

    <h3>Things to consider<br>

    </h3>

    <ul>

      <li>You can always downgrade unicode strings later, but you can't

        always go the other way after you've saved a file with the wrong

        encoding.  Isn't unicode the <i>"right"</i> thing to do?<br>

      </li>

      <li>Dovecot seems to store files with ASCII encoding (i.e.- no

        UTF-8 necessarily) --</li>

    </ul>

    <blockquote><font face="Courier New, Courier, monospace">[/var/vmail/tommybutler.me/ace/cur]<br>

        # file -i 1363734898.M59401P5961.peedoo,S=6676,W=6785:2,<br>

        1363734898.M59401P5961.peedoo,S=6676,W=6785:2,: <u><b>message/rfc822;

            charset=us-ascii</b></u><br>

      </font></blockquote>

    <br>

    --Tommy Butler<br>

  </body>

</html>