[DFW.pm] SpamMeNot design considerations - unicode? sysread vs. read? parsing? security?

Tue Mar 19 19:32:07 PDT 2013

On 03/19/2013 07:34 PM, Tommy Butler wrote:
...
>   * I've started this out with full unicode support for both incoming
>     streams and what will eventually be passed to the message storage
>     mechanism (maildir), but do I really need unicode?  It's already
>     probably base64 when it comes in right? But what about headers with
>     unicode characters?  D*ŏ*main names can have *ü*nicode chars nowad*ẵ*ys!

Here's some advice. Since I don't know what you already know, I'll try 
to give a general overview. Ignore what you already know. :)

It depends. :) Do you plan on communicating with people whose languages 
don't fit into the ASCII (or even the latin1 {iso-8859-1}) charset 
spaces? (I don't mean that sarcastically but seriously.) If yes, then 
yes you do need to take utf8 into account. If you plan to go that route, 
you need to do the 3-arg open() like:

   open(my $fh, "<:encoding(UTF-8)", "filename")

As long as you're using a recent version of Perl, like 5.14 or better 
5.16, you should have pretty good unicode support. Also remember that 
with utf8 that byte != char, so reading 1000 chars could give you many 
more bytes (theoretically up to 6000 bytes).

"Base64" data encoding has nothing to do with charset encoding, confuse 
the 2 at your own peril. :) A base64 encoded blob can hold anything: 
ASCII text, binary files, utf8 text, whatever. And no, you don't have to 
base64 encode a utf8 message; we send utf8 email where I work all the 
time using the MIME::Lite and Sendmail modules.

Also, I recently learned there is a mail header to help you know the 
charset. Look for that to help you if you're reading, or be sure to set 
it if you're writing email. It's "Content-Type: text/html; 
charset=UTF-8" (or "text/plain") if you've never done it before.

>   * The original codebase literally reads until *EOF*; this can't be
>     good!  Someone could send me an email as big as my server's hard
>     drive!  In the newest code you'll see that I decided to *read()* one
>     utf8-encoded char at a time, and limit the maximum size of each
>     header and the email's message content.  It has allowed for
>     effective header parsing, but it's tedious and probably dreadfully
>     inefficient.  Wouldn't it be better to *read STDIN, $buffer,
>     $max_msg_size* and if there's anything left -- kick it back?  That's
>     still DoS resistant, and much faster.  I could do post-processing
>     and parse the headers/message after the fact.
...

Yes, it's theoretically possible for someone to send you an email so big 
that it gives you problems; however, that's unlikely if it's really 
email. Most email servers have limits on what they'll accept to pass on 
... or so I've always understood but I'll defer to someone who knows 
more on this. Still as a practical matter, I would have a max message 
size that you'll accept. Common sizes that I've seen in email servers 
are 1-4G, although I have seem email as much as 12G in size passed in 
rare instances. If the message is greater than $max_msg_size, I believe 
you can just drop the rest (as in close the file/stream and "too bad so 
sad" for the rest of the message). You could even immediately close the 
file, don't accept anything, and send an auto-reply back with an "email 
to big" message -- although I have no idea if that's RFC compliant or not.

HTH,
Kevin