[VPM] link 'bot' protection

Mon Feb 23 19:52:41 PST 2009

On Mon, Feb 23, 2009 at 04:57:39PM -0800, Jer A wrote:
> 
> 
> If I were to ban by ip, what if it were only one bad machine in a large network behind a router.....will it block the entire network?
> 

If you do this by IP you'll end up banning AOL users and anyone else who's
part of a large number of users behind NAT, plus it's relatively trivial to
get around.

The short answer is, you can't accomplish what you're trying to accomplish.
The longer more complex answer is, it depends on the value of your data.  There
are a number of techniques you can use to make it more annoying to write bots,
but there is no silver bullet that will prevent automation.  If someone is
motivated enough to want to scrape your data, they can, and nothing you can do
will stop them.  Also, it only takes one motivated person to release a library
and all the other less motivated people will be able to do it too.  If your
business model relies on this being impossible, you probably should rethink
things.  Now, all that said, here's some ways you can put up stumbling blocks
and their various flaws.

Obfuscation - You can use javascript to deobfuscate the contents of the page
on the fly.  This relies on the fact that most bots don't understand Javascript,
and thus the page will be unreadable to them.  The flaws are it will totally
screw over your google ranking (google relies on bots) and CPAN has at least
a couple of modules that will allow you to build a bot that either automates
a real browser (and thus understands javascript) or adds Javascript
functionallity to WWW::Mechanize.  Plus it's trivial to just use firebug to
figure out what's actually going on.

Captcha - You can make a "human only understandable" test and require users
to fill one out before entering the protected part of your site.  The flaws
are that it screws the Google bot again, it will annoy your users, and as
far as I know, ever captcha has been broken by OCR software right now.  If
your data is valuable enough, people will use either porn or mechanical turk
to incent real people to solve your captcha and build a library which they can
use as a lookup table.  The only time this actually works is if your test is
"enter a credit card to be charged".  Having a valid credit card and being
willing to part with money is (almost) always a good sign that something isn't
a bot.

Throttling - You can require that requests from a given IP, session cookie,
subnet, or anything else that you can think you can use to differentiate 
browsers only come in at a given rate.  The flaws are that anyone who cares
will either use Tor or proxies to get around your IP restrictions, large 
chunks of the net (AOL as an example) are behind NAT and you'll have a ton
of false positives, and most robot code understands cookies anyway.  Trying
to figure out unique users is basically only statistically possible.  Plus
anyone who cares will just trickle by your rate limit and slowly leach your
data anyway.

Watermarking - You can seed your data with unique fingerprints that can be
identified by you.  Bonus points if you make the watermark findable using the
Google bot so you can use Google to find your leeches and then sue them for
copyright violation.  The flaws are that once people find out about your
watermarks, they're always trivial to remove.  Depending on who is leeching 
your data, copyright law might be useless anyway (good luck suing in China).

Tarpits - You can make an invisible neverending link generator which spiders
will descend into infinitely.  This can make a nice way of spotting bots and
then banning their IP.  However it will only work once.  A suffiently 
motivated attacker will just code around your tarpit.

AUP - You can publish an acceptable use policy and threaten to sue anyone who
breaks it.  This might work, assuming they're in a jurisdiction where you can
enforce it, and you have deep enough pockets to make it stick.  And assuming
you can identify who's doing it.  This works well for data you don't want
republished (copyright law is fairly universal and well understood by
providers) but sucks for things like white papers, where you're attempting to
control exclusivity.

The real answer is to figure out a way so that bots aren't a problem for you,
but a benefit.  Google, for example, publishes an API, which is access
controlled by a user key.  Requests to the API cost money, and so it doesn't
matter if bots hit the API assuming the cost of serving is less than the
price to the user.

Hope that helps.

mock (who has written a fair number of bots)