SPUG: Chip Salzenberg Defense Fund

Fri Aug 5 13:56:29 PDT 2005

This is probably getting a bit off-topic for this list, feel free to email 
me directly if you'd like more information, but put simply robots.txt is a 
way for website operators to state which bots they wish to be able to 
spider which parts of their site.

What it means for robots that ignore the rules laid out in this file is 
open to interpretation, but most "good netizen" internet providers would 
probably put a stop to a badly behaved robot if all other options were 
exhausted.

Here's more information on robots.txt

http://www.robotstxt.org/wc/robots.html
http://www.robotstxt.org/wc/norobots.html

You can block the ip the bot comes from, but there are a virtually 
unlimited number of misconfigured proxies in probably every country in the 
world.  If they're abusing these (which could be illegal in and of itself 
if the operator of the proxy did not intend it to actually be public) 
it's not a simple matter to stop.

It's not like a huge DDoS that's dropping entire ISPs off the net which 
backbone providers would be interested in stopping, it's generally quite a 
bit more isolated of an incident than that.

If you wanted to put a stop to a misbehaving bot, and you couldn't track 
down the people who were running it (or their ISP is not cooperative) 
you'd probably need to contact individual proxy operators to fix their 
proxy configurations, which is pointless given the number of proxies 
available.  Your only other option for remedy is likely an expensive and 
difficult legal process, involving network resources probably located in 
other countries.

An example of a similar situation (although it really was an attack as 
opposed to a rude bot) is what happened to Kuro5hin.org.  I may be 
remembering this incorrectly, but I believe this was the reason K5 had to 
completely disable their search feature: someone wrote an abusive bot to 
make random search requests very rapidly over open proxies.  Searches were 
very expensive under Scoop, and this basically made the site unusable.

It can be difficult to track people down behind proxies, and the legal 
option is obviously more expensive than many website operators can afford.
In K5's case, the easiest solution was just to disable the feature the bot 
was abusing, and in-site search on kuro5hin has been disabled ever 
since--to the detriment of the legitimate users of the site.

Jon

On Fri, 5 Aug 2005, Ken Meyer wrote:

> Again, please note that this is all education for me.
>
> So, how does the misbehaving bot, and the possible responses to it, differ
> from the ways that any old DoS attack is countered.  Since we haven't had
> any virtual Internet shut-downs recently, it suggests to me that effective
> measures have been developed.  Also again, I don't understand whether a
> robots.txt file has the same status as password protection in establishing
> criminal activity.
>
> Ken Meyer
>
>
> -----Original Message-----
>
> From: jlb [mailto:jlb at io.com]
> Sent: Thursday, August 04, 2005 6:24 PM
> To: Ken Meyer
> Cc: SPUG Members
>
> Subject: Re: SPUG: Chip Salzenberg Defense Fund
>
> The methods IRC servers and Mail servers use aren't appropriate for web
> sites because they introduce significant delay to each and every
> connection.  Even if the information were cached, the initial connection
> would be slow enough to drive many web users away.
>
> Just speaking as someone who has run a website and encountered some of
> these issues:  Frequently "web bots" are very poorly behaved, issuing
> bursts of thousands of requests in a row, at frequent intervals.  This
> can often impact other legitimate users of the site, as well as
> potentially costing the site money in bandwidth and hosting.
>
> There is a big difference between some person running a recursive wget on
> your site once to mirror it for their own personal use, and someone
> frequently and aggressively running screen scraping bots against it in an
> automated fashion.
>
> If a web site has a robots.txt indicating they dont wish to be spidered,
> and have gone so far as to ban a single misbehaving bot multiple times,
> only to have those bans evaded...well, at what point does it become "bad?"
>
> On Thu, 4 Aug 2005, Ken Meyer wrote:
>
>> My desire to understand this has trumped my desire to conceal my lack of
>> geeky sophistication.
>>
>> Is an "open proxy" used simply to evade an attempt by sites to
> specifically
>> block this company's bots and not others?
>>
>> What about robots.txt; seems to me that this file implements no more than
> a
>> "gentlemen's agreement", rather than a legal barrier, such as a password
> to
>> access a computer on a network that is not intended for public access,
> such
>> as a web server?
>>
>> Here is an excerpt from Wikipedia:
>>
>> "Because proxies are implicated in abuse, system administrators have
>> developed a number of ways to refuse service to open proxies. IRC networks
>> such as the Blitzed network automatically test client systems for known
>> types of open proxy. [1] Likewise, a mail server may be configured to
>> automatically test mail senders for open proxies, using software such as
>> Michael Tokarev's proxycheck. [2]"
>>
>> So why have these techniques not been effective against the subject
>> "scraping" in point (by the way, I thought that "scraping" referred to
>> getting text off a screen shot that is in raster format, i.e. OCR, not
>> actually snarfing the ASCII)?
>>
>> So, when is one hacking into a system and when is one simple accessing
>> material that is exposed and fair game, whether that is desirable or not?
>>
>> What sort of material was this company harvesting?  Does it bear on
> privacy,
>> which is a very tight subject in the case of medical information -- HIPAA
>> philosophy is highly prevalent.
>>
>> Where are Mr. Salzenberg's computers now?  Are there contents intact?  Who
>> has control of any files copied from them?
>>
>> It is unwise to address this problem via an organization called
>> "geeksunite", which is certainly off-putting to the majority of the
>> population, which if they are not actually repelled by the geek image,
> will
>> presume that the subject will be beyond their comprehension.  If there are
>> truly illegal acts going on, isn't a counterattack possible?  If civil
>> liberties have been violated, certainly  the usual organizations will be
>> alarmed and will provide support.  What about the ACLU and the EFF to
> defend
>> Mr. Salzenberg?  I would rather support a well-known champion of the
>> individual than directly to an individual who has not defined the problem
> or
>> his approach to addressing it in other than vague terms -- or is the
>> vagueness simply a product of my lack of understanding of the technical
>> details of what is going on here.
>>
>> By the way, I don't consider this to be "OT" at all, as subjects that bear
>> on the livelihoods of the computing technical community are subsumed by
> any
>> and all more specific technical discussions -- IMHO.
>>
>> Ken Meyer
>
> _____________________________________________________________
> Seattle Perl Users Group Mailing List
>     POST TO: spug-list at pm.org
> SUBSCRIPTION: http://mail.pm.org/mailman/listinfo/spug-list
>    MEETINGS: 3rd Tuesdays, Location: Amazon.com Pac-Med
>    WEB PAGE: http://seattleperl.org/
>