SPUG: Web bots and authentication cookies

Sanford Morton smorton at pobox.com
Wed Aug 16 17:33:12 CDT 2000


Recently I've seen more web sites which use cookies for password
authentication. You log into a login page (a web form) on the site,
and if successful, a cookie is returned and saved. A valid cookie then
becomes your authentication for content pages on the site. This has
some advantages over standard htpasswd-type authentication: you can
control the expiration more finely (eg, saving it across sessions) and
you can do the login on a secure server, then transfer further pages
to a faster, unsecured server. The Wall Street Journal does the
former, my futures broker (Lind-Waldock) the latter, though I'm not
real confident that this is very secure.

But this makes for more complex web robots. I've got it working, so I
thought I'd share the logic. (I also have a Net::IRC interface if
anyone's interested.) Essentially, a login function submits the web
form with your username and password, and saves the cookie contained
in the response header. Then you browse other pages after retrieving
the cookie.

Hope this is useful--comments are welcome. --Sandy

#!/usr/bin/perl -w

my $username= 'xxx';
my $password = 'yyy';

use LWP::UserAgent;
use HTTP::Cookies;
use strict;
$|=1;

# create a cookie jar in a file
my $jar = HTTP::Cookies->new ('file' => 'my.LWP.cookies', 'autosave' => 1);
# url of the login page
my $login_url = 'http://.../login.html';
# login, get and save cookie
login($login_url, $jar, $username, $password);

# page we want to repeatedly poll for new news items
my $url = 'http://.../news.html';
my $req = new HTTP::Request ('GET', $url);

# add the cookie into the current request
$jar->add_cookie_header($req);

my $ua = new LWP::UserAgent;
my $resp;
my ($s,$m,$h);

while (1) {

  $resp = $ua->request($req);
  if ($resp->is_success) {  
    # analyze the news page, printing only new items
    process_page( $resp->as_string );  
  } else { 
    die $resp->as_string; 
  }

  # report the time as a counter to see if we're still alive
  # (bots often time out or otherwise fail)
  ($s,$m,$h) = localtime(time);
  print " $h:$m ";
  sleep 60;
}

# logs in, sets cookies in cookie jar
sub login {

  my ($url, $jar, $username, $password) = @_;
  my $ua = new LWP::UserAgent;
  my $req = new HTTP::Request ('POST', $url);

  # The login page is a web form, where you enter username/password.
  # Set the content for this request by looking carefully at the source
  # of the form, including hidden elements.
  $req->content_type('application/x-www-form-urlencoded');
  $req->content("Username=$username&Password=$password&x=1&y=1");
  
  my $resp = $ua->request($req);
  if ($resp->is_success or $resp->is_redirect) {

    # extract the cookie from the response and save to cookie jar
    $jar->extract_cookies($resp);
    $jar->save;
  } else {
    die $resp->as_string;
  }
  1;
}

# process the news page, remembers last item, prints output if new item
# in this case, new items are listed in order above older items on the page
{ my $last_seen_item = '';  

  sub process_sp_page {  

    # split on whatever is the news item separator on the page
    # you may need to cut header material first
    for (split /<SPAN class=subhead><B>/is, $_[0]) { 

      # have we already seen the first item on the page? 
      # if not, it must be new
      return 0 if $last_seen_item eq $_;
      $last_seen_item = $_;

      # remove html, format it nicely
      # .....

      print "\n$_\n\n";
    }
  }
}


 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For full traffic, use spug-list for LIST ; otherwise use spug-list-digest
  Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/





More information about the spug-list mailing list