From rjbs-perl-abe at lists.manxome.org Tue Nov 8 12:41:37 2005 From: rjbs-perl-abe at lists.manxome.org (Ricardo SIGNES) Date: Tue, 8 Nov 2005 15:41:37 -0500 Subject: [ABE.pm] next meeting: beer, perl, and meat Message-ID: <20051108204136.GO27557@manxome.org> Yes, I believe it is time that we should once again celebrate three of life's great pleasures: good beer, good coding, and good meat. Our last meeting was (good grief!) September 21, at J.P. McGrady's, and I think a good time was had by all. I think a next meeting there would be excellent. I will have an Anchor Steam and a Fahy Bridge burger, because I am predictable, and I will talk about the awesome code I'm getting to poke at, lately. You are free to order anything you like, and to ignore me or talk over me. I'm giving plenty of notice, because the next few weeks are so busy, not just for me but for everyone. Here is what you need to know: COME TO J.P. McGRADY'S AT 3rd St AND ADAMS STREET IN SOUTH BETHLEHEM WEDNESDAY, NOVEMBER 30 19:00 (that's seven o'clock in the evening) If you want to come drink beer with me, John, and whoever else wants to show up -- Mark? Faber? Dann? -- but you can't make that time, whine now and it can be moved around. I say that starting in January, we should set up a recurring schedule. More on that later! -- rjbs -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.pm.org/pipermail/abe-pm/attachments/20051108/a6874fe9/attachment.bin From thehead at patshead.com Wed Nov 9 10:38:26 2005 From: thehead at patshead.com (Pat Regan) Date: Wed, 09 Nov 2005 13:38:26 -0500 Subject: [ABE.pm] next meeting: beer, perl, and meat In-Reply-To: <20051108204136.GO27557@manxome.org> References: <20051108204136.GO27557@manxome.org> Message-ID: <43724222.7070603@patshead.com> Ricardo SIGNES wrote: > COME TO J.P. McGRADY'S > AT 3rd St AND ADAMS STREET IN SOUTH BETHLEHEM > WEDNESDAY, NOVEMBER 30 > 19:00 (that's seven o'clock in the evening) > What kind of crazy town is Bethlehem that two streets can intersect? :) Pat -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: OpenPGP digital signature Url : http://mail.pm.org/pipermail/abe-pm/attachments/20051109/d00ef109/signature.bin From waltman at pobox.com Wed Nov 9 11:06:26 2005 From: waltman at pobox.com (Walt Mankowski) Date: Wed, 9 Nov 2005 14:06:26 -0500 Subject: [ABE.pm] next meeting: beer, perl, and meat In-Reply-To: <43724222.7070603@patshead.com> References: <20051108204136.GO27557@manxome.org> <43724222.7070603@patshead.com> Message-ID: <20051109190626.GA21577@waltman.dnsalias.org> On Wed, Nov 09, 2005 at 01:38:26PM -0500, Pat Regan wrote: > Ricardo SIGNES wrote: > > COME TO J.P. McGRADY'S > > AT 3rd St AND ADAMS STREET IN SOUTH BETHLEHEM > > WEDNESDAY, NOVEMBER 30 > > 19:00 (that's seven o'clock in the evening) > > > > What kind of crazy town is Bethlehem that two streets can intersect? :) I hear some of their streets are *paved*, too! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://mail.pm.org/pipermail/abe-pm/attachments/20051109/ad6cb48b/attachment.bin From faber at linuxnj.com Mon Nov 14 17:07:19 2005 From: faber at linuxnj.com (Faber Fedor) Date: Mon, 14 Nov 2005 20:07:19 -0500 Subject: [ABE.pm] Screen-scraping Message-ID: <20051115010719.GA23764@neptune.faber.nom> Guys, I've got to pull data off of two websites. I plan on using WWW:MEchanize and HTML::TokeParser. The one website seems to be easy enough, but the HTML from the second webiste is alot of tags, embedded tables, etc. with no identifying tags on the data. My Qs are: 1) Is there a better tool for this than HTML::TokeParser 2) Does anyone know of a tool that will parse HTML and build a DOM-like object? It would be easier to walk through that then the actual text/HTML. -- Regards, Faber Fedor President Linux New Jersey, Inc. 908-320-0357 800-706-0701 http://www.linuxnj.com From faber at linuxnj.com Tue Nov 15 13:55:57 2005 From: faber at linuxnj.com (Faber Fedor) Date: Tue, 15 Nov 2005 16:55:57 -0500 Subject: [ABE.pm] Using TreeBuilder Message-ID: <20051115215557.GA29664@neptune.faber.nom> There's something (well, many things, I'm sure!) I'm not getting about TreeBuilder. Here's an example of what I'm trying to do: I've got a table that looks like this: Who's Next
Kiss Alive! Dressed To Kill
Kate Bush The Kick Inside Hounds of Love Ariel
The Who The Kids are Alright
I want to loop through the table and if the first column is equal to "Kate Bush", put the rest of the row into variables/arrays/hashes/whatever. Not only can I not get past the first row, the code to print the contents of the data cells is ugly, IMO: #!/usr/bin/perl -w use strict; use HTML::TreeBuilder; # Let's walk through the forms.... my $tree = HTML::TreeBuilder->new; $tree->parse_file("simple.html"); # we found a table row my $node = $tree->look_down("_tag", "tr") ; #this prints out the first row foreach my $item ($node->content_list()) { foreach my $jtem ($item->content_list()){ print "jtem is $jtem\n"; } } # find the next row print "Looking for the next row\n"; # I assumed this would increment to the next row, but it doesn't. :-( # it prints the first row again. $node = $tree->look_down("_tag", "tr"); foreach my $item ($node->content_list()) { foreach my $jtem ($item->content_list()){ print "jtem is $jtem\n"; } } #the first row, again! $node = $tree->look_down("_tag", "td"); print @{$node->content()}[0]."\n"; if (@{$node->content()}[0] eq "Kate Bush") { print @{$node->content()}[0]."\n"; print @{$node->content()}[1]."\n"; print @{$node->content()}[2]."\n"; } $tree->delete(); exit(0); -- Regards, Faber Fedor President Linux New Jersey, Inc. 908-320-0357 800-706-0701 http://www.linuxnj.com From faber at linuxnj.com Tue Nov 15 14:31:25 2005 From: faber at linuxnj.com (Faber Fedor) Date: Tue, 15 Nov 2005 17:31:25 -0500 Subject: [ABE.pm] Using TreeBuilder In-Reply-To: <20051115215557.GA29664@neptune.faber.nom> References: <20051115215557.GA29664@neptune.faber.nom> Message-ID: <20051115223125.GA29964@neptune.faber.nom> Well, I found something that works. I know why this works, but I don't know why the previous stuff doesn't work. Here's what I'm going to use: #!/usr/bin/perl -w use strict; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file("simple.html"); my @artists = ( "Kate Bush", "Peter Gabriel"); foreach my $artist (@artists) { my $node = $tree->look_down('_tag', 'tr', sub { $_[0]->as_text =~ m{$artist} } ); foreach my $item ($node->content_list()) { foreach my $jtem ($item->content_list()){ print "jtem is $jtem\n"; } } } $tree->delete(); exit(0); Comments? -- Regards, Faber Fedor President Linux New Jersey, Inc. 908-320-0357 800-706-0701 http://www.linuxnj.com From faber at linuxnj.com Tue Nov 15 20:20:13 2005 From: faber at linuxnj.com (Faber Fedor) Date: Tue, 15 Nov 2005 23:20:13 -0500 Subject: [ABE.pm] Using TreeBuilder In-Reply-To: <20051115223125.GA29964@neptune.faber.nom> References: <20051115215557.GA29664@neptune.faber.nom> <20051115223125.GA29964@neptune.faber.nom> Message-ID: <20051116042013.GA31484@neptune.faber.nom> For all of you following along at home, you'll be pleased to hear that I've got a working system going; I can pull the data from the website, parse it into hash of hashes, and load it into a database. After spending a day and a half RTFM for HTML:TokeParser, HTML::TreeBuilder and the like, and manually parsing tables within tables with s without s, the winning method is.... my @content = `lynx --dump $url` Everything old is new again. -- Regards, Faber Fedor President Linux New Jersey, Inc. 908-320-0357 800-706-0701 http://www.linuxnj.com From faber at linuxnj.com Wed Nov 30 08:58:51 2005 From: faber at linuxnj.com (Faber Fedor) Date: Wed, 30 Nov 2005 11:58:51 -0500 Subject: [ABE.pm] backticking problem Message-ID: <20051130165851.GA29641@neptune.faber.nom> I'm trying to get Perl to read the output of some shell comands tht involve pipes and I'm missing something basic. I have this: lynx -dump $url | grep 'S&P 500' | head -5 | head -1 which works correctly on the command line. When I put it in a PErl script thusly my $line = `lynx -dump $url | grep 'S&P 500' | head -5 | head -1` I get the output of the lynx command or the equivalent of my $line = `lynx -dump $url` How do I get the proper output into $line? Or do I have to process the lynx output separately inside of Perl? -- Regards, Faber Fedor President Linux New Jersey, Inc. 908-320-0357 800-706-0701 http://www.linuxnj.com From waltman at pobox.com Wed Nov 30 09:59:16 2005 From: waltman at pobox.com (Walt Mankowski) Date: Wed, 30 Nov 2005 12:59:16 -0500 Subject: [ABE.pm] backticking problem In-Reply-To: <20051130165851.GA29641@neptune.faber.nom> References: <20051130165851.GA29641@neptune.faber.nom> Message-ID: <20051130175916.GH5398@waltman.dnsalias.org> On Wed, Nov 30, 2005 at 11:58:51AM -0500, Faber Fedor wrote: > I'm trying to get Perl to read the output of some shell comands tht > involve pipes and I'm missing something basic. > > I have this: > > lynx -dump $url | grep 'S&P 500' | head -5 | head -1 > > which works correctly on the command line. When I put it in a PErl > script thusly > > my $line = `lynx -dump $url | grep 'S&P 500' | head -5 | head -1` > > I get the output of the lynx command or the equivalent of > > my $line = `lynx -dump $url` > > How do I get the proper output into $line? Or do I have to process the > lynx output separately inside of Perl? That's odd. At first I thought you needed to escape the &, but then I wrote this little test script and it appears to work correctly: #!/usr/local/bin/perl -w use strict; my $url = 'http://finance.yahoo.com/q?s=%5EGSPC'; my $line = `lynx -dump $url | grep 'S&P 500' | head -5 | head -1`; print "$line\n"; So I guess you must be doing something else wrong... BTW why are you doing "head -5 | head -1"? If you want to print the first line, "head -1" is sufficient. If you want to print the 5th line, you could do something like "sed -n 5p". Walt -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://mail.pm.org/pipermail/abe-pm/attachments/20051130/c557159f/attachment.bin From rjbs-perl-abe at lists.manxome.org Wed Nov 30 11:01:11 2005 From: rjbs-perl-abe at lists.manxome.org (Ricardo SIGNES) Date: Wed, 30 Nov 2005 14:01:11 -0500 Subject: [ABE.pm] backticking problem In-Reply-To: <20051130165851.GA29641@neptune.faber.nom> References: <20051130165851.GA29641@neptune.faber.nom> Message-ID: <20051130190111.GD1075@manxome.org> * Faber Fedor [2005-11-30T11:58:51] > which works correctly on the command line. When I put it in a PErl > script thusly > > my $line = `lynx -dump $url | grep 'S&P 500' | head -5 | head -1` > > I get the output of the lynx command or the equivalent of > > my $line = `lynx -dump $url` I'm with Walt: I don't see anything immediately wrong. Did you copy and paste that code, or did you maybe insert an error (fixing the problem) in your transcription? alternately: my ($line) = grep { /S&P/ } split /\n/, `lynx -dump $url`; I've just eaten, so my brain may be slighly laggy, but that should do what you want, too. Could you show us the whole script your running -- or, much better, show us a pared down version that can be run, producing the same problematic results? Are we going to see you at McGrady's tonight? :) -- rjbs -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.pm.org/pipermail/abe-pm/attachments/20051130/c0c16a72/attachment.bin From waltman at pobox.com Wed Nov 30 11:08:52 2005 From: waltman at pobox.com (Walt Mankowski) Date: Wed, 30 Nov 2005 14:08:52 -0500 Subject: [ABE.pm] backticking problem In-Reply-To: <20051130190111.GD1075@manxome.org> References: <20051130165851.GA29641@neptune.faber.nom> <20051130190111.GD1075@manxome.org> Message-ID: <20051130190852.GI5398@waltman.dnsalias.org> On Wed, Nov 30, 2005 at 02:01:11PM -0500, Ricardo SIGNES wrote: > Are we going to see you at McGrady's tonight? :) Sadly, no. I have class tonight. But hopefully I'll make it to one of your meetings eventually... :) Walt -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://mail.pm.org/pipermail/abe-pm/attachments/20051130/240b14b3/attachment.bin From waltman at pobox.com Wed Nov 30 11:08:52 2005 From: waltman at pobox.com (Walt Mankowski) Date: Wed, 30 Nov 2005 14:08:52 -0500 Subject: [ABE.pm] backticking problem In-Reply-To: <20051130190111.GD1075@manxome.org> References: <20051130165851.GA29641@neptune.faber.nom> <20051130190111.GD1075@manxome.org> Message-ID: <20051130190852.GI5398@waltman.dnsalias.org> On Wed, Nov 30, 2005 at 02:01:11PM -0500, Ricardo SIGNES wrote: > Are we going to see you at McGrady's tonight? :) Sadly, no. I have class tonight. But hopefully I'll make it to one of your meetings eventually... :) Walt -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://mail.pm.org/pipermail/abe-pm/attachments/20051130/240b14b3/attachment-0001.bin