SPUG: extracting text between <a> and </a>
Tim Maher/CONSULTIX
tim at consultix-inc.com
Thu Oct 5 06:09:14 CDT 2000
This might be the second time you see this and if so, I apologize 8-}
(I may have responded only to the author in the first transmission.)
-Tim
Date: Thu, 5 Oct 2000 10:42:21 +0000
From: Tim Maher/CONSULTIX <tim at consultix-inc.com>
To: Todd Wells <toddw at wrq.com>
Subject: Re: SPUG: extracting text between <a> and </a>
X-Mailer: Mutt 0.95.1i
In-Reply-To: <1654BC972546D31189DA00508B318AC801C88257 at charmander.wrq.com>; from Todd Wells on Thu, Oct 05, 2000 at 08:54:59AM -0700
On Thu, Oct 05, 2000 at 08:54:59AM -0700, Todd Wells wrote:
> I'm working on a little web automation routine and I've used HTML::LinkExtor
> to extract the links from a web page, then I'm processing each of those
> links.
>
> What I'd like to know is if there's some easy way that I could get the
> original text that accompanied that link. e.g., <a href =
> "http://thislink"> this text here I want </a>.
You need to "Use Damian" 8-) !
His Text::Balanced module has a method called extract_tagged() that will
find and parse your anchor tags and return each part in a different list
element.
For example:
$ cat extract
#! /usr/bin/perl -w
use Text::Balanced 'extract_tagged';
$_= '<a href = "http://thislink"> this text here I want </a> MORE STUFF';
$skip=undef;
($parts{whole_match},
$parts{remnants},
$parts{skipover},
$parts{first_tag},
$parts{enclosed},
$parts{last_tag}) =
extract_tagged($_,undef,undef,$skip);
print map "$_\t=>'$parts{$_}'\n", sort keys %parts;
$ ./extract
enclosed =>' this text here I want '
first_tag =>'<a href = "http://thislink">'
last_tag =>'</a>'
remnants =>' MORE STUFF'
skipover =>''
whole_match =>'<a href = "http://thislink"> this text here I want </a>'
$
---
Check the documentation of the *latest version* for details on setting
the $skip parameter, which controls skipping over text on the way to
finding the tag; you might find its behavior counter-intuitive.
-Tim
*========================================================================*
| Dr. Tim Maher, CEO, Consultix (206) 781-UNIX/8649; ask for FAX# |
| Email: tim at consultix-inc.com Web: http://www.consultix-inc.com |
|Training- TIM MAHER: Unix, Perl DAMIAN CONWAY: Adv. Perl, OOP, Parsing |
|CLASSES: 10/9: Adv OO-Perl/Parsing 10/16: Int. Perl 10/23 Perl Prog. |
*========================================================================*
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
POST TO: spug-list at pm.org PROBLEMS: owner-spug-list at pm.org
Subscriptions; Email to majordomo at pm.org: ACTION LIST EMAIL
Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
For daily traffic, use spug-list for LIST ; for weekly, spug-list-digest
Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/
More information about the spug-list
mailing list