SPUG: extracting text between <a> and </a>

Tim Maher/CONSULTIX tim at consultix-inc.com
Thu Oct 5 06:09:14 CDT 2000


This might be the second time you see this and if so, I apologize 8-}
(I may have responded only to the author in the first transmission.)
-Tim

Date: Thu, 5 Oct 2000 10:42:21 +0000
From: Tim Maher/CONSULTIX <tim at consultix-inc.com>
To: Todd Wells <toddw at wrq.com>
Subject: Re: SPUG: extracting text between <a> and </a>
X-Mailer: Mutt 0.95.1i
In-Reply-To: <1654BC972546D31189DA00508B318AC801C88257 at charmander.wrq.com>; from Todd Wells on Thu, Oct 05, 2000 at 08:54:59AM -0700

On Thu, Oct 05, 2000 at 08:54:59AM -0700, Todd Wells wrote:
> I'm working on a little web automation routine and I've used HTML::LinkExtor
> to extract the links from a web page, then I'm processing each of those
> links.
> 
> What I'd like to know is if there's some easy way that I could get the
> original text that accompanied that link.  e.g., <a href =
> "http://thislink"> this text here I want </a>. 

You need to "Use Damian" 8-) !

His Text::Balanced module has a method called extract_tagged() that will
find and parse your anchor tags and return each part in a different list
element.

For example:

$ cat extract
#! /usr/bin/perl -w
use Text::Balanced 'extract_tagged';

$_= '<a href = "http://thislink"> this text here I want </a> MORE STUFF';

$skip=undef;
($parts{whole_match},
    $parts{remnants},
	    $parts{skipover},
		    $parts{first_tag},
			    $parts{enclosed},
				    $parts{last_tag}) =
					extract_tagged($_,undef,undef,$skip);

print  map "$_\t=>'$parts{$_}'\n",  sort keys %parts;

$ ./extract
enclosed	=>' this text here I want '
first_tag	=>'<a href = "http://thislink">'
last_tag	=>'</a>'
remnants	=>' MORE STUFF'
skipover	=>''
whole_match	=>'<a href = "http://thislink"> this text here I want </a>'
$

---
Check the documentation of the *latest version* for details on setting
the $skip parameter, which controls skipping over text on the way to
finding the tag; you might find its behavior counter-intuitive.

-Tim
*========================================================================*
| Dr. Tim Maher, CEO, Consultix       (206) 781-UNIX/8649;  ask for FAX# | 
| Email: tim at consultix-inc.com        Web: http://www.consultix-inc.com  |
|Training- TIM MAHER: Unix, Perl  DAMIAN CONWAY: Adv. Perl, OOP, Parsing |
|CLASSES: 10/9: Adv OO-Perl/Parsing   10/16: Int. Perl  10/23 Perl Prog. |
*========================================================================*

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For daily traffic, use spug-list for LIST ;  for weekly, spug-list-digest
  Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/





More information about the spug-list mailing list