[Omaha.pm] One-liner file clean-up

Jay Hannah jay at jays.net
Tue Feb 13 19:08:54 PST 2007


PROBLEM:

Given a file like this:

--------
A<B>01100 Metabolism</B>$
B$
B  <B>01110 Carbohydrate Metabolism</B>$
C$
C    00010 Glycolysis / Gluconeogenesis [PATH:sac00010]$
D$
D      <a href="/dbget-bin/www_bget?sac:SACOL1604">SACOL1604</a> glk; 
glucokinase [EC:2.7.1.2]; <a 
href=/dbget-bin/www_bget?ko+K00845>K00845</a> glucokinase $
D      <a href="/dbget-bin/www_bget?sac:SACOL0966">SACOL0966</a> pgi; 
glucose-6-phosphate isomerase [EC:5.3.1.9]; <a 
href=/dbget-bin/www_bget?ko+K01810>K01810</a> glucose-6-phosphate 
isomerase $
--------

Strip out all the HTML, and the leading capital letter and spaces. So 
it ends up looking like this:

--------
01100 Metabolism

01110 Carbohydrate Metabolism

00010 Glycolysis / Gluconeogenesis [PATH:sac00010]

SACOL1604 glk; glucokinase [EC:2.7.1.2]; K00845 glucokinase
SACOL0966 pgi; glucose-6-phosphate isomerase [EC:5.3.1.9]; K01810 
glucose-6-ph
osphate isomerase
--------


SOLUTION:

$ perl -pe 's/<.*?>//g; s/^[A-Z] *//;' filename.txt


Grin,

j



More information about the Omaha-pm mailing list