[Buffalo-pm] XML File Parsing And Manipulation...

Tue Dec 27 20:46:04 PST 2005

Mongers,

I apologize in advance for the long email...

I have an optimization/"is this the best way to do it" question.
Hopefully someone can offer a better solution.

Problem:

I need to take the following XML file, and essentially remove a few tags
and their contents (including tags and data contained within those
tags). Here is a link to the file:
http://www.acsu.buffalo.edu/~dkm/example.xml

The XML file is a dump of a Round Robin Database (RRD -
www.rrdtool.org). This RRD has two data sources (named "la" and "ds1").
I need to remove the "ds1" datasource, and the only way to remove it is
by dumping the RRD to XML, modifying the XML file, then restoring the
RRD from the XML file.

What I need to do is remove the datasource information and the actual
numerical data for "ds1". The numerical data for the two datasources is
contained in the <v> tags, which is within the <row> tags. I will need
to remove the second set of <v> tags and its data.

The following sections need to be removed from the file (in addition to
the second <v> tags and data):

<ds>
     <name> ds1 </name>
     <type> GAUGE </type>
     <minimal_heartbeat> 600 </minimal_heartbeat>
     <min> 0.0000000000e+00 </min>
     <max> 2.0000000000e+05 </max>
     <last_ds> UNKN </last_ds>
     <value> 9.0000000000e+00 </value>
     <unknown_sec> 0 </unknown_sec>
</ds>
....
<ds><value> NaN </value>  <unknown_datapoints> 0
</unknown_datapoints></ds>
...
<ds><value> 7.0166666667e+00 </value>  <unknown_datapoints> 0
</unknown_datapoints></ds>

Solution:

Here's a solution that I came up with. I'm assuming that there's some
XML module that will make this easier. The script takes in an XML
filename, reads in the file, parses out the data that is no longer
needed and prints the needed XML info into a new file named
"oldfilename.new".

#!/usr/bin/perl

@files = @ARGV;
foreach $file (@ARGV)
{
  open IN, "$file" or die "Can't open $file\n";
  open OUT, ">$file.new" or die "Can't open new file $file.new\n";
  $ds   = 0;
  $cdp = 0;
  foreach (<IN>)
  {
     if (/^(.+)\<row\>\<v\> (.+) \<\/v\>\<v\>/)
     {
        print OUT "$1<row><v> $2 </v></row>\n";
        next;
     }
     elsif (/\<ds\>\<value\>/ && $cdp == 1) { $cdp = 0; next; }
     elsif (/\<ds\>\<value\>/ && $cdp == 0) { $cdp = 1; }
     elsif (/\<ds\>/ && $ds == 0)                 { $ds   = 1; }
     elsif (/\<ds\>/ && $ds == 1)                 { $ds   = 2; next; }
     elsif (/\<\/ds\>/ && $ds == 2)              { next; }
     elsif (/(\<name\>
               |\<type\>
               |minimal\_heartbeat|min\>
               |max\>
               |last\_ds
               |value\>
               |unknown_sec\>)
               /x && $ds == 2)
     {
        next;
     }
     print OUT "$_";
  }
  close OUT;
}

There has to be a better way to do this - rather than just counting tags
and keeping counters that decide whether to print the current line or
not.

Thoughts?

#!/Dan

---------
Daniel Magnuszewski, CCNA
Systems Analyst
Operating Tools
M & T Bank Corporation
716.639.6834
dmagnuszewski { at } mandtbank.com 
http://www.mandtbank.com 
---------------------------
M&T Bank Corporation - "Understanding What's Important"