[Za-pm] Script

Wed Jun 4 13:16:03 CDT 2003

On Wednesday 04 June 2003 08:33, Bartho Saaiman wrote:
> I want to filter a log file that I currently have to manipulate
> manually. So I was thinking to myself that this would be nice if I could
> do this with Perl. If it would be easier in bash, suggestions would also
> be welcome. So here is the scenario:
>
> Output of log file (sms.log):
> <snip>
> [Fri May 23 11:02:12 SAST 2003] SMS  user at domain.co.za sent 43
> characters to 271234567891
> [Fri May 23 18:16:02 SAST 2003] SMS  "Some User" <user at domain.co.za>
> sent 150 characters to 271234567891
> [Sat May 24 12:51:37 SAST 2003] SMS  "Some User" <user at domain.co.za>
> sent 151 characters to 271234567891
> [Mon May 26 15:16:00 SAST 2003] SMS  Some User <user at domain.co.za> sent
> 142 characters to 271234567891
> </snip>
>
> So the first problem is that the user (Some User) detail is logged in
> three different ways. I am also only interested in the email addres as I
> can use this to do accountting with. I am currentl using bash like this:
>

Just some notes of my own - it was probably mentioned in some of the other 
mails, but I might have mist it.

We assume you have various forms of the name, but one particular e-mail. When 
ever you use regular expressions or other means, remember to first convert 
all ASCII text to the same case (or at least ignore case). What we can also 
do, is just use th first instance of the real name, and ignore the rest ( if 
we want a name ). Also remember to do away with all punctuations.

Given the above example, here is a simple script:

<script "test.pl">
#!/usr/bin/perl

%usernames = ();
%usercounter = ();

# read from STDIN
while (<>) {

	chomp;	# remove newlines
	/SMS\s+(.+)\s+sent\s+(\d+)\s+/;
	$userraw = $1; # holds name and e-mail or just an e-mail
	$counter = $2; # holds the counter value

	# now work some more on the user - split name and e-mail
	if ( $userraw =~ /\s+<\w+/ ) {

		$userraw =~ /(.+)<(.+)>/;
		$user = $1; # the user name
		$email = $2; # the e-mail

		# clear the username of non ASCII characters
		$user =~ s/^\W+//;
		$user =~ s/\W+$//;

		# you could now build more code to "ignore" case - I skipped it

		# add values to the hashes
		$usercounter{ $email } = $usercounter{ $email } + $counter;
		if ( not $usernames{ $email } ) { $usernames{ $email } = $user; }

	} else {

		# only an e-mail - no name to work with
		$userraw =~ /<(.+)>/;
		$email = $1;
		$usercounter{ $email } = $usercounter{ $email } + $counter;

	}

}

# print the results...
foreach $key ( keys %usercounter ) {

	$count = $usercounter{ $key };
	$name = $usernames{ $key };
	if ( $name ) { print "$name = $count\n"; }
	else { print "unknown = $count\n"; }

}
</script "test.pl">

Sample Data:

<data "test.txt">
[Fri May 23 11:02:12 SAST 2003] SMS  user at domain.co.za sent 43 characters to 
271234567891
[Fri May 23 18:16:02 SAST 2003] SMS  "Some User" <user at domain.co.za> sent 150 
characters to 271234567891
[Sat May 24 12:51:37 SAST 2003] SMS  "Some User" <user at domain.co.za> sent 151 
characters to 271234567891
[Mon May 26 15:16:00 SAST 2003] SMS  Soe User <user at domain.co.za> sent 142 
characters to 271234567891
[Sat May 24 12:51:37 SAST 2003] SMS  "Another User" <user2 at domain.co.za> sent 
151 characters to 271234567891
[Fri May 23 18:16:02 SAST 2003] SMS  <user3 at domain.co.za> sent 150 characters 
to 271234567891
</data "test.txt">

Sample Run:

<output>
$ cat test.txt | perl test.pl
unknown = 150
Another User = 151
Some User = 486
</output>

NOTE: You should be aware that the data in a hash is not sorted - there is a 
way to do it, so just shout if you need it. Also note, that although the last 
instance of "Some User", the actual name is misspelled, it will still take 
only the first name it finds for that e-mail. For e-mails that have NO known 
name, we simply use 'unknown', but they will still be uniquely grouped, as we 
matched per e-mail.

Cheers

> [bartho at hercules bartho]$ cat  smslog |grep "May"| grep "2003" |awk \
> 	'{print $8, $9, $10}'
> user at domain.co.za sent 47
> "Some User" <user at domain.co.za>
> Some User <user at domain.co.za>
>
> Now this is where my problem starts. I probably need to use regular
> expressions to feed it the month and the domain. The year I could
> probably use in a regex too, but this doesn't change to often. Then I
> ned to send this to a clean file only containing the emails that this
> originated from. I do not need to sort them as unique since I have to
> add them up, similar to 'wc -l'
>
> I know this is a lot of questions, but any pointers would be appreciated.

-- 
Nico Coetzee

http://www.itfirms.co.za/
http://za.pm.org/
http://forums.databasejournal.com/

To the systems programmer, users and applications serve only to provide a
test load.