Matching two lists of users

Scott Penrose scottp at dd.com.au
Tue Jun 17 18:37:06 CDT 2003


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I am still looking into these modules, and thanks heaps for all the 
replies I have had.

But it appears there is nothing that thinks about users in the terms of 
unique identifiers and names.

Here is a more complex example of some matching I want to do, and I 
think the right thing to do is some specific name based rules.

What I know about the some users:

	Student ID: 	123
	Full Name:	Scott Dustin Penrose
	Login:		scottp

	Student ID:	666
	Full Name:	Brooke Penrose
	Login:		brookep

	Student ID:	333
	Full Name:	Steven Trevor Platypus
	Login:		steve

	Student ID:	444
	Full Name:	John Smithson
	Login:		123

Here is a list and what should match...

123				Fails, there is two 123 unique identifiers above
sp				Fails, there are two SP above
sdp				Found
Scott Penrose		Found
Penrose, Scott	Found
Penrose, SD		Found
Penrose			Fails, there are two
Penrose, S		Found
John Smithson		Found
Steven Platypus	Found
Steven T Platypus	Found
Steven Trevor Platypus	Found

The trick is around how to split up the words and rejoin them. For 
example a person can have a two word surname, in which case the 
following would be wrong.

Smith, David Ashton

it should be

Ashton Smith, David

There is no way for me to know that, so likely I would accept any of 
the following...

	David Ashton Smith
	Smith, David Ashton
	Ashton Smith, David
	David Smith
	Smith, David
	Smith, D
	D Smith
	D A Smith
	Smith, D A
	David A Smith
	DS
	DAD

and there are probably more.

Of course if we have a David Smith as well as David Ashton Smith then 
lots of the above would be a fail.
My current code has two hashes along the lines of...

	if (exist($match{"DAD"}) && ($count{"DAD"} == 1)) {
		# MATCH !!!
	}

That way when building the match hash we increment the counter to keep 
track of more than one of that type, thus making it redundant.

I guess to make it complete, I would have to provide a config to say 
how to treat each incoming field...

For example

	'Full_Name'		=>	'words',
	'Student_ID'		=> 	'unique',
	'Login'			=>	'unique',

Thus knowing to do the split up of words, and potential reorder, or 
just match the string.

I think that approximate matches in this case would not be beneficial 
and potentially add quite a bit of confusion around matching the 
students listed.

Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org

iD8DBQE+76YlDCFCcmAm26YRAniSAJ4+rA5QorGZpxkix/f+K14xxtvqIwCfRMHN
fVflGRbQ6e45R6w/H+4pKoc=
=P39l
-----END PGP SIGNATURE-----




More information about the Melbourne-pm mailing list