Matching two lists of users
Scott Penrose
scottp at dd.com.au
Tue Jun 17 18:37:06 CDT 2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I am still looking into these modules, and thanks heaps for all the
replies I have had.
But it appears there is nothing that thinks about users in the terms of
unique identifiers and names.
Here is a more complex example of some matching I want to do, and I
think the right thing to do is some specific name based rules.
What I know about the some users:
Student ID: 123
Full Name: Scott Dustin Penrose
Login: scottp
Student ID: 666
Full Name: Brooke Penrose
Login: brookep
Student ID: 333
Full Name: Steven Trevor Platypus
Login: steve
Student ID: 444
Full Name: John Smithson
Login: 123
Here is a list and what should match...
123 Fails, there is two 123 unique identifiers above
sp Fails, there are two SP above
sdp Found
Scott Penrose Found
Penrose, Scott Found
Penrose, SD Found
Penrose Fails, there are two
Penrose, S Found
John Smithson Found
Steven Platypus Found
Steven T Platypus Found
Steven Trevor Platypus Found
The trick is around how to split up the words and rejoin them. For
example a person can have a two word surname, in which case the
following would be wrong.
Smith, David Ashton
it should be
Ashton Smith, David
There is no way for me to know that, so likely I would accept any of
the following...
David Ashton Smith
Smith, David Ashton
Ashton Smith, David
David Smith
Smith, David
Smith, D
D Smith
D A Smith
Smith, D A
David A Smith
DS
DAD
and there are probably more.
Of course if we have a David Smith as well as David Ashton Smith then
lots of the above would be a fail.
My current code has two hashes along the lines of...
if (exist($match{"DAD"}) && ($count{"DAD"} == 1)) {
# MATCH !!!
}
That way when building the match hash we increment the counter to keep
track of more than one of that type, thus making it redundant.
I guess to make it complete, I would have to provide a config to say
how to treat each incoming field...
For example
'Full_Name' => 'words',
'Student_ID' => 'unique',
'Login' => 'unique',
Thus knowing to do the split up of words, and potential reorder, or
just match the string.
I think that approximate matches in this case would not be beneficial
and potentially add quite a bit of confusion around matching the
students listed.
Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org
iD8DBQE+76YlDCFCcmAm26YRAniSAJ4+rA5QorGZpxkix/f+K14xxtvqIwCfRMHN
fVflGRbQ6e45R6w/H+4pKoc=
=P39l
-----END PGP SIGNATURE-----
More information about the Melbourne-pm
mailing list