Name parsing with PL/Perl
I have been working on a database cleanup task which required normalizing names of people into first and last names. Unfortunately, this is legacy data from many different databases where a single field is used for the entire name. The names appear in every conceivable order and variation, with lots of extraneous words and characters thrown in for extra fun.
Certainly I was not the first person faced with this task, so I searched for existing tools rather than start from scratch. I found commercial software to do this "starting at" $10,000. I hunted around a bit more and came across this Perl module for parsing names.
I have never written a single line of Perl in my life, but I thought I would give it a try. I was able to get the module installed via CPAN on OS X without too much trouble.
Next, I wanted to get it running in PostgreSQL so I could call the name parser in the database. I recompiled PostgreSQL using the --with-perl option and then installed PL/Perl in my database using the CREATE LANGUAGE command. I created a PL/Perl function based on examples from the module and was pleasantly surprised when everything worked on the first try. Here is the function I ended up with:
create type person_name as ( first_name text, last_name text, suffix text ); create or replace function parse_name(pname text) returns person_name as $perl$ my $name = $_SHARED{name_parser}; my $first = ''; my $last = ''; if(!$name) { use Lingua::EN::NameParse qw(clean case_surname); # optional configuration arguments my %args = ( # salutation => 'Dear', # sal_default => 'Friend', auto_clean => 1, force_case => 1, lc_prefix => 1, initials => 2, allow_reversed => 0, joint_names => 0, extended_titles => 0 ); $name = new Lingua::EN::NameParse(%args); $_SHARED{name_parser} = $name; } $error = $name->parse($_[0]); if ($name->{properties}{type} eq 'unknown') { return undef; } else { %comps = $name->case_components; $first = $comps{given_name_1}; if ($first && $comps{initials_1}) {$first = $first . ' ';} $first = $first . $comps{initials_1}; if ($first && $comps{initials_2}) {$first = $first . ' ';} $first = $first . $comps{initials_2}; $last = $comps{surname_1}; if ($last && $comps{surname_2}) {$last = $last . ' ';} $last = $last . $comps{surname_2}; return { first_name => $first, last_name => $last, suffix => $comps{suffix} }; } end; $perl$ language plperlu;
Unfortunately, the parser could not handle a lot of different cases in my data set, so in the end I gave up and wrote my own. But this was a good exercise for learning a bit of Perl and testing other languages inside of PostgreSQL. PL/Perl support for pgEdit may be in the works.