Name parsing with PL/Perl

I have been working on a database cleanup task which required normalizing names of people into first and last names. Unfortunately, this is legacy data from many different databases where a single field is used for the entire name. The names appear in every conceivable order and variation, with lots of extraneous words and characters thrown in for extra fun.

Certainly I was not the first person faced with this task, so I searched for existing tools rather than start from scratch. I found commercial software to do this "starting at" $10,000. I hunted around a bit more and came across this Perl module for parsing names.

I have never written a single line of Perl in my life, but I thought I would give it a try. I was able to get the module installed via CPAN on OS X without too much trouble.

Next, I wanted to get it running in PostgreSQL so I could call the name parser in the database. I recompiled PostgreSQL using the --with-perl option and then installed PL/Perl in my database using the CREATE LANGUAGE command. I created a PL/Perl function based on examples from the module and was pleasantly surprised when everything worked on the first try. Here is the function I ended up with:

create type person_name as (
    first_name text,
    last_name text,
    suffix text
);


create or replace function parse_name(pname text)
returns person_name as $perl$
    my $name = $_SHARED{name_parser};
    my $first = '';
    my $last = '';

    if(!$name) {
        use Lingua::EN::NameParse qw(clean case_surname);
    
        # optional configuration arguments
        my %args =
        (
            # salutation      => 'Dear',
            # sal_default     => 'Friend',
            auto_clean      => 1,
            force_case      => 1,
            lc_prefix       => 1,
            initials        => 2,
            allow_reversed  => 0,
            joint_names     => 0,
            extended_titles => 0
        );

        $name = new Lingua::EN::NameParse(%args);
        $_SHARED{name_parser} = $name;
    }

    $error = $name->parse($_[0]);
    if ($name->{properties}{type} eq 'unknown') {
        return undef;
    } else {
        %comps = $name->case_components;
        
        $first = $comps{given_name_1};
        if ($first && $comps{initials_1}) {$first = $first . ' ';}
        $first = $first . $comps{initials_1};
        if ($first && $comps{initials_2}) {$first = $first . ' ';}
        $first = $first . $comps{initials_2};

        $last = $comps{surname_1};
        if ($last && $comps{surname_2}) {$last = $last . ' ';}
        $last = $last . $comps{surname_2};
        
        return {
            first_name => $first,
            last_name => $last,
            suffix => $comps{suffix}
        };
    }
end;
$perl$ language plperlu;

Unfortunately, the parser could not handle a lot of different cases in my data set, so in the end I gave up and wrote my own. But this was a good exercise for learning a bit of Perl and testing other languages inside of PostgreSQL. PL/Perl support for pgEdit may be in the works.