[Chicago-talk] Parsing APA-format citations

Thu May 3 14:00:35 PDT 2018

Is this list still active? I have a task that I think might be
well-suited to Perl and I'd appreciate the lists' advice.

I read and review documents written in APA format and it would sometimes
be handy to have parse out the citations from the text. Here's a few
lines of input:

Personality constructs have been found to be predictive of job
performance across many studies and occupations (Ones, Viswesveran, &
Dilchert, 2005). Recent...
Alpha, Beta and Gamma (2007) collected personality data and performance
ratings on 142 incumbent sales employees...
... has been used in sales (Alpha et al., 2007)
Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
Seed, 2013b)...
... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
... conform to currently accepted best practices (see Society for
Industrial and Organizational Psychology, 2019)

And I'd like this output:

Ones, Viswesveran, & Dilchert, 2005
Alpha, Beta, & Gamma, 2007
Mead & Reed, 2012
Mead & Seed, 2013a
Mead & Seed, 2013b
Mead, 2012
Mead, 2013
Drwho & Layla, 2011
Society for Industrial and Organizational Psychology, 2019

The number of authors can vary, there are at least a couple ways to
cite, and there is a rule where "et al." should replace the names of
second and later authors in the second and subsequent citations.
Sometimes the "oxford comma" is used and other times it's omitted. There
are some special rules for other things, like direct quotations and
articles with more than eight authors.

So, this is messy, but not as messy as a lot of NLP problems. I wrote a
script that used regexs to find common patterns but it was fallible. I
know publishers have such software because they'll scan your manuscript
and parse both the citations and the references (the full "citation"
with the author initials, work title, etc.) and highlight errors (e.g.,
cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
software would highlight the citation and the reference).

Any ideas about how best to approach this problem? Work harder on the
regexs? Hand code a bunch of documents and use a machine learning
algorithm (which would start, I guess, with named entity recognition)?
Pay MTurkers to find these? Does anyone know of a good named-entity
recognition library that's Perl-friendly?  The solution doesn't need to
be particularly fast. I wouldn't be running it on millions of pages.

I suppose someone will suggest I use Python. I really prefer Perl but
maybe Python has more ML tools at this point.

-Alan

-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

I've... seen things you people wouldn't believe...
functions on fire in a copy of Orion.
I watched C-Sharp glitter in the dark near a programmable gate.
All those moments will be lost in time, like Ruby... on... Rails... Time for Pi.

          --"The Register" user Alister, applying the famous 
            "Blade Runner" speech to software development