[Chicago-talk] Parsing APA-format citations

Sun May 6 17:29:35 PDT 2018

On 05/03/2018 05:00 PM, Alan Mead wrote:
> Is this list still active? I have a task that I think might be
> well-suited to Perl and I'd appreciate the lists' advice.
> 
> I read and review documents written in APA format and it would sometimes
> be handy to have parse out the citations from the text. Here's a few
> lines of input:
> 
> Personality constructs have been found to be predictive of job
> performance across many studies and occupations (Ones, Viswesveran, &
> Dilchert, 2005). Recent...
> Alpha, Beta and Gamma (2007) collected personality data and performance
> ratings on 142 incumbent sales employees...
> ... has been used in sales (Alpha et al., 2007)
> Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
> Seed, 2013b)...
> ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
> ... conform to currently accepted best practices (see Society for
> Industrial and Organizational Psychology, 2019)
> 
> And I'd like this output:
> 
> Ones, Viswesveran, & Dilchert, 2005
> Alpha, Beta, & Gamma, 2007
> Mead & Reed, 2012
> Mead & Seed, 2013a
> Mead & Seed, 2013b
> Mead, 2012
> Mead, 2013
> Drwho & Layla, 2011
> Society for Industrial and Organizational Psychology, 2019
> 
> The number of authors can vary, there are at least a couple ways to
> cite, and there is a rule where "et al." should replace the names of
> second and later authors in the second and subsequent citations.
> Sometimes the "oxford comma" is used and other times it's omitted. There
> are some special rules for other things, like direct quotations and
> articles with more than eight authors.
> 

Is it the case that there are commercial software solutions for this 
problem, but as yet no open-source solutions?

Is this the standard you are trying to meet?
http://www.apastyle.org/manual/index.aspx

> So, this is messy, but not as messy as a lot of NLP problems. I wrote a
> script that used regexs to find common patterns but it was fallible. I
> know publishers have such software because they'll scan your manuscript
> and parse both the citations and the references (the full "citation"
> with the author initials, work title, etc.) and highlight errors (e.g.,
> cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
> software would highlight the citation and the reference).
> 
> Any ideas about how best to approach this problem? Work harder on the
> regexs? Hand code a bunch of documents and use a machine learning
> algorithm (which would start, I guess, with named entity recognition)?
> Pay MTurkers to find these? Does anyone know of a good named-entity
> recognition library that's Perl-friendly?  The solution doesn't need to
> be particularly fast. I wouldn't be running it on millions of pages.
> 
> I suppose someone will suggest I use Python. I really prefer Perl but
> maybe Python has more ML tools at this point.
> 
> -Alan
>