[Chicago-talk] Parsing APA-format citations

Sun May 6 15:18:10 PDT 2018

Of course it's well suited for Perl. Isn't everything.
Are you looking at OCR'd documents or are they 
electronic in a format like PDF, Postscript, Word, etc?
Doing a quick search I see the "APA format" is a 
print style formalized in 1929. It's a 
rudimentary typesetting style much like what is 
used for legal appellate briefs so that doesn't 
mean much if you are just dealing with text (that's why the OCR question).
If these are PDFs there are various ways to 
extract data using the style header in the PDF document.

At 04:00 PM 5/3/2018, Alan Mead wrote:

>Is this list still active? I have a task that I think might be
>well-suited to Perl and I'd appreciate the lists' advice.
>
>I read and review documents written in APA format and it would sometimes
>be handy to have parse out the citations from the text. Here's a few
>lines of input:
>
>Personality constructs have been found to be predictive of job
>performance across many studies and occupations (Ones, Viswesveran, &
>Dilchert, 2005). Recent...
>Alpha, Beta and Gamma (2007) collected personality data and performance
>ratings on 142 incumbent sales employees...
>... has been used in sales (Alpha et al., 2007)
>Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
>Seed, 2013b)...
>... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
>... conform to currently accepted best practices (see Society for
>Industrial and Organizational Psychology, 2019)
>
>And I'd like this output:
>
>Ones, Viswesveran, & Dilchert, 2005
>Alpha, Beta, & Gamma, 2007
>Mead & Reed, 2012
>Mead & Seed, 2013a
>Mead & Seed, 2013b
>Mead, 2012
>Mead, 2013
>Drwho & Layla, 2011
>Society for Industrial and Organizational Psychology, 2019
>
>The number of authors can vary, there are at least a couple ways to
>cite, and there is a rule where "et al." should replace the names of
>second and later authors in the second and subsequent citations.
>Sometimes the "oxford comma" is used and other times it's omitted. There
>are some special rules for other things, like direct quotations and
>articles with more than eight authors.
>
>So, this is messy, but not as messy as a lot of NLP problems. I wrote a
>script that used regexs to find common patterns but it was fallible. I
>know publishers have such software because they'll scan your manuscript
>and parse both the citations and the references (the full "citation"
>with the author initials, work title, etc.) and highlight errors (e.g.,
>cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
>software would highlight the citation and the reference).
>
>Any ideas about how best to approach this problem? Work harder on the
>regexs? Hand code a bunch of documents and use a machine learning
>algorithm (which would start, I guess, with named entity recognition)?
>Pay MTurkers to find these? Does anyone know of a good named-entity
>recognition library that's Perl-friendly?Â  The solution doesn't need to
>be particularly fast. I wouldn't be running it on millions of pages.
>
>I suppose someone will suggest I use Python. I really prefer Perl but
>maybe Python has more ML tools at this point.
>
>-Alan
>
>--
>
>Alan D. Mead, Ph.D.
>President, Talent Algorithms Inc.
>
>science + technology = better workers
>
>http://www.alanmead.org
>
>I've... seen things you people wouldn't believe...
>functions on fire in a copy of Orion.
>I watched C-Sharp glitter in the dark near a programmable gate.
>All those moments will be lost in time, like 
>Ruby... on... Rails... Time for Pi.
>
>           --"The Register" user Alister, applying the famous
>             "Blade Runner" speech to software development
>_______________________________________________
>Chicago-talk mailing list
>Chicago-talk at pm.org
>http://mail.pm.org/mailman/listinfo/chicago-talk