[Chicago-talk] Parsing APA-format citations

Alan Mead amead at alanmead.org
Sun May 6 21:00:07 PDT 2018


Jim,

I was mainly thinking about using this on documents that I've created or
a student created to check that the citations (in the text) and
references (in the reference section) are consistent and correct. I
developed the fallible prior script when I was a faculty member to check
references for student theses. Now, I'm mainly worried about my own
writing at this point, to catch errors. For example, I'm writing a
book-length manual and I'd like to check that all the citations are
referenced.

I guess I would like to use it on the text of PDF's some of which (the
"searchable PDF's") are based on OCR'ing scanned documents. But this
isn't my focus currently.

I would describe APA format as a convention for manuscripts so that they
can be reviewed and then typeset in a standard way. It's perhaps an
anachronism now that we use software that can recreate a lot of typeset
effects. The main issue is that in APA format, you cite a work using a
parenthetical  string with names and dates, rather than numbers. So "...
been found (Mead, 2013)." rather than "... been found [3]." There are
speciofic rules for citations and then a long list of templates for
references for different kinds of works (articles with different numbers
of authors, book chapters, books, software, databases, websites, etc.).

-Alan

On 5/6/2018 5:18 PM, Jim Jacobus wrote:
>
> Of course it's well suited for Perl. Isn't everything.
> Are you looking at OCR'd documents or are they electronic in a format
> like PDF, Postscript, Word, etc?
> Doing a quick search I see the "APA format" is a print style
> formalized in 1929. It's a rudimentary typesetting style much like
> what is used for legal appellate briefs so that doesn't mean much if
> you are just dealing with text (that's why the OCR question).
> If these are PDFs there are various ways to extract data using the
> style header in the PDF document.
>
>
> At 04:00 PM 5/3/2018, Alan Mead wrote:
>
>> Is this list still active? I have a task that I think might be
>> well-suited to Perl and I'd appreciate the lists' advice.
>>
>> I read and review documents written in APA format and it would sometimes
>> be handy to have parse out the citations from the text. Here's a few
>> lines of input:
>>
>> Personality constructs have been found to be predictive of job
>> performance across many studies and occupations (Ones, Viswesveran, &
>> Dilchert, 2005). Recent...
>> Alpha, Beta and Gamma (2007) collected personality data and performance
>> ratings on 142 incumbent sales employees...
>> ... has been used in sales (Alpha et al., 2007)
>> Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
>> Seed, 2013b)...
>> ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
>> ... conform to currently accepted best practices (see Society for
>> Industrial and Organizational Psychology, 2019)
>>
>> And I'd like this output:
>>
>> Ones, Viswesveran, & Dilchert, 2005
>> Alpha, Beta, & Gamma, 2007
>> Mead & Reed, 2012
>> Mead & Seed, 2013a
>> Mead & Seed, 2013b
>> Mead, 2012
>> Mead, 2013
>> Drwho & Layla, 2011
>> Society for Industrial and Organizational Psychology, 2019
>>
>> The number of authors can vary, there are at least a couple ways to
>> cite, and there is a rule where "et al." should replace the names of
>> second and later authors in the second and subsequent citations.
>> Sometimes the "oxford comma" is used and other times it's omitted. There
>> are some special rules for other things, like direct quotations and
>> articles with more than eight authors.
>>
>> So, this is messy, but not as messy as a lot of NLP problems. I wrote a
>> script that used regexs to find common patterns but it was fallible. I
>> know publishers have such software because they'll scan your manuscript
>> and parse both the citations and the references (the full "citation"
>> with the author initials, work title, etc.) and highlight errors (e.g.,
>> cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
>> software would highlight the citation and the reference).
>>
>> Any ideas about how best to approach this problem? Work harder on the
>> regexs? Hand code a bunch of documents and use a machine learning
>> algorithm (which would start, I guess, with named entity recognition)?
>> Pay MTurkers to find these? Does anyone know of a good named-entity
>> recognition library that's Perl-friendly?  The solution doesn't need to
>> be particularly fast. I wouldn't be running it on millions of pages.
>>
>> I suppose someone will suggest I use Python. I really prefer Perl but
>> maybe Python has more ML tools at this point.
>>
>> -Alan
>>
>> -- 
>>
>> Alan D. Mead, Ph.D.
>> President, Talent Algorithms Inc.
>>
>> science + technology = better workers
>>
>> http://www.alanmead.org
>>
>> I've... seen things you people wouldn't believe...
>> functions on fire in a copy of Orion.
>> I watched C-Sharp glitter in the dark near a programmable gate.
>> All those moments will be lost in time, like Ruby... on... Rails...
>> Time for Pi.
>>
>>           --"The Register" user Alister, applying the famous
>>             "Blade Runner" speech to software development
>> _______________________________________________
>> Chicago-talk mailing list
>> Chicago-talk at pm.org
>> http://mail.pm.org/mailman/listinfo/chicago-talk
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk

-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

I've... seen things you people wouldn't believe...
functions on fire in a copy of Orion.
I watched C-Sharp glitter in the dark near a programmable gate.
All those moments will be lost in time, like Ruby... on... Rails... Time for Pi.

          --"The Register" user Alister, applying the famous 
            "Blade Runner" speech to software development


More information about the Chicago-talk mailing list