[Chicago-talk] Parsing APA-format citations

Mike Fragassi mikefrag at gmail.com
Sun May 6 19:28:43 PDT 2018


Biblio::Citation::Parser may help you.  First, seems like it should be able
to parse the references listed in an article's bibliography:

use Biblio::Citation::Parser::Standard;
my $ref = 'Gleditsch, N. P., Pinker, S., Thayer, B. A., Levy, J. S., &
Thompson, W. R. (2013). The forum: The decline of war. International
Studies Review, 15(3), 396-419.';
# example from http://www.citationmachine.net/apa/cite-a-book
my $cit_parser = new Biblio::Citation::Parser::Standard;
my $metadata = $cit_parser->parse($ref);
print Dumper($metadata);

The $metadata is a hashref of valid values parsed from the string,
including the rule that the parser used to identify the parts, that looks
like this:

   'match' => '_AUTHORS_ (_YEAR_). _TITLE_. _PUBLICATION_,
_VOLUME_(_ISSUE_), _PAGES_',

The 'match' seem to come from Biblio::Citation::Parser::Templates, and you
should be able to modify this to look for things like '(_AUTHORS_, _YEAR_)'
etc.  (Note: I've never used this module myself before now.)


On Sun, May 6, 2018 at 7:29 PM, James E Keenan <jkeenan at pobox.com> wrote:

> On 05/03/2018 05:00 PM, Alan Mead wrote:
>
>> Is this list still active? I have a task that I think might be
>> well-suited to Perl and I'd appreciate the lists' advice.
>>
>> I read and review documents written in APA format and it would sometimes
>> be handy to have parse out the citations from the text. Here's a few
>> lines of input:
>>
>> Personality constructs have been found to be predictive of job
>> performance across many studies and occupations (Ones, Viswesveran, &
>> Dilchert, 2005). Recent...
>> Alpha, Beta and Gamma (2007) collected personality data and performance
>> ratings on 142 incumbent sales employees...
>> ... has been used in sales (Alpha et al., 2007)
>> Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
>> Seed, 2013b)...
>> ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
>> ... conform to currently accepted best practices (see Society for
>> Industrial and Organizational Psychology, 2019)
>>
>> And I'd like this output:
>>
>> Ones, Viswesveran, & Dilchert, 2005
>> Alpha, Beta, & Gamma, 2007
>> Mead & Reed, 2012
>> Mead & Seed, 2013a
>> Mead & Seed, 2013b
>> Mead, 2012
>> Mead, 2013
>> Drwho & Layla, 2011
>> Society for Industrial and Organizational Psychology, 2019
>>
>> The number of authors can vary, there are at least a couple ways to
>> cite, and there is a rule where "et al." should replace the names of
>> second and later authors in the second and subsequent citations.
>> Sometimes the "oxford comma" is used and other times it's omitted. There
>> are some special rules for other things, like direct quotations and
>> articles with more than eight authors.
>>
>>
> Is it the case that there are commercial software solutions for this
> problem, but as yet no open-source solutions?
>
> Is this the standard you are trying to meet?
> http://www.apastyle.org/manual/index.aspx
>
> So, this is messy, but not as messy as a lot of NLP problems. I wrote a
>> script that used regexs to find common patterns but it was fallible. I
>> know publishers have such software because they'll scan your manuscript
>> and parse both the citations and the references (the full "citation"
>> with the author initials, work title, etc.) and highlight errors (e.g.,
>> cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
>> software would highlight the citation and the reference).
>>
>> Any ideas about how best to approach this problem? Work harder on the
>> regexs? Hand code a bunch of documents and use a machine learning
>> algorithm (which would start, I guess, with named entity recognition)?
>> Pay MTurkers to find these? Does anyone know of a good named-entity
>> recognition library that's Perl-friendly?  The solution doesn't need to
>> be particularly fast. I wouldn't be running it on millions of pages.
>>
>> I suppose someone will suggest I use Python. I really prefer Perl but
>> maybe Python has more ML tools at this point.
>>
>> -Alan
>>
>> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20180506/5aa8532b/attachment.html>


More information about the Chicago-talk mailing list