<div dir="ltr"><div>Biblio::Citation::Parser may help you.  First, seems like it should be able to parse the references listed in an article's bibliography:</div><div><div><br></div><div>use Biblio::Citation::Parser::Standard;</div></div><div>my $ref = 'Gleditsch, N. P., Pinker, S., Thayer, B. A., Levy, J. S., & Thompson, W. R. (2013). The forum: The decline of war. International Studies Review, 15(3), 396-419.'; <br></div><div># example from <a href="http://www.citationmachine.net/apa/cite-a-book">http://www.citationmachine.net/apa/cite-a-book</a></div><div><div>my $cit_parser = new Biblio::Citation::Parser::Standard;</div><div>my $metadata = $cit_parser->parse($ref);</div><div>print Dumper($metadata);</div></div><div><br></div><div>The $metadata is a hashref of valid values parsed from the string, including the rule that the parser used to identify the parts, that looks like this:</div><div><br></div><div>   'match' => '_AUTHORS_ (_YEAR_). _TITLE_. _PUBLICATION_, _VOLUME_(_ISSUE_), _PAGES_',<br></div><div><br></div><div>The 'match' seem to come from Biblio::Citation::Parser::Templates, and you should be able to modify this to look for things like '(_AUTHORS_, _YEAR_)' etc.  (Note: I've never used this module myself before now.) </div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, May 6, 2018 at 7:29 PM, James E Keenan <span dir="ltr"><<a href="mailto:jkeenan@pobox.com" target="_blank">jkeenan@pobox.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 05/03/2018 05:00 PM, Alan Mead wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Is this list still active? I have a task that I think might be<br>

well-suited to Perl and I'd appreciate the lists' advice.<br>

<br>

I read and review documents written in APA format and it would sometimes<br>

be handy to have parse out the citations from the text. Here's a few<br>

lines of input:<br>

<br>

Personality constructs have been found to be predictive of job<br>

performance across many studies and occupations (Ones, Viswesveran, &<br>

Dilchert, 2005). Recent...<br>

Alpha, Beta and Gamma (2007) collected personality data and performance<br>

ratings on 142 incumbent sales employees...<br>

... has been used in sales (Alpha et al., 2007)<br>

Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &<br>

Seed, 2013b)...<br>

... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)<br>

... conform to currently accepted best practices (see Society for<br>

Industrial and Organizational Psychology, 2019)<br>

<br>

And I'd like this output:<br>

<br>

Ones, Viswesveran, & Dilchert, 2005<br>

Alpha, Beta, & Gamma, 2007<br>

Mead & Reed, 2012<br>

Mead & Seed, 2013a<br>

Mead & Seed, 2013b<br>

Mead, 2012<br>

Mead, 2013<br>

Drwho & Layla, 2011<br>

Society for Industrial and Organizational Psychology, 2019<br>

<br>

The number of authors can vary, there are at least a couple ways to<br>

cite, and there is a rule where "et al." should replace the names of<br>

second and later authors in the second and subsequent citations.<br>

Sometimes the "oxford comma" is used and other times it's omitted. There<br>

are some special rules for other things, like direct quotations and<br>

articles with more than eight authors.<br>

<br>

</blockquote>

<br></div></div>

Is it the case that there are commercial software solutions for this problem, but as yet no open-source solutions?<br>

<br>

Is this the standard you are trying to meet?<br>

<a href="http://www.apastyle.org/manual/index.aspx" rel="noreferrer" target="_blank">http://www.apastyle.org/manual<wbr>/index.aspx</a><span class="im HOEnZb"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

So, this is messy, but not as messy as a lot of NLP problems. I wrote a<br>

script that used regexs to find common patterns but it was fallible. I<br>

know publishers have such software because they'll scan your manuscript<br>

and parse both the citations and the references (the full "citation"<br>

with the author initials, work title, etc.) and highlight errors (e.g.,<br>

cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the<br>

software would highlight the citation and the reference).<br>

<br>

Any ideas about how best to approach this problem? Work harder on the<br>

regexs? Hand code a bunch of documents and use a machine learning<br>

algorithm (which would start, I guess, with named entity recognition)?<br>

Pay MTurkers to find these? Does anyone know of a good named-entity<br>

recognition library that's Perl-friendly?  The solution doesn't need to<br>

be particularly fast. I wouldn't be running it on millions of pages.<br>

<br>

I suppose someone will suggest I use Python. I really prefer Perl but<br>

maybe Python has more ML tools at this point.<br>

<br>

-Alan<br>

<br>

</blockquote></span><div class="HOEnZb"><div class="h5">

______________________________<wbr>_________________<br>

Chicago-talk mailing list<br>

<a href="mailto:Chicago-talk@pm.org" target="_blank">Chicago-talk@pm.org</a><br>

<a href="http://mail.pm.org/mailman/listinfo/chicago-talk" rel="noreferrer" target="_blank">http://mail.pm.org/mailman/lis<wbr>tinfo/chicago-talk</a><br>

</div></div></blockquote></div><br></div>