From amead at alanmead.org  Thu May  3 14:00:35 2018
From: amead at alanmead.org (Alan Mead)
Date: Thu, 3 May 2018 16:00:35 -0500
Subject: [Chicago-talk] Parsing APA-format citations
Message-ID: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>

Is this list still active? I have a task that I think might be
well-suited to Perl and I'd appreciate the lists' advice.

I read and review documents written in APA format and it would sometimes
be handy to have parse out the citations from the text. Here's a few
lines of input:

Personality constructs have been found to be predictive of job
performance across many studies and occupations (Ones, Viswesveran, &
Dilchert, 2005). Recent...
Alpha, Beta and Gamma (2007) collected personality data and performance
ratings on 142 incumbent sales employees...
... has been used in sales (Alpha et al., 2007)
Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
Seed, 2013b)...
... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
... conform to currently accepted best practices (see Society for
Industrial and Organizational Psychology, 2019)

And I'd like this output:

Ones, Viswesveran, & Dilchert, 2005
Alpha, Beta, & Gamma, 2007
Mead & Reed, 2012
Mead & Seed, 2013a
Mead & Seed, 2013b
Mead, 2012
Mead, 2013
Drwho & Layla, 2011
Society for Industrial and Organizational Psychology, 2019

The number of authors can vary, there are at least a couple ways to
cite, and there is a rule where "et al." should replace the names of
second and later authors in the second and subsequent citations.
Sometimes the "oxford comma" is used and other times it's omitted. There
are some special rules for other things, like direct quotations and
articles with more than eight authors.

So, this is messy, but not as messy as a lot of NLP problems. I wrote a
script that used regexs to find common patterns but it was fallible. I
know publishers have such software because they'll scan your manuscript
and parse both the citations and the references (the full "citation"
with the author initials, work title, etc.) and highlight errors (e.g.,
cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
software would highlight the citation and the reference).

Any ideas about how best to approach this problem? Work harder on the
regexs? Hand code a bunch of documents and use a machine learning
algorithm (which would start, I guess, with named entity recognition)?
Pay MTurkers to find these? Does anyone know of a good named-entity
recognition library that's Perl-friendly?? The solution doesn't need to
be particularly fast. I wouldn't be running it on millions of pages.

I suppose someone will suggest I use Python. I really prefer Perl but
maybe Python has more ML tools at this point.

-Alan

-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

I've... seen things you people wouldn't believe...
functions on fire in a copy of Orion.
I watched C-Sharp glitter in the dark near a programmable gate.
All those moments will be lost in time, like Ruby... on... Rails... Time for Pi.

          --"The Register" user Alister, applying the famous 
            "Blade Runner" speech to software development

From JJacobus at PonyX.com  Sun May  6 15:18:10 2018
From: JJacobus at PonyX.com (Jim Jacobus)
Date: Sun, 06 May 2018 17:18:10 -0500
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
Message-ID: <20180506224728.7515311D87E@xx1.develooper.com>


Of course it's well suited for Perl. Isn't everything.
Are you looking at OCR'd documents or are they 
electronic in a format like PDF, Postscript, Word, etc?
Doing a quick search I see the "APA format" is a 
print style formalized in 1929. It's a 
rudimentary typesetting style much like what is 
used for legal appellate briefs so that doesn't 
mean much if you are just dealing with text (that's why the OCR question).
If these are PDFs there are various ways to 
extract data using the style header in the PDF document.


At 04:00 PM 5/3/2018, Alan Mead wrote:

>Is this list still active? I have a task that I think might be
>well-suited to Perl and I'd appreciate the lists' advice.
>
>I read and review documents written in APA format and it would sometimes
>be handy to have parse out the citations from the text. Here's a few
>lines of input:
>
>Personality constructs have been found to be predictive of job
>performance across many studies and occupations (Ones, Viswesveran, &
>Dilchert, 2005). Recent...
>Alpha, Beta and Gamma (2007) collected personality data and performance
>ratings on 142 incumbent sales employees...
>... has been used in sales (Alpha et al., 2007)
>Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
>Seed, 2013b)...
>... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
>... conform to currently accepted best practices (see Society for
>Industrial and Organizational Psychology, 2019)
>
>And I'd like this output:
>
>Ones, Viswesveran, & Dilchert, 2005
>Alpha, Beta, & Gamma, 2007
>Mead & Reed, 2012
>Mead & Seed, 2013a
>Mead & Seed, 2013b
>Mead, 2012
>Mead, 2013
>Drwho & Layla, 2011
>Society for Industrial and Organizational Psychology, 2019
>
>The number of authors can vary, there are at least a couple ways to
>cite, and there is a rule where "et al." should replace the names of
>second and later authors in the second and subsequent citations.
>Sometimes the "oxford comma" is used and other times it's omitted. There
>are some special rules for other things, like direct quotations and
>articles with more than eight authors.
>
>So, this is messy, but not as messy as a lot of NLP problems. I wrote a
>script that used regexs to find common patterns but it was fallible. I
>know publishers have such software because they'll scan your manuscript
>and parse both the citations and the references (the full "citation"
>with the author initials, work title, etc.) and highlight errors (e.g.,
>cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
>software would highlight the citation and the reference).
>
>Any ideas about how best to approach this problem? Work harder on the
>regexs? Hand code a bunch of documents and use a machine learning
>algorithm (which would start, I guess, with named entity recognition)?
>Pay MTurkers to find these? Does anyone know of a good named-entity
>recognition library that's Perl-friendly??  The solution doesn't need to
>be particularly fast. I wouldn't be running it on millions of pages.
>
>I suppose someone will suggest I use Python. I really prefer Perl but
>maybe Python has more ML tools at this point.
>
>-Alan
>
>--
>
>Alan D. Mead, Ph.D.
>President, Talent Algorithms Inc.
>
>science + technology = better workers
>
>http://www.alanmead.org
>
>I've... seen things you people wouldn't believe...
>functions on fire in a copy of Orion.
>I watched C-Sharp glitter in the dark near a programmable gate.
>All those moments will be lost in time, like 
>Ruby... on... Rails... Time for Pi.
>
>           --"The Register" user Alister, applying the famous
>             "Blade Runner" speech to software development
>_______________________________________________
>Chicago-talk mailing list
>Chicago-talk at pm.org
>http://mail.pm.org/mailman/listinfo/chicago-talk

From jkeenan at pobox.com  Sun May  6 17:29:35 2018
From: jkeenan at pobox.com (James E Keenan)
Date: Sun, 6 May 2018 20:29:35 -0400
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
Message-ID: <ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>

On 05/03/2018 05:00 PM, Alan Mead wrote:
> Is this list still active? I have a task that I think might be
> well-suited to Perl and I'd appreciate the lists' advice.
> 
> I read and review documents written in APA format and it would sometimes
> be handy to have parse out the citations from the text. Here's a few
> lines of input:
> 
> Personality constructs have been found to be predictive of job
> performance across many studies and occupations (Ones, Viswesveran, &
> Dilchert, 2005). Recent...
> Alpha, Beta and Gamma (2007) collected personality data and performance
> ratings on 142 incumbent sales employees...
> ... has been used in sales (Alpha et al., 2007)
> Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
> Seed, 2013b)...
> ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
> ... conform to currently accepted best practices (see Society for
> Industrial and Organizational Psychology, 2019)
> 
> And I'd like this output:
> 
> Ones, Viswesveran, & Dilchert, 2005
> Alpha, Beta, & Gamma, 2007
> Mead & Reed, 2012
> Mead & Seed, 2013a
> Mead & Seed, 2013b
> Mead, 2012
> Mead, 2013
> Drwho & Layla, 2011
> Society for Industrial and Organizational Psychology, 2019
> 
> The number of authors can vary, there are at least a couple ways to
> cite, and there is a rule where "et al." should replace the names of
> second and later authors in the second and subsequent citations.
> Sometimes the "oxford comma" is used and other times it's omitted. There
> are some special rules for other things, like direct quotations and
> articles with more than eight authors.
> 

Is it the case that there are commercial software solutions for this 
problem, but as yet no open-source solutions?

Is this the standard you are trying to meet?
http://www.apastyle.org/manual/index.aspx

> So, this is messy, but not as messy as a lot of NLP problems. I wrote a
> script that used regexs to find common patterns but it was fallible. I
> know publishers have such software because they'll scan your manuscript
> and parse both the citations and the references (the full "citation"
> with the author initials, work title, etc.) and highlight errors (e.g.,
> cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
> software would highlight the citation and the reference).
> 
> Any ideas about how best to approach this problem? Work harder on the
> regexs? Hand code a bunch of documents and use a machine learning
> algorithm (which would start, I guess, with named entity recognition)?
> Pay MTurkers to find these? Does anyone know of a good named-entity
> recognition library that's Perl-friendly?? The solution doesn't need to
> be particularly fast. I wouldn't be running it on millions of pages.
> 
> I suppose someone will suggest I use Python. I really prefer Perl but
> maybe Python has more ML tools at this point.
> 
> -Alan
> 

From mikefrag at gmail.com  Sun May  6 19:28:43 2018
From: mikefrag at gmail.com (Mike Fragassi)
Date: Sun, 6 May 2018 21:28:43 -0500
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
	<ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>
Message-ID: <CAAKpS=54wde8wGPKib8eb7Cb6m6BYNSqsiqbTW1E6xBeSt2KPg@mail.gmail.com>

Biblio::Citation::Parser may help you.  First, seems like it should be able
to parse the references listed in an article's bibliography:

use Biblio::Citation::Parser::Standard;
my $ref = 'Gleditsch, N. P., Pinker, S., Thayer, B. A., Levy, J. S., &
Thompson, W. R. (2013). The forum: The decline of war. International
Studies Review, 15(3), 396-419.';
# example from http://www.citationmachine.net/apa/cite-a-book
my $cit_parser = new Biblio::Citation::Parser::Standard;
my $metadata = $cit_parser->parse($ref);
print Dumper($metadata);

The $metadata is a hashref of valid values parsed from the string,
including the rule that the parser used to identify the parts, that looks
like this:

   'match' => '_AUTHORS_ (_YEAR_). _TITLE_. _PUBLICATION_,
_VOLUME_(_ISSUE_), _PAGES_',

The 'match' seem to come from Biblio::Citation::Parser::Templates, and you
should be able to modify this to look for things like '(_AUTHORS_, _YEAR_)'
etc.  (Note: I've never used this module myself before now.)


On Sun, May 6, 2018 at 7:29 PM, James E Keenan <jkeenan at pobox.com> wrote:

> On 05/03/2018 05:00 PM, Alan Mead wrote:
>
>> Is this list still active? I have a task that I think might be
>> well-suited to Perl and I'd appreciate the lists' advice.
>>
>> I read and review documents written in APA format and it would sometimes
>> be handy to have parse out the citations from the text. Here's a few
>> lines of input:
>>
>> Personality constructs have been found to be predictive of job
>> performance across many studies and occupations (Ones, Viswesveran, &
>> Dilchert, 2005). Recent...
>> Alpha, Beta and Gamma (2007) collected personality data and performance
>> ratings on 142 incumbent sales employees...
>> ... has been used in sales (Alpha et al., 2007)
>> Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
>> Seed, 2013b)...
>> ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
>> ... conform to currently accepted best practices (see Society for
>> Industrial and Organizational Psychology, 2019)
>>
>> And I'd like this output:
>>
>> Ones, Viswesveran, & Dilchert, 2005
>> Alpha, Beta, & Gamma, 2007
>> Mead & Reed, 2012
>> Mead & Seed, 2013a
>> Mead & Seed, 2013b
>> Mead, 2012
>> Mead, 2013
>> Drwho & Layla, 2011
>> Society for Industrial and Organizational Psychology, 2019
>>
>> The number of authors can vary, there are at least a couple ways to
>> cite, and there is a rule where "et al." should replace the names of
>> second and later authors in the second and subsequent citations.
>> Sometimes the "oxford comma" is used and other times it's omitted. There
>> are some special rules for other things, like direct quotations and
>> articles with more than eight authors.
>>
>>
> Is it the case that there are commercial software solutions for this
> problem, but as yet no open-source solutions?
>
> Is this the standard you are trying to meet?
> http://www.apastyle.org/manual/index.aspx
>
> So, this is messy, but not as messy as a lot of NLP problems. I wrote a
>> script that used regexs to find common patterns but it was fallible. I
>> know publishers have such software because they'll scan your manuscript
>> and parse both the citations and the references (the full "citation"
>> with the author initials, work title, etc.) and highlight errors (e.g.,
>> cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
>> software would highlight the citation and the reference).
>>
>> Any ideas about how best to approach this problem? Work harder on the
>> regexs? Hand code a bunch of documents and use a machine learning
>> algorithm (which would start, I guess, with named entity recognition)?
>> Pay MTurkers to find these? Does anyone know of a good named-entity
>> recognition library that's Perl-friendly?  The solution doesn't need to
>> be particularly fast. I wouldn't be running it on millions of pages.
>>
>> I suppose someone will suggest I use Python. I really prefer Perl but
>> maybe Python has more ML tools at this point.
>>
>> -Alan
>>
>> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20180506/5aa8532b/attachment.html>

From amead at alanmead.org  Sun May  6 21:00:07 2018
From: amead at alanmead.org (Alan Mead)
Date: Sun, 6 May 2018 23:00:07 -0500
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <20180506224728.7515311D87E@xx1.develooper.com>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
	<20180506224728.7515311D87E@xx1.develooper.com>
Message-ID: <60957c19-b245-32a3-6244-11cae477c44b@alanmead.org>

Jim,

I was mainly thinking about using this on documents that I've created or
a student created to check that the citations (in the text) and
references (in the reference section) are consistent and correct. I
developed the fallible prior script when I was a faculty member to check
references for student theses. Now, I'm mainly worried about my own
writing at this point, to catch errors. For example, I'm writing a
book-length manual and I'd like to check that all the citations are
referenced.

I guess I would like to use it on the text of PDF's some of which (the
"searchable PDF's") are based on OCR'ing scanned documents. But this
isn't my focus currently.

I would describe APA format as a convention for manuscripts so that they
can be reviewed and then typeset in a standard way. It's perhaps an
anachronism now that we use software that can recreate a lot of typeset
effects. The main issue is that in APA format, you cite a work using a
parenthetical? string with names and dates, rather than numbers. So "...
been found (Mead, 2013)." rather than "... been found [3]." There are
speciofic rules for citations and then a long list of templates for
references for different kinds of works (articles with different numbers
of authors, book chapters, books, software, databases, websites, etc.).

-Alan

On 5/6/2018 5:18 PM, Jim Jacobus wrote:
>
> Of course it's well suited for Perl. Isn't everything.
> Are you looking at OCR'd documents or are they electronic in a format
> like PDF, Postscript, Word, etc?
> Doing a quick search I see the "APA format" is a print style
> formalized in 1929. It's a rudimentary typesetting style much like
> what is used for legal appellate briefs so that doesn't mean much if
> you are just dealing with text (that's why the OCR question).
> If these are PDFs there are various ways to extract data using the
> style header in the PDF document.
>
>
> At 04:00 PM 5/3/2018, Alan Mead wrote:
>
>> Is this list still active? I have a task that I think might be
>> well-suited to Perl and I'd appreciate the lists' advice.
>>
>> I read and review documents written in APA format and it would sometimes
>> be handy to have parse out the citations from the text. Here's a few
>> lines of input:
>>
>> Personality constructs have been found to be predictive of job
>> performance across many studies and occupations (Ones, Viswesveran, &
>> Dilchert, 2005). Recent...
>> Alpha, Beta and Gamma (2007) collected personality data and performance
>> ratings on 142 incumbent sales employees...
>> ... has been used in sales (Alpha et al., 2007)
>> Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead &
>> Seed, 2013b)...
>> ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011)
>> ... conform to currently accepted best practices (see Society for
>> Industrial and Organizational Psychology, 2019)
>>
>> And I'd like this output:
>>
>> Ones, Viswesveran, & Dilchert, 2005
>> Alpha, Beta, & Gamma, 2007
>> Mead & Reed, 2012
>> Mead & Seed, 2013a
>> Mead & Seed, 2013b
>> Mead, 2012
>> Mead, 2013
>> Drwho & Layla, 2011
>> Society for Industrial and Organizational Psychology, 2019
>>
>> The number of authors can vary, there are at least a couple ways to
>> cite, and there is a rule where "et al." should replace the names of
>> second and later authors in the second and subsequent citations.
>> Sometimes the "oxford comma" is used and other times it's omitted. There
>> are some special rules for other things, like direct quotations and
>> articles with more than eight authors.
>>
>> So, this is messy, but not as messy as a lot of NLP problems. I wrote a
>> script that used regexs to find common patterns but it was fallible. I
>> know publishers have such software because they'll scan your manuscript
>> and parse both the citations and the references (the full "citation"
>> with the author initials, work title, etc.) and highlight errors (e.g.,
>> cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the
>> software would highlight the citation and the reference).
>>
>> Any ideas about how best to approach this problem? Work harder on the
>> regexs? Hand code a bunch of documents and use a machine learning
>> algorithm (which would start, I guess, with named entity recognition)?
>> Pay MTurkers to find these? Does anyone know of a good named-entity
>> recognition library that's Perl-friendly??? The solution doesn't need to
>> be particularly fast. I wouldn't be running it on millions of pages.
>>
>> I suppose someone will suggest I use Python. I really prefer Perl but
>> maybe Python has more ML tools at this point.
>>
>> -Alan
>>
>> -- 
>>
>> Alan D. Mead, Ph.D.
>> President, Talent Algorithms Inc.
>>
>> science + technology = better workers
>>
>> http://www.alanmead.org
>>
>> I've... seen things you people wouldn't believe...
>> functions on fire in a copy of Orion.
>> I watched C-Sharp glitter in the dark near a programmable gate.
>> All those moments will be lost in time, like Ruby... on... Rails...
>> Time for Pi.
>>
>> ????????? --"The Register" user Alister, applying the famous
>> ??????????? "Blade Runner" speech to software development
>> _______________________________________________
>> Chicago-talk mailing list
>> Chicago-talk at pm.org
>> http://mail.pm.org/mailman/listinfo/chicago-talk
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk

-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

I've... seen things you people wouldn't believe...
functions on fire in a copy of Orion.
I watched C-Sharp glitter in the dark near a programmable gate.
All those moments will be lost in time, like Ruby... on... Rails... Time for Pi.

          --"The Register" user Alister, applying the famous 
            "Blade Runner" speech to software development

From amead at alanmead.org  Sun May  6 21:16:11 2018
From: amead at alanmead.org (Alan Mead)
Date: Sun, 6 May 2018 23:16:11 -0500
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
	<ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>
Message-ID: <831035e0-1fe6-4817-0e34-6e131fac74c0@alanmead.org>

On 5/6/2018 7:29 PM, James E Keenan wrote:
> Is it the case that there are commercial software solutions for this
> problem, but as yet no open-source solutions?

I'm not aware of any solution to which I have access, commercial or
otherwise. Most people either perform this task manually or else they
solve it in a completely different way using what is sometimes called a
"reference manager" which involves creating a database of works and
using software embedded within the word-processor to insert codes that
are expanded by the software into citations and references. I don't know
of a reference manager that works in both LibreOffice and Word, which I
would need. I also don't like this approach.

> Is this the standard you are trying to meet?
> http://www.apastyle.org/manual/index.aspx

Well, yes but I'm trying to write code that will extract in-text
citations which is a tiny portion of Chapter 6 of this document (mainly
6.11 to 6.21; there are 32 rules -- in fact, just covering 6.11 and 6.12
would be sufficient because using rules 6.13 - 6.21 is rare).

-Alan

-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

I've... seen things you people wouldn't believe...
functions on fire in a copy of Orion.
I watched C-Sharp glitter in the dark near a programmable gate.
All those moments will be lost in time, like Ruby... on... Rails... Time for Pi.

          --"The Register" user Alister, applying the famous 
            "Blade Runner" speech to software development

From amead at alanmead.org  Sun May  6 21:21:44 2018
From: amead at alanmead.org (Alan Mead)
Date: Sun, 6 May 2018 23:21:44 -0500
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <CAAKpS=54wde8wGPKib8eb7Cb6m6BYNSqsiqbTW1E6xBeSt2KPg@mail.gmail.com>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
	<ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>
	<CAAKpS=54wde8wGPKib8eb7Cb6m6BYNSqsiqbTW1E6xBeSt2KPg@mail.gmail.com>
Message-ID: <4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org>

Mike,

Thanks! This is extremely interesting. I didn't know about this. But
this is the second half of the issue (parsing the reference list). It
would greatly enhance my hypothetical script to add this capability and
check the in-text citations match one (and only one) of the references.
And, I agree, using this mechanism, I could generate a list of "targets"
that would make it easier to find those citations.

But I was hoping to be able to parse text where the reference list
hadn't been created to facilitate the creation of the reference list.

-Alan

On 5/6/2018 9:28 PM, Mike Fragassi wrote:
> Biblio::Citation::Parser may help you.? First, seems like it should be
> able to parse the references listed in an article's bibliography:
>
> use Biblio::Citation::Parser::Standard;
> my $ref = 'Gleditsch, N. P., Pinker, S., Thayer, B. A., Levy, J. S., &
> Thompson, W. R. (2013). The forum: The decline of war. International
> Studies Review, 15(3), 396-419.';?
> # example from?http://www.citationmachine.net/apa/cite-a-book
> my $cit_parser = new Biblio::Citation::Parser::Standard;
> my $metadata = $cit_parser->parse($ref);
> print Dumper($metadata);
>
> The $metadata is a hashref of valid values parsed from the string,
> including the rule that the parser used to identify the parts, that
> looks like this:
>
> ? ?'match' => '_AUTHORS_ (_YEAR_). _TITLE_. _PUBLICATION_,
> _VOLUME_(_ISSUE_), _PAGES_',
>
> The 'match' seem to come from?Biblio::Citation::Parser::Templates, and
> you should be able to modify this to look for things like '(_AUTHORS_,
> _YEAR_)' etc. ?(Note: I've never used this module myself before now.)?
>


-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

I've... seen things you people wouldn't believe...
functions on fire in a copy of Orion.
I watched C-Sharp glitter in the dark near a programmable gate.
All those moments will be lost in time, like Ruby... on... Rails... Time for Pi.

          --"The Register" user Alister, applying the famous 
            "Blade Runner" speech to software development

From mikefrag at gmail.com  Mon May  7 07:23:24 2018
From: mikefrag at gmail.com (Mike Fragassi)
Date: Mon, 7 May 2018 09:23:24 -0500
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
	<ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>
	<CAAKpS=54wde8wGPKib8eb7Cb6m6BYNSqsiqbTW1E6xBeSt2KPg@mail.gmail.com>
	<4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org>
Message-ID: <CAAKpS=6YHh4DiQKUTJYjDgBKROX9C9d2tF+dfNop2nQQKPW0TA@mail.gmail.com>

Well, if you can create targets for the in-text citations, and feed the
text body into the parser to look for these, you could then take the
hashrefs and use them to generate a skeleton of a bibliography that you can
fill out later. I.e. for a text reference of "see Foo and Bar (2001)"
you'll only have the author(s) and year for what's in the text, but you
could take that, feed it into a template system like Template::Toolkit, and
spit out a bibliography with 'TODO' or 'XXX' in the missing fields:
   Foo, XXX. & Bar XXX. (2001)  XXX_TITLE. XXX_JOURNAL, XXX_VOL, XXX_PAGES.
Then go back and fill in the missing fields.
And when done with writing both the text and the bibliography, you can
rescan both to check that there's no mismatches. Of course, that won't help
you if in one place you site Foo & Bar (2001) but you meant to site Foo &
Bar (2002), and you do also correctly site both of these elsewhere.

On Sun, May 6, 2018 at 11:21 PM, Alan Mead <amead at alanmead.org> wrote:

> Mike,
>
> Thanks! This is extremely interesting. I didn't know about this. But
> this is the second half of the issue (parsing the reference list). It
> would greatly enhance my hypothetical script to add this capability and
> check the in-text citations match one (and only one) of the references.
> And, I agree, using this mechanism, I could generate a list of "targets"
> that would make it easier to find those citations.
>
> But I was hoping to be able to parse text where the reference list
> hadn't been created to facilitate the creation of the reference list.
>
> -Alan
>
> On 5/6/2018 9:28 PM, Mike Fragassi wrote:
> > Biblio::Citation::Parser may help you.  First, seems like it should be
> > able to parse the references listed in an article's bibliography:
> >
> > use Biblio::Citation::Parser::Standard;
> > my $ref = 'Gleditsch, N. P., Pinker, S., Thayer, B. A., Levy, J. S., &
> > Thompson, W. R. (2013). The forum: The decline of war. International
> > Studies Review, 15(3), 396-419.';
> > # example from http://www.citationmachine.net/apa/cite-a-book
> > my $cit_parser = new Biblio::Citation::Parser::Standard;
> > my $metadata = $cit_parser->parse($ref);
> > print Dumper($metadata);
> >
> > The $metadata is a hashref of valid values parsed from the string,
> > including the rule that the parser used to identify the parts, that
> > looks like this:
> >
> >    'match' => '_AUTHORS_ (_YEAR_). _TITLE_. _PUBLICATION_,
> > _VOLUME_(_ISSUE_), _PAGES_',
> >
> > The 'match' seem to come from Biblio::Citation::Parser::Templates, and
> > you should be able to modify this to look for things like '(_AUTHORS_,
> > _YEAR_)' etc.  (Note: I've never used this module myself before now.)
> >
>
>
> --
>
> Alan D. Mead, Ph.D.
> President, Talent Algorithms Inc.
>
> science + technology = better workers
>
> http://www.alanmead.org
>
> I've... seen things you people wouldn't believe...
> functions on fire in a copy of Orion.
> I watched C-Sharp glitter in the dark near a programmable gate.
> All those moments will be lost in time, like Ruby... on... Rails... Time
> for Pi.
>
>           --"The Register" user Alister, applying the famous
>             "Blade Runner" speech to software development
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20180507/9fbd55bc/attachment.html>

From amead at alanmead.org  Mon May  7 07:28:15 2018
From: amead at alanmead.org (Alan Mead)
Date: Mon, 7 May 2018 09:28:15 -0500
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <CAAKpS=6YHh4DiQKUTJYjDgBKROX9C9d2tF+dfNop2nQQKPW0TA@mail.gmail.com>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
	<ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>
	<CAAKpS=54wde8wGPKib8eb7Cb6m6BYNSqsiqbTW1E6xBeSt2KPg@mail.gmail.com>
	<4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org>
	<CAAKpS=6YHh4DiQKUTJYjDgBKROX9C9d2tF+dfNop2nQQKPW0TA@mail.gmail.com>
Message-ID: <be2c1e0a-d379-3665-4405-6dee9f543882@alanmead.org>

On 5/7/2018 9:23 AM, Mike Fragassi wrote:
> Well, if you can create targets for the in-text citations, and feed
> the text body into the parser to look for these, you could then take
> the hashrefs and use them to generate a skeleton of a bibliography
> that you can fill out later. I.e. for a text reference of "see Foo and
> Bar (2001)" you'll only have the author(s) and year for what's in the
> text, but you could take that, feed it into a template system like
> Template::Toolkit, and spit out a bibliography with 'TODO' or 'XXX' in
> the missing fields:?
> ? ?Foo, XXX. & Bar XXX. (2001) ?XXX_TITLE. XXX_JOURNAL, XXX_VOL,
> XXX_PAGES.
> Then go back and fill in the missing fields.
> And when done with writing both the text and the bibliography, you can
> rescan both to check that there's no mismatches. Of course, that won't
> help you if in one place you site Foo & Bar (2001) but you meant to
> site Foo & Bar (2002), and you do also correctly site both of these
> elsewhere.

This is precisely what I want to do. The first step is to create the
skeleton and your suggestion to use Biblio::Citation::Parser will make
the second step much easier.

-Alan


-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

I've... seen things you people wouldn't believe...
functions on fire in a copy of Orion.
I watched C-Sharp glitter in the dark near a programmable gate.
All those moments will be lost in time, like Ruby... on... Rails... Time for Pi.

          --"The Register" user Alister, applying the famous 
            "Blade Runner" speech to software development

From joel.a.berger at gmail.com  Tue May  8 14:25:52 2018
From: joel.a.berger at gmail.com (Joel Berger)
Date: Tue, 08 May 2018 21:25:52 +0000
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <be2c1e0a-d379-3665-4405-6dee9f543882@alanmead.org>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
	<ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>
	<CAAKpS=54wde8wGPKib8eb7Cb6m6BYNSqsiqbTW1E6xBeSt2KPg@mail.gmail.com>
	<4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org>
	<CAAKpS=6YHh4DiQKUTJYjDgBKROX9C9d2tF+dfNop2nQQKPW0TA@mail.gmail.com>
	<be2c1e0a-d379-3665-4405-6dee9f543882@alanmead.org>
Message-ID: <CAAMA-9Mmw4ea0mbZJ_ow+QsyxFNtiL0+mYbGq_PtF7uqHc3d_A@mail.gmail.com>

While I'm all for supporting Perl and it seems like you have found a Perl
way to do it, I thought I'd just offer one (possible) alternative,
depending on what your actual end goal is.

During my Ph.D. research I found the program zotero to do bibliography
management and I'm not sure what I would have done without it. I kept all
my citation in there and I was able to export them to BibTeX for use in my
thesis. I don't know what its export formats are, but I presume they have
something that can output simple formatted text. Anyway its worth taking a
look if you are doing any kind of project with a bibliography, I highly
recommend it!

https://www.zotero.org/

Cheers,
Joel Berger


On Mon, May 7, 2018 at 9:28 AM Alan Mead <amead at alanmead.org> wrote:

> On 5/7/2018 9:23 AM, Mike Fragassi wrote:
> > Well, if you can create targets for the in-text citations, and feed
> > the text body into the parser to look for these, you could then take
> > the hashrefs and use them to generate a skeleton of a bibliography
> > that you can fill out later. I.e. for a text reference of "see Foo and
> > Bar (2001)" you'll only have the author(s) and year for what's in the
> > text, but you could take that, feed it into a template system like
> > Template::Toolkit, and spit out a bibliography with 'TODO' or 'XXX' in
> > the missing fields:
> >    Foo, XXX. & Bar XXX. (2001)  XXX_TITLE. XXX_JOURNAL, XXX_VOL,
> > XXX_PAGES.
> > Then go back and fill in the missing fields.
> > And when done with writing both the text and the bibliography, you can
> > rescan both to check that there's no mismatches. Of course, that won't
> > help you if in one place you site Foo & Bar (2001) but you meant to
> > site Foo & Bar (2002), and you do also correctly site both of these
> > elsewhere.
>
> This is precisely what I want to do. The first step is to create the
> skeleton and your suggestion to use Biblio::Citation::Parser will make
> the second step much easier.
>
> -Alan
>
>
> --
>
> Alan D. Mead, Ph.D.
> President, Talent Algorithms Inc.
>
> science + technology = better workers
>
> http://www.alanmead.org
>
> I've... seen things you people wouldn't believe...
> functions on fire in a copy of Orion.
> I watched C-Sharp glitter in the dark near a programmable gate.
> All those moments will be lost in time, like Ruby... on... Rails... Time
> for Pi.
>
>           --"The Register" user Alister, applying the famous
>             "Blade Runner" speech to software development
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/chicago-talk/attachments/20180508/798a1544/attachment.html>

From amead at alanmead.org  Tue May  8 14:46:36 2018
From: amead at alanmead.org (Alan Mead)
Date: Tue, 8 May 2018 16:46:36 -0500
Subject: [Chicago-talk] Parsing APA-format citations
In-Reply-To: <CAAMA-9Mmw4ea0mbZJ_ow+QsyxFNtiL0+mYbGq_PtF7uqHc3d_A@mail.gmail.com>
References: <c0649bd8-3eb3-99c5-a1bf-701f0a8f5219@alanmead.org>
	<ed61d285-e626-1d3f-a4c6-91f5c84765c4@pobox.com>
	<CAAKpS=54wde8wGPKib8eb7Cb6m6BYNSqsiqbTW1E6xBeSt2KPg@mail.gmail.com>
	<4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org>
	<CAAKpS=6YHh4DiQKUTJYjDgBKROX9C9d2tF+dfNop2nQQKPW0TA@mail.gmail.com>
	<be2c1e0a-d379-3665-4405-6dee9f543882@alanmead.org>
	<CAAMA-9Mmw4ea0mbZJ_ow+QsyxFNtiL0+mYbGq_PtF7uqHc3d_A@mail.gmail.com>
Message-ID: <d7f0cf65-f6f7-b6f0-6afe-5e032ffeaa42@alanmead.org>

Joel,

Thanks for the suggestion. That's especially true for this project,
which is book-length. And I see that zotero has a plugin for both Word
and LibreOffice (contradicting something I said earlier).

I've used zotero a bit and I think a citation manager makes a lot of
sense in many use cases (like a thesis) but in this case where I have
collaborators, they would have to agree to use it with Word. While I've
shared zotero databases with students, it would be a big process to get
my client to use it; they cannot even receive a ZIP file (because those
are insecure). And I've been on the receiving end of having to edit
manuscripts that used a unknown citation manager, and it makes a fairly
closed format even more so closed (a lot like using equation plugins).

But I agree that my solution is clumsy.

-Alan

On 5/8/2018 4:25 PM, Joel Berger wrote:
> While I'm all for supporting Perl and it seems like you have found a
> Perl way to do it, I thought I'd just offer one (possible)
> alternative, depending on what your actual end goal is.
>
> During my Ph.D. research I found the program zotero to do bibliography
> management and I'm not sure what I would have done without it. I kept
> all my citation in there and I was able to export them to BibTeX for
> use in my thesis. I don't know what its export formats are, but I
> presume they have something that can output simple formatted text.
> Anyway its worth taking a look if you are doing any kind of project
> with a bibliography, I highly recommend it!
>
> https://www.zotero.org/
>
> Cheers,
> Joel Berger
>

-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

I've... seen things you people wouldn't believe...
functions on fire in a copy of Orion.
I watched C-Sharp glitter in the dark near a programmable gate.
All those moments will be lost in time, like Ruby... on... Rails... Time for Pi.

          --"The Register" user Alister, applying the famous 
            "Blade Runner" speech to software development