From amead at alanmead.org Thu May 3 14:00:35 2018 From: amead at alanmead.org (Alan Mead) Date: Thu, 3 May 2018 16:00:35 -0500 Subject: [Chicago-talk] Parsing APA-format citations Message-ID: Is this list still active? I have a task that I think might be well-suited to Perl and I'd appreciate the lists' advice. I read and review documents written in APA format and it would sometimes be handy to have parse out the citations from the text. Here's a few lines of input: Personality constructs have been found to be predictive of job performance across many studies and occupations (Ones, Viswesveran, & Dilchert, 2005). Recent... Alpha, Beta and Gamma (2007) collected personality data and performance ratings on 142 incumbent sales employees... ... has been used in sales (Alpha et al., 2007) Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead & Seed, 2013b)... ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011) ... conform to currently accepted best practices (see Society for Industrial and Organizational Psychology, 2019) And I'd like this output: Ones, Viswesveran, & Dilchert, 2005 Alpha, Beta, & Gamma, 2007 Mead & Reed, 2012 Mead & Seed, 2013a Mead & Seed, 2013b Mead, 2012 Mead, 2013 Drwho & Layla, 2011 Society for Industrial and Organizational Psychology, 2019 The number of authors can vary, there are at least a couple ways to cite, and there is a rule where "et al." should replace the names of second and later authors in the second and subsequent citations. Sometimes the "oxford comma" is used and other times it's omitted. There are some special rules for other things, like direct quotations and articles with more than eight authors. So, this is messy, but not as messy as a lot of NLP problems. I wrote a script that used regexs to find common patterns but it was fallible. I know publishers have such software because they'll scan your manuscript and parse both the citations and the references (the full "citation" with the author initials, work title, etc.) and highlight errors (e.g., cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the software would highlight the citation and the reference). Any ideas about how best to approach this problem? Work harder on the regexs? Hand code a bunch of documents and use a machine learning algorithm (which would start, I guess, with named entity recognition)? Pay MTurkers to find these? Does anyone know of a good named-entity recognition library that's Perl-friendly?? The solution doesn't need to be particularly fast. I wouldn't be running it on millions of pages. I suppose someone will suggest I use Python. I really prefer Perl but maybe Python has more ML tools at this point. -Alan -- Alan D. Mead, Ph.D. President, Talent Algorithms Inc. science + technology = better workers http://www.alanmead.org I've... seen things you people wouldn't believe... functions on fire in a copy of Orion. I watched C-Sharp glitter in the dark near a programmable gate. All those moments will be lost in time, like Ruby... on... Rails... Time for Pi. --"The Register" user Alister, applying the famous "Blade Runner" speech to software development From JJacobus at PonyX.com Sun May 6 15:18:10 2018 From: JJacobus at PonyX.com (Jim Jacobus) Date: Sun, 06 May 2018 17:18:10 -0500 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: References: Message-ID: <20180506224728.7515311D87E@xx1.develooper.com> Of course it's well suited for Perl. Isn't everything. Are you looking at OCR'd documents or are they electronic in a format like PDF, Postscript, Word, etc? Doing a quick search I see the "APA format" is a print style formalized in 1929. It's a rudimentary typesetting style much like what is used for legal appellate briefs so that doesn't mean much if you are just dealing with text (that's why the OCR question). If these are PDFs there are various ways to extract data using the style header in the PDF document. At 04:00 PM 5/3/2018, Alan Mead wrote: >Is this list still active? I have a task that I think might be >well-suited to Perl and I'd appreciate the lists' advice. > >I read and review documents written in APA format and it would sometimes >be handy to have parse out the citations from the text. Here's a few >lines of input: > >Personality constructs have been found to be predictive of job >performance across many studies and occupations (Ones, Viswesveran, & >Dilchert, 2005). Recent... >Alpha, Beta and Gamma (2007) collected personality data and performance >ratings on 142 incumbent sales employees... >... has been used in sales (Alpha et al., 2007) >Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead & >Seed, 2013b)... >... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011) >... conform to currently accepted best practices (see Society for >Industrial and Organizational Psychology, 2019) > >And I'd like this output: > >Ones, Viswesveran, & Dilchert, 2005 >Alpha, Beta, & Gamma, 2007 >Mead & Reed, 2012 >Mead & Seed, 2013a >Mead & Seed, 2013b >Mead, 2012 >Mead, 2013 >Drwho & Layla, 2011 >Society for Industrial and Organizational Psychology, 2019 > >The number of authors can vary, there are at least a couple ways to >cite, and there is a rule where "et al." should replace the names of >second and later authors in the second and subsequent citations. >Sometimes the "oxford comma" is used and other times it's omitted. There >are some special rules for other things, like direct quotations and >articles with more than eight authors. > >So, this is messy, but not as messy as a lot of NLP problems. I wrote a >script that used regexs to find common patterns but it was fallible. I >know publishers have such software because they'll scan your manuscript >and parse both the citations and the references (the full "citation" >with the author initials, work title, etc.) and highlight errors (e.g., >cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the >software would highlight the citation and the reference). > >Any ideas about how best to approach this problem? Work harder on the >regexs? Hand code a bunch of documents and use a machine learning >algorithm (which would start, I guess, with named entity recognition)? >Pay MTurkers to find these? Does anyone know of a good named-entity >recognition library that's Perl-friendly?? The solution doesn't need to >be particularly fast. I wouldn't be running it on millions of pages. > >I suppose someone will suggest I use Python. I really prefer Perl but >maybe Python has more ML tools at this point. > >-Alan > >-- > >Alan D. Mead, Ph.D. >President, Talent Algorithms Inc. > >science + technology = better workers > >http://www.alanmead.org > >I've... seen things you people wouldn't believe... >functions on fire in a copy of Orion. >I watched C-Sharp glitter in the dark near a programmable gate. >All those moments will be lost in time, like >Ruby... on... Rails... Time for Pi. > > --"The Register" user Alister, applying the famous > "Blade Runner" speech to software development >_______________________________________________ >Chicago-talk mailing list >Chicago-talk at pm.org >http://mail.pm.org/mailman/listinfo/chicago-talk From jkeenan at pobox.com Sun May 6 17:29:35 2018 From: jkeenan at pobox.com (James E Keenan) Date: Sun, 6 May 2018 20:29:35 -0400 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: References: Message-ID: On 05/03/2018 05:00 PM, Alan Mead wrote: > Is this list still active? I have a task that I think might be > well-suited to Perl and I'd appreciate the lists' advice. > > I read and review documents written in APA format and it would sometimes > be handy to have parse out the citations from the text. Here's a few > lines of input: > > Personality constructs have been found to be predictive of job > performance across many studies and occupations (Ones, Viswesveran, & > Dilchert, 2005). Recent... > Alpha, Beta and Gamma (2007) collected personality data and performance > ratings on 142 incumbent sales employees... > ... has been used in sales (Alpha et al., 2007) > Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead & > Seed, 2013b)... > ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011) > ... conform to currently accepted best practices (see Society for > Industrial and Organizational Psychology, 2019) > > And I'd like this output: > > Ones, Viswesveran, & Dilchert, 2005 > Alpha, Beta, & Gamma, 2007 > Mead & Reed, 2012 > Mead & Seed, 2013a > Mead & Seed, 2013b > Mead, 2012 > Mead, 2013 > Drwho & Layla, 2011 > Society for Industrial and Organizational Psychology, 2019 > > The number of authors can vary, there are at least a couple ways to > cite, and there is a rule where "et al." should replace the names of > second and later authors in the second and subsequent citations. > Sometimes the "oxford comma" is used and other times it's omitted. There > are some special rules for other things, like direct quotations and > articles with more than eight authors. > Is it the case that there are commercial software solutions for this problem, but as yet no open-source solutions? Is this the standard you are trying to meet? http://www.apastyle.org/manual/index.aspx > So, this is messy, but not as messy as a lot of NLP problems. I wrote a > script that used regexs to find common patterns but it was fallible. I > know publishers have such software because they'll scan your manuscript > and parse both the citations and the references (the full "citation" > with the author initials, work title, etc.) and highlight errors (e.g., > cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the > software would highlight the citation and the reference). > > Any ideas about how best to approach this problem? Work harder on the > regexs? Hand code a bunch of documents and use a machine learning > algorithm (which would start, I guess, with named entity recognition)? > Pay MTurkers to find these? Does anyone know of a good named-entity > recognition library that's Perl-friendly?? The solution doesn't need to > be particularly fast. I wouldn't be running it on millions of pages. > > I suppose someone will suggest I use Python. I really prefer Perl but > maybe Python has more ML tools at this point. > > -Alan > From mikefrag at gmail.com Sun May 6 19:28:43 2018 From: mikefrag at gmail.com (Mike Fragassi) Date: Sun, 6 May 2018 21:28:43 -0500 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: References: Message-ID: Biblio::Citation::Parser may help you. First, seems like it should be able to parse the references listed in an article's bibliography: use Biblio::Citation::Parser::Standard; my $ref = 'Gleditsch, N. P., Pinker, S., Thayer, B. A., Levy, J. S., & Thompson, W. R. (2013). The forum: The decline of war. International Studies Review, 15(3), 396-419.'; # example from http://www.citationmachine.net/apa/cite-a-book my $cit_parser = new Biblio::Citation::Parser::Standard; my $metadata = $cit_parser->parse($ref); print Dumper($metadata); The $metadata is a hashref of valid values parsed from the string, including the rule that the parser used to identify the parts, that looks like this: 'match' => '_AUTHORS_ (_YEAR_). _TITLE_. _PUBLICATION_, _VOLUME_(_ISSUE_), _PAGES_', The 'match' seem to come from Biblio::Citation::Parser::Templates, and you should be able to modify this to look for things like '(_AUTHORS_, _YEAR_)' etc. (Note: I've never used this module myself before now.) On Sun, May 6, 2018 at 7:29 PM, James E Keenan wrote: > On 05/03/2018 05:00 PM, Alan Mead wrote: > >> Is this list still active? I have a task that I think might be >> well-suited to Perl and I'd appreciate the lists' advice. >> >> I read and review documents written in APA format and it would sometimes >> be handy to have parse out the citations from the text. Here's a few >> lines of input: >> >> Personality constructs have been found to be predictive of job >> performance across many studies and occupations (Ones, Viswesveran, & >> Dilchert, 2005). Recent... >> Alpha, Beta and Gamma (2007) collected personality data and performance >> ratings on 142 incumbent sales employees... >> ... has been used in sales (Alpha et al., 2007) >> Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead & >> Seed, 2013b)... >> ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011) >> ... conform to currently accepted best practices (see Society for >> Industrial and Organizational Psychology, 2019) >> >> And I'd like this output: >> >> Ones, Viswesveran, & Dilchert, 2005 >> Alpha, Beta, & Gamma, 2007 >> Mead & Reed, 2012 >> Mead & Seed, 2013a >> Mead & Seed, 2013b >> Mead, 2012 >> Mead, 2013 >> Drwho & Layla, 2011 >> Society for Industrial and Organizational Psychology, 2019 >> >> The number of authors can vary, there are at least a couple ways to >> cite, and there is a rule where "et al." should replace the names of >> second and later authors in the second and subsequent citations. >> Sometimes the "oxford comma" is used and other times it's omitted. There >> are some special rules for other things, like direct quotations and >> articles with more than eight authors. >> >> > Is it the case that there are commercial software solutions for this > problem, but as yet no open-source solutions? > > Is this the standard you are trying to meet? > http://www.apastyle.org/manual/index.aspx > > So, this is messy, but not as messy as a lot of NLP problems. I wrote a >> script that used regexs to find common patterns but it was fallible. I >> know publishers have such software because they'll scan your manuscript >> and parse both the citations and the references (the full "citation" >> with the author initials, work title, etc.) and highlight errors (e.g., >> cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the >> software would highlight the citation and the reference). >> >> Any ideas about how best to approach this problem? Work harder on the >> regexs? Hand code a bunch of documents and use a machine learning >> algorithm (which would start, I guess, with named entity recognition)? >> Pay MTurkers to find these? Does anyone know of a good named-entity >> recognition library that's Perl-friendly? The solution doesn't need to >> be particularly fast. I wouldn't be running it on millions of pages. >> >> I suppose someone will suggest I use Python. I really prefer Perl but >> maybe Python has more ML tools at this point. >> >> -Alan >> >> _______________________________________________ > Chicago-talk mailing list > Chicago-talk at pm.org > http://mail.pm.org/mailman/listinfo/chicago-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amead at alanmead.org Sun May 6 21:00:07 2018 From: amead at alanmead.org (Alan Mead) Date: Sun, 6 May 2018 23:00:07 -0500 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: <20180506224728.7515311D87E@xx1.develooper.com> References: <20180506224728.7515311D87E@xx1.develooper.com> Message-ID: <60957c19-b245-32a3-6244-11cae477c44b@alanmead.org> Jim, I was mainly thinking about using this on documents that I've created or a student created to check that the citations (in the text) and references (in the reference section) are consistent and correct. I developed the fallible prior script when I was a faculty member to check references for student theses. Now, I'm mainly worried about my own writing at this point, to catch errors. For example, I'm writing a book-length manual and I'd like to check that all the citations are referenced. I guess I would like to use it on the text of PDF's some of which (the "searchable PDF's") are based on OCR'ing scanned documents. But this isn't my focus currently. I would describe APA format as a convention for manuscripts so that they can be reviewed and then typeset in a standard way. It's perhaps an anachronism now that we use software that can recreate a lot of typeset effects. The main issue is that in APA format, you cite a work using a parenthetical? string with names and dates, rather than numbers. So "... been found (Mead, 2013)." rather than "... been found [3]." There are speciofic rules for citations and then a long list of templates for references for different kinds of works (articles with different numbers of authors, book chapters, books, software, databases, websites, etc.). -Alan On 5/6/2018 5:18 PM, Jim Jacobus wrote: > > Of course it's well suited for Perl. Isn't everything. > Are you looking at OCR'd documents or are they electronic in a format > like PDF, Postscript, Word, etc? > Doing a quick search I see the "APA format" is a print style > formalized in 1929. It's a rudimentary typesetting style much like > what is used for legal appellate briefs so that doesn't mean much if > you are just dealing with text (that's why the OCR question). > If these are PDFs there are various ways to extract data using the > style header in the PDF document. > > > At 04:00 PM 5/3/2018, Alan Mead wrote: > >> Is this list still active? I have a task that I think might be >> well-suited to Perl and I'd appreciate the lists' advice. >> >> I read and review documents written in APA format and it would sometimes >> be handy to have parse out the citations from the text. Here's a few >> lines of input: >> >> Personality constructs have been found to be predictive of job >> performance across many studies and occupations (Ones, Viswesveran, & >> Dilchert, 2005). Recent... >> Alpha, Beta and Gamma (2007) collected personality data and performance >> ratings on 142 incumbent sales employees... >> ... has been used in sales (Alpha et al., 2007) >> Mead and his colleagues (Mead & Reed, 2012; Mead & Seed, 2013a; Mead & >> Seed, 2013b)... >> ... has been validated (Mead, 2012, 2013; Drwho & Layla, 2011) >> ... conform to currently accepted best practices (see Society for >> Industrial and Organizational Psychology, 2019) >> >> And I'd like this output: >> >> Ones, Viswesveran, & Dilchert, 2005 >> Alpha, Beta, & Gamma, 2007 >> Mead & Reed, 2012 >> Mead & Seed, 2013a >> Mead & Seed, 2013b >> Mead, 2012 >> Mead, 2013 >> Drwho & Layla, 2011 >> Society for Industrial and Organizational Psychology, 2019 >> >> The number of authors can vary, there are at least a couple ways to >> cite, and there is a rule where "et al." should replace the names of >> second and later authors in the second and subsequent citations. >> Sometimes the "oxford comma" is used and other times it's omitted. There >> are some special rules for other things, like direct quotations and >> articles with more than eight authors. >> >> So, this is messy, but not as messy as a lot of NLP problems. I wrote a >> script that used regexs to find common patterns but it was fallible. I >> know publishers have such software because they'll scan your manuscript >> and parse both the citations and the references (the full "citation" >> with the author initials, work title, etc.) and highlight errors (e.g., >> cited Mead & Reed, 2013 and then referenced Mead & Seed, 2011, the >> software would highlight the citation and the reference). >> >> Any ideas about how best to approach this problem? Work harder on the >> regexs? Hand code a bunch of documents and use a machine learning >> algorithm (which would start, I guess, with named entity recognition)? >> Pay MTurkers to find these? Does anyone know of a good named-entity >> recognition library that's Perl-friendly??? The solution doesn't need to >> be particularly fast. I wouldn't be running it on millions of pages. >> >> I suppose someone will suggest I use Python. I really prefer Perl but >> maybe Python has more ML tools at this point. >> >> -Alan >> >> -- >> >> Alan D. Mead, Ph.D. >> President, Talent Algorithms Inc. >> >> science + technology = better workers >> >> http://www.alanmead.org >> >> I've... seen things you people wouldn't believe... >> functions on fire in a copy of Orion. >> I watched C-Sharp glitter in the dark near a programmable gate. >> All those moments will be lost in time, like Ruby... on... Rails... >> Time for Pi. >> >> ????????? --"The Register" user Alister, applying the famous >> ??????????? "Blade Runner" speech to software development >> _______________________________________________ >> Chicago-talk mailing list >> Chicago-talk at pm.org >> http://mail.pm.org/mailman/listinfo/chicago-talk > _______________________________________________ > Chicago-talk mailing list > Chicago-talk at pm.org > http://mail.pm.org/mailman/listinfo/chicago-talk -- Alan D. Mead, Ph.D. President, Talent Algorithms Inc. science + technology = better workers http://www.alanmead.org I've... seen things you people wouldn't believe... functions on fire in a copy of Orion. I watched C-Sharp glitter in the dark near a programmable gate. All those moments will be lost in time, like Ruby... on... Rails... Time for Pi. --"The Register" user Alister, applying the famous "Blade Runner" speech to software development From amead at alanmead.org Sun May 6 21:16:11 2018 From: amead at alanmead.org (Alan Mead) Date: Sun, 6 May 2018 23:16:11 -0500 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: References: Message-ID: <831035e0-1fe6-4817-0e34-6e131fac74c0@alanmead.org> On 5/6/2018 7:29 PM, James E Keenan wrote: > Is it the case that there are commercial software solutions for this > problem, but as yet no open-source solutions? I'm not aware of any solution to which I have access, commercial or otherwise. Most people either perform this task manually or else they solve it in a completely different way using what is sometimes called a "reference manager" which involves creating a database of works and using software embedded within the word-processor to insert codes that are expanded by the software into citations and references. I don't know of a reference manager that works in both LibreOffice and Word, which I would need. I also don't like this approach. > Is this the standard you are trying to meet? > http://www.apastyle.org/manual/index.aspx Well, yes but I'm trying to write code that will extract in-text citations which is a tiny portion of Chapter 6 of this document (mainly 6.11 to 6.21; there are 32 rules -- in fact, just covering 6.11 and 6.12 would be sufficient because using rules 6.13 - 6.21 is rare). -Alan -- Alan D. Mead, Ph.D. President, Talent Algorithms Inc. science + technology = better workers http://www.alanmead.org I've... seen things you people wouldn't believe... functions on fire in a copy of Orion. I watched C-Sharp glitter in the dark near a programmable gate. All those moments will be lost in time, like Ruby... on... Rails... Time for Pi. --"The Register" user Alister, applying the famous "Blade Runner" speech to software development From amead at alanmead.org Sun May 6 21:21:44 2018 From: amead at alanmead.org (Alan Mead) Date: Sun, 6 May 2018 23:21:44 -0500 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: References: Message-ID: <4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org> Mike, Thanks! This is extremely interesting. I didn't know about this. But this is the second half of the issue (parsing the reference list). It would greatly enhance my hypothetical script to add this capability and check the in-text citations match one (and only one) of the references. And, I agree, using this mechanism, I could generate a list of "targets" that would make it easier to find those citations. But I was hoping to be able to parse text where the reference list hadn't been created to facilitate the creation of the reference list. -Alan On 5/6/2018 9:28 PM, Mike Fragassi wrote: > Biblio::Citation::Parser may help you.? First, seems like it should be > able to parse the references listed in an article's bibliography: > > use Biblio::Citation::Parser::Standard; > my $ref = 'Gleditsch, N. P., Pinker, S., Thayer, B. A., Levy, J. S., & > Thompson, W. R. (2013). The forum: The decline of war. International > Studies Review, 15(3), 396-419.';? > # example from?http://www.citationmachine.net/apa/cite-a-book > my $cit_parser = new Biblio::Citation::Parser::Standard; > my $metadata = $cit_parser->parse($ref); > print Dumper($metadata); > > The $metadata is a hashref of valid values parsed from the string, > including the rule that the parser used to identify the parts, that > looks like this: > > ? ?'match' => '_AUTHORS_ (_YEAR_). _TITLE_. _PUBLICATION_, > _VOLUME_(_ISSUE_), _PAGES_', > > The 'match' seem to come from?Biblio::Citation::Parser::Templates, and > you should be able to modify this to look for things like '(_AUTHORS_, > _YEAR_)' etc. ?(Note: I've never used this module myself before now.)? > -- Alan D. Mead, Ph.D. President, Talent Algorithms Inc. science + technology = better workers http://www.alanmead.org I've... seen things you people wouldn't believe... functions on fire in a copy of Orion. I watched C-Sharp glitter in the dark near a programmable gate. All those moments will be lost in time, like Ruby... on... Rails... Time for Pi. --"The Register" user Alister, applying the famous "Blade Runner" speech to software development From mikefrag at gmail.com Mon May 7 07:23:24 2018 From: mikefrag at gmail.com (Mike Fragassi) Date: Mon, 7 May 2018 09:23:24 -0500 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: <4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org> References: <4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org> Message-ID: Well, if you can create targets for the in-text citations, and feed the text body into the parser to look for these, you could then take the hashrefs and use them to generate a skeleton of a bibliography that you can fill out later. I.e. for a text reference of "see Foo and Bar (2001)" you'll only have the author(s) and year for what's in the text, but you could take that, feed it into a template system like Template::Toolkit, and spit out a bibliography with 'TODO' or 'XXX' in the missing fields: Foo, XXX. & Bar XXX. (2001) XXX_TITLE. XXX_JOURNAL, XXX_VOL, XXX_PAGES. Then go back and fill in the missing fields. And when done with writing both the text and the bibliography, you can rescan both to check that there's no mismatches. Of course, that won't help you if in one place you site Foo & Bar (2001) but you meant to site Foo & Bar (2002), and you do also correctly site both of these elsewhere. On Sun, May 6, 2018 at 11:21 PM, Alan Mead wrote: > Mike, > > Thanks! This is extremely interesting. I didn't know about this. But > this is the second half of the issue (parsing the reference list). It > would greatly enhance my hypothetical script to add this capability and > check the in-text citations match one (and only one) of the references. > And, I agree, using this mechanism, I could generate a list of "targets" > that would make it easier to find those citations. > > But I was hoping to be able to parse text where the reference list > hadn't been created to facilitate the creation of the reference list. > > -Alan > > On 5/6/2018 9:28 PM, Mike Fragassi wrote: > > Biblio::Citation::Parser may help you. First, seems like it should be > > able to parse the references listed in an article's bibliography: > > > > use Biblio::Citation::Parser::Standard; > > my $ref = 'Gleditsch, N. P., Pinker, S., Thayer, B. A., Levy, J. S., & > > Thompson, W. R. (2013). The forum: The decline of war. International > > Studies Review, 15(3), 396-419.'; > > # example from http://www.citationmachine.net/apa/cite-a-book > > my $cit_parser = new Biblio::Citation::Parser::Standard; > > my $metadata = $cit_parser->parse($ref); > > print Dumper($metadata); > > > > The $metadata is a hashref of valid values parsed from the string, > > including the rule that the parser used to identify the parts, that > > looks like this: > > > > 'match' => '_AUTHORS_ (_YEAR_). _TITLE_. _PUBLICATION_, > > _VOLUME_(_ISSUE_), _PAGES_', > > > > The 'match' seem to come from Biblio::Citation::Parser::Templates, and > > you should be able to modify this to look for things like '(_AUTHORS_, > > _YEAR_)' etc. (Note: I've never used this module myself before now.) > > > > > -- > > Alan D. Mead, Ph.D. > President, Talent Algorithms Inc. > > science + technology = better workers > > http://www.alanmead.org > > I've... seen things you people wouldn't believe... > functions on fire in a copy of Orion. > I watched C-Sharp glitter in the dark near a programmable gate. > All those moments will be lost in time, like Ruby... on... Rails... Time > for Pi. > > --"The Register" user Alister, applying the famous > "Blade Runner" speech to software development > _______________________________________________ > Chicago-talk mailing list > Chicago-talk at pm.org > http://mail.pm.org/mailman/listinfo/chicago-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amead at alanmead.org Mon May 7 07:28:15 2018 From: amead at alanmead.org (Alan Mead) Date: Mon, 7 May 2018 09:28:15 -0500 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: References: <4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org> Message-ID: On 5/7/2018 9:23 AM, Mike Fragassi wrote: > Well, if you can create targets for the in-text citations, and feed > the text body into the parser to look for these, you could then take > the hashrefs and use them to generate a skeleton of a bibliography > that you can fill out later. I.e. for a text reference of "see Foo and > Bar (2001)" you'll only have the author(s) and year for what's in the > text, but you could take that, feed it into a template system like > Template::Toolkit, and spit out a bibliography with 'TODO' or 'XXX' in > the missing fields:? > ? ?Foo, XXX. & Bar XXX. (2001) ?XXX_TITLE. XXX_JOURNAL, XXX_VOL, > XXX_PAGES. > Then go back and fill in the missing fields. > And when done with writing both the text and the bibliography, you can > rescan both to check that there's no mismatches. Of course, that won't > help you if in one place you site Foo & Bar (2001) but you meant to > site Foo & Bar (2002), and you do also correctly site both of these > elsewhere. This is precisely what I want to do. The first step is to create the skeleton and your suggestion to use Biblio::Citation::Parser will make the second step much easier. -Alan -- Alan D. Mead, Ph.D. President, Talent Algorithms Inc. science + technology = better workers http://www.alanmead.org I've... seen things you people wouldn't believe... functions on fire in a copy of Orion. I watched C-Sharp glitter in the dark near a programmable gate. All those moments will be lost in time, like Ruby... on... Rails... Time for Pi. --"The Register" user Alister, applying the famous "Blade Runner" speech to software development From joel.a.berger at gmail.com Tue May 8 14:25:52 2018 From: joel.a.berger at gmail.com (Joel Berger) Date: Tue, 08 May 2018 21:25:52 +0000 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: References: <4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org> Message-ID: While I'm all for supporting Perl and it seems like you have found a Perl way to do it, I thought I'd just offer one (possible) alternative, depending on what your actual end goal is. During my Ph.D. research I found the program zotero to do bibliography management and I'm not sure what I would have done without it. I kept all my citation in there and I was able to export them to BibTeX for use in my thesis. I don't know what its export formats are, but I presume they have something that can output simple formatted text. Anyway its worth taking a look if you are doing any kind of project with a bibliography, I highly recommend it! https://www.zotero.org/ Cheers, Joel Berger On Mon, May 7, 2018 at 9:28 AM Alan Mead wrote: > On 5/7/2018 9:23 AM, Mike Fragassi wrote: > > Well, if you can create targets for the in-text citations, and feed > > the text body into the parser to look for these, you could then take > > the hashrefs and use them to generate a skeleton of a bibliography > > that you can fill out later. I.e. for a text reference of "see Foo and > > Bar (2001)" you'll only have the author(s) and year for what's in the > > text, but you could take that, feed it into a template system like > > Template::Toolkit, and spit out a bibliography with 'TODO' or 'XXX' in > > the missing fields: > > Foo, XXX. & Bar XXX. (2001) XXX_TITLE. XXX_JOURNAL, XXX_VOL, > > XXX_PAGES. > > Then go back and fill in the missing fields. > > And when done with writing both the text and the bibliography, you can > > rescan both to check that there's no mismatches. Of course, that won't > > help you if in one place you site Foo & Bar (2001) but you meant to > > site Foo & Bar (2002), and you do also correctly site both of these > > elsewhere. > > This is precisely what I want to do. The first step is to create the > skeleton and your suggestion to use Biblio::Citation::Parser will make > the second step much easier. > > -Alan > > > -- > > Alan D. Mead, Ph.D. > President, Talent Algorithms Inc. > > science + technology = better workers > > http://www.alanmead.org > > I've... seen things you people wouldn't believe... > functions on fire in a copy of Orion. > I watched C-Sharp glitter in the dark near a programmable gate. > All those moments will be lost in time, like Ruby... on... Rails... Time > for Pi. > > --"The Register" user Alister, applying the famous > "Blade Runner" speech to software development > _______________________________________________ > Chicago-talk mailing list > Chicago-talk at pm.org > http://mail.pm.org/mailman/listinfo/chicago-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amead at alanmead.org Tue May 8 14:46:36 2018 From: amead at alanmead.org (Alan Mead) Date: Tue, 8 May 2018 16:46:36 -0500 Subject: [Chicago-talk] Parsing APA-format citations In-Reply-To: References: <4188b914-acd2-5e54-e2dc-ba68522b0224@alanmead.org> Message-ID: Joel, Thanks for the suggestion. That's especially true for this project, which is book-length. And I see that zotero has a plugin for both Word and LibreOffice (contradicting something I said earlier). I've used zotero a bit and I think a citation manager makes a lot of sense in many use cases (like a thesis) but in this case where I have collaborators, they would have to agree to use it with Word. While I've shared zotero databases with students, it would be a big process to get my client to use it; they cannot even receive a ZIP file (because those are insecure). And I've been on the receiving end of having to edit manuscripts that used a unknown citation manager, and it makes a fairly closed format even more so closed (a lot like using equation plugins). But I agree that my solution is clumsy. -Alan On 5/8/2018 4:25 PM, Joel Berger wrote: > While I'm all for supporting Perl and it seems like you have found a > Perl way to do it, I thought I'd just offer one (possible) > alternative, depending on what your actual end goal is. > > During my Ph.D. research I found the program zotero to do bibliography > management and I'm not sure what I would have done without it. I kept > all my citation in there and I was able to export them to BibTeX for > use in my thesis. I don't know what its export formats are, but I > presume they have something that can output simple formatted text. > Anyway its worth taking a look if you are doing any kind of project > with a bibliography, I highly recommend it! > > https://www.zotero.org/ > > Cheers, > Joel Berger > -- Alan D. Mead, Ph.D. President, Talent Algorithms Inc. science + technology = better workers http://www.alanmead.org I've... seen things you people wouldn't believe... functions on fire in a copy of Orion. I watched C-Sharp glitter in the dark near a programmable gate. All those moments will be lost in time, like Ruby... on... Rails... Time for Pi. --"The Register" user Alister, applying the famous "Blade Runner" speech to software development