From pm at hutnick.com Wed May 5 13:39:37 2004 From: pm at hutnick.com (Peter Hutnick) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement Message-ID: <409934E9.6040707@hutnick.com> Hello, I'm writing a little script to do some of the grunt-work of converting a LaTeX document to XHTML. Consider the following block of code: @rule=( 's/\\emph{(.*?)}/\$1\<\/em\>/g', 's/\\chapter{(.*)}/\$1\<\/h1\>/g'); foreach (@rule) { print $_, ";\n"; $text =~ $_; } As, I'm sure you can see, this does nothing. I think you can also see what I am driving at. The obvious fix is to break each element of @rule down into two strings. (I.e. a hash or an array of arrays.) For my application that leaves two problems. 1. This significantly increases the complexity of what should be a simple script. (It makes loading the rules harder. It makes calling the subs more complex.) 2. I imagine that some of my rules are /not/ going to be in the form s/$search/$replace/g. I'd like to have the flexibility of doing a tr for example. Is there a better way to approach this? Thanks, Peter From luke at luqui.org Wed May 5 15:56:32 2004 From: luke at luqui.org (Luke Palmer) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <409934E9.6040707@hutnick.com> References: <409934E9.6040707@hutnick.com> Message-ID: <20040505205632.GA16202%luke@luqui.org> Peter Hutnick writes: > Hello, > > I'm writing a little script to do some of the grunt-work of converting a > LaTeX document to XHTML. > > Consider the following block of code: > > @rule=( 's/\\emph{(.*?)}/\$1\<\/em\>/g', > 's/\\chapter{(.*)}/\$1\<\/h1\>/g'); > > foreach (@rule) { > print $_, ";\n"; > $text =~ $_; > } > > As, I'm sure you can see, this does nothing. I think you can also see > what I am driving at. You had me staring at this for five minutes wondering why it didn't work. I completely missed the fact that you didn't have an eval. Here's what you want (note that I fixed your regexes a little, too): @rule = ( 's[\\\\emph{(.*?)}] [$1]g', 's[\\\\chapter{(.*?)}] [

$1

]g', ); # this only works if the text is in $_ # you might have to do a little BS to get it there foreach my $rule (@rule) { eval $rule; } But I'm sure someone's written a LaTeX to XHTML converter before. Do a little searching around (or mabye you have). I'd also suggest using Parse::RecDescent if you want this to scale, or if you plan on using this on multiple documents in the future. LaTeX has a heirarchial structure, and Perl 5's regexes don't work very well with that. Luke From pm at hutnick.com Thu May 6 08:30:00 2004 From: pm at hutnick.com (Peter Hutnick) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <20040505205632.GA16202%luke@luqui.org> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> Message-ID: <409A3DD8.7030903@hutnick.com> Luke Palmer wrote: > You had me staring at this for five minutes wondering why it didn't > work. I almost said "$text =~ $_; #does nothing." Guess I should have :-( > I completely missed the fact that you didn't have an eval. Ah, eval. That's exactly the incantation I was looking for! > Here's what you want (note that I fixed your regexes a little, too): > > @rule = ( > 's[\\\\emph{(.*?)}] [$1]g', > 's[\\\\chapter{(.*?)}] [

$1

]g', > ); > > # this only works if the text is in $_ > # you might have to do a little BS to get it there > foreach my $rule (@rule) { > eval $rule; > } One of my philosophical clashes with perl is that I like explicit, descriptive variable names and expressions. Perl allows (encourages) stuff like that, where you have an expression who's results are all implicit. My answer is to just write things more explicitly than I have to. I'm okay with that ;-) Oh, and what is with the extra backslash? (I.e. \\emph -> \\\\emph) > But I'm sure someone's written a LaTeX to XHTML converter before. Do a > little searching around (or mabye you have). Eh. There's one called ltoh that would work, except that it has an unacceptable license term. (It sticks a little ad for itself in the output, and the license disallows deleting it.) This script will be used in support of a copyleft project so I don't really have any latitude in the matter. The other thing is that I will be converting specific documents with this script. Because of functional differences between HTML and LaTeX a universal translator is really impossible. I figure one that works really well, but only for me, is the best solution. > I'd also suggest using Parse::RecDescent if you want this to scale, or > if you plan on using this on multiple documents in the future. LaTeX > has a heirarchial structure, and Perl 5's regexes don't work very well > with that. I don't think that is necessary. I am counting on the LaTeX file being well formed. Since I am only working with my own document I think that is okay. Again, if this was a general purpose app parsing the whole thing semantically would be the way to go. For pure simplicity you can't beat "``=" and "''=" etc. As an update, I went through last night and made a bunch of rules and it worked pretty well. I actually got everything working except internal links, the s I need to do some font work (i.e. \Large), and I am getting a few extra paragraphs. I think that the paragraphs are unavoidable due to the way LaTeX uses whitespace. (Well, unavoidable without parsing the input semantically . . .) Thanks a million for the advice. -Peter From nagler at bivio.biz Thu May 6 09:33:08 2004 From: nagler at bivio.biz (Rob Nagler) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <409A3DD8.7030903@hutnick.com> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> <409A3DD8.7030903@hutnick.com> Message-ID: <16538.19620.177174.72658@jump.bivio.biz> Peter Hutnick writes: > One of my philosophical clashes with perl is that I like explicit, > descriptive variable names and expressions. Perl allows (encourages) > stuff like that, where you have an expression who's results are all > implicit. You ahve no philosophical clash with perl, whose philosophy is TIMTOWTDI. However, you will bash up against the Perl Intelligentsia if you do this. Implicit variables with dynamic scoping are simply bad programming. It's why I also write: use Bla::Module (); If you don't do this, you have no idea where any of the names are coming from. > The other thing is that I will be converting specific documents with > this script. Because of functional differences between HTML and LaTeX a > universal translator is really impossible. I figure one that works > really well, but only for me, is the best solution. You may want to look at doclifter, which converts troff to html. One of the dangers of this approach is that you miss something. Write your code with assertions so that you don't accidently miss a latex value, e.g. if you use Luke's alg, add a rule of the form: q{/\\(w+)/ && die($1, ': unhandled latex command')} And, for a shameless plug, read the "It's a SMOP" chapter in my book which contains a DocBook/XML to HTML translator: http://www.extremeperl.org/bk/its-a-smop It may help you simplify Luke's algorithm so that the rules are clearer and easier to maintain. Rob From pm at hutnick.com Thu May 6 10:54:39 2004 From: pm at hutnick.com (Peter Hutnick) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <16538.19620.177174.72658@jump.bivio.biz> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> <409A3DD8.7030903@hutnick.com> <16538.19620.177174.72658@jump.bivio.biz> Message-ID: <409A5FBF.9090104@hutnick.com> Rob Nagler wrote: > You ahve no philosophical clash with perl, whose philosophy is > TIMTOWTDI. However, you will bash up against the Perl Intelligentsia > if you do this. Like it or not, culture is part of the language . . . > Implicit variables with dynamic scoping are simply bad programming. > It's why I also write: > > use Bla::Module (); > > If you don't do this, you have no idea where any of the names are > coming from. I'm not following you here. I've used getopts that way. Maybe I just don't know the "bad" way? > add a rule of the form: > > q{/\\(w+)/ && die($1, ': unhandled latex command')} I plan to do exactly this as the default, and have a rule like s/\\.*?}//g for "pristine" mode. > And, for a shameless plug, read the "It's a SMOP" chapter in my book > which contains a DocBook/XML to HTML translator: > > http://www.extremeperl.org/bk/its-a-smop You might consider revising "subject matter oriented program evolves" in the first paragraph to "subject matter oriented program (SMOP) evolves." Took me a while to figure out what the hell a SMOP is. > It may help you simplify Luke's algorithm so that the rules are > clearer and easier to maintain. The rules are going to end up in a separate file that lives with the LaTeX file it works with. As I said before, generality seems unachievable for this application. Laziness, in this case, is the better part of valor ;-) -Peter From pm at hutnick.com Thu May 6 11:54:11 2004 From: pm at hutnick.com (Peter Hutnick) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <20040505205632.GA16202%luke@luqui.org> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> Message-ID: <409A6DB3.2090607@hutnick.com> Luke Palmer wrote: > eval $rule; Well, I've read up on eval a bit, but I remain unclear on how it is expanded. I'd like to do something like foreach $rule (@rules) { $text =~ (eval $rule) } But the eval doesn't expand the way I'd like. I've blindly tried various quotes and brackets with no luck. Is it just impossible to have an eval as a part of a statement? -Peter From nagler at bivio.biz Thu May 6 12:34:50 2004 From: nagler at bivio.biz (Rob Nagler) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <409A6DB3.2090607@hutnick.com> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> <409A6DB3.2090607@hutnick.com> Message-ID: <16538.30522.784380.373400@jump.bivio.biz> Peter Hutnick writes: > foreach $rule (@rules) { > $text =~ (eval $rule) You could: eval("$text =~ $rule"); This assumes a certain structure to $rule. Alternatively, you could use the $_ approach, that is, have the $rule assume a specific variable name that contains the text, e.g., write the rules like: '$text =~ s/\\\\bla//', The advantage of using $_ is that the rules are shorter and easier to read. > But the eval doesn't expand the way I'd like. I've blindly tried > various quotes and brackets with no luck. Is it just impossible to have > an eval as a part of a statement? eval is perl so it can do anything perl can do. On the flip side, going with a more structured language for the rules would allow you to avoid eval, which makes your code easier to debug. Generally, code which generates code and executes it is more difficult to understand. Rob From pm at hutnick.com Thu May 6 12:59:45 2004 From: pm at hutnick.com (Peter Hutnick) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <16538.30522.784380.373400@jump.bivio.biz> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> <409A6DB3.2090607@hutnick.com> <16538.30522.784380.373400@jump.bivio.biz> Message-ID: <409A7D11.80208@hutnick.com> Rob Nagler wrote: > Peter Hutnick writes: > >> foreach $rule (@rules) { >> $text =~ (eval $rule) > > > You could: > > eval("$text =~ $rule"); This gives me: Backslash found where operator expected at (eval 1) line 1, near "a \" (Do you need to predeclare a?) The first rule is "s/\\emph{(.*?)}/\$1\<\/em\>/g;" which works if I just paste it in. > This assumes a certain structure to $rule. Alternatively, you could > use the $_ approach, that is, have the $rule assume a specific > variable name that contains the text, e.g., write the rules like: > > '$text =~ s/\\\\bla//', Since TeX commands are in the form \bla I am confused about why you put \\\\bla, not \\bla. > eval is perl so it can do anything perl can do. On the flip side, > going with a more structured language for the rules would allow you to > avoid eval, which makes your code easier to debug. Generally, code > which generates code and executes it is more difficult to understand. I really need to isolate as much of the complexity as possible into the code. -Peter From nagler at bivio.biz Thu May 6 13:02:34 2004 From: nagler at bivio.biz (Rob Nagler) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <409A5FBF.9090104@hutnick.com> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> <409A3DD8.7030903@hutnick.com> <16538.19620.177174.72658@jump.bivio.biz> <409A5FBF.9090104@hutnick.com> Message-ID: <16538.32186.685418.710119@jump.bivio.biz> Peter Hutnick writes: > Like it or not, culture is part of the language . . . Absolutely, but is language proscriptive or prescriptive? The creator of perl says that language is proscriptive, that is, defined by the culture. The Intelligentsia fights the proscriptive nature of Perl by using phrases like "That's not Perlish". A phrase like this is not XPish, which is why I wrote my book. :-) It's fear of change that drives people to define Perlish. It gives them a home base to run and hide. > I'm not following you here. I've used getopts that way. Maybe I just > don't know the "bad" way? If you say: use Foo::Bar; You allow Foo::Bar to pollute your name space with whatever it likes for all time. You relinquish control of your naming. That's why @EXPORT_OK is not ok imho. I can see people wanting to bring in names with: use Foo::Bar qw(foo bar); That's a laziness thing, and in certain cases it makes sense. However, blanket import of arbitrary symbols is a disaster waiting to happen. One mispelling on the importers part, and you have a whole new set of semantics occuring. This is why: eval($anything); is probably a bad practice. You probably want: eval($anything) || die($@); If $anything contains something that isn't defined and then all of a sudden it becomes defined, well, there you go, you've got new semantics and you have to figure out which of your 100 rules is causing them. > I plan to do exactly this as the default, and have a rule like > s/\\.*?}//g for "pristine" mode. Cool > You might consider revising "subject matter oriented program evolves" in > the first paragraph to "subject matter oriented program (SMOP) evolves." Done. > Took me a while to figure out what the hell a SMOP is. It's a play on words, actually. SMOP in the Hacker's Dictionary is: Simple (or Small) Matter of Programming http://info.astrian.net/jargon/terms/s/SMOP.html It's a derogatory term, but I believe that if you go back to the subject matter instead of the program, you end up with a subject matter oriented program which truly is a simple matter of programming. Twisted, but that's part of the Perl culture. ;-) > The rules are going to end up in a separate file that lives with the > LaTeX file it works with. Cool. Could you not create a latex style sheet that would do the work of converting your latex to html? > As I said before, generality seems unachievable for this > application. Never. ;-) Check out doclifter. It is simply amazing. However, your customer (even that person is yourself) probably doesn't want to pay for any more generalization than is absolutely necessary. > Laziness, in this case, is the better part of valor ;-) Laziness to me is doing the simplest thing that could possibly work (XPism) that makes me happy. I refactor when the code doesn't feel right, and only if it the refactoring doesn't cost "too much". The SMOP example in my book was refactored way too much, but then it is a book example. :-) Rob From pm at hutnick.com Thu May 6 14:22:10 2004 From: pm at hutnick.com (Peter Hutnick) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <16538.32186.685418.710119@jump.bivio.biz> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> <409A3DD8.7030903@hutnick.com> <16538.19620.177174.72658@jump.bivio.biz> <409A5FBF.9090104@hutnick.com> <16538.32186.685418.710119@jump.bivio.biz> Message-ID: <409A9062.1090706@hutnick.com> Rob Nagler wrote: > Peter Hutnick writes: > >>I'm not following you here. I've used getopts that way. Maybe I just >>don't know the "bad" way? > > > If you say: > > use Foo::Bar; Oh. I lied. But it's fixed now ;-) > You allow Foo::Bar to pollute your name space with whatever it likes > for all time. You relinquish control of your naming. That's why > @EXPORT_OK is not ok imho. I can see people wanting to bring in names > with: > > use Foo::Bar qw(foo bar); > > That's a laziness thing, and in certain cases it makes sense. What's the non-lazy way to use the functions in a package? Any idea how to fix the opposite problem of typo warnings when you "reach down into" the module? (E.g. $Getopt::Std::opt_h) > If $anything contains something that isn't defined and then all of a > sudden it becomes defined, well, there you go, you've got new > semantics and you have to figure out which of your 100 rules is > causing them. >>You might consider revising "subject matter oriented program evolves" in >>the first paragraph to "subject matter oriented program (SMOP) evolves." > > > Done. Cool! >>The rules are going to end up in a separate file that lives with the >>LaTeX file it works with. > > > Cool. Could you not create a latex style sheet that would do the work > of converting your latex to html? Yes, I could not. I actually don't know the first thing about LaTeX style sheets. Do they really generate non TeX text output?! >>As I said before, generality seems unachievable for this >>application. > > > Never. ;-) Check out doclifter. It is simply amazing. However, > your customer (even that person is yourself) probably doesn't want to > pay for any more generalization than is absolutely necessary. I don't think I was clear. The two do not share a 1:1 correspondence. Any mapping that I devise could fail with a different input. The biggest killer is the fact that TeX lets you make new commands. Guaranteed failure right there. So I have settled on a sort of meta-language for describing how /my/ document is best represented in HTML, and s script to implement those rules. A significant portion will be relevant to some other arbitrary LaTeX file, so a few simple changes to the rules file will allow applicability to any other file. -Peter PS: I am really enjoying and learning from this discussion. From nagler at bivio.biz Thu May 6 15:37:32 2004 From: nagler at bivio.biz (Rob Nagler) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <409A9062.1090706@hutnick.com> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> <409A3DD8.7030903@hutnick.com> <16538.19620.177174.72658@jump.bivio.biz> <409A5FBF.9090104@hutnick.com> <16538.32186.685418.710119@jump.bivio.biz> <409A9062.1090706@hutnick.com> Message-ID: <16538.41484.287908.336872@jump.bivio.biz> Peter Hutnick writes: > What's the non-lazy way to use the functions in a package? I recommend: Foo::Bar->do_it(); Many more modern CPAN packages are written this way. Older packages require you to do: Foo::Bar::do_it(); It's better than just: do_it(); However, many packages (which shall remain nameless) don't make it easy to find Foo::Bar, because they mix everything up. To me, all Foo::Bar functions should be defined in Foo/Bar.pm. That's the way bOP is organized, and it makes it very easy to navigate. > Any idea how to fix the opposite problem of typo warnings when you > "reach down into" the module? (E.g. $Getopt::Std::opt_h) I'm not sure if you mean this: use vars qw($Getopt::Std::opt_h); > Yes, I could not. I actually don't know the first thing about LaTeX > style sheets. Do they really generate non TeX text output?! I think you can get them to generate anything in "aux" files. That's how bib entries and such work. > I don't think I was clear. The two do not share a 1:1 correspondence. > Any mapping that I devise could fail with a different input. Understand. I thought the dataset was constrained. > The biggest killer is the fact that TeX lets you make new commands. > Guaranteed failure right there. Well, not really. TeX commands are macros and easy to interpret, but we won't got there. ;-) > So I have settled on a sort of meta-language for describing how /my/ > document is best represented in HTML, and s script to implement those rules. That's great. I don't think there is a big market in LaTeX to HTML translators: http://tex.loria.fr/english/outils.html#latex2html > A significant portion will be relevant to some other arbitrary LaTeX > file, so a few simple changes to the rules file will allow applicability > to any other file. And you'll solve that problem when you come to it. That's Extreme Perl at its laziest! > PS: I am really enjoying and learning from this discussion. Ditto. Rob From luke at luqui.org Thu May 6 15:42:18 2004 From: luke at luqui.org (Luke Palmer) Date: Mon Aug 2 21:25:48 2004 Subject: [Boulder.pm] Trickey (for a newbie) String Replacement In-Reply-To: <409A7D11.80208@hutnick.com> References: <409934E9.6040707@hutnick.com> <20040505205632.GA16202%luke@luqui.org> <409A6DB3.2090607@hutnick.com> <16538.30522.784380.373400@jump.bivio.biz> <409A7D11.80208@hutnick.com> Message-ID: <20040506204218.GA32033%luke@luqui.org> Peter Hutnick writes: > Rob Nagler wrote: > > >Peter Hutnick writes: > > > >> foreach $rule (@rules) { > >> $text =~ (eval $rule) > > > > > >You could: > > > > eval("$text =~ $rule"); > > This gives me: > > Backslash found where operator expected at (eval 1) line 1, near "a \" > (Do you need to predeclare a?) > > The first rule is "s/\\emph{(.*?)}/\$1\<\/em\>/g;" which works if I > just paste it in. He meant: eval("\$text =~ $rule"); You don't want your text expanded and evaluated as Perl code. I suggest operating on $_, however. That way you can have multiple statements in a rule if you like, without being horribly hackly. > >This assumes a certain structure to $rule. Alternatively, you could > >use the $_ approach, that is, have the $rule assume a specific > >variable name that contains the text, e.g., write the rules like: > > > > '$text =~ s/\\\\bla//', > > Since TeX commands are in the form \bla I am confused about why you put > \\\\bla, not \\bla. Well, look at the expansion. When you have: $rx = 's/\\bla//'; Then $rx contains the string: s/\bla// Since a double-backslash in a single quoted string turns into a single backslash. This is something that I dislike and am trying to change for Perl 6, but no luck so far. So now you're matching against a word boundary \b and then "la", which isn't what you want. If you read the rules in from a file or use a single-quoted heredoc, you don't run into that problem: my @rules = split /^/, <<'EORULES'; s/\\bla//g; # works as expected EORULES Luke