From Peter at PSDT.com Mon Feb 2 14:20:34 2009 From: Peter at PSDT.com (Peter Scott) Date: Mon, 02 Feb 2009 14:20:34 -0800 Subject: [VPM] Fwd: [reccompsci] Re: Impending mtg. Call for talks. (fwd) Message-ID: <6.2.3.4.2.20090202141433.038ef920@mail.webquarry.com> Although my schedule doesn't permit me to attend tomorrow evening, this looks like a very interesting talk and one I am happy to forward to Victoria.pm. >My name is Cy Schubert. I'm an member of RCSS (Recreational Computer >Science). Darren Duncan will be giving a talk about a project he has been >working on called Set::Relation to implement a standalone portion of a new >programming language called Muldis D. RCSS would like to invite the folks >at Victoria PM to the RCSS meeting. The announcement is attached. The RCSS >meeting is on Tuesday (tomorrow), Feb 3 at UVic in the ECS building, room >TBD. > >Would you mind positing this on your mailing list? Thanks. > > >-- >Cheers, >Cy Schubert >FreeBSD UNIX: Web: http://www.FreeBSD.org > > e**(i*pi)+1=0 > > >------- Forwarded Message > >Date: Mon, 02 Feb 2009 00:09:02 -0800 >From: Darren Duncan >To: reccompsci at googlegroups.com >Subject: [reccompsci] Re: Impending mtg. Call for talks. > > >Since I'm not actually sure who is compiling the official RCSS meeting >announcements this month, I'm going to post my revised details of my talk >to th >e >list. The details appear below the dashed line. > >I still hope an official announcement will be sent to the list asap, which >also > >contains the other details like location and what else may happen that >night >etc, as typically happens, and then I'll forward that to several lists I'm >on. > >And I did hear agreement from both RCSS and Victoria.pm people that this >can be > >a joint meeting ... similar fare to RCSS+VLUG in January but RCSS is the >host >timeslot this time. > >Thank you. -- Darren Duncan > >- ---------------------------------------- > >Darren Duncan will give a talk on a personal project, the new Perl module >Set::Relation, which implements a standalone portion of the new programming >language Muldis D that Darren created. See >http://search.cpan.org/dist/Set-Relation/ to see said module with >documentation >or to download it. It is functional and can be used right now; though some >features are missing, the most important ones are present, and the rest >should >be implemented within a few days of the talk, if not beforehand. > >Set::Relation provides a simple Perl-native facility for an application to >organize and process information using the relational model of data, without >having to employ a separate DBMS, and without having to employ a whole >separate >sub-language (such as Muldis Rosetta does). Rather, it is integrated a lot >mor >e >into the Perl way of doing things, and you use it much like a Perl array or >hash, or like some other third-party Set:: modules available for Perl. >This is >a standalone Perl 5 object class that represents a Muldis D quasi-relation >value, and its methods implement all the Muldis D relational operators. > >A simple working example: > > use Set::Relation; > > my $r1 = Set::Relation->new( members => [ [ 'x', 'y' ], [ > [ 4, 7 ], > [ 3, 2 ], > ] ] ); > > my $r2 = Set::Relation->new( members => [ > { 'y' => 5, 'z' => 6 }, > { 'y' => 2, 'z' => 1 }, > { 'y' => 2, 'z' => 4 }, > ] ); > > my $r3 = $r1->join( $r2 ); > > my $r3_as_nfmt_perl = $r3->members(); > my $r3_as_ofmt_perl = $r3->members( 1 ); > > # Then $r3_as_nfmt_perl contains: > # [ > # { 'x' => 3, 'y' => 2, 'z' => 1 }, > # { 'x' => 3, 'y' => 2, 'z' => 4 }, > # ] > # And $r3_as_ofmt_perl contains: > # [ [ 'x', 'y', 'z' ], [ > # [ 3, 2, 1 ], > # [ 3, 2, 4 ], > # ] ] > >This talk will focus on describing the features of Set::Relation, discussing >what the module does, how you could use it, such as to accomplish tasks >directl >y >in your program that you might otherwise offload to a SQL DBMS or other >tool, >and will go into detail on how the module is designed and works, including a >source code walk through. The talk will also shed some light on other >larger >projects of Darren and others. This talk has no prepared slide show or >handouts, and any visuals will be the module source as well as >white/chalk-boar >d >diagrams. Questions are encouraged at any time, and can so time can be >focused >on the areas you are most interested in. > > >- --~--~---------~--~----~------------~-------~--~----~ >You received this message because you are subscribed to the Google Groups >"Recr >eational Computer Science Society" group. >To post to this group, send email to reccompsci at googlegroups.com >To unsubscribe from this group, send email to reccompsci+unsubscribe at googleg >rou >ps.com >For more options, visit this group at http://groups.google.com/group/reccomp >sci >?hl=en >- -~----------~----~----~----~------~----~------~--~--- > >------- End of Forwarded Message -- Peter Scott Pacific Systems Design Technologies http://www.perldebugged.com/ http://www.perlmedic.com/ From darren at darrenduncan.net Mon Feb 2 15:22:19 2009 From: darren at darrenduncan.net (Darren Duncan) Date: Mon, 02 Feb 2009 15:22:19 -0800 Subject: [VPM] Tue, 2009 Feb 3rd, 7pm - February RCSS+VPM meeting Message-ID: <4987802B.2040902@darrenduncan.net> Greetings, This message is to announce the February meeting of the Recreational Computer Science Society, which this month is being held joint with the Victoria Perl Mongers. It is being held Tuesday, February 3, 7pm, at UVIC in the ECS (Engineering Computer Science) building, exact room TBD but should be on the ground floor. Darren Duncan is giving the talk, a summary of which follows the dashed line. After the meeting, we will quite likely adjourn to Maude Hunter's for the usual drinks and late dinner/snacks. Hope to see you there. -- Darren Duncan ---------------------------------------- Darren Duncan will give a talk on a personal project, the new Perl module Set::Relation, which implements a standalone portion of the new programming language Muldis D that Darren created. See http://search.cpan.org/dist/Set-Relation/ to see said module with documentation or to download it. It is functional and can be used right now; though some features are missing, the most important ones are present, and the rest should be implemented within a few days of the talk, if not beforehand. Set::Relation provides a simple Perl-native facility for an application to organize and process information using the relational model of data, without having to employ a separate DBMS, and without having to employ a whole separate sub-language (such as Muldis Rosetta does). Rather, it is integrated a lot more into the Perl way of doing things, and you use it much like a Perl array or hash, or like some other third-party Set:: modules available for Perl. This is a standalone Perl 5 object class that represents a Muldis D quasi-relation value, and its methods implement all the Muldis D relational operators. A simple working example: use Set::Relation; my $r1 = Set::Relation->new( members => [ [ 'x', 'y' ], [ [ 4, 7 ], [ 3, 2 ], ] ] ); my $r2 = Set::Relation->new( members => [ { 'y' => 5, 'z' => 6 }, { 'y' => 2, 'z' => 1 }, { 'y' => 2, 'z' => 4 }, ] ); my $r3 = $r1->join( $r2 ); my $r3_as_nfmt_perl = $r3->members(); my $r3_as_ofmt_perl = $r3->members( 1 ); # Then $r3_as_nfmt_perl contains: # [ # { 'x' => 3, 'y' => 2, 'z' => 1 }, # { 'x' => 3, 'y' => 2, 'z' => 4 }, # ] # And $r3_as_ofmt_perl contains: # [ [ 'x', 'y', 'z' ], [ # [ 3, 2, 1 ], # [ 3, 2, 4 ], # ] ] This talk will focus on describing the features of Set::Relation, discussing what the module does, how you could use it, such as to accomplish tasks directly in your program that you might otherwise offload to a SQL DBMS or other tool, and will go into detail on how the module is designed and works, including a source code walk through. The talk will also shed some light on other larger projects of Darren and others. This talk has no prepared slide show or handouts, and any visuals will be the module source as well as white/chalk-board diagrams. Questions are encouraged at any time, and can so time can be focused on the areas you are most interested in. From darren at darrenduncan.net Thu Feb 5 18:08:40 2009 From: darren at darrenduncan.net (Darren Duncan) Date: Thu, 05 Feb 2009 18:08:40 -0800 Subject: [VPM] follow-up to my Feb 3 talk Message-ID: <498B9BA8.4080600@darrenduncan.net> All, As a follow-up to my talk on tuesday and inspired by some feedback I got from you who were there, I have just written a new documentation section for Set::Relation (in POD) a copy of which appears below. It should be read as having been inserted at http://search.cpan.org/dist/Set-Relation/lib/Set/Relation.pm below "DESCRIPTION" and above "Matters of Value Identity". If you have any further feedback along the lines of trying to understand what the purpose of Set::Relation is, let me know what changes you think I should make, or if I mentioned yesterday a good use example that I forgot to say below. Thank you. -- Darren Duncan ----------- =head2 Appropriate Uses For Set::Relation Set::Relation I intended to be used in production environments. It has been developed according to a rigorously thought out API and behaviour specification, and it should be easy to learn, to install and use, and to extend. It is expected to be maintained and supported by the original author over the long term, either standalone or by the author providing an effective migration path to a suitable replacement. At the same time, the primary design goal of Set::Relation is to be simple. Set::Relation focuses on providing all the operators of the relational model of data to Perl developers in as concise a manner as is possible while focusing on correctness of behaviour above all else, and also focusing on ease of understanding and maintainability, since generally a developer's time is the most valuable resource for us to conserve. Despite initial design efforts to help Set::Relation's execution (CPU, RAM, etc) performance, this module is still assumed to be very un-optimized for its conceptually low level task of data crunching. It generally applies the same internal representation and algorithms regardless of the actual structure or meaning of the data, and regardless of the amount of data. It generically applies certain up-front costs in the form of data hashing that should both speed up later operations and simplify the implementation code of most operations, but any actual performance benefit depends a lot on actual use, and it may even have a net loss of execution performance. This module is still assumed to be considerably, perhaps an order or three of magnitude, slower than a hand-rolled task-specific solution. If your primary concern is execution performance, you will most likely not want to use Set::Relation but rather hand-code what it does specifically for your task with your specific data, or alternately employ some other dependency to do the work for you (or even, if necessary, write the task in C). Set::Relation is best used in situations where you either want to just get some correct solution up and working quickly (conserving developer time), such as because it is a prototype or proof of concept, or where your data set is relatively small (so Set::Relation's overhead cost is less noticeable), or where your task is one that is less time sensitive like a batch process where a longer execution time isn't harmful. Some specific suggested uses for Set::Relation are: =over =item Flat File Processing Use it to simplify some kinds of batch processing data from flat files such as CSV text files; a Set::Relation object can be used to store the content of one source file, and then the relational operators can be used to easily join or filter the file contents against each other, and eventually reports or other results be produced. =item SQL Generation Use it when gathering and pre-processing data that needs to end up in a SQL database for longer term use. If you generate your INSERT SQL from Set::Relation objects then those objects can help you do it all in a batch up front, and Set::Relation can help you test for duplicates or various kinds of dirty data that might violate database constraints. So it is less likely that you would need to connect to your SQL database interactively to test your data against it before insertion, and it is more likely you can just talk to it once when your single batch of SQL INSERTs is ready to go. =item Database APIs Various DBMS wrappers, ORMs, persistence tools, etc can use Set::Relation objects internally or as part of their API to represent database row sets. Wrappers that like to do some database-like work internally, such as associating parent and child row sets, or testing key constraints, or various other tasks can use Set::Relation to do some of their work for them, making development and maintenance of said tools easier. Note that in general this would fall under the "small data set" use category since a large number of applications, particularly web apps, just access or display from one to a hundred rows at a time. =item Testing Since it represents row-sets and provides all the relational operators, with a focus on correctness, Set::Relation should be useful in helping to test all sorts of other code intended to work with databases, particularly code that is a wrapper for a database, as a basis for comparison to whether the other code is having correct behaviour or not. For example, it could help test that code which generates and runs SQL is producing the correct results with various inputs and scenarios. =item Teaching Set::Relation should be helpful in teaching the relational model to people, helping them to know what is really going on conceptually with different operations, without being distracted by a lot of ancillary matters, and without being distracted by limitations of various DBMSs that may not expose the whole relational model or may do it incorrectly. It provides something students can experiment with right now. =item General List or Set Operations Set::Relation is complementary to the things you do with Perl's built-in Array and Hash types, including 'map' and 'grep' operations. It is useful when you want to do miscellaneous combining or filtering of lists of data against other lists, particularly multi-dimensional ones, or helping in summarizing lists of data for reports. Maybe helping with some tasks that are easier in Perl 6 than in Perl 5, when you're using Perl 5. =back Of course, like any generic tool, Set::Relation should be widely applicable in many different situations. Now, another situation where you may not want to use Set::Relation is when its sibling project L would serve you better. In contrast to Set::Relation, which is standalone and intended to integrate closely with Perl, Muldis Rosetta implements a whole programming language distinct from Perl, L, and presents a superior environment at large for working with the relational model of data; you use it sort of like how you use L to talk to a SQL DBMS, as a separate thing walled off from Perl. The benefits of using Muldis Rosetta over Set::Relation are multiple, including much better performance and scalability, and that it can directly persist data as you'd expect a DBMS to do, as well as provide easy access to many other relational model features like stronger typing and arbitrary database constraints, and nested transactions, as well as access to full powered DBMS engines like PostgreSQL and Oracle and SQLite (though you don't have to use those and Muldis Rosetta can be used purely implemented in Perl). I That brings out another important reason why Set::Relation exists now; it also serves as a proof of concept for a main part of Muldis D and Muldis Rosetta, or for a so-called "truly relational DBMSs" in general. It demonstrates ideal features and behaviour for relational operators, in a functioning form that users can experiment with right now. Set::Relation is also meant to serve as inspiration for similar projects, and better illustrate features that would be nice for modern programming languages to have built-in, same as they have collection types like ordered and associative arrays and one-dimensional sets and bags built-in. It I reasonable for standard equipment to not just be plain set operators but the other relational model operators too, such as relational join. I From darren at darrenduncan.net Tue Feb 10 00:41:46 2009 From: darren at darrenduncan.net (Darren Duncan) Date: Tue, 10 Feb 2009 00:41:46 -0800 Subject: [VPM] ANNOUNCE - Set::Relation version 0.6.0 for Perl 5 Message-ID: <49913DCA.70509@darrenduncan.net> P.S. This is the same project I gave a talk on last week to RCSS/VPM; at that time some short term planned features were missing but now they are present. Also following feedback from said meeting, the current release is better explained as to its purpose etc. ---------- All, I am pleased to announce the first (widely announced, and the 9th actual) release of Set::Relation, the official/unembraced version 0.6.0 for Perl 5, on CPAN. You can see it now, with nicely HTMLized documentation, at: http://search.cpan.org/dist/Set-Relation/ A short summary description with synopsis code is further below in this message. While new, Set::Relation is effectively done (enough for a first major version), with a full feature set and with everything fully documented in POD, and you can start actually using it now. That said, this module is officially in alpha release status so you should take caution with it. While its API is unlikely to change much, and the code appears correct, a lot of it has not yet actually been executed, and the current test suite is almost empty. The module will probably work now but might have breaks. See further below if you'd like to help out with this module's future development. Also expected in the near future, though not today, is a corresponding version for Perl 6, which was intended from day one. The official discussion forums for Set::Relation currently are just the email based ones listed at http://mm.darrenduncan.net/mailman/listinfo and labeled 'muldis-db'; the FORUMS pod section in Relation.pm itself also lists these. Any protracted discussion following this announcement would ideally take place there, so it is easy to find aggregate information resulting from said discussions. As for replying in other forums, use your discretion as usual. No official IRC forums for Set::Relation or other Muldis database-related things exist yet, though in the near future I expect I would get one setup on perl.org or freenode.org, preferably I would want a logged channel. -------- Set::Relation provides a simple Perl-native facility for an application to organize and process information using the relational model of data, without having to employ a separate DBMS, and without having to employ a whole separate sub-language (such as Muldis Rosetta does). Rather, it is integrated a lot more into the Perl way of doing things, and you use it much like a Perl array or hash, or like some other third-party Set:: modules available for Perl. This is a standalone Perl 5 object class that represents a Muldis D quasi-relation value, and its methods implement all the Muldis D relational operators. A simple working example: use Set::Relation; my $r1 = Set::Relation->new( [ [ 'x', 'y' ], [ [ 4, 7 ], [ 3, 2 ], ] ] ); my $r2 = Set::Relation->new( [ { 'y' => 5, 'z' => 6 }, { 'y' => 2, 'z' => 1 }, { 'y' => 2, 'z' => 4 }, ] ); my $r3 = $r1->join( $r2 ); my $r3_as_nfmt_perl = $r3->members(); my $r3_as_ofmt_perl = $r3->members( 1 ); # Then $r3_as_nfmt_perl contains: # [ # { 'x' => 3, 'y' => 2, 'z' => 1 }, # { 'x' => 3, 'y' => 2, 'z' => 4 }, # ] # And $r3_as_ofmt_perl contains: # [ [ 'x', 'y', 'z' ], [ # [ 3, 2, 1 ], # [ 3, 2, 4 ], # ] ] This is the initial complement of public routines; besides the "new" constructor submethod, there are these 68 object methods: "clone", "export_for_new", "has_frozen_identity", "freeze_identity", "which", "members", "heading", "body", "slice", "attr", "evacuate", "insert", "delete", "degree", "is_nullary", "has_attrs", "attr_names", "cardinality", "is_empty", "is_member", "empty", "insertion", "deletion", "rename", "projection", "cmpl_projection", "wrap", "cmpl_wrap", "unwrap", "group", "cmpl_group", "ungroup", "transitive_closure", "restriction", "restriction_and_cmpl", "cmpl_restriction", "extension", "static_extension", "map", "summary", "is_identical", "is_subset", "is_proper_subset", "is_disjoint", "union", "exclusion", "intersection", "difference", "semidifference", "semijoin_and_diff", "semijoin", "join", "product", "quotient", "composition", "join_with_group", "rank", "limit", "substitution", "static_substitution", "subst_in_restr", "static_subst_in_restr", "subst_in_semijoin", "static_subst_in_semijoin", "outer_join_with_group", "outer_join_with_undefs", "outer_join_with_static_exten", "outer_join_with_exten". It is important to note that practically anything you can do in a SQL SELECT (and in various other kinds of SQL), for any vendor of DBMS, you can do with the Set::Relation routines (and ordinary Perl); in the short term a "how do I" kind of FAQ or tutorial will be made, but it doesn't exist yet; meanwhile you should be able to figure it out using the routines' reference documentation. For examples: 1. the "SELECT ... FROM $foo" query portion is handled by any of [projection, extension, rename, map, substitution, etc]; 2. the "WHERE" and "HAVING" clauses are handled by [restriction, semijoin, semidifference, etc] which includes "IN" and "NOT IN"; 3. the "GROUP BY" is handled by [group, cmpl_group, etc]; 4. aggregation operators combined with "GROUP BY" are handled by [summary, etc]; 5. ranking, sorting and quota queries like "RANK", "ORDER BY" and "LIMIT" are handled by [rank, limit, etc]; 6. inner joins are handled by [join, product, intersection, etc]; 7. outer joins are handled by the various [outer_join_*, etc]; 8. union, intersection, difference, etc are handled by the same; 9. "COUNT(*)" is handled by [cardinality]; 10. recursive queries are handled by [transitive_closure, etc]; 11. sub-queries are supported everywhere simply as the normal way of doing things; 12. other features like relational divide, composition, etc are given by [quotient, composition, etc]. Set::Relation is a generic tool and can be widely applied. It has been developed according to a rigorously thought out API and behaviour specification, and it should be easy to learn, to install and use, and to extend. But in the short term at least, this module is still assumed to be very un-optimized for its conceptually low level task of data crunching, and you may want to avoid it if your top concern is execution (CPU, RAM, etc) performance. Set::Relation is best used in situations where you either want to just get some correct solution up and working quickly (conserving developer time), such as because it is a prototype or proof of concept, or where your data set is relatively small, or where your task is one that is less time sensitive like a batch process. Some suggested uses for Set::Relation include applying it to help with: flat file processing, SQL generation, database APIs, testing database related code, teaching databases, and general list or set operations. See http://search.cpan.org/dist/Set-Relation/lib/Set/Relation.pm#Appropriate_Uses_For_Set::Relation for more details. Set::Relation's performance will be improved over time so some of these issues should go away later, or the sibling project Muldis::Rosetta (still under construction) will have much better performance anyway due to its greater complexity to address such matters. Set::Relation requires Perl 5.8.1+, Moose 0.68+, version.pm 0.74+, namespace::clean 0.09+, and List::MoreUtils 0.22+; it has no other direct external dependencies. This module is pure Perl and a single file. It is now maintained in a Git repository; see http://utsl.gen.nz/gitweb/?p=Set-Relation or the distribution's README file. If you like Set::Relation, either as it is now or as you see it becoming, and you would like to help improve it, I welcome any and all kinds of assistance as you would like to offer such. Probably the greatest help I can get if people want to is to supply test files to confirm correct behaviour and expose current or regression bugs; other Set:: modules or database-related modules may be an inspiration for copying/adapting tests from. I would also like to build up a set of usage examples and basic tutorials, meant to answer the sort of questions "how do I do this?". For example, within the context of a relational database represented as a Perl Hash whose elements are Set::Relation objects representing SQL/etc tables/relvars, I would like a number of brief problem descriptions, such as that provide example database schemas and data (multiple questions/examples can share the same schema/data), saying first in a sentence what a query is trying to find out, then example SQL/etc to do it; for each example I/we would then supply Perl code for how to do the same thing with Set::Relation; we have a side-by-side comparison. Otherwise, I invite feedback on all aspects of the module's design, implementation, and documentation. For example, What sorts of changes do you suggest to the criteria Set::Relation uses to determine whether 2 arbitrary Perl values are to be considered identical or not (that's a big one); what sorts of typical module serialization hooks should I or should I not be using as object identifiers? Is the documentation structured the best way it could be. Is the module making as much use of Moose's features as it can be, or making as much use of the lesser known power features of Perl 5 itself as it should be? Do you think details of the module's API or semantics should change, such as to better integrate it into typical or best practice ways of using Perl? What additional prior art such as other Perl modules should I be looking at, either that Set::Relation should use as a dependency, or that it should copy/adapt functionality or techniques from? How are you applying, or would you consider applying, Set::Relation to your work and what changes if any might help you adopt it more easily? Do you propose different internal syntax for the module's code, or propose a different factoring of the code? Can you suggest a better way to package the module; eg would you propose an alternative to the simple Makefile.PL? Do you propose a particular structure for the test suite? What about examples and tutorials; how might those best be organized and what sorts of things should they contain? What can you suggest for helping performance? And then there's Perl 6; do you have suggestions for particular Perl 6 features that should be exploited for Set::Relation's Perl 6 native version? Or do you have ideas for the Perl 6 language itself to adapt distinct Set::Relation features into Perl 6 itself as if a relation were just another generic collection type (which it is)? Note that the work done on Set::Relation and in improving it and testing it will later feed back into implementing Muldis::Rosetta, whose design overlaps. It is very helpful to me if Set::Relation can be made the best it can be, as soon as possible, so to make said feedback more timely. Thank you and have a good day. -- Darren Duncan From jeremygwa at hotmail.com Mon Feb 23 15:16:55 2009 From: jeremygwa at hotmail.com (Jer A) Date: Mon, 23 Feb 2009 15:16:55 -0800 Subject: [VPM] link 'bot' protection Message-ID: hi all, I am designing a website service. how do i prevent automated bots and link scrapers and cross-site scripts from access to the site, without hindering the user experience, as well as hindering the performance of the host/server/site? My site is not graphic intensive, and I do not think anyone would be interest at grabbing anything that is graphical, only Information/Data. I have thought of banning ip's by parsing log files, but what should I look for that is 'fishy'? Thanks in advance for all advice/help. Regards, Jeremy _________________________________________________________________ Windows Live Messenger. Multitasking at its finest. http://www.microsoft.com/windows/windowslive/products/messenger.aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: From matt at elrod.ca Mon Feb 23 16:13:03 2009 From: matt at elrod.ca (Matt Elrod) Date: Mon, 23 Feb 2009 16:13:03 -0800 Subject: [VPM] link 'bot' protection In-Reply-To: References: Message-ID: <49A33B8F.3020600@elrod.ca> I should think you would want to "throttle" bots by timing their requests and temporarily banning IPs that exceed a speed limit. You can specify a preferred delay in your robots.txt file to give fair warning. Granted, giving bots a chance to exceed your speed limit gives them a chance to slurp some of your data, but if your code blocks them after a dozen or so rapid requests, they won't get far. The user-agent variable is easily forged, so speed of requests is the only reliable way of spotting bots that I am aware of. Matt Elrod Jer A wrote: > hi all, > > I am designing a website service. > > how do i prevent automated bots and link scrapers and cross-site scripts > from access to the site, without hindering the user experience, as well > as hindering the performance of the host/server/site? > > My site is not graphic intensive, and I do not think anyone would be > interest at grabbing anything that is graphical, only Information/Data. > > I have thought of banning ip's by parsing log files, but what should I > look for that is 'fishy'? > > Thanks in advance for all advice/help. > > Regards, > Jeremy From jeremygwa at hotmail.com Mon Feb 23 16:57:39 2009 From: jeremygwa at hotmail.com (Jer A) Date: Mon, 23 Feb 2009 16:57:39 -0800 Subject: [VPM] link 'bot' protection In-Reply-To: <49A33B8F.3020600@elrod.ca> References: <49A33B8F.3020600@elrod.ca> Message-ID: If I were to ban by ip, what if it were only one bad machine in a large network behind a router.....will it block the entire network? Thanks again, Jeremy > Date: Mon, 23 Feb 2009 16:13:03 -0800 > From: matt at elrod.ca > To: jeremygwa at hotmail.com > CC: victoria-pm at pm.org > Subject: Re: [VPM] link 'bot' protection > > > I should think you would want to "throttle" bots by timing > their requests and temporarily banning IPs that exceed a > speed limit. You can specify a preferred delay in your > robots.txt file to give fair warning. > > Granted, giving bots a chance to exceed your speed limit > gives them a chance to slurp some of your data, but if > your code blocks them after a dozen or so rapid requests, > they won't get far. > > The user-agent variable is easily forged, so speed of > requests is the only reliable way of spotting bots that > I am aware of. > > Matt Elrod > > Jer A wrote: > > hi all, > > > > I am designing a website service. > > > > how do i prevent automated bots and link scrapers and cross-site scripts > > from access to the site, without hindering the user experience, as well > > as hindering the performance of the host/server/site? > > > > My site is not graphic intensive, and I do not think anyone would be > > interest at grabbing anything that is graphical, only Information/Data. > > > > I have thought of banning ip's by parsing log files, but what should I > > look for that is 'fishy'? > > > > Thanks in advance for all advice/help. > > > > Regards, > > Jeremy > _________________________________________________________________ The new Windows Live Messenger. You don?t want to miss this. http://www.microsoft.com/windows/windowslive/products/messenger.aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeremygwa at hotmail.com Mon Feb 23 17:05:43 2009 From: jeremygwa at hotmail.com (Jer A) Date: Mon, 23 Feb 2009 17:05:43 -0800 Subject: [VPM] link 'bot' protection In-Reply-To: <513860.22151.qm@web36204.mail.mud.yahoo.com> References: <513860.22151.qm@web36204.mail.mud.yahoo.com> Message-ID: Thank you for your response. what can i also do to prevent cross site scripting....eg, if some one finds the html code that references to the form cgi script.....and calls it from their own site for example....is there anything in perl that would allow client computers access (eg. surfers), but block other domains (websites)? > Date: Mon, 23 Feb 2009 16:31:55 -0800 > From: semaphore_2000 at yahoo.com > Subject: Re: [VPM] link 'bot' protection > To: jeremygwa at hotmail.com > > > I think ultimately, that's fighting a rear-guard type of action. There are ways of blocking clients that grab too much too fast (many bots grab lots of pages in a short time so can be detected like that). There are other tricks like that too. But if the scraper or bot is written correctly, and is polite, taking pages slowly, ignores robots.txt and uses a user-agent string that looks like an existing browser, then you'd have a hard time telling. Maybe use javascript to present the info so that scrapers that don't use javascript can't see it. > > Anyway, I write web scrapers (er, in perl - nice, well-behaved bots that do not suck a server's resources) and if you'd like I can help you test. You might try yourself by playing with the CPAN mech-shell perhaps... > > Doug > > > --- On Mon, 2/23/09, Jer A wrote: > > > From: Jer A > > Subject: [VPM] link 'bot' protection > > To: victoria-pm at pm.org > > Date: Monday, February 23, 2009, 6:16 PM > > hi all, > > > > I am designing a website service. > > > > how do i prevent automated bots and link scrapers and > > cross-site scripts from access to the site, without > > hindering the user experience, as well as hindering the > > performance of the host/server/site? > > > > My site is not graphic intensive, and I do not think anyone > > would be interest at grabbing anything that is graphical, > > only Information/Data. > > > > I have thought of banning ip's by parsing log files, > > but what should I look for that is 'fishy'? > > > > Thanks in advance for all advice/help. > > > > Regards, > > Jeremy > > > > > > _________________________________________________________________ > > Windows Live Messenger. Multitasking at its finest. > > http://www.microsoft.com/windows/windowslive/products/messenger.aspx_______________________________________________ > > Victoria-pm mailing list > > Victoria-pm at pm.org > > http://mail.pm.org/mailman/listinfo/victoria-pm > > > _________________________________________________________________ So many new options, so little time. Windows Live Messenger. http://www.microsoft.com/windows/windowslive/products/messenger.aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: From matt at elrod.ca Mon Feb 23 17:24:36 2009 From: matt at elrod.ca (Matt Elrod) Date: Mon, 23 Feb 2009 17:24:36 -0800 Subject: [VPM] link 'bot' protection In-Reply-To: References: <513860.22151.qm@web36204.mail.mud.yahoo.com> Message-ID: <49A34C54.3010308@elrod.ca> I sometimes obscure the "action" in my forms with javascript. Obviously you can look at the referrer, but it too can be easily forged. You can have your form send a token to your cgi. I sometimes have the form insert a unix timestamp in a hidden field and then reject post data if it comes in too fast or too slow. That is, each generation of the form is only good for X minutes. Granted, this technique could be reverse engineered, and the parasite might insert a valid timestamp in their post data, but I expect it thwarts most unwelcome accesses. HTH, Matt Jer A wrote: > > Thank you for your response. > > what can i also do to prevent cross site scripting....eg, if some one > finds the html code that references to the form cgi script.....and calls > it from their own site for example....is there anything in perl that > would allow client computers access (eg. surfers), but block other > domains (websites)? > > > > Date: Mon, 23 Feb 2009 16:31:55 -0800 > > From: semaphore_2000 at yahoo.com > > Subject: Re: [VPM] link 'bot' protection > > To: jeremygwa at hotmail.com > > > > > > I think ultimately, that's fighting a rear-guard type of action. > There are ways of blocking clients that grab too much too fast (many > bots grab lots of pages in a short time so can be detected like that). > There are other tricks like that too. But if the scraper or bot is > written correctly, and is polite, taking pages slowly, ignores > robots.txt and uses a user-agent string that looks like an existing > browser, then you'd have a hard time telling. Maybe use javascript to > present the info so that scrapers that don't use javascript can't see it. > > > > Anyway, I write web scrapers (er, in perl - nice, well-behaved bots > that do not suck a server's resources) and if you'd like I can help you > test. You might try yourself by playing with the CPAN mech-shell perhaps... > > > > Doug > > > > > > --- On Mon, 2/23/09, Jer A wrote: > > > > > From: Jer A > > > Subject: [VPM] link 'bot' protection > > > To: victoria-pm at pm.org > > > Date: Monday, February 23, 2009, 6:16 PM > > > hi all, > > > > > > I am designing a website service. > > > > > > how do i prevent automated bots and link scrapers and > > > cross-site scripts from access to the site, without > > > hindering the user experience, as well as hindering the > > > performance of the host/server/site? > > > > > > My site is not graphic intensive, and I do not think anyone > > > would be interest at grabbing anything that is graphical, > > > only Information/Data. > > > > > > I have thought of banning ip's by parsing log files, > > > but what should I look for that is 'fishy'? > > > > > > Thanks in advance for all advice/help. > > > > > > Regards, > > > Jeremy > > > > > > > > > _________________________________________________________________ > > > Windows Live Messenger. Multitasking at its finest. > > > > http://www.microsoft.com/windows/windowslive/products/messenger.aspx_______________________________________________ > > > Victoria-pm mailing list > > > Victoria-pm at pm.org > > > http://mail.pm.org/mailman/listinfo/victoria-pm > > > > > > > > ------------------------------------------------------------------------ > So many new options, so little time. Windows Live Messenger. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Victoria-pm mailing list > Victoria-pm at pm.org > http://mail.pm.org/mailman/listinfo/victoria-pm From matt at elrod.ca Mon Feb 23 17:30:35 2009 From: matt at elrod.ca (Matt Elrod) Date: Mon, 23 Feb 2009 17:30:35 -0800 Subject: [VPM] link 'bot' protection In-Reply-To: <49A34C54.3010308@elrod.ca> References: <513860.22151.qm@web36204.mail.mud.yahoo.com> <49A34C54.3010308@elrod.ca> Message-ID: <49A34DBB.4010504@elrod.ca> You could do it with a session cookie as well. Matt Matt Elrod wrote: > You can have your form send a token to your cgi. I sometimes have > the form insert a unix timestamp in a hidden field and then reject > post data if it comes in too fast or too slow. From mock at obscurity.org Mon Feb 23 19:52:41 2009 From: mock at obscurity.org (mock) Date: Tue, 24 Feb 2009 03:52:41 +0000 Subject: [VPM] link 'bot' protection In-Reply-To: References: <49A33B8F.3020600@elrod.ca> Message-ID: <20090224035241.GF7144@obscurity.org> On Mon, Feb 23, 2009 at 04:57:39PM -0800, Jer A wrote: > > > If I were to ban by ip, what if it were only one bad machine in a large network behind a router.....will it block the entire network? > If you do this by IP you'll end up banning AOL users and anyone else who's part of a large number of users behind NAT, plus it's relatively trivial to get around. The short answer is, you can't accomplish what you're trying to accomplish. The longer more complex answer is, it depends on the value of your data. There are a number of techniques you can use to make it more annoying to write bots, but there is no silver bullet that will prevent automation. If someone is motivated enough to want to scrape your data, they can, and nothing you can do will stop them. Also, it only takes one motivated person to release a library and all the other less motivated people will be able to do it too. If your business model relies on this being impossible, you probably should rethink things. Now, all that said, here's some ways you can put up stumbling blocks and their various flaws. Obfuscation - You can use javascript to deobfuscate the contents of the page on the fly. This relies on the fact that most bots don't understand Javascript, and thus the page will be unreadable to them. The flaws are it will totally screw over your google ranking (google relies on bots) and CPAN has at least a couple of modules that will allow you to build a bot that either automates a real browser (and thus understands javascript) or adds Javascript functionallity to WWW::Mechanize. Plus it's trivial to just use firebug to figure out what's actually going on. Captcha - You can make a "human only understandable" test and require users to fill one out before entering the protected part of your site. The flaws are that it screws the Google bot again, it will annoy your users, and as far as I know, ever captcha has been broken by OCR software right now. If your data is valuable enough, people will use either porn or mechanical turk to incent real people to solve your captcha and build a library which they can use as a lookup table. The only time this actually works is if your test is "enter a credit card to be charged". Having a valid credit card and being willing to part with money is (almost) always a good sign that something isn't a bot. Throttling - You can require that requests from a given IP, session cookie, subnet, or anything else that you can think you can use to differentiate browsers only come in at a given rate. The flaws are that anyone who cares will either use Tor or proxies to get around your IP restrictions, large chunks of the net (AOL as an example) are behind NAT and you'll have a ton of false positives, and most robot code understands cookies anyway. Trying to figure out unique users is basically only statistically possible. Plus anyone who cares will just trickle by your rate limit and slowly leach your data anyway. Watermarking - You can seed your data with unique fingerprints that can be identified by you. Bonus points if you make the watermark findable using the Google bot so you can use Google to find your leeches and then sue them for copyright violation. The flaws are that once people find out about your watermarks, they're always trivial to remove. Depending on who is leeching your data, copyright law might be useless anyway (good luck suing in China). Tarpits - You can make an invisible neverending link generator which spiders will descend into infinitely. This can make a nice way of spotting bots and then banning their IP. However it will only work once. A suffiently motivated attacker will just code around your tarpit. AUP - You can publish an acceptable use policy and threaten to sue anyone who breaks it. This might work, assuming they're in a jurisdiction where you can enforce it, and you have deep enough pockets to make it stick. And assuming you can identify who's doing it. This works well for data you don't want republished (copyright law is fairly universal and well understood by providers) but sucks for things like white papers, where you're attempting to control exclusivity. The real answer is to figure out a way so that bots aren't a problem for you, but a benefit. Google, for example, publishes an API, which is access controlled by a user key. Requests to the API cost money, and so it doesn't matter if bots hit the API assuming the cost of serving is less than the price to the user. Hope that helps. mock (who has written a fair number of bots) From mock at obscurity.org Mon Feb 23 20:09:01 2009 From: mock at obscurity.org (mock) Date: Tue, 24 Feb 2009 04:09:01 +0000 Subject: [VPM] link 'bot' protection In-Reply-To: References: <513860.22151.qm@web36204.mail.mud.yahoo.com> Message-ID: <20090224040901.GG7144@obscurity.org> On Mon, Feb 23, 2009 at 05:05:43PM -0800, Jer A wrote: > > > Thank you for your response. > > what can i also do to prevent cross site scripting....eg, if some one finds the html code that references to the form cgi script.....and calls it from their own site for example....is there anything in perl that would allow client computers access (eg. surfers), but block other domains (websites)? > REST is your friend (don't use GET in stateful contexts) and use a token passing scheme to prevent credential replay attacks. Both these things will make your life significantly more annoying when designing your app, but will save your ass in the long run. From darren at darrenduncan.net Mon Feb 23 21:09:39 2009 From: darren at darrenduncan.net (Darren Duncan) Date: Mon, 23 Feb 2009 21:09:39 -0800 Subject: [VPM] link 'bot' protection In-Reply-To: <20090224035241.GF7144@obscurity.org> References: <49A33B8F.3020600@elrod.ca> <20090224035241.GF7144@obscurity.org> Message-ID: <49A38113.2050009@darrenduncan.net> mock wrote: > On Mon, Feb 23, 2009 at 04:57:39PM -0800, Jer A wrote: > Watermarking - You can seed your data with unique fingerprints that can be > identified by you. Bonus points if you make the watermark findable using the > Google bot so you can use Google to find your leeches and then sue them for > copyright violation. The flaws are that once people find out about your > watermarks, they're always trivial to remove. Depending on who is leeching > your data, copyright law might be useless anyway (good luck suing in China). > > AUP - You can publish an acceptable use policy and threaten to sue anyone who > breaks it. This might work, assuming they're in a jurisdiction where you can > enforce it, and you have deep enough pockets to make it stick. And assuming > you can identify who's doing it. This works well for data you don't want > republished (copyright law is fairly universal and well understood by > providers) but sucks for things like white papers, where you're attempting to > control exclusivity. Jeremy, I don't know what the content of your website is, but copyright law may not necessarily protect it, depending on the content and how the robots' masters make use of what they pull. Copyright law only protects creative expressions, such as the exact sentences you write in a description paragraph, or the visual arrangement of the information (if the latter isn't simple and obvious). By contrast, if the content is simply data without creative expression, for example a phone book with names, addresses, and phone numbers, then anyone can take that and make their own phone book, and this is perfectly legal by copyright law (and by rationality). You can't copyright facts, only creative expressions based on said facts. So, say if you have a website with a catalog and written reviews of products or services, someone can copy the names and prices etc of the products, but you can only go after them on copyright if, say, they copy the written review paragraphs as well. Also, while you can post a terms of services document, all those can do is communicate your desires but it doesn't give you a legal leg to stand on against people using facts from your site. Simply using a website that one doesn't have to login to is not a contractual matter, and visitors are not bound by any contact as they haven't signed such. A TOS is not a contract. I think a TOS is only enforceable if people have to login to an account to get access to something, and that something isn't otherwise available to the public. And even then, a TOS may be voided by other laws depending what it demands. -- Darren Duncan From matt at elrod.ca Wed Feb 25 10:30:36 2009 From: matt at elrod.ca (Matt Elrod) Date: Wed, 25 Feb 2009 10:30:36 -0800 Subject: [VPM] link 'bot' protection In-Reply-To: <20090224035241.GF7144@obscurity.org> References: <49A33B8F.3020600@elrod.ca> <20090224035241.GF7144@obscurity.org> Message-ID: <49A58E4C.4050400@elrod.ca> Honeypots - One could also provide an invisible link, anchored by an invisible gif for example, expressly forbid access to what it links to in robots.txt, and then temporarily ban any IP that trespasses. That would separate the robot.txt ignoring spiders from the rest. Matt http://en.wikipedia.org/wiki/Web_scraping#Technical_measures_to_stop_bots mock wrote: > Tarpits - You can make an invisible neverending link generator which spiders > will descend into infinitely. This can make a nice way of spotting bots and > then banning their IP. However it will only work once. A suffiently > motivated attacker will just code around your tarpit. From jeremygwa at hotmail.com Fri Feb 27 17:11:20 2009 From: jeremygwa at hotmail.com (Jer A) Date: Fri, 27 Feb 2009 17:11:20 -0800 Subject: [VPM] regex found matches array Message-ID: How do i retrieve the "groups" array, from the regex query that contains the groups in $1 $2 $3 ...etc . the script has no hardcoded knowledge of the regex pattern, as it is read from a string into a $var, which is evaluated like this /$var/eg. Thanks in advance for all help....I hope you can understand, as I find it hard to explain. -Jeremy. _________________________________________________________________ The new Windows Live Messenger. You don?t want to miss this. http://www.microsoft.com/windows/windowslive/products/messenger.aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: From darren at darrenduncan.net Fri Feb 27 18:45:17 2009 From: darren at darrenduncan.net (Darren Duncan) Date: Fri, 27 Feb 2009 18:45:17 -0800 Subject: [VPM] regex found matches array In-Reply-To: References: Message-ID: <49A8A53D.7040106@darrenduncan.net> Jer A wrote: > How do i retrieve the "groups" array, from the regex query that contains > the groups in $1 $2 $3 ...etc . the script has no hardcoded knowledge of > the regex pattern, as it is read from a string into a $var, which is > evaluated like this /$var/eg. > > Thanks in advance for all help....I hope you can understand, as I find > it hard to explain. Say something like this: my $results = [$source =~ m/$pattern/g]; Then the $results array has one element per capture by the pattern. Make sure the whole pattern-match expression is in list context (such as my array value constructor provides) or otherwise it will just (in scalar context) return the count of matches. -- Darren Duncan