[sf-perl] Help with naming modules

David Muir Sharnoff sfpug at dave.sharnoff.org
Sat Jul 18 01:18:07 PDT 2009

I hope to soon open source a bunch of modules and commands.   I need
help naming them.  I developed them for Searchme, Inc, and they are all
currently in the SearchMe:: namespace, but I they should all be moved
out of the SearchMe namespace.

Please propose new names for my modules.  I need to choose the
new names in the next few days.  I also need to choose how to break
this up into distributions.



Here's the list:

program: process_logs

	A distributed log processing system.  Working from a configuration
	that desribes the steps and their dependencies, it runs the runs
	the steps using a cluster of systems.  Each step can consist of
	the following parts:
	The sources of the data
	A filter (perl snippet) to choose which data to process.
	A grouping function to choose how much data to process at once.
	A transformation step to massage the data.
	The output format.
	How to bucketize the output data and into how many buckets.
	How to sort the output data.
	How to name the output data.
	How often to run this step (on daily, weekly, etc data)
	The start time (date) for which the step is valid (eg: 'last week')
	The end time (date) for which the step is valid (eg: 'yesterday')

program: test_log_configs

	A program to run sample data through particular steps defined
	by a process_logs configuration file.  Outputs Test Anything Protocol.

program: grab_data

	Re-assemble bucketized data files generated by process_logs

program: ltsv_to_tsv
	Convert bucketized multi-format data files from process_logs into
	standard TSVs

program: ltsv_samples

	Pull sample data from multi-format data files from process_logs into

program: ltsv_column_unique_counts

	Count number of unique values per column in the multi-format data 
	files from process_logs into


	The main driver for the process_logs program.  It figures out the
	time and job dependencies and queues all the jobs into a 


	Generic queue of "jobs".   Most likely to be subclassed
	for different situations.  Jobs are registered.  Hosts are
	registered.  Jobs may or may not be tied to particular hosts.
	Jobs are started on hosts when the hosts have less than their
	maximum number of jobs running.

	Maintain a dependency graph of objects.  This does not depend
	on SearchMe::JobQueue though it is used in combination with it
	in SearchMe::DependencyQueue.


	Combines SearchMe::JobQueue, SearchMe::JobQueue::DependencyGraph,
	and SearchMe::JobQueue::DependencyTask to make a queue of jobs
	that is run as their dependencies are met.  Jobs are started by
	calling perl callbacks.


	Base class of the jobs managed by SearchMe::JobQueue.

	Subclass of SearchMe::JobQueue::Job for use with 


	Lighter than a job, a task is a callback.  It doesn't get 
	scheduled, it simply runs when it can.  Used with


	A job queue of jobs that are run using Proc::Background.


	A command line job for SearchMe::JobQueue::BackgroundQueue.  Runs
	any unix command.


	A command line job that sorts files.
	A command line job that moves files, possibly remotely.


	A compound job that is actually a sequence of jobs.

	A remote perl callback job, using SearchMe::Misc::RemoteJob


	Make a RPC call to a remote system.  It will start perl on the
	remote system.  Asynchronous.  Requires IO::Event.  For use when
	farming work out from a central point.  The remote job communicates
	sychrounously, but the master is asychrounous.


	Make a callback to the master system from a remote job started
	with SearchMe::Misc::RemoteJob.

	The ltsv (log TSV) format used for most log processing jobs.
	The set of columns is discovered as the logs are written.


	The writer and parser for raw data files.

	Base clase for data writing modules.


	Translate path specifications into filenames.  Generates
	regular expressions to understand names of existing data files.

	Force return values to be integer or float.


	Streming generic data aggregation.  Based on a configuration,
	it generates perl code to aggregate a stream of input data.  It
	can do nested aggregations (eg: urls, hosts, & domains) and 
	can do cross-product aggregations (domains x time-of-day).  For
	nested aggregations, the input must be sorted.

	Has built-in functions support for min, max, mean, median, 
	standard deviation, etc.  Also supports custom code.  Can limit
	memory use.

	A few statistics functions: standard_deviation, percentile,
	dominant, etc.  All operate on 
	$SearchMe::Aggregation::Stats::ps->{keep}{$fieldname} so that
	SearchMe::Log::Aggregate can have rules like:
		p90_foo: percentile(foo => 90)


	For SearchMe::Log::Jobs metadata, compresses the data 
	structures by joining duplicates together.


	Base class for data reading modules.


	Country code to name and region table.

	Trim strings to length and make sure they're properly encoded
	for inserting into a fixed-length database column.  Will limit by
	bytes even though there might be multi-byte characters in the

	Parse durations and frequency like: "daily", "last week",
	"4th wednesday of each month"

	Validate SearchMe::Log::Jobs configuration files.


	Run SearchMe::Log::Jobs steps, usally on a remote system.

	Misc support funcs for SearchMe::Log::Jobs.


	Insert data strems into a database.  Will do safe interpolation 
	into the query and can apply updates condititionally.


	Translate IP addresses to country codes using a dump of data
	from http://software77.net/geo-ip/

	Light wrapper around YAML::Syck to improve error reporting.

	A simple iterator function, from David Wheeler, but
	now looks like a filehandle too.  


	Translate POD to twiki format and upload it into a twiki.
	Flawed since Pod::Simple::Wiki doesn't manage a very good
	translation.  Uses WWW::TWikiClient.


	Callback API for IO::Event.  This can be IO::Event::Callback
	and be added to the IO::Event distribution.


	Open files locally or remotely, with or without compression
	based on the name of the file.

	Like File::Slurp, but reads & writes remote files using ssh/scp
	via SearchMe::Misc::SmartOpen.


	Remeber local and remote process ids.  On receipt of a
	control-C, try to kill them all.


	Merge multiple input streams (SearchMe::Iterator's) using a
	sort function to produce a single output stream.  Can handle
	large numbers of streams with only log N performance degradation.
	Still not fast enough.


	Call srand() 


	Run multiple commands and process their output in parallel.


	Simple API for tie'ing functions to hashes.  A whole bunch 
	of useful examples like 

		%q_shell for quoting things safely for /bin/sh
		%round for rounding numbers
		%q_perl for quoting things safely for perl
		%thoucomma for adding commas to large numbers

	Figure out hostnames the hard way: `ssh $host hostname` for
	when DNS doesn't work right and you can't have aliases.


	Things that should have been in Util::List like:
		do_sublist( &selector, &callback, @list )

	Misc support functions: 
		Add line numbers to eval blocks.
		Detailed list comparision
		Find differences in ordering

	Enhanced YAML-format configuration files.  Allows includes
	and macros.

	Configuration file validator.  Describe the allowed configuration
	in perl struct (embedded YAML) and compare to the 
	configuration supplied (another perl struct).

program: tsv

	A Tab Separated Value slicer/dicer.  All simple stuff, but handy
	when working with tsvs.  Can search them, cut them (by column name),
	rotate them, etc.

program: do.hosts

	Run commands across a cluster of systems.  Can run run N per host, 
	can run locally (rsync) or remotely.  Can limit the number simultaneous.
	Can prefix output with the hostname, or not.

program: rkill

	Given output of the form, "hostname: PID junk" as produced by
	running do.hosts ps x | grep, go kill all the processes on their
program: cronjob_wrapper

	Run a command.  Save the output.  Send email only if the command 
	exits non-zero.

program: pods2htmls
	Generate HMTL documentaion for all the pod documentation that
	can be found in a source tree.   Uses Pod::Html to do the work.


	Stream inputs through multiple slave process, resending
	items that get dropped and recovering from hung or crashed
	slave processes.  Not connected to the log processing system,
	but dependent on some of the same modules.

More information about the SanFrancisco-pm mailing list