our friends the utils: a highway traveled by wheels we didn't re-invent

43
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Steven Lembark Workhorse Computing [email protected]

Upload: workhorse-computing

Post on 14-Jul-2015

434 views

Category:

Technology


0 download

TRANSCRIPT

Our Friends the Utils:A highway traveled by wheels we didn't re-invent.

Steven LembarkWorkhorse Computing

[email protected]

Meet the Utils

● Scalar::Util & List::Util were first written in by the ancient Prophet of Barr (c. 1997).

● The modules provide often-requested features that were not worth modifying Perl itself to offer.

● Later, List::MoreUtils added features that List::Util does not include.

● If the Sound of Perl is an un-bloodied wall, the Utils are a superhighway traveled by truly lazy wheels.

Mixing old and new

● Several features in v5.10+ overlap Util features.– Smart matches are the most obvious, and are usually

compared with List::Util::first.

– New features are not replacements, but work well with the modules.

– Examples here show how to use the modules with smart matching, switches.

● What's important to notice is that these modules remain relevant.

Scalar::Util

Provides introspection for scalars:– Is a filehandle [still] open?

– The address, type, and class of a variable.

– Is a value “numeric” according to Perl?

– Does the variable contain readonly or tainted data?

– Tools for managing weak references or modifying prototypes.

● Handling these in Pure Perl is messy, slow, or error-prone.

Dealing with ref's & objects

● Collectively these replace “ref” or stringified references with a simpler, cleaner interface.

● The problem with ref and stringified objects is that they return different data for objects or “plain” refs.– Stringified refs are “Foobar=ARRAY(0x29eba90)”,

unless overloading gets in the way.

– Ref returns the address and base type, unless the reference is blessed.

● blessed, refaddr, & reftype are consistent.

Blessed is the Object

● blessed returns a class or undef.● This simplifies sanity checks:

blessed $_[0] or die 'Non-object...';

● Construction with objects for types:

bless $x, blessed $proto || $proto;

avoids classes like “ARRAY(0xab1234)”.● Check for blessed before “can” to avoid errors:

blessed $x && $x->can( $x ) or die ...

Blessed Structures

● ref does not return the base type of a blessed ref.● reftype returns the data type, regardless of blessing.● Works nicely with switches:

given( reftype $thing ) # blessed or not, same reftype{

when( undef ) { die “Not a reference: '$thing'” }

when( 'ARRAY' ) { ... }when( 'HASH' ) { ... }when( 'SCALAR' ) { ... }

die "Un-usable data type: '$_'";}

Blessed Matches

● Smart-matching an object requires an overloading.● Developers would like to QA their modules to

validate the overload is available.● A generic test is simple: blessed scalars that

can( '~~' ) are usable.● Writing this test with only ref is a pain.● With Scalar::Utils it is blessedly simple:

blessed $var && $var->can( '~~' )or die ...

The guts of “inside out” classes

● Virtual addresses are unique during execution.● Make useful keys for associating external data.● Problem is that stringified refs include too much data:

– Plain : ARRAY(0XEAA750)

– Blessed: Foo=ARRAY(0XEAA750)

– Re-blessed: Bletch=ARRAY(0XEAA750)

● The extra data makes them unusable as keys.● Parsing the ref's to extract the address is too slow.

The key to your guts: refaddr

● refaddr returns only the address portion of a ref:– Previous values all look like: 0XEAA750

● Note the lack of package or type.● This is not affected by [re]blessing the variable.● This leaves $data{ refaddr $ref } a stable over

the life cycle of a ref or object.

use Scalar::Util qw( refaddr );

my %obj2data = (); # private cache for object data.

sub set{ my ( $obj, $data ) = @_; $obj2data{ refaddr $obj } = $data; return}

sub get{ $obj2data{ refaddr $_[0] }}

# have to manually clear out the cache.

DESTROY{ delete $obj2data{ refaddr $_[0] }; $obj->NEXT::DESTROY;}

Circular references are not garbage● In fact, with Perl's reference counting they are

normally memory leaks.● These are any case where a variable keeps alive

some extra reference to itself:– Self reference: $a = \$a

– Linked list: $a->[0] = [ [], \$a, @data ]

● The first is probably a mistake, the second is a properly formed doubly-linked list.

● Both of them prevent $a from ever being released.

Fix: Weak References

● Weak ref's do not increment the var's reference count.

● In this case $backlink does not prevent cleaning $a:

weaken ( my $backlink = $a );

@$a = ( [], $backlink, @data );

● $a->[1] will be undef if $a goes out of scope.● isweak returns true for weak ref's.

Aside: Accidentally getting strong● Copies are strong references unless they are

explicitly weakened.● This can leave you accidentally keeping items alive

with things like:

my @a = grep { defined } @a;

this leaves @a with strong references that have to be explicitly weakened again.

● See Scalar::Util's POD for dealing with this.

Knowing Your Numbers

● We've all seen code that checks for numeric values with a regex like /^\d+$/.

● Aside from being slow, this simply does not work.

Exercse: Come up with a working regex that gracefully handles all of Perl's numeric types including int, float, exponents, hex, and octal along with optional whitespace.

● Better yet, let Perl figure it out for you:

if( looks_like_number $x ) { … }

Switching on numerics

● Switches with looks_like_number help parsing and make the logic more readable:

if( looks_like_number $_ ){

…}elsif( $regex )

# deal with text...

}

Sorting and Sanity Checks

sub generic_minimum{

looks_like_number $_[0]$_[0] ? min @_ : minstr @_

}

sub numeric_input{ my $numstr = get_user_input;

looks_like_number $numstr or die "Not a number: '$numstr'";

$numstr}

Anonymous Prototyping

● set_prototype adjusts the prototype on a subref.– Including anonymous subroutines.

– Allows installation of subs that handle block inputs or multiple arrays – think of import subs.

● Another is removing or modifying mis-guided prototypes in wrappers that call them.– Example is a prototype of “$$” that prevents calling a

wrapped sub with “@_”.

Bi-polar Variables

● dulvar is a fast handler for dealing with multimode string+numeric data.

● Returns stringy or numeric portion depending on context:

$a = dualvar ( 90, '/var/tmp' );

print $a if $a > 80; # prints “/var/tmp”

or

sort { $a <=> $b or $a cmp $b } @list;

● dulvar's are faster than blessed ref's with overloads and offer better encapsulation.

But wait, there's more!!!

● Obvious sanity checks:● openhandle returns true for an open filehandle.

– validate stdin for interactive sessions.

– check for [still] live sockets.

● isvstring returns true for a vstrings (e.g., “v5.16.0”).

● tainted returns true for tainted values.● isreadonly checks for readonly values or variables.

Managing lists

● List::Util provides mostly-obvious functions: sum, max, min, maxstr, minstr, shuffle, first, and reduce.

● max and min compare numbers, maxstr and minstr handle strings.

● shuffle randomized the order of a list – useful for security or simulations.

● first & reduce take a bit more explanation...

First Thing: Why Bother?

● These can all be written in Pure Perl.● Why bother with Yet Another Module and XS?

– Most people think of speed, which is true.

– These all have simple, clean interfaces that Just Work.

– XS encapsulates the in-work data.

– Module provides them in one place, once, with POD.

● So, speed is not the only issue –but it doesn't hurt that these are fast.

Second Thing's first()

● first looks a lot like grep, with a block and list.● Unlike grep, first stops after finding the first match.● It returns the first scalar that leaves the block true – not

the blocks output!● Lists don't have to be data: they can be anything.

my $odd = first { $_ % 2} @itemz;

my $valid= first { /$rx/ } @regexen;

my $found= first { foo $_} @inputz;

my $obj = first { $_->valid($data) } @objz

or die “Invalid data...”;

first with ~~ for validation

● Ever get sick of running through if-blocks for mutually exclusive switches?

● first with smart matching offers is declarative:

● Hash-slicing the arguments array allows comparing invalid values with the same structure.

my @bogus = ( [ qw( fork debug ) ], … ); ...if( my $botched = first { $_ ~~ %argz } @bogus ){

local $” = ' ';die “Mutually exclusive: @$botched”;

}

Working smarter

● First saves overhead by stopping early.● Returning a scalar simplifies the syntax for

assigning a result.● Depending on your data, first on an array may be

faster than exists on a hash key.● Useful for more than iterating data:

– Use a list of regexes to determine what type of data is being processed.

– Lists of objects can be iterated to find the correct parser for general input.

Smart Match ~~ first

● Unlike most Perly boolean operators, smart returns true or false, not the argument value that left it true.

● first returns the value that matched:

my $found = first { $record ~~ $_ } @filterz;

● $found is the first entry from @filterz that matches the record.

● Filters can be regexen, arrays, hashes, or objects with overloaded ~~ matching valid or unusable data.

– Use to check edge-cases in testing data handlers.

Inside-out data for a regex● Use an inside-out structure to associate arbitrary

data or state with the regex.● Smart matching handles blessed regexen properly:

works equally well with std regex or object.

my $regex1 = qr{ ... };my $regex2 = qr{ ... };

$inside{ refaddr $regex1 } = [];

my @filtrz = ( $regex1, $regex2 );my $found = first { $input ~~ $_ } @filtrz;

push @{ $inside{ refaddr $found }, $input;

Use first to pick handlers

● Say you have records with a variety of fields.● A set of arrays with the required fields for handlers

makes it easy to pick the right one:

● Add a bit of inside-out data and you can dispatch the record and its handler in a few lines of code.

my @keyz = ( [ qw( ... ) ], [ qw( ... ) ] );

my $found = first { $record ~~ $_ } @keyzor die 'Record fails minimum key test';

Reducing your workload

● All of the min, max, and sum functions are canned versions of reduce.

● reduce looks like sort, with $a and $b.● Empty returns undef, singletons return themselves.● Otherwise:

– $a, $b are aliased to the first two list values.

– The block's result is assigned to $a.

– $b is cycled through the remaining list values.

Example: min, max, sum, prodmy @list = ( 1 .. 100 );

my $min = reduce { $a < $b ? $a : $b } @list;my $max = reduce { $a > $b ? $a : $b } @list;

# sum, product roll the value forward:

my $sum = reduce { $a += $b } @list;my $prd = reduce { $a *= $b } @list;

# sum of x-squared uses a placeholder:

my $sumx2= reduce { $a += $b**2 } ( 0,@list );

But wait, there's more more!!!

● List::Utils lacks a number of operations that are easy to implement in Pure Perl:– unique

– interleave, every nth record, groups of N records.

● Using XS does have advantages, not the least having none of use re-write the same Pure Perl.

● So... we have List::MoreUtils, written by Adam Kennedy, maintained by Tassilo von Parseval.

Taking lazyness to XS

● This module is a kitchen sink of things you've done at least once:

any all none notall true false firstidxfirst_index lastidx last_index insert_afterinsert_after_string apply indexes afterafter_incl before before_incl firstvalfirst_value lastval last_value each_arrayeach_arrayref pairwise natatime mesh zip uniqdistinct minmax part

Indexes and last items

● first is nice, but to find the last item you need to reverse a list, which is expensive.

● Looking up using indexes with first requires $ary[$_], which also gets expensive.

● last, last_index, first_index do what you'd expect [novel idea, what?].

● before and after are more compact versions of slices using the results of first_index.

If first is false, use any

● first returns a list value, which might be false.● any() returns true the first time its block is true.● Solves tests using first failing on a false list value:

# $x is 0, $y is 1

@list = ( 0, 1, 2 );

$x = first { defined $_ } @list;

$y = any { defined $_ } @list;

Unique lists

● MoreUtil's unique returns a list in its original order (list) or the last value (scalar):

● Using hash keys gives a random order.● Any Pure Perl approach requires sort or lots of index

operations.

# 1 2 3 5 4my @x = uniq 1, 1, 2, 2, 3, 5, 3, 4;# 5my $x = uniq 1, 1, 2, 2, 3, 5, 3, 4;

Relative locations

● insert_after places an item after the first item for which its block passes.

● insert_after_string uses a string compare, avoiding the need for a block.

● Example: post-insert sentinel values into processed lists.

apply: map Without Side-effects

● One downside to map, sort, & grep is that they alias their block variables.– Updating $_ or $a/$b will alter the inputs.

● apply works like map: extracting the result of a block applied to each element in a list.– The difference is that $_ is copied, not aliased.

– The inputs are safe from modification.

Merging Lists

● Pairwise processing of lists uses prototypes to keep the syntax saner:

@sum_xy = pairwise { $a + $b } @x, @y;

@x = pairwise { $a->($b) } @subz, @valz;

● Nice for merging key/value pairs, which is what mesh does without a block:

%y = pairwise{ ($a,$b) } @keyz, @valz;

%y = mesh @keyz, @valz;

● Prototypes require arrays; arrayrefs have to use “@$arrayref” sytax.

Iterating Separate Lists

● each_array generates an iterator that cycles through successive values in multiple lists:

my $each = each_array @a, @b, @c;

while( my( $a, $b, $c ) = $each->() ) { … }

● This avoids having to destroy the lists with shift or the overhead of many index accesses.

● each_arrayref takes arrayref (vs. array) args.● Limitation of prototypes: can't mix arrays & refs.

Breaking up is easy to do

● Partitioning a list is quite doable in Pure Perl but gets messy when handling arbitrary lists.

● part uses a block to select index entries, returning an array[ref] segregated by the block output:

# [ 1, 3, 5, 7 ], [ 2, 4, 6, 8 ]

my @partz = part { $i ++ % 2 } ( 1 .. 8 );

● using %3 generates three lists.● Block can use regexen (including parsing results),

looks_like_number, error levels, whatever.

POD is your friend

● Actually, the module authors are: All of these modules are well documented, with good examples.

● Especially for MoreUtils: Take the time to run the POD code in a debugger to see what it does.

CPAN & the Power of Perl

● Code on CPAN isn't mouldy just because it's old.– The modules are kept up to date.

– The guts of Perl have remained stable enough to keep the XS working.

● This is due to a lot of effort from module owners and Perl hackers.

Summary

● Smart matches did not obviate “first”, they work together.

● Utils work with newer features like smart matching and switches.

● Any time you find yourself hacking indexes, it's probably time to think about these modules.

● POD is your friend – check the modules for examples (and good examples of writing XS).

● Truly lazy wheels are not re-invented.