Download - Auto-loading of Drupal CCK Nodes
![Page 1: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/1.jpg)
David Naughton | December 3, 2008
Automatic Scheduled
Loading of CCK Nodes
ETL with drupal_execute, OO, drush, & cron
![Page 2: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/2.jpg)
Who am I?David Naughton
● Web Applications Developer
● University of Minnesota Libraries
● 11+ years development experience
● New to Drupal & PHP
![Page 3: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/3.jpg)
What's EthicShare?ethicshare.org
• Who: UMN Center for Bioethices, UMN Libraries, & UMN Csci & EE
• What: A sustainable aggregation of bioethics research and a forum for scholarship
• When: Pilot Phase January 2008 – June 2009
• How: Funded by Andrew W. Mellon Foundation
![Page 4: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/4.jpg)
Sustainable Aggregation of Bioethics Research
• My part of the project
• Extract citations from multiple sources
• Transform into Drupal-compatible format
• Load into Drupal
• On a regular, ongoing basis
![Page 5: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/5.jpg)
ETL...• Extract, Transform, and Load = ETL
• Very common IT problem
• ETL is the most common term for it
• Librarians like to say...
• “Harvesting” instead of Extracting
• “Crosswalking” instead of Transforming
• ...but they're peculiar
![Page 6: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/6.jpg)
...ETL• Complex problem
• Lots of packaged solutions
• Mostly Java, for data warehouses
• Not a good fit for EthicShare
• Using Drupal 5 and CCK
• No Batch API
• When we move to Drupal 6...
• Batch API http://bit.ly/BatchAPI?
• content.crud.inc http://bit.ly/content-crud-inc?
![Page 7: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/7.jpg)
Without Automation• First PubMed load alone was > 100,000 citations• Without automation, I could have been doing lots of this:
![Page 8: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/8.jpg)
One SolutionIf money were no object, we could have hired lots of these:
![Page 9: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/9.jpg)
Really want...
![Page 10: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/10.jpg)
...but don't want:
![Page 11: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/11.jpg)
drush
Architecture
CiteETLPubMed
WorlCat
New York Times
BBC
Extractors
PubMed
WorlCat
New York Times
BBC
XML
XML
XML
XML
Transformers
PHP ArrayLoader EthicShare
MySQL
![Page 12: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/12.jpg)
drushA portmanteau of “Drupal shell”.
“…a command line shell and Unix scripting interface for Drupal, a veritable Swiss Army knife designed to make life easier for those of us who spend most of our working hours hacking away at the command prompt.”
-- http://drupal.org/project/drush
![Page 13: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/13.jpg)
Why drush?• Very flexible scheduling via cron
● Uses php-cli, so no web timeouts
● Experimental support for running drush without a
running Drupal web instance
● Run tests from the cli with Drush simpletest runner
![Page 14: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/14.jpg)
Why not hook_cron?• If you're comfortable with cron, flexible scheduling via hook_cron requires unnecessary extra work
● Subject to web timeouts
● Runs within a Drupal web instance, so large loads
may affect user experience
![Page 15: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/15.jpg)
drush help$ cd $drush_dir$ ./drush.php helpUsage: drush.php [options] <command> <command> ...
Options: -r <path>, --root=<path> Drupal root directory to use
(default: current directory) -l <uri> , --uri=<uri> URI of the drupal site to use (only
needed in multisite environments)...
Commands: cite load Load data to create new citations. help View help. Run "drush help [command]" to view command-specific help.
pm install Install one or more modules
![Page 16: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/16.jpg)
drush command help$ ./drush.php help cite loadUsage: drush.php cite load [options]
Options: --E=<extractor class> Base name of an extractor class, excluding the CiteETL/E/ parent path & '.php'. Required. --T=<transformer class> Base name of an transformer class, excluding the CiteETL/T/ parent path & '.php'. Required. --L=<loader class> Base name of an loader class, excluding the CiteETL/L/ parent path & '.php'. Optional: default is 'Loader'.
--dbuser=<db username> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --dbpass=<db password> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --memory_limit=<memory limit> Optional: default is 512M.
![Page 17: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/17.jpg)
drush cite loadExample specifying the New York Times – Health extractor & transformer classes on the cli:
$ ./drush.php cite load --E=NYTHealth \ --T=NYTHealth --dbuser=$dbuser \ --dbpass=$dbpass
Allows for flexible, per-data-source scheduling via cron, a requirement for EthicShare.
![Page 18: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/18.jpg)
php-cli Problems• PHP versions < 5.3 do not free circular references.
This is a problem when parsing loads of XML: Memory
Leaks With Objects in PHP 5
http://bit.ly/php5-memory-leak
• Still may have to allocate huge amounts of memory to
PHP to avoid “out of memory” errors.
![Page 19: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/19.jpg)
drush APIUndocumented, but simple & http://drupal.org/project/drushlinks to some modules that use it. To create a drush command…
● Implement hook_drush_command, mapping cli text to a
callback function name
● Implement the callback function
…and optionally…
● Implement a hook_help case for your command
![Page 20: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/20.jpg)
drush getopt emulation…Supports:
● --opt=value
● -opt or --opt (boolean based on presence or
absence)
Contrary to README.txt, does not support:
● -opt value
● -opt=value
![Page 21: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/21.jpg)
…drush getopt emulation• Puts options in an associative array, where keys are the option
names: $GLOBALS['args']['options']
● Puts commands (“words” not starting with a dash) in an array:
$GLOBALS['args']['commands']
Quirks:
● in cases of repetition (e.g. -opt --opt=value ), last one wins
● commands & options can be interspersed, as long as order of
commands is maintained
![Page 22: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/22.jpg)
cite.module example…function cite_drush_command() { $items['cite load'] = array( 'callback' => 'cite_load_cmd', 'description' => t('Load data to create new citations.') ); return $items;}
![Page 23: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/23.jpg)
…cite.module example…function cite_load_cmd($url) {
global $args; $options = $args['options'];
// Batch loading will often require more // than the default memory. $memory_limit = ( array_key_exists('memory_limit', $options) ? $options['memory_limit'] : '512M' ); ini_set('memory_limit', $memory_limit);
// continued on next slide…
![Page 24: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/24.jpg)
…cite.module example // …continued from previous slide if (array_key_exists('dbuser', $options) && array_key_exists('dbpass', $options)) { user_authenticate($options['dbuser'], $options['dbpass']); } set_include_path( './' . drupal_get_path('module', 'cite') . PATH_SEPARATOR . './' . drupal_get_path('module', 'cite') . '/contrib' . PATH_SEPARATOR . get_include_path() );
require_once 'CiteETL.php'; $etl = new CiteETL( $options ); $etl->run();
} // end function cite_load_cmd
![Page 25: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/25.jpg)
CiteETL.php…class CiteETL {
private $option_property_map = array( 'E' => 'extractor', 'T' => 'transformer', 'L' => 'loader');
// Not shown: identically-named accessors for these propertiesprivate $extractor;private $transformer;private $loader;
![Page 26: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/26.jpg)
…CiteETL.php…function __construct($params) { // The loading process is the almost always the same... if (!array_key_exists('L', $params)) { $params['L'] = 'Loader'; }
foreach ($params as $option => $class) { if (!preg_match('/^(E|T|L)$/', $option)) { continue; } // Naming-convention-based, factory-ish, dynamic // loading of classes, e.g. CiteETL/E/NYTHealth.php: require_once 'CiteETL/' . $option . '/' . $class . '.php'; $instantiable_class = 'CiteETL_' . $option . '_' . $class; $property = $this->option_property_map[$option]; $this->$property = new $instantiable_class; }}
![Page 27: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/27.jpg)
…CiteETL.phpfunction run() { // Extractors must all implement the Iterator interface. $extractor = $this->extractor(); $extractor->rewind(); while ($extractor->valid()) { $original_citation = $extractor->current(); try { $transformed_citation = $this->transformer->transform( $original_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . "\n"); $extractor->next(); } try { $this->loader->load( $transformed_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . "\n"); } $extractor->next(); }}
![Page 28: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/28.jpg)
Example E. Base Class…require_once 'simplepie.inc';
class CiteETL_E_SimplePie implements Iterator {
private $items = array();private $valid = FALSE;
function __construct($params) { $feed = new SimplePie(); $feed->set_feed_url( $params['feed_url'] ); $feed->init(); if ($feed->error()) { throw new Exception( $feed->error() ); } $feed->strip_htmltags( $params['strip_html_tags'] ); $this->items = $feed->get_items();}
// continued on next slide…
![Page 29: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/29.jpg)
…Example E. Base Class// …continued from previous slidefunction rewind() { $this->valid = (FALSE !== reset($this->items));}
function current() { return current($this->items);}
function key() { return key($this->items);}
function next() { $this->valid = (FALSE !== next($this->items));}
function valid() { return $this->valid;}
} # end class CiteETL_E_SimplePie
![Page 30: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/30.jpg)
Example Extractorrequire_once 'CiteETL/E/SimplePie.php';
class CiteETL_E_NYTHealth extends CiteETL_E_SimplePie {
function __construct() { parent::__construct(array( 'feed_url' => 'http://www.nytimes.com/services/xml/rss/nyt/Health.xml', 'strip_html_tags' => array('br','span','a','img') ));}
} // end class CiteETL_E_NYTHealth
![Page 31: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/31.jpg)
Example Transformer…class CiteETL_T_NYTHealth {
private $filter_pattern;
function __construct() {
$simple_keywords = array( 'abortion', 'advance directives', // whole bunch of keywords omitted… 'world health', ); $this->filter_pattern = '/(' . join('|', $simple_keywords) . ')/i';}
// continued on next slide…
![Page 32: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/32.jpg)
…Example Transformer…// …continued from previous slide
function transform( $simplepie_item ) { // create an array matching the cite CCK content type structure: $citation = array();
$citation['title'] = $simplepie_item->get_title(); $citation['field_abstract'][0]['value'] = $simplepie_item->get_content(); $this->filter( $citation );
// lots of transformation ops omitted…
$categories = $simplepie_item->get_categories(); $category_labels = array(); foreach ($categories as $category) { array_push($category_labels, $category->get_label()); } $citation['field_subject'][0]['value'] = join('; ', $category_labels);
$this->filter( $citation ); return $citation;}
![Page 33: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/33.jpg)
…Example Transformer// …continued from previous slide
function filter( $citation ) {
$combined_content = $citation['title'] . $citation['field_abstract'][0]['value'] . $citation['field_subject'][0]['value'];
if (!preg_match($this->filter_pattern, $combined_content)) { throw new Exception( "The article '" . $citation['title'] . "', id: " . $citation['source_id'] . " was rejected by the relevancy filter" ); }}
![Page 34: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/34.jpg)
Why not FeedAPI?• Supports only simple one-feed-field to one-CCK-field
mappings
• Avoid the Rube Goldberg Effect by using the same
ETL system for feeds that use for everything else
![Page 35: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/35.jpg)
Loaderclass CiteETL_L_Loader {
function load( $citation ) { // de-duplication code omitted… $node = array('type' => 'cite'); $citation['status'] = 1; $node_path = drupal_execute( 'cite_node_form', $citation, $node ); $errors = form_get_errors(); if (count($errors)) { $message = join('; ', $errors); throw new Exception( $message ); } // de-duplication code omitted…}
![Page 36: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/36.jpg)
CCK Auto-loading Resources• Quick-and-dirty CCK imports
http://bit.ly/quick-dirty-cck-imports
• Programmatically Create, Insert, and Update CCK
Nodes http://bit.ly/cck-import-update
• What is the Content Construction Kit? A View from the
Database. http://bit.ly/what-is-cck
![Page 37: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/37.jpg)
CCK Auto-loading Problems• Column names may change from one database
instance to another if other CCK content types with
identical field names already exist.
• drupal_execute bug in Drupal 5 Form API:
• cannot call drupal_validate_form on the same form
more than once: http://bit.ly/drupal5-formapi-bug
• Fixed in Drupal versions > 5
![Page 38: Auto-loading of Drupal CCK Nodes](https://reader030.vdocuments.site/reader030/viewer/2022020207/556a6969d8b42ab0468b4be8/html5/thumbnails/38.jpg)
Questions?