datacamp etl documentation

Upload: stefan-urbanek

Post on 30-May-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Datacamp ETL Documentation

    1/19

    [email protected]

    www.knowerce.sk

    Datacamp ETLDocumentation

    November 2009

    knowerce|consulting

  • 8/14/2019 Datacamp ETL Documentation

    2/19

    Document information

    Creator Knowerce, s.r.o.

    Vavilovova 16

    851 01 Bratislava

    [email protected]

    www.knowerce.sk

    Author tefan Urbnek, [email protected]

    Date of creation 12.11.2009

    Document revision 1

    Document Restrictions

    Copyright (C) 2009 Knowerce, s.r.o., Stefan Urbanek

    Permission is granted to copy, distribute and/or modify this document under the terms of the GNU

    Free Documentation License, Version 1.3 or any later version published by the Free Software

    Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the

    license is included in the section entitled "GNU Free Documentation License".

    knowerce|consulting

    Offer [email protected] 2

  • 8/14/2019 Datacamp ETL Documentation

    3/19

    Contents

    ....................................................................................................................................................................................Introduction 4

    .........................................................................................................................................................................................Overview 6

    System Context 6

    Objects and classes 6

    ........................................................................................................................................................................................Installation 8

    Software Requirements 8

    Preparation 8

    Database initialisation 8

    Configuration 9

    ......................................................................................................................................................................Running ETL Jobs 10

    Launching 10

    Manual Launching 10

    Scheduled using cron 10

    Running Programatically 10

    What jobs will be run 10

    Job Status

    ........................................................................................................................................................................ Job Management

    Scheduling 12

    Forced run 12

    .............................................................................................................................................................Creating a Job Bundle 14

    Example: Public Procurement Extraction ETL job 14

    Job Utility Methods 14

    Errors and Failing a Job 14

    ...........................................................................................................................................................................................Defaults 15

    ETL System Defaults 15

    Using defaults in jobs 15

    .............................................................................................................................................................Appendix: ETL Tables 17

    etl_jobs 17

    etl_job_status 17

    etl_defaults 18

    etl_batch 18

    .............................................................................................................................................................................Cron Example 19

    knowerce|consulting

    Offer [email protected] 3

  • 8/14/2019 Datacamp ETL Documentation

    4/19

    Introduction

    This document describes architecture, structures and process of Datacamp Extraction Transformation

    and Loading framework. Purpose of the framework is to perform automated scheduled dataprocessing, usually in the background. Main features:

    scheduled or manual launching of ETL jobs

    job management and configuration through database

    logging

    ETL job plug-in API

    ETL tools provided:

    parallel URL downloader

    record transformation functions

    table comparisons

    table mappings

    knowerce|consulting

    Offer [email protected] 4

  • 8/14/2019 Datacamp ETL Documentation

    5/19

    Project Page and Sources

    Project page with sources can be found:

    http://github.com/Stiivi/Datacamp-ETL

    Wiki Documentation:

    http://wiki.github.com/Stiivi/Datacamp-ETL/

    Related project Datacamp:

    http://github.com/Stiivi/datacamp

    Support

    General Discussion Mailing List

    http://groups.google.com/group/datacamp

    Development Mailing List (recommended for Datacamp-ETL project):

    http://groups.google.com/group/datacamp-dev

    knowerce|consulting

    Offer [email protected] 5

    http://groups.google.com/group/datacamp-devhttp://groups.google.com/group/datacamp-devhttp://groups.google.com/group/datacamphttp://groups.google.com/group/datacamphttp://github.com/Stiivi/datacamphttp://github.com/Stiivi/datacamphttp://wiki.github.com/Stiivi/Datacamp-ETL/http://wiki.github.com/Stiivi/Datacamp-ETL/http://github.com/Stiivi/Datacamp-ETLhttp://github.com/Stiivi/Datacamp-ETL
  • 8/14/2019 Datacamp ETL Documentation

    6/19

    Overview

    System Context

    Datacamp ETL framework has plug-in based architecture and runs on top of a database server.

    Objects and classes

    Core of the ETL framework are Job Manager and Job objects. There are two categories of classes: job

    management and utility classes that are not necessary for data processing.

    Class Description and provided functionality

    Batch Information about data processed by ETL

    Download BatchList of files and additional information for automated parallel downloading and

    processing

    DB Server

    ETL

    ETL StagingDatabase

    directory for extracted andtemporary files

    job module bundle

    job module bundle

    job module bundle

    ETL modulesdirectory (directories)

    Job Management Utilities

    Job Manager

    Job InfoJob Status Job

    Extraction Transformation Loading

    Download Manager

    Download Batch

    ETL Defaults

    Batch

    knowerce|consulting

    Offer [email protected] 6

  • 8/14/2019 Datacamp ETL Documentation

    7/19

    Class Description and provided functionality

    Download Manager Performs parallel download of huge amount of URLs

    ETL Defaults Stores configuration variables in key-value dictionary

    Job Abstract class for ETL jobs, provides utilities for running, logging and error handling

    Job Info Information about job: name, type, scheduling,

    Job Manager Configures and launches jobs, handles errors.

    Job Status Information about job run: when was run, what was the result and reason for failure.

    knowerce|consulting

    Offer [email protected] 7

  • 8/14/2019 Datacamp ETL Documentation

    8/19

    Installation

    Software Requirements

    database server1

    ruby

    rails

    gems: sequel

    Preparation

    I. create a directory where working files, such as dumps and ETL files, will be stored, for example:

    /var/lib/datacamp

    II. create a database. For use with Datacamp web application create two schemas:

    data schema, example: datacamp_data

    staging schema (for ETL), example: datacamp_staging

    III. create a database user that has full access (SELECT, INSERT, UPDATE, CREATE TABLE, ) to

    the datacamp ETL schemas

    Check: at this point you should have:

    sources

    working directory

    one or two database schemas

    database user with appropriate permissions

    Database initialisation

    To initialize ETL database schema run appropriate SQL script from install directory, for example:

    mysql -u root -p datacamp_staging < install/etl_tables.mysql.sql

    knowerce|consulting

    Offer [email protected] 8

    1

    currently works only with MySQL server as there are couple of MySQL specific code residues. This will changein the future.

  • 8/14/2019 Datacamp ETL Documentation

    9/19

  • 8/14/2019 Datacamp ETL Documentation

    10/19

    Running ETL Jobs

    Launching

    Manual Launching

    Jobs are being run with simply launching the etl.rb script:

    ruby etl.rbThe script looks for config.yml in current directory. You can pass another configuration file:

    ruby etl.rb --config another_config.yml

    Scheduled using cron

    You would mostly like to run ETL automatically and periodically. To do so, configure a cron job for the

    Datacamp ETL by creating a cron script. There is an example in install/etl_cron_job, where you

    have to change ETL_PATH, CONFIG and probabblyRUBY variables. See appendix where example file is

    listed.

    Running Programatically

    Or configure JobManager manually and run all jobs by:

    job_manager = JobManager.new # configure job_manager herejob_manager.run_scheduled_jobs

    Log is being written to preconfigured file or to standard error output. See Installation instructions how

    to configure the log file.

    What jobs will be run

    By default only jobs that are enabled and scheduled for this day and were not run successfully already.

    If all jobs succeed, then any subsequent launch of ETL should not run any jobs. All unsuccessful are

    being re-tried. Not enabled jobs are not run on any occasion. For more information see Job

    Management.

    knowerce|consulting

    Offer [email protected] 10

  • 8/14/2019 Datacamp ETL Documentation

    11/19

    Job Status

    Each job leaves a footprint of its run in etl_job_status table. The table contains information:

    Column Description

    job_name task which was run

    job_id identifier of the job

    status current status of the job: ok, running, failed

    phase if job has more phases, this column identifies which phase the job is in

    message error message on job fail

    start_date when the job started

    end_date when the job finished, or NULL if job is still running

    Possible job statuses are:

    running job is still running (or ETL crashed and did not reset the job status)

    ok job finished correctly

    failed job dod not finished correctly, see phase and message for more information

    Example of successful runs you want to achieve this:

    Example of mixed statuses, including failed ones:

    knowerce|consulting

    Offer [email protected] 11

  • 8/14/2019 Datacamp ETL Documentation

    12/19

    Job Management

    Jobs are managet through etl_jobs table where you specify:

    Column Description

    job_name name of a job (see below)

    job_type type of a job: extraction, transformation, loading,

    is_enabled set to 1 when the task is enabled

    run_ordernumber which specifies order in which jobs are being run. Jobs are run from lowest

    number to highest. If number is the same for more jobs, behaviour is undefined

    schedule when the job is being run

    force_run run despite scheduling rule

    Example:

    To add a new job, insert a line into the table and set job information. To remove a job just delete a line.

    Scheduling

    Jobs can be currently scheduled on daily basis:

    daily: run each day

    monday, tuesday, wednesday, thursday, friday, saturday, sunday run on particular week

    day

    Once the job was successfully run by scheduler, the job manager does not run it again unless explicitly

    specified byforce_run flag.

    Forced runThere is a way how to run jobs out-of-schedule by setting the force_run flag. This allows data

    managers to re-run an ETL job remotely without requiring access to the system where ETL processes

    are being hosted. The job will be run next time scheduler is run. For example: if ETL is scheduled in

    cron for hourly run, then the job is re-run within next hour, if it is scheduled for daily runs it will be run

    next day.

    The flag is reset to 0 after each run to prevent running again. Reason for this behaviour is to prevent

    running lengthy, time and CPU consuming jobs unintentionally and to protect already processed data

    from possible inconsistencies introduced by running jobs at unexpected times.

    knowerce|consulting

    Offer [email protected] 12

  • 8/14/2019 Datacamp ETL Documentation

    13/19

    This behaviour can be modified using ETL system defaults:

    force_run_all run all enabled jobs, regardless of their scheduling time

    reset_force_run_flag allow jobs to be re-run each time ETL script is launched. Set this to 0 for

    development and testing.

    knowerce|consulting

    Offer [email protected] 13

  • 8/14/2019 Datacamp ETL Documentation

    14/19

    Creating a Job Bundle

    Jobs are implemented by bundles or in other words directories containing all necessary code and

    information for the job. Only requirement for the bundle is that it follows certain naming conventionand contains ruby script with the job class.

    bundle directory should be named: job_name.job_type

    bundle should contain ruby file: job_name_job_type.rb

    the ruby file should contain camelized job name and job type class: JobNameJobType which should

    be a subclass of appropriate job subclass (Extraction, Transformation, Loading)

    The class should implement run method with the main job code.

    Example: Public Procurement Extraction ETL job

    I. create a job bundle directory: mkdir public_procurement.extractionII. create a Ruby file: public_procurement.extraction/public_procurement_extraction.rb

    III. implement a class named: PublicProcurementExtraction:

    class PublicProcurementExtraction < Extraction

    def run job code goes here

    end

    Job Utility Methods

    There are several utility methods for job writers:

    files_directory directory where working, extracted, downloaded and temporary files are stored.

    This directory is job specific each job has its own directory by default

    logger object for writing into ETL manager log

    message, phase set job status information

    Also each job has access to defaults dictionary. See chapter about Defaults for more information.

    Errors and Failing a Job

    It is recommended to raise exception on error. The exception will be handled by job manager and thejob will be closed properly with appropriate status and message set.

    raise unable to connect to data source

    will result in failed job with same message as the exception.

    knowerce|consulting

    Offer [email protected] 14

  • 8/14/2019 Datacamp ETL Documentation

    15/19

    Defaults

    Defaults is configurable key-value dictionary used by ETL jobs and the ETL system as well. The key-

    value pairs are stored by domains. Domain usually corresponds to job name, for example: invoicesloading job and invoices transformation job share common domain invoices. The domain etl is reserved

    for ETL system configuration. Purpose of defaults is to be able to configure ETL jobs remotely and in

    more convenient way.

    Defaults are stored in etl_defaults table which contains: domain, default_key and value:

    ETL System Defaults

    Key Description

    Default Value

    (if key-value does not exist)

    force_run_all On next ETL run all enabled jobs are launched, regardless of

    their scheduling. See Running ETL?

    FALSE

    reset_force_run_flag After running forced job (see Running ETL?) clear its flag so it

    will be not run again.

    TRUE

    Using defaults in jobs

    Job has access to defaults domain based on the job name. To retrieve a value from defaults:

    url = defaults[:download_url]count = defaults[:count].to_i

    Retrieve value or set to default value if not found:

    batch_size = defaults.value(:batch_size, 200).to_iThis will look for batch_size key, if it does not exist, then the key will be created and assigned value

    200.

    To store default value:

    defaults[:count] = countValues are committed when job finishes.

    Example:

    knowerce|consulting

    Offer [email protected] 15

  • 8/14/2019 Datacamp ETL Documentation

    16/19

    @batch_size = defaults.value(:batch_size, 200).to_i@download_threads = defaults.value(:download_threads, 10).to_i@download_fail_threshold = defaults.value(:download_fail_threshold, 10).to_i

    knowerce|consulting

    Offer [email protected] 16

  • 8/14/2019 Datacamp ETL Documentation

    17/19

    Appendix: ETL Tables

    etl_jobs

    Column Type Description

    id int object identifier

    name varchar job name

    job_type varchar job type

    is_enabled int flag whether the job is run or not

    run_order int order in which the jobs are being run. If more jobs have same order numer, the

    behaviour is undefined.

    last_run_date datetime date and time when job was alst run

    last_run_status varchar status of last run

    schedule varchar how the job is scheduled

    force_run int force job to be run next time ETL runs

    etl_job_status

    Column Type Description

    id int object identifier

    job_name varchar job name

    job_id int job identifier

    status varchar current or last run status

    phase varchar phase in which the job currently is wile running or was when finished

    message varchar status message provided by job object or exception message

    start_date datetime when the job was run

    end_date datetime when the job finished

    knowerce|consulting

    Offer [email protected] 17

  • 8/14/2019 Datacamp ETL Documentation

    18/19

    etl_defaults

    Column Type Description

    id int association id

    domain varchar domain name (usually corresponds to job name)

    default_key varchar key

    value varchar value for key

    etl_batch

    Column Type Description

    id int

    batch_type varchar

    batch_source varchar

    data_source_name varchar

    data_source_url varchar

    valid_due_date date

    batch_date date

    username varchar

    created_at datetime

    updated_at datetime

    knowerce|consulting

    Offer [email protected] 18

  • 8/14/2019 Datacamp ETL Documentation

    19/19

    Cron Example

    #!/bin/bash

    ## ETL cron job script## Ubuntu/Debian: Put this script in /etc/cron.daily# Other unces: schedule appropriately in /etc/crontab

    ###################################################################### ETL Configuration

    # Path to your ETL installationETL_PATH=/usr/lib/datacamp-etl

    # Configuration file (database connection and other paths)CONFIG=$ETL_PATH/config.yml

    # Ruby interpreter pathRUBY=/usr/bin/ruby

    #####################################################################

    ETL_TOOL=etl.rb$RUBY -I $ETL_PATH $ETL_PATH/$ETL_TOOL --config $CONFIG

    knowerce|consulting

    Offer info@knowerce sk 19