luigi presentation oa summit
DESCRIPTION
OA NYC SummitTRANSCRIPT
Data Workflows at Foursquare using Luigi
Foursquare
• 35 million users
• Nearly 4 billion check-ins
• More than 5 million check-ins per day
• 50 million point-of-interest database
• 100's of GB of log data per day
Tools We Use
• Hive o Ad hoc analytics, data dumping ground
• Raw MapReduce o 100's of MapReduce jobs in our codebase
• Pig o Fits between structure Hive and free-form
MapReduce
• Vertica o Low latency analytics
Cron
E.g. 0 0 * * * ./hadoop-script-1.sh # Wait two hours for that job to finish...
0 2 * * * ./hadoop-script-2.sh
# And on and on and on
Cron - Problems
• Brittle
• Hard to reason about / visualize
• Spend a lot of time waiting
• Difficult to tell what succeeded or failed
• No one likes writing Bash scripts
Oozie
XML-based Workflow Engine, with support for Hadoop, Hive, and Pig
Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel"
Coordinators launch recurring workflows at a given frequency, when dependent data is available
Oozie - Example
Oozie - Problems
• Workflows are all-or-nothing o Cannot just run step that failed o Very little code reuse
• Little to no extensibility • Limited control flow • Extremely verbose • Difficult to test • No one likes writing XML
Luigi • Python framework for batch processing jobs
• Created by Spotify, open-sourced Sept. 2012
• Tasks are units of work that produce Targets
• Tasks can depend on one or more other Tasks
• A Task is only run if all of its dependent Tasks are done
• Tasks are idempotent
Luigi - Example Task
Luigi - Running the Task $ python word-count.py WordCount --date 2013-06-01
Luigi - Scheduler
Central scheduler ensures each Task is only run by a single worker.
A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01)
Will retry failed Tasks after a configured timeout
Emails someone when a Task fails
Luigi - Visualizer
Luigi - Visualizer
Luigi - Visualizer
Luigi - Advantages over Cron
• Explicit dependencies
• No wasted time waiting
• Easy to tell what has failed
• Avoid duplicate work / partial failures
Luigi - Advantages over Oozie
• Explicit dependencies between workflows
• Easier to write
• Vastly more extensible
• Code reuse
• Can easily re-run individual steps
Thank you!
Check out Luigi: https://github.com/spotify/luigi
Drop me a line: Joe Ennever [email protected]