intro to airflow: good bye cron, welcome scheduled workflow management
TRANSCRIPT
g r o u p
m a r k e t i n g t o o l s
Value driven “marketing as a service” agency for small business
Best in class marketing and productivity tools
for small business
DEALS WITH LONG RUNNING PROGRESS
IMAGINE YOU WORK FOR DATA-DRIVEN COMPANY
NIGHTLY DATA LOADS INTO THE DATA WAREHOUSE
USES A WORKFLOW SCHEDULER TO COORDINATE
C R O N
10 1 * * * echo “hello world” >> hello.log
execute commands or scripts (groups of commands) automatically at a specified time/date
Every 1 minute * * * * *
Every 15 minutes */15 * * * *
Every 30 minutes */30 * * * *
Every 1 hour 0 * * * *
Every 6 hours 0 */6 * * *
good old cron scheduler to get started
However, we found it hard to manage and monitor the status of the jobs.
USING CRON BECAME A HEADACHE
▸ It’s very difficult to add new jobs in complex crons.
▸ Hard to debug and maintain. The crontab is just a text file.
▸ Failure handling
▸ developer needs to write a program for the Cron to call
▸ No scalability
https://danidelvalle.me/2016/09/12/im-sorry-cron-ive-met-airbnbs-airflow/
TEXT
“ IF I HAD TO BUILD A NEW ETL SYSTEM TODAY FROM SCRATCH, I WOULD USE AIRFLOW. “
- MARTON TRENCSENI
HTTP://BYTEPAWN.COM/LUIGI-AIRFLOW-PINBALL.HTML
- started by Maxime Beauchemin at Airbnb in2014
- joined the Apache Software Foundation’s incubation program in 2016
A I R F L O W ?
Airflow is a platform to programmatically author, schedule and monitor workflows.
- It’s been built to scale
- Python script (configuration as code)
- active development
- Rich web UI
- In Airflow, a DAG – or a Directed Acyclic Graph
https://en.wikipedia.org/wiki/Directed_acyclic_graph
- define DAGs = define workflow ( Yes! Python code)
DAG
task
While DAGs describe how to run a workflow, An operator describes a single task in a workflow.
Airflow is not a data streaming solution. Tasks do not move data from one to the other
O P E R A T O R S
BashOperator - executes a bash command
PythonOperator - calls an arbitrary Python function
EmailOperator - sends an email
HTTPOperator - sends an HTTP request
SqlOperator - executes a SQL command
Sensor - waits for a certain time, file, database row, S3 key, etc…
and more in ….airflow/contrib/ directory
more specific operators: DockerOperator, HiveOperator, S3FileTransferOperator, PrestoToMysqlOperator, SlackOperator
‣ Email notifications of tasks retries or failures.
‣ Specify task dependencies is straightforward.
‣ Automatically retry failed jobs.
‣ a cool DAG visualization — perform some maintenance.
‣ A powerful CLI, useful to test new tasks or dags.
‣ Logging! see the output of each task execution
‣ Scaling! Integration with Apache Mesos and Celery.
P R O S
ex. Today is 06 - 05 (June 05, 2017)
actual rundata we want on that day
E X E C U T I O N vs S T A R T DATE DATE
HTTP://SITE.CLAIRVOYANTSOFT.COM/SETTING-APACHE-AIRFLOW-CLUSTER/
Single Node
WEBSERVER + SCHEDULER + WORKER
WEBSERVER + SCHEDULER + WORKER
HTTP://SITE.CLAIRVOYANTSOFT.COM/SETTING-APACHE-AIRFLOW-CLUSTER/
Multi-Node (Cluster)
E X E C U T O R
▸ Sequential executor This executor will only run one task instance at a time
▸ Local executor executes tasks locally in parallel.
▸ Celery executor allows distributing the execution of task instances to multiple worker nodes.
▸ Documentation: https://airflow.incubator.apache.org/
▸ Install Documentation: https://airflow.incubator.apache.org/installation.html
▸ GitHub Repo: https://github.com/apache/incubator-airflow