data processing with celery and rabbit mq
TRANSCRIPT
Data Processing with Python /Celery and RabbitMQ
for the New England Regional Developers (NERD) Summit
Jeff Peck9/11/2015
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
IntroductionJeff Peck
Senior Software EngineerCode Ninja
www.esperdyne.com
Esperdyne Technologies, LLC245 Russell Street, Suite 23
Hadley, MA 01035-9558
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
The Goal of this Presentation● Understand the challenges of
real-life data processing scenarios
● Consider the possible solutions● Describe an approach using
Python / Celery and RabbitMQ● Discover how you can process
data with Celery, from scratch, by walking through a realexample
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Agenda● Background● The Challenge● Approaches Considered● About Celery / Task Queues● Practical Example: Processing Emails● Questions
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Background● We process data for ~5 million industrial parts
each week● Data comes from different sources● Some structured / some unstructured● Multiple deploy targets: MySQL / FAST ESP● Database deploy non-item-specific data (i.e.
catalog data or taxonomy data, etc)● Metadata processing● Various dependencies before processing and
pushing to production
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Background
StructuredCatalog Data
UnstructuredPDF Data
Metadata
Database
Search Index
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
The Challenge● Efficiently process data from multiple sources● Consider all dependencies● Deploy to multiple targets in parallel● Capture the success/failure of each item to be
able to generate a report● Build a process that can be easily triggered to
handle all aspects of data processing on aweekly basis
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Approaches● Process everything in separate batches
– Fine for small amount of data
– Lots of manual steps
– Almost no parallel processing
– Would take approximately one week to process all data● Pypes
– Flow-based programming paradigm
– “Components” and “Packets”
– Lacked flexibility to spawn multiple jobs from a singlecomponent
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
“This Calls for Some Celery!”● Celery: Distributed Task Queue● Written in Python● Integrates with RabbitMQ and Redis● Supports task chaining● Extremely Flexible● Distributed
– Can manage multiple queues
● Very active community– (over 10k downloads per day)
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Celery● “Celery is an asynchronous task queue/job
queue based on distributed message passing. Itis focused on real-time operation, but supportsscheduling as well.”
● http://www.celeryproject.org/● pip install -U Celery
● Supports callbacks or task chaining● Ideal for processing data from different sources,
and deploying to multiple targets, whilecollecting status of individual items
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
What is a Distributed Task Queue?● A message queue passes, holds, and delivers
messages across a system or application● A task queue is a type of message queue that
deals with tasks, such as processing some data● A distributed task queue combines multiple
task queues across systems
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Workers, Brokers, and Backends● In Celery, a worker executes tasks that are
passed to it from the message broker● The message broker is the service that sends
are receives the messages (i.e. the messagequeue). Celery is compatible with manydifferent brokers such as Redis, Mongo DB,Iron MQ, etc. We use RabbitMQ.
● A backend is necessary if you want to store theresults of tasks or send the states somewhere(i.e. when executing a “group” of tasks)
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Practical Example: ProcessingEmails
● 500k emails recovered from Enron
● Goal is to parse each email and load them intoElasticSearch and MySQL
● We could do this manually in stages, but we want to takefull advantage of our resources and minimize ourinteraction with the process
● We will use Celery, RabbitMQ, and Redis
● All of the source code for this example is available here:https://github.com/esperdyne
●
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing
Parse
Elastic Search
MySQL
Emails
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: Setup● Install:
– RabbitMQ
– Redis
– Celery
– Fabric
– MySQL
– ElasticSearch
Install RabbitMQ:$ sudo apt-get install rabbitmq-server
Install Redis:$ sudo apt-get install redis-server$ sudo pip install redis
Install Celery:$ sudo pip install celery
Install Fabric:$ sudo pip install fabric
Install ElasticSearch:$ sudo apt-get install openjdk-7-jre$ wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch |sudo apt-key add -$ echo "deb http://packages.elastic.co/elasticsearch/1.7/debianstable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-1.7.list$ sudo apt-get update && sudo apt-get install elasticsearch$ sudo update-rc.d elasticsearch defaults 95 10$ sudo pip install elasticsearch$ sudo service elasticsearch start
Install MySQL:$ sudo apt-get install mysql-server$ sudo apt-get build-dep python-mysqldb$ sudo pip install MySQL_python$ sudo pip install sqlalchemy
Make “messages” database:$ mysql -u root -e "CREATE DATABASE messages"
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: Setup● Create a new directory for the project● Create the proj directory and put an empty__init__.py file in it.
● Download the raw Enron emails$ mkdir celery-message-processing$ cd celery-message-processing
$ mkdir proj$ touch proj/__init__.py
$ wget http://www.cs.cmu.edu/~enron/enron_mail_20150507.tgz$ tar -xvf enron_mail_20150507.tgz
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: The Celery file● Inside the proj dir, create a file calledcelery.py and open it with your favorite texteditor (i.e. emacs proj/celery.py )
from __future__ import absolute_import
from celery import Celery
app = Celery('proj', broker='amqp://', backend='redis://localhost', include=['proj.tasks'])
# Optional configuration, see the application user guide.app.conf.update( CELERY_TASK_RESULT_EXPIRES=3600,)
if __name__ == '__main__': app.start()
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: The Tasks File● Now, create another file inside the proj
directory called tasks.py and open it forediting.
● Write the following imports:
from __future__ import absolute_import
import emailfrom sqlalchemy import *from elasticsearch import Elasticsearch
from celery import Taskfrom proj.celery import app
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Emails Processing: Tasks File (cont)class MessagesTask(Task):
"""This is a celery abstract base class that contains all of the logic for parsing and deploying content."""
abstract = True _messages_table = None _elasticsearch = None def _init_database(self): """Set up the MySQL database""" db = create_engine('mysql://root@localhost/messages') metadata = MetaData(db) messages_table = Table('messages', metadata, Column('message_id', String(255), primary_key = True), Column('subject', String(255)), Column('to', String(255)), Column('x_to', String(255)), Column('from', String(255)), Column('x_from', String(255)), Column('cc', String(255)), Column('x_cc', String(255)), Column('bcc', String(255)), Column('x_bcc', String(255)), Column('payload', Text()))
messages_table.create(checkfirst=True) self._messages_table = messages_table def _init_elasticsearch(self): """Set up the ElasticSearch instance""" self._elasticsearch = Elasticsearch()
...
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Emails Processing: Tasks File (cont)...
def parse_message_file(self, filename): """Parse an email file. Return as dictionary""" with open(filename) as f: message = email.message_from_file(f) return {'subject': message.get("Subject"), 'to': message.get("To"), 'x_to': message.get("X-To"), 'from': message.get("From"), 'x_from': message.get("X-From"), 'cc': message.get("Cc"), 'x_cc': message.get("X-cc"), 'bcc': message.get("Bcc"), 'x_bcc': message.get("X-bcc"), 'message_id': message.get("Message-ID"), 'payload': message.get_payload()} def database_insert(self, message_dict): """Insert a message into the MySQL database""" if self._messages_table is None: self._init_database() ins = self._messages_table.insert(values=message_dict) ins.execute() def elasticsearch_index(self, id, message_dict): """Insert a message into the ElasticSearch index""" if self._elasticsearch is None: self._init_elasticsearch() self._elasticsearch.index(index="messages", doc_type="message", id=id, body=message_dict)
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: Tasks File (cont)
@app.task(base=MessagesTask, queue="parse")def parse(filename):
"""Parse an email file. Return as dictionary""" # Call the method in the base task and return the result return parse.parse_message_file(filename)
@app.task(base=MessagesTask, queue="db_deploy", ignore_result=True)def deploy_db(message_dict): """Deploys the message dictionary to the MySQL database table""" # Call the method in the base task deploy_db.database_insert(message_dict)
@app.task(base=MessagesTask, queue="es_deploy", ignore_result=True)def deploy_es(message_dict): """Deploys the message dictionary to the Elastic Search instance""" # Call the method in the base task deploy_es.elasticsearch_index(message_dict['message_id'], message_dict)
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: Fabric Script● I use fabric to start/stop the Celery workers and
to pass the raw emails to be processed● Make a fabfile.py in the base directory and
open it for editing
import os
from fabric.api import local
from celery import chain, groupfrom celery.task.control import inspectfrom proj.tasks import parse, deploy_db, deploy_es
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: Fabric (cont)
def workers(action): """Issue command to start, restart, or stop celery workers""" # Prepare the directories for pids and logs local("mkdir -p celery-pids celery-logs") # Launch 4 celery workers for 4 queues (parse, db_deploy, es_deploy, and default) # Each has a concurrency of 2 except the default which has a concurrency of 1 # More info on the format of this command can be found here: # http://docs.celeryproject.org/en/latest/reference/celery.bin.multi.html local("celery multi {} parse db_deploy es_deploy celery "\ "-Q:parse parse -Q:db_deploy db_deploy -Q:es_deploy es_deploy -Q:celery celery " \ "-c 2 -c:celery 1 "\ "-l info -A proj "\ "--pidfile=celery-pids/%n.pid --logfile=celery-logs/%n.log".format(action))
● Start/stop the workers with fabric
Usage example:$ fab workers:start$ fab workers:stop$ fab workers:restart
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: Fabric (cont)● Task Chaining
def process_one(filename=None): """Enqueues a mail file for processing""" res = chain(parse.s(filename), group(deploy_db.s(), deploy_es.s()))() print "Enqueued mail file for processing: {} ({})".format(filename, res) def process(path=None): """Enqueues a mail file for processing. Optionally, submitting a directory will enqueue all files in that directory""" if os.path.isfile(path): process_one(path) elif os.path.isdir(path): for subpath, subdirs, files in os.walk(path): for name in files: process_one(os.path.join(subpath, name))
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: Usage● To start a build cycle, this is all that you need to
do:
$ fab workers:start$ fab process:maildir
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: What next?● Implement a “chord”:
– Trigger a task to update an email's status aftersuccessfully being processed and deployed to MySQLand ElasticSearch
● Handle errors:
– Write to a special log file every time an error occurs witha custom error handler
● Reporting:
– Detect the completion of processing with a scheduledtask that confirms that all tasks are complete, and emaila report automatically with the number of successful /failed messages
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
Email Processing: Try it yourself● All of the source code and instructions for this demo are
available here:https://github.com/esperdyne/celery-message-processing
● Can be used as a boilerplate for an unrelated celeryproject
● Fork, experiment, ask questions, etc.
9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ
One More Thing: Celery Flower● There is a tool that provides real-time monitoring for your
Celery instance, called “Flower”:https://github.com/mher/flower