data processing with celery and rabbit mq

Data Processing with Python /Celery and RabbitMQ

for the New England Regional Developers (NERD) Summit

Jeff Peck9/11/2015

9/11/2015Jeff PeckData Processing with Python / Celery and RabbitMQ

IntroductionJeff Peck

Senior Software EngineerCode Ninja

[email protected]

www.esperdyne.com

Esperdyne Technologies, LLC245 Russell Street, Suite 23

Hadley, MA 01035-9558

mailto:[email protected]

http://www.esperdyne.com/


The Goal of this Presentation● Understand the challenges of

real-life data processing scenarios

● Consider the possible solutions● Describe an approach using

Python / Celery and RabbitMQ● Discover how you can process

data with Celery, from scratch, by walking through a realexample


Agenda● Background● The Challenge● Approaches Considered● About Celery / Task Queues● Practical Example: Processing Emails● Questions


Background● We process data for ~5 million industrial parts

each week● Data comes from different sources● Some structured / some unstructured● Multiple deploy targets: MySQL / FAST ESP● Database deploy non-item-specific data (i.e.

catalog data or taxonomy data, etc)● Metadata processing● Various dependencies before processing and

pushing to production


Background

StructuredCatalog Data

UnstructuredPDF Data

Metadata

Database

Search Index


The Challenge● Efficiently process data from multiple sources● Consider all dependencies● Deploy to multiple targets in parallel● Capture the success/failure of each item to be

able to generate a report● Build a process that can be easily triggered to

handle all aspects of data processing on aweekly basis


Approaches● Process everything in separate batches

– Fine for small amount of data

– Lots of manual steps

– Almost no parallel processing

– Would take approximately one week to process all data● Pypes

– Flow-based programming paradigm

– “Components” and “Packets”

– Lacked flexibility to spawn multiple jobs from a singlecomponent


“This Calls for Some Celery!”● Celery: Distributed Task Queue● Written in Python● Integrates with RabbitMQ and Redis● Supports task chaining● Extremely Flexible● Distributed

– Can manage multiple queues

● Very active community– (over 10k downloads per day)


Celery● “Celery is an asynchronous task queue/job

queue based on distributed message passing. Itis focused on real-time operation, but supportsscheduling as well.”

● http://www.celeryproject.org/● pip install -U Celery

● Supports callbacks or task chaining● Ideal for processing data from different sources,

and deploying to multiple targets, whilecollecting status of individual items

http://www.celeryproject.org/


What is a Distributed Task Queue?● A message queue passes, holds, and delivers

messages across a system or application● A task queue is a type of message queue that

deals with tasks, such as processing some data● A distributed task queue combines multiple

task queues across systems


Workers, Brokers, and Backends● In Celery, a worker executes tasks that are

passed to it from the message broker● The message broker is the service that sends

are receives the messages (i.e. the messagequeue). Celery is compatible with manydifferent brokers such as Redis, Mongo DB,Iron MQ, etc. We use RabbitMQ.

● A backend is necessary if you want to store theresults of tasks or send the states somewhere(i.e. when executing a “group” of tasks)


Practical Example: ProcessingEmails

● 500k emails recovered from Enron

● Goal is to parse each email and load them intoElasticSearch and MySQL

● We could do this manually in stages, but we want to takefull advantage of our resources and minimize ourinteraction with the process

● We will use Celery, RabbitMQ, and Redis

● All of the source code for this example is available here:https://github.com/esperdyne

●

https://github.com/esperdyne


Email Processing

Parse

Elastic Search

MySQL

Emails


Email Processing: Setup● Install:

– RabbitMQ

– Redis

– Celery

– Fabric

– MySQL

– ElasticSearch

Install RabbitMQ:$ sudo apt-get install rabbitmq-server

Install Redis:$ sudo apt-get install redis-server$ sudo pip install redis

Install Celery:$ sudo pip install celery

Install Fabric:$ sudo pip install fabric

Install ElasticSearch:$ sudo apt-get install openjdk-7-jre$ wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch |sudo apt-key add -$ echo "deb http://packages.elastic.co/elasticsearch/1.7/debianstable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-1.7.list$ sudo apt-get update && sudo apt-get install elasticsearch$ sudo update-rc.d elasticsearch defaults 95 10$ sudo pip install elasticsearch$ sudo service elasticsearch start

Install MySQL:$ sudo apt-get install mysql-server$ sudo apt-get build-dep python-mysqldb$ sudo pip install MySQL_python$ sudo pip install sqlalchemy

Make “messages” database:$ mysql -u root -e "CREATE DATABASE messages"


Email Processing: Setup● Create a new directory for the project● Create the proj directory and put an empty__init__.py file in it.

● Download the raw Enron emails$ mkdir celery-message-processing$ cd celery-message-processing

$ mkdir proj$ touch proj/__init__.py

$ wget http://www.cs.cmu.edu/~enron/enron_mail_20150507.tgz$ tar -xvf enron_mail_20150507.tgz

http://www.cs.cmu.edu/~enron/enron_mail_20150507.tgz


Email Processing: The Celery file● Inside the proj dir, create a file calledcelery.py and open it with your favorite texteditor (i.e. emacs proj/celery.py )

from __future__ import absolute_import

from celery import Celery

app = Celery('proj', broker='amqp://', backend='redis://localhost', include=['proj.tasks'])

# Optional configuration, see the application user guide.app.conf.update( CELERY_TASK_RESULT_EXPIRES=3600,)

if __name__ == '__main__': app.start()


Email Processing: The Tasks File● Now, create another file inside the proj

directory called tasks.py and open it forediting.

● Write the following imports:

from __future__ import absolute_import

import emailfrom sqlalchemy import *from elasticsearch import Elasticsearch

from celery import Taskfrom proj.celery import app


Emails Processing: Tasks File (cont)class MessagesTask(Task):

"""This is a celery abstract base class that contains all of the logic for parsing and deploying content."""

abstract = True _messages_table = None _elasticsearch = None def _init_database(self): """Set up the MySQL database""" db = create_engine('mysql://root@localhost/messages') metadata = MetaData(db) messages_table = Table('messages', metadata, Column('message_id', String(255), primary_key = True), Column('subject', String(255)), Column('to', String(255)), Column('x_to', String(255)), Column('from', String(255)), Column('x_from', String(255)), Column('cc', String(255)), Column('x_cc', String(255)), Column('bcc', String(255)), Column('x_bcc', String(255)), Column('payload', Text()))

messages_table.create(checkfirst=True) self._messages_table = messages_table def _init_elasticsearch(self): """Set up the ElasticSearch instance""" self._elasticsearch = Elasticsearch()

...


Emails Processing: Tasks File (cont)...

def parse_message_file(self, filename): """Parse an email file. Return as dictionary""" with open(filename) as f: message = email.message_from_file(f) return {'subject': message.get("Subject"), 'to': message.get("To"), 'x_to': message.get("X-To"), 'from': message.get("From"), 'x_from': message.get("X-From"), 'cc': message.get("Cc"), 'x_cc': message.get("X-cc"), 'bcc': message.get("Bcc"), 'x_bcc': message.get("X-bcc"), 'message_id': message.get("Message-ID"), 'payload': message.get_payload()} def database_insert(self, message_dict): """Insert a message into the MySQL database""" if self._messages_table is None: self._init_database() ins = self._messages_table.insert(values=message_dict) ins.execute() def elasticsearch_index(self, id, message_dict): """Insert a message into the ElasticSearch index""" if self._elasticsearch is None: self._init_elasticsearch() self._elasticsearch.index(index="messages", doc_type="message", id=id, body=message_dict)


Email Processing: Tasks File (cont)

@app.task(base=MessagesTask, queue="parse")def parse(filename):

"""Parse an email file. Return as dictionary""" # Call the method in the base task and return the result return parse.parse_message_file(filename)

@app.task(base=MessagesTask, queue="db_deploy", ignore_result=True)def deploy_db(message_dict): """Deploys the message dictionary to the MySQL database table""" # Call the method in the base task deploy_db.database_insert(message_dict)

@app.task(base=MessagesTask, queue="es_deploy", ignore_result=True)def deploy_es(message_dict): """Deploys the message dictionary to the Elastic Search instance""" # Call the method in the base task deploy_es.elasticsearch_index(message_dict['message_id'], message_dict)


Email Processing: Fabric Script● I use fabric to start/stop the Celery workers and

to pass the raw emails to be processed● Make a fabfile.py in the base directory and

open it for editing

import os

from fabric.api import local

from celery import chain, groupfrom celery.task.control import inspectfrom proj.tasks import parse, deploy_db, deploy_es


Email Processing: Fabric (cont)

def workers(action): """Issue command to start, restart, or stop celery workers""" # Prepare the directories for pids and logs local("mkdir -p celery-pids celery-logs") # Launch 4 celery workers for 4 queues (parse, db_deploy, es_deploy, and default) # Each has a concurrency of 2 except the default which has a concurrency of 1 # More info on the format of this command can be found here: # http://docs.celeryproject.org/en/latest/reference/celery.bin.multi.html local("celery multi {} parse db_deploy es_deploy celery "\ "-Q:parse parse -Q:db_deploy db_deploy -Q:es_deploy es_deploy -Q:celery celery " \ "-c 2 -c:celery 1 "\ "-l info -A proj "\ "--pidfile=celery-pids/%n.pid --logfile=celery-logs/%n.log".format(action))

● Start/stop the workers with fabric

Usage example:$ fab workers:start$ fab workers:stop$ fab workers:restart


Email Processing: Fabric (cont)● Task Chaining

def process_one(filename=None): """Enqueues a mail file for processing""" res = chain(parse.s(filename), group(deploy_db.s(), deploy_es.s()))() print "Enqueued mail file for processing: {} ({})".format(filename, res) def process(path=None): """Enqueues a mail file for processing. Optionally, submitting a directory will enqueue all files in that directory""" if os.path.isfile(path): process_one(path) elif os.path.isdir(path): for subpath, subdirs, files in os.walk(path): for name in files: process_one(os.path.join(subpath, name))


Email Processing: Usage● To start a build cycle, this is all that you need to

do:

$ fab workers:start$ fab process:maildir


Email Processing: What next?● Implement a “chord”:

– Trigger a task to update an email's status aftersuccessfully being processed and deployed to MySQLand ElasticSearch

● Handle errors:

– Write to a special log file every time an error occurs witha custom error handler

● Reporting:

– Detect the completion of processing with a scheduledtask that confirms that all tasks are complete, and emaila report automatically with the number of successful /failed messages


Email Processing: Try it yourself● All of the source code and instructions for this demo are

available here:https://github.com/esperdyne/celery-message-processing

● Can be used as a boilerplate for an unrelated celeryproject

● Fork, experiment, ask questions, etc.

https://github.com/esperdyne/celery-message-processing


One More Thing: Celery Flower● There is a tool that provides real-time monitoring for your

Celery instance, called “Flower”:https://github.com/mher/flower

https://github.com/mher/flower


Any Questions?

(Can you spare a guess as to why that question mark isn't made out of celery?)

data processing with celery and rabbit mq

Data & Analytics