create a serverless architecture for data collecon with ... · pdf filecreate a serverless...
TRANSCRIPT
Create a serverless architecture for data collec1on with Python and AWS
9 Apr 2017
David Santucci
About me
David SantucciData scien;st @ CloudAcademy.com
@davidsantucci
linkedin.com/in/davidsantucci/
Agenda
• Introduc;on
• Architecture
• Amazon Kinesis Stream
• Amazon Lambda
• Dead LeGer Queue (DLQ)
• Conclusions
• Q&A
Introduc4on
Challenges: • Collect events from different sources
• Backend applica;ons • Frontend applica;ons • Mobile apps
• Store events to different des4na4ons • Data Warehouse • Third-party services
• e.g., Hubspot, Mixpanel, GTM, … • Avoid data loss
A serverless architecture
AWS services:
• Kinesis Stream
• Lambda Func;ons
• SQS
• S3
• Amazon API Gateway
Manage events from mul4ple sources
Amazon Kinesis Stream
What is Amazon Kinesis Stream?
• Collect and process large streams of data records in real ;me.
• Typical scenarios for using Streams: • Manage mul;ple producers that push their data feed directly into a stream;
• Collect real-;me analy;cs and metrics;
• Process applica;on logs;
• Create pipeline with other AWS services (the consumers).
from time import gmtime, strftimeimport boto3client = boto3.client( service_name="kinesis", region_name="us-east-1", ) for i in xrange(300): print "sending event {}".format(i+1) response = client.put_record( StreamName="data-collection-stream", Data='{"name":"event-%d","data":{"payload":%d}}' % (i, i), PartitionKey=strftime("PK-%Y%m%d-%H%M%S", gmtime()), ) print "response for event {}: {}".format(i+1, response)
Amazon Kinesis Stream
Amazon Kinesis Stream - Tips
• Use API Gateway as entry point for front-end and mobile.
• Start with a single shard and increase only when needed.
• Output events one by one to avoid data loss.
• Generate Par44onKey using uuid (e.g., for test purpose).
Amazon Lambda
What is AWS Lambda?
• It processes a single event at real-;me without managing servers.
• Highly scalable.
• Fallback strategy in case of errors.
Amazon Lambda - Events rou4ng
Amazon Lambda - Events rou4ng
It works as router and it is directly triggered by Kinesis Streams.
[ { "destination_name": "mixpanel", "destination_arn": "arn:aws:lambda:region:account-id:function:function-name:prod", "enabled_events": [ "page_view", "search", "button_click", "page_scroll", ] }, { "destination_name": "hubspotcrm", "destination_arn": "arn:aws:lambda:region:account-id:function:function-name:prod", "enabled_events": [ "login", "logout", "registration", "page_view", "search", "email_sent", "email_open", ] }, { "destination_name": "datawarehouse", "destination_arn": "arn:aws:lambda:region:account-id:function:function-name:prod", "enabled_events": [ "login", "logout", "registration", "page_view", "search", "button_click", "page_scroll", "email_sent", "email_open", ] }]
Amazon Lambda - Events rou4ng
… { "destination_name": "datawarehouse", "destination_arn": "arn:aws:lambda:region:id:function:name:prod", "enabled_events": [ "login", "logout", "registration", "page_view", "search", "button_click", "page_scroll", "email_sent", "email_open", ] }
Amazon Lambda - Events rou4ng
Amazon Lambda - events rou4ng
Amazon Lambda - events rou4ng
It provides the logic to connect to the des4na4on services (e.g., HubSpot, Mixpanel, etc … ). Custom retry strategy (with exponen;al delay).
Amazon Lambda - Retry strategydef lambda_handler(event, context=None): try: hub_id = os.environ['HUBSPOT_HUB_ID'] except KeyError: raise DoNotRetryException('HUBSPOT_HUB_ID') event = format_event_data(event, hub_id) process_event(event['data']) return "ok"def format_event_data(event, hub_id): event_id = event["name"].split(".")[-1].replace("_", " ").title() event['data'].update({ '_a': hub_id, '_n': event_id, 'email': event['data']['_email'], }) return event@retrydef process_event(params): url = 'http://track.hubspot.com/v1/event?{}'.format(urllib.urlencode(params)) urllib2.urlopen(url)
Amazon Lambda - Retry strategy
def retry(func, max_retries=3, backoff_rate=2, scale_factor=.1): def func_wrapper(*args, **kwargs): attempts = 0 while True: attempts += 1 if attempts >= max_retries: raise try: return func(*args, **kwargs) except DoNotRetryException: raise except: time.sleep(backoff_rate ** attempts * scale_factor) return func_wrapperclass DoNotRetryException(Exception): def __init__(self, *args, **kwargs): Exception.__init__(self, *args, **kwargs)
Amazon Lambda - Our 4ps
• Enable Kinesis Stream as a trigger for other AWS services. • To preserve the priority Configure trigger with Batch size: 1 and Star;ng posi;on: Trim Horizon
• An S3 file can be used to define the rou;ng rules. • Invoke Lambda Func;ons that work as connector asynchronously • Always create aliases and versions for each Func;on. • Use environment variables for configura;ons. • Create a custom IAM role for each Func;on. • Detect delays in stream processing monitoring IteratorAge metric
in the Lambda console’s monitoring tab.
Dead LeIer Queues (DLQ) - Avoid event loss
DLQ - Simple Queue Service (SQS)
What is AWS SQS?
• Lambda automa4cally retries failed execu;ons for asynchronous invoca;ons.
• Configure Lambda (advanced secngs) to forward payloads that were not
processed to a dead-leIer queue (an SQS queue or an SNS topic).
• We used a SQS.
def get_events_from_sqs( sqs_queue_name, region_name='us-west-2', purge_messages=False, backup_filename='backup.jsonl', visibility_timeout=60): """ Create a json backup file of all events in the SQS queue with the given 'sqs_queue_name'. :sqs_queue_name: the name of the AWS SQS queue to be read via boto3 :region_name: the region name of the AWS SQS queue to be read via boto3 :purge_messages: True if messages must be deleted after reading, False otherwise :backup_filename: the name of the file where to store all SQS messages :visibility_timeout: period of time in seconds (unique consumer window) :return: the number of processed batch of events """ forwarded = 0 counter = 0 sqs = boto3.resource('sqs', region_name=region_name) dlq = sqs.get_queue_by_name(QueueName=sqs_queue_name) # continues to next slide ..
Amazon Lambda - Events rou4ng
Amazon Lambda - Events rou4ng # continues from previous slide .. with open(backup_filename, 'a') as filep: while True: batch_messages = dlq.receive_messages( MessageAttributeNames=['All'], MaxNumberOfMessages=10, WaitTimeSeconds=20, VisibilityTimeout=visibility_timeout, ) for msg in batch_messages: try: line = "{}\n".format(json.dumps({ 'attributes': msg.message_attributes, 'body': msg.body, })) print("Line: ", line) filep.write(line) if purge_messages: print('Deleting message from the queue.') msg.delete() forwarded += 1 except Exception as ex: print("Error in processing message %s: %r", msg, ex) counter += 1 print('Batch %d processed', counter)
DLQ - Our 4ps
• Set a DLQ on each Lambda Func;on that can fail.
• Re-process events sent to DLQ with a custom script.
• Tune DLQ config directly from Lambda Func;on panel.
Conclusions
Why a serverless architecture? • scalability - prevent data loss - full control on each step - costs
Open points: • Integrate a custom CloudWatch dashboard. • Configure Firehose for a Backup. • Write a script that manages events sent to DLQs. • Create a listener for anomaly detec;on with Kinesis Analy;cs. • Amazon StepFunc;ons.
Useful links
These slides: Create a serverless architecture for data collec4on with Python and AWS —> hGp://clda.co/pycon8-serverless-data-collec;on
Blog post with code snippets: Building a serverless architecture for data collec4on with AWS Lambda —> hGp://clda.co/pycon8-data-collec;on-blogpost
Serverless Learning Path: GeJng Started with Serverless Compu4ng —> hGp://clda.co/pycon8-serverless-LP