plot 2 txt · architecture three lambda functions are driven by uploads to three different s3...
TRANSCRIPT
Migrating the plot2txt processing pipeline to AWS
bill brouwer / plot2txt.com
plot 2 txt
http://www.plot2txt.com 01/16
Overview
● The ('backend') processing flow of plot2txt is now a cloud service available via AWS
● AWS offers a rich technology stack, the following is used in this work:– Processing algorithms/service → lambda functions
– NoSQL DB for saving meta-data etc→ dynamoDB
– Logs → CloudWatch
– Storage → S3
– Access Control(s) → IAM
http://www.plot2txt.com 01/16
Architecture
● Three lambda functions are driven by uploads to three different S3 containers (0 → 2) ie., when object is placed in bucket, appropriate lambda function is triggered
S30
S31
S32
S33
UpDown
dynamoDBinputTable
dynamoDBoutputTable
http://www.plot2txt.com 01/16
Setup
● Local development environment:
>uname -a
>Linux bill-ThinkPad-W530 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
● Create an AWS account, launch web console:
http://www.plot2txt.com 01/16
Setup
● Assuming python (eg., 2.7) is available, install AWS command line interface and configure, after producing admin creds at the web console (click on IAM link, follow directions for adding new user with ADMIN privileges)
● Folder ~/.aws must exist after this step, with desired config and creds file
>sudo pip install awscli
>aws configure
AWS Access Key ID [****************KHAA]:
AWS Secret Access Key [****************dTr0]:
Default region name [us-east-1]:
Default output format [json]:
>ls .awsconfig credentials
http://www.plot2txt.com 01/16
Setup
● Also need to establish:
– PHP dev environment (for serving upload/browse web pages)
● create composer file composer.phar in PHP project dir:{
"require": {
"aws/aws-sdk-php": "2.*"
}
}
● Install composer & create env (*vendor directory should appear):>curl -sS https://getcomposer.org/installer | php
>php composer.phar install
● Apache/PHP locally for testing (sudo cp pages for testing to /var/www/html etc)● All php files to include app/start.php with creds:
http://www.plot2txt.com 01/16
Setup
use Aws\S3\S3Client;
require 'vendor/autoload.php';$s3 = S3Client::factory(array( 'region' => 'us-east-1', 'version' => 'latest', 'credentials' => array( 'key' => 'xxxxxx', 'secret' => 'xxxxxx', )));
http://www.plot2txt.com 01/16
dynamoDB
● Using the CLI or web console, create dynamoDB tables:
●
http://www.plot2txt.com 01/16
dynamoDB
● Five key tables for p2t processing flow:
– dailyQuota → track upload size on a daily basis
– userQuota → for use with table above
– outputFiles → output meta-data from the processing flow (last lambda function)
● User key, time, input file key, size, output filename, URL for download
– processingJobs → meta-data from the input side (first lambda function)
● User key, time, input file key, size, processing files
– uploadDetails → meta-data from the point of upload
● User key, orginal filename, new random string key
http://www.plot2txt.com 01/16
dynamoDB
● Example; dailyQuota logging (PHP upload method)
$client = $sdk->createDynamoDb();
$result = $client->putItem(array(
'TableName' => 'dailyQuota',
'Item' => array(
'user' => array('S' => $user),
'time' => array('N' => (string) $t),
'size' => array('N' => (string) $cumlative_size)
)
));
http://www.plot2txt.com 01/16
dynamoDB
● Example; output details browsing (PHP browse method)
$client = $sdk->createDynamoDb();
// milliseconds$t = strtotime("-2 days") * 1000;
$iterator = $client->getIterator('Query', array( 'TableName' => 'outputFiles', 'KeyConditions' => array( 'email' => array( 'AttributeValueList' => array( array('S' => 'user_handle') ), 'ComparisonOperator' => 'EQ' ), 'time' => array( 'AttributeValueList' => array( array('N' => (string) $t) ), 'ComparisonOperator' => 'GT' ) ) ));
http://www.plot2txt.com 01/16
Lambda functions
● Can create lambdas from the command line eg.,
> aws lambda create-function --region us-east-1 --function-name CreateThumbnail --zip-file fileb://textTN.zip --role arn:aws:iam::4856xxxxxxxx:role/lambda_s3_exec_role --handler CreateThumbnail.handler --runtime nodejs --timeout 10 --memory-size 1024
● Billing is function of memory-size used and execution time (<= timeout)
● Region must be consistent with S3 buckets and any other resource used eg., dynamoDB
● Obviously S3 buckets and other resources must be created first
– Pay particular attention to access controls for S3 eg., easy to make buckets publically available via simple URL, may not be what you want :)
http://www.plot2txt.com 01/16
Lambda functions
● For initial effort(s), web console is more helpful:
– Upload zip file for function, or point to S3 location for large zip files
– Configure test event
– Debug from logged output
– Quickly change timeout length/memory consumed
http://www.plot2txt.com 01/16
Lambda functions
● eg., make a new test event
http://www.plot2txt.com 01/16
Lambda functions
● eg., edit the event details
http://www.plot2txt.com 01/16
Lambda functions● Check cloud watch logs for problems; common issues:
– Permissions
– Timeout
– Missing depends
http://www.plot2txt.com 01/16
Lambda functions
● Pay attention to roles & policies; will need to update simple S3 access role eg., if lambda function accesses dynamoDB
– use IAM console to edit existing role/policy, or create a new one
http://www.plot2txt.com 01/16
Lambda functions
● Lambdas can timeout/fail to complete for a variety of reasons eg.,
– node.js module or [something else] unavailable
– Premature termination
● The body of the (node.js) function must set state of context for successful termination eg.,
exports.handler = function(event, context) {
context.done()
}
● Async nature of node.js program control/flow is liable to cause some consternation; two npm modules make development and avoidance of timeout/termination easier to avoid
– Callback count
– Async waterfall
http://www.plot2txt.com 01/16
Lambda functions
● callback-count →track your callbacks, only proceed when set number complete (works like a thread barrier)
– https://www.npmjs.com/package/async-waterfall
// from the webpage//initializevar counter = callbackCount(3,done);
//use throughout callbackscounter.next(); counter.next(); counter.next();
//once limit specified in callbackCount is reached, execute the followingfunction done(){callback(null,arg1);}
http://www.plot2txt.com 01/16
Lambda functions
● async-waterfall → Run asynchronous tasks, cascaded together
– https://www.npmjs.com/package/async-waterfall
//from the webpage
waterfall([
function(callback){
callback(null, 'one', 'two');
},
function(arg1, arg2, callback){
callback(null, 'three');
},
function(arg1, callback){
// arg1 now equals 'three'
callback(null, 'done');
}
], function (err, result) {
// result now equals 'done'
});
http://www.plot2txt.com 01/16
Lambda functions
● Can workout the particular lambda instance details by (eg.,) running some basic linux/unix utilities in a bash script called from index.js, uploaded with lambda function:
var cmd = './my_bash_script.sh ';var async = require('child_process').exec;async(cmd, function(error,stdout,stderr){ console.log(stdout); console.log(stderr); if (error !== null){ console.log(error); } else{ callback(null, 'bash script complete'); }});
http://www.plot2txt.com 01/16
Lambda functions
● Typical image details:
>cat /proc/meminfo | grep "MemTotal"
model name : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
model name : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>cat /proc/cpuinfo | grep "model name"
MemTotal: 3858728 kB
>uname -a
Linux ip-10-0-89-9 3.14.48-33.39.amzn1.x86_64 #1 SMP Tue Jul 14 23:43:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>pwd
/var/task
http://www.plot2txt.com 01/16
Lambda functions
● Points to note:
– Lambda compute resources appear to be sparse, however, very cost effective and perfect for large number of short running tasks (<= 300s)
– There are some curiously absent unix/linux tools (eg., bc, zip) however the kernel is similar to stock EC2 instances, obtain missing utilties from (eg.,) a test EC2 instance.
– Utilities, bash scripts, required node modules, any executables wrapped in bash/nodejs etc etc all must be zipped up and supplied together
– Assume very little about instances used for lambda functions
http://www.plot2txt.com 01/16
Lambda functions
– Consider using the node-lambda-template for rapid development and testing : https://github.com/motdotla/node-lambda-template
– Lambda function working directory appears to be /var/task; only disk with write permisssion is /tmp
– If lambda function fails, there are multiple subsequent attempts and thus costs incurred
– No state on instance ie., files are not persistant; use dynamoDB or S3, for example ...
http://www.plot2txt.com 01/16
Lambda functions● Example: loop over output files, upload to final bucket
fs.readdir("/tmp", function(err, files) {
if (err){
console.log(err);
Return;
}
files.forEach(function(f) {
if (pth.extname(f) == '.zip') {
AWS.config.region = 'us-east-1';
var table = "outputFiles";
console.log('uploading: ' + newLabel);
var body = fs.createReadStream("/tmp/" + newLabel);
var s3_out = new AWS.S3({params: {Bucket: 'output', Key: newLabel}});
s3_out.upload({Body: body}).
on('httpUpload', function(evt) { console.log(evt); }).
send(function(err, data) {
console.log(err, data);
counter.next();
});
}
});});
http://www.plot2txt.com 01/16
Lambda functions
● Example: output meta-data produced by last lambda function, put into outputFiles table:
var params ={
TableName:'outputFiles',
Item:{
"time": time,
"email": putEmail,
Info:{
"id": globalLabel,
"file": newLabel,
"url": url,
"base": realFile
}
}
};
http://www.plot2txt.com 01/16
Lambda functions
● Example; download link created using a pre-signed URL, generated in lambda function / nodejs:
//expire in 2 days
var exp = 3600*24*2; var url_params = {Bucket: 'my_bucket', Key: 'object_key', Expires: exp}; var url = s3.getSignedUrl('getObject', url_params);
http://www.plot2txt.com 01/16
Summary
● AWS provides many tools, objects, APIs for complete cloud based solutions
● Relatively inituitive and easy to develop with
● Event driven lambda functions coupled with storage (S3) and NoSQL database (dynamoDB) couple to provide a powerful backend
● Development time and costs miniscule compared to alternatives ...
http://www.plot2txt.com 01/16
Billing
● Cost for this development work thus far :
http://www.plot2txt.com 01/16