deploying deep learning models with docker and kubernetes

36
Deploying deep learning models Platform agnostic approach for production with docker+Kubernetes

Upload: petteri-teikari-phd

Post on 15-Apr-2017

3.713 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 3: Deploying deep learning models with Docker and Kubernetes

DOCKER

https://www.docker.com/what-docker

e.g. ASUS ESC8000 G3 local Server 'Lock-in'less

Cloud Service

For inference, i.e.process customer

queries via API

EXACTLY THE SAME MODELBoth locally at the office and in the cloud

https://www.docker.com/survey-2016

Page 4: Deploying deep learning models with Docker and Kubernetes

DOCKER Deep learning?

Unfortunately that is wrong for deep learning applications. For any serious deep learning application, you need NVIDIA graphics cards, otherwise it could take months to train your models. NVIDIA requires both the host driver and the docker image's driver to be exactly the same. If the version is off by a minor number, you will not be able to use the NVIDIA card, it will refuse to run. I don't know how much of the binary code changes between minor versions, but I would rather have the card try to run instructions and get a segmentation fault then die because of a version mismatch.

We build our docker images based off the NVIDIA card and driver along with the software needed. We essentially have the same docker image for each driver version. To help stay manage this, we have a test platform that makes sure all of our code runs on all the different docker images.

This issue is mostly in NVIDIA's court, they can modify their drivers to be able to work across different versions. I'm not sure if there is anything that Docker can do on their side. I think its something they should figure out though, the combination of docker and deep learning could help a lot more people get started faster, but right now its an empty promise.

http://www.somatic.io/blog/docker-and-deep-learning-a-bad-match

The biggest impact on data science right now is not coming from a new algorithm or statistical method. It’s coming from Docker containers. Containers solve a bunch of tough problems simultaneously: they make it easy to use libraries with complicated setups; they make your output reproducible; they make it easier to share your work; and they can take the pain out of the Python data science stack.

The wonderful triad of Docker : “Isolation! Portability! Repeatability!” There are numerous use cases where Docker might just be what you need, be it Data Analytics, Machine Learning or AI

Page 5: Deploying deep learning models with Docker and Kubernetes

DOCKERize everything as microservices

.pwc.com/us/en/technology-forecast/2014

http://www.slideshare.net/RichardHarvey7/micro-services-and-containers

(ARC401) Cloud First: New Architecture for New InfrastructureAmazon Web Services, slideshare.net/AmazonWebServices

Page 6: Deploying deep learning models with Docker and Kubernetes

Why Microservices?Why run microservices using Docker and Kubernetes?Posted by: Seth LakowskePublished: 2016-04-25http://sethlakowske.com/articles/why-run-docker-containers-and-kubernetes/

Benefits of microservices1) Code can be broken out into smaller microservices that are easier to learn, release

and update.2) Individual microservices can be written using the best tools for the job.3) Releasing a new service doesn't require synchronization across a whole company.4) New technology stacks have lower risk since the service is relatively small.5) Developers can run containers locally, rebuilding and verifying after each commit on a

system that mirrors production.6) Both Docker and Kubernetes are open source and free to use.7) Access to Docker hub leverages the work of the opensource community.8) Service isolation without the heavyweight VM. Adding a service to a server does not

affect other services on the server.9) Services can be more easily run on a large cluster of nodes making it more reliable.10) Some clients will only host in private and not on public clouds.11) Lends itself to immutable infrastructure, so services are reloadable without missing

state when a server goes down.12) Immutable containers improve security since data can only be mutated in specified

volumes, root kits often can't be installed even if the system is penetrated.13) Increasing support for new hardware, like the GPU in a container means even gpgpu

tasks like deep learning can be containerized.14) There is a cost for running microservices - the build and runtime becomes more

complex. This is part of the price to pay and if you've made the right decision in your context, then benefits will exceed the costs.

Costs of microservices• Managing multiple services tends to be more costly.• New ways for network and servers to fail.

ConclusionIn the right circumstances, the benefits of microservices outweigh the extra cost of management.

events.linuxfoundation.org, Frank Zhao

https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/

Page 7: Deploying deep learning models with Docker and Kubernetes

Docker vs AWS Lambda ”In General”

AWS Lambda will win - sort of.....  From a programming model and a cost model, AWS Lambda is the future - despite so of the tooling limitations. Docker in my opinion is an evolutionary step of "virtualization" that we've been seeing for the last 10 years. AWS Lambda is a step-function. In fact, I personally think it is innovations like Amazon Elastic Beanstalk and CloudFormation that has pushed the demand solutions like Docker.  In the near future, I predict that open source will catch up and provide an AWS Lambda experience on top of Docker containers. Iron.io is opensource and appears to be going down this path.

Florian Walker Product Manager at Fujitsu

The future is now :) Funktion, part of Fabric8, aims to provide a Lamda experience on-top of Kubernetes -> https://github.com/fabric8io/funktion

Jason Daniels CTO - Fujitsu Hybrid Cloud EMEIA

project Kratos .. https://www.iron.io/introducing-aws-lambda-support/

https://www.quora.com/Are-there-any-alternatives-to-Amazon-Lambda

Funktion is an open source event driven lambda style programming model on top of Kubernetes. A funktion is a regular function in any programming language bound to a trigger deployed into Kubernetes. Then Kubernetes takes care of the rest (scaling, high availability, load balancing, logging and metrics etc).

Funktion supports hundreds of different trigger endpoint URLs including most network protocols, transports, databases, messaging systems, social networks, cloud services and SaaS offerings. In a sense funktion is a serverless approach to event driven microservices as you focus on just writing funktions and Kubernetes takes care of the rest. Its not that there's no servers; its more that you as the funktion developer don't have to worry about managing them.

Announcing Project Kratos I’m happy to announce that Project Kratos is now available in beta. Iron.io is rolling out a set of tools that allow you to convert AWS Lambda functions into Docker images. Now, you can import existing Lambda functions and run them via any container orchestration system. You can also create new Lambda functions and quickly package them up in a container to run on other platforms. All three of the AWS runtimes are supported – Node.js, Python and Java.

Page 8: Deploying deep learning models with Docker and Kubernetes

Docker Issues SizeDocker containers quickly grow in size as they need to contain everything required for deployment

http://blog.xebia.com/create-the-smallest-possible-docker-container/

https://www.ctl.io/developers/blog/post/optimizing-docker-images/

“Docker images can get really big. Many are over 1G in size. How do they get so big? Do they really need to be this big? Can we make them smaller without sacrificing functionality?

Here at CenturyLink we've spent a lot of time recently building different docker images. As we began experimenting with image creation one of the things we discovered was that our custom images were ballooning in size pretty quickly (it wasn't uncommon to end up with images that weighed-in at 1GB or more). Now, it's not too big a deal to have a couple gigs worth of images sitting on your local system, but it becomes a bit of pain as soon as you start pushing/pulling these images across the network on a regular basis. “

https://blog.replicated.com/2016/02/05/refactoring-a-dockerfile-for-image-size/

“There’s been a welcome focus in the Docker community recently around image size. Smaller image sizes are being championed by Docker and by the community. When many images clock in at multi-100 MB and ship with a large ubuntu base, it’s greatly needed.”

https://ypereirareis.github.io/blog/2016/02/15/docker-image-size-optimization/

https://github.com/microscaling/imagelayers-graph

ImageLayers.io is a project maintained by Microscaling Systems since September 2016. The project was developed by the team at CenturyLink Labs. This utility provides a browser-based visualization of user-specified Docker Images and their layers. This visualization provides key information on the composition of a Docker Image and any commonalities between them. ImageLayers.io allows Docker users to easily discover best practices for image construction, and aid in determining which images are most appropriate for their specific use cases.

Deploying in Kubernetes Please see deployment/README.md

Page 9: Deploying deep learning models with Docker and Kubernetes

What is lambda architecture anyway?

https://www.oreilly.com/ideas/questioning-the-lambda-architecture

The Lambda Architecture is an approach to building stream processing applications on top of MapReduce andStorm or similar systems. This has proven to be a surprisingly popular idea, with a dedicated website and an upcoming book. 

The way this works is that an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch system and once in the stream processing system. You stitch together the results from both systems at query time to produce a complete answer. There are a lot of variations on this.

The Lambda Architecture is aimed at applications built around complex asynchronous transformations that need to run with low latency (say, a few seconds to a few hours). A good example would be a news recommendation system that needs to crawl various news sources, process and normalize all the input, and then index, rank, and store it for serving.

I like that the Lambda Architecture emphasizes retaining the input data unchanged. I think the discipline of modeling data transformation as a series of materialized stages from an original input has a lot of merit. I also like that this architecture highlights the problem of reprocessing data (processing input data over again to re-derive output).

The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be. I don’t think this problem is fixable. Ultimately, even if you can avoid coding your application twice, the operational burden of running and debugging two systems is going to be very high. And any new abstraction can only provide the features supported by the intersection of the two systems. Worse, committing to this new uber-framework walls off the rich ecosystem of tools and languages that makes Hadoop so powerful (Hive, Pig, Crunch, Cascading, Oozie, etc).

Kappa Architecture is a simplification of Lambda Architecture. A Kappa Architecture system is like a Lambda Architecture system with the batch processing system removed. To replace batch processing, data is simply fed through the streaming system quickly.

Kappa Architecture revolutionizes database migrations and reorganizations: just delete your serving layer database and populate a new copy from the canonical store! Since there is no batch processing layer, only one set of code needs to be maintained.

kappa-architecture.com

CHALLENGING THE LAMBDA ARCHITECTURE: BUILDING APPS FOR FAST DATA WITH VOLTDB V5.0dataconomy.comVoltDB is an ideal alternative to the Lambda Architecture’s speed layer. It offers horizontal scaling and high per-machine throughput. It can easily ingest and process millions of tuples per second with redundancy, while using fewer resources than alternative solutions. VoltDB requires an order of magnitude fewer nodes to achieve the scale and speed of the Lambda speed layer. As a benefit, substantially smaller clusters are cheaper to build and run, and easier to manage.

Page 10: Deploying deep learning models with Docker and Kubernetes

DOCKER Management enter → Kubernetes

https://www.youtube.com/watch?v=PivpCKEiQOQ

www.computerweekly.com/feature/Demystifying-Kubernete

Once every five years, the IT industry witnesses a major technology shift. In the past two decades, we have seen server paradigm evolve into web-based architecture that matured to service orientation before finally moving to the cloud. Today it is containers.

Docker is much more than just the tools and API. It created a vibrant ecosystem that started to contribute to a variety of tools to manage the lifecycle of containers. 

One of the first tools that Google decided to make open source is called Kubernetes, which means “pilot” or “helmsman” in Greek.

Kubernetes works in conjunction with Docker. While Docker provides the lifecycle management of containers, Kubernetes takes it to the next level by providing orchestration and managing clusters of containers.

Traditionally, platform as a service (PaaS) offerings such as Azure, App Engine, Cloud Foundry, OpenShift, Heroku and Engine Yard exposed the capability of running the code by abstracting the infrastructure.

Kubernetes and Docker deliver the promise of PaaS through a simplified mechanism. Once the system administrators configure and deploy Kubernetes on a specific infrastructure, developers can start pushing the code into the clusters. This hides the complexity of dealing with the command line tools, APIs and dashboards of specific IaaS providers. 

Page 11: Deploying deep learning models with Docker and Kubernetes

Containers at scale

As has been demonstrated, it is relatively easy to launch tens of thousands of containers on a single host. But how do you deploy thousands of containers? How do you manage and keep track of them? How do you manage and recover from failure. While these things sometimes might look easy, there are some hard problems to tackle. Let us walk through what it makes it so difficult.

With a single command the Docker environment is set up and you can docker run until you drop. But what if you have to run Docker containers across two hosts? How about 50 hosts? Or how about 10,000 hosts? Now, you may ask why one would want to do this. There are some good reasons why:

nextplatform.com/2016/03/22

https://www.nextplatform.com/2015/09/29/why-containers-at-scale-is-hard/

nextplatform.com/2016/03/03

Two founders of the Kubernetes project at Google, Craig McLuckie and Joe Beda, today announced their new company, Heptio. The company has raised $8.5 million in a series A investment round led by Accel, with participation from Madrona Venture Group.

Open source Kubernetes is a widely deployed technology for container orchestration. Now, Heptio will bring a commercial version of the software to enterprises.

www.sdxcentral.com

Page 12: Deploying deep learning models with Docker and Kubernetes

KubernetesKubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.

It groups containers that make up an application into logical units for easy management and discovery. Kubernetes builds upon 15 years of experience of running production workloads at Google, combined with best-of-breed ideas and practices from the community.

KubeWeekly — aggregating all interesting weekly news about Kubernetes in the form of a newsletter. Manage a cluster of Linux containers as a single system to accelerate Dev and simplify Ops.https://kubeweekly.com/

http://nshani.blogspot.co.uk/2016/02/getting-started-with-kubernetes.html

https://www.youtube.com/watch?v=21hXNReWsUU

http://cloud9.nebula.fi/app.html

Page 13: Deploying deep learning models with Docker and Kubernetes

Kubernetes concepts

http://www.slideshare.net/arungupta1/package-your-java-ee-application-using-docker-and-kubernetes

linkedin.com/pulse

http://www.slideshare.net/jawnsy/kubernetes-my-bff

Inference can be very resource intensive. Our server executes the following TensorFlow graph to process every classification request it receives. The Inception-v3 model has over 27 million parameters and runs 5.7 billion floating point operations per inference.

Fortunately, this is where Kubernetes can help us. Kubernetes distributes inference request processing across a cluster using its External Load Balancer. Each pod in the cluster contains a TensorFlow Serving Docker image with the TensorFlow Serving-based gRPC server and a trained Inception-v3 model. The model is represented as a set of files describing the shape of the TensorFlow graph, model weights, assets, and so on.

Since everything is neatly packaged together, we can dynamically scale the number of replicated pods using the Kubernetes Replication Controller to keep up with the service demands.

blog.kubernetes.io/2016/03

Page 14: Deploying deep learning models with Docker and Kubernetes

Alternatives

medium.com/@mustwin

Bare Metal

Most schedulers with the notable exception of Cloud Foundry can be installed on “bare metal” or physical machines inside your datacenter. This can save you big on hypervisor licensing fees.

Volume Mounts

Volume mounts allow you to persist data across container deployments. This is a key differentiator depending on your applications’ needs. Mesos is the leader here, and Kubernetes is slowly catching up.

https://news.ycombinator.com/item?id=10438273

https://www.oreilly.com/ideas/swarm-v-fleet-v-kubernetes-v-mesos

Conclusion

There are clearly a lot of choices for orchestrating, clustering, and managing containers. That being said, the choices are generally well differentiated. In terms of orchestration, we can say the following:

Swarm has the advantage (and disadvantage) of using the standard Docker interface. Whilst this makes it very simple to use Swarm and to integrate it into existing workflows, it may also make it more difficult to support the more complex scheduling that may be defined in custom interfaces.

Fleet is a low-level and fairly simple orchestration layer that can be used as a base for running higher level orchestration tools, such as Kubernetes or custom systems.

Kubernetes is an opinionated orchestration tool that comes with service discovery and replication baked-in. It may require some re-designing of existing applications, but used correctly will result in a fault-tolerant and scalable system.

Mesos is a low-level, battle-hardened scheduler that supports several frameworks for container orchestration including Marathon, Kubernetes, and Swarm. At the time of writing, Kubernetes and Mesos are more developed and stable than Swarm. In terms of scale, only Mesos has been proven to support large-scale systems of hundreds or thousands of nodes. However, when looking at small clusters of, say, less than a dozen nodes, Mesos may be an overly complex solution.

Page 15: Deploying deep learning models with Docker and Kubernetes

Kubernetes Still on top?

https://news.ycombinator.com/item?id=12462261

After all, Kubernetes is a mere two years old (as a public open source project), whereas Apache Mesos has clocked seven years in market. Docker Swarm is younger than Kubernetes, and it comes with the backing of the center of the container universe, Docker Inc Yet the orchestration rivals pale in comparison to Kubernetes' community, which -- now under management by the Cloud Native Computing Foundation -- is exceptionally large and diverse.• Kubernetes is one of the top projects on GitHub: in the top 0.01

percent in stars and No. 1 in terms of activity.• While documentation is subpar, Kubernetes has a significant Slack

and Stack Overflow community that steps in to answer questions and foster collaboration, with growth that dwarfs that of its rivals.

• More professionals list Kubernetes in their LinkedIn profile than any other comparable offering by a wide margin.

• Perhaps most glaring, data from OpenHub shows Apache Mesos dwindling since its initial release and Docker Swarm starting to slow. In terms of raw community contributions, Kubernetes is exploding, with 1,000-plus contributors and 34,000 commits -- more than four times those of nearest rival Mesos.

http://www.infoworld.com/article/3118345/cloud-computing/why-kubernetes-is-winning-the-container-war.html

https://github.com/kubernetes/kubernetes

I would argue that general-purpose clusters like those managed by Google Kubernetes are better for hosting Internet businesses depending on artificial intelligence technologies than special-purpose clusters like NVIDIA DGX-1.

Consider the case that an experiment model training job is using all the 100 GPUs in the cluster. A production job gets started and asks for 50 GPUs. If we use MPI, we'd have to kill the experiment job so to release enough resource to run the production job. This tends to make the owner of the experiment job get the impression that he is doing a "second-class" work.

Kubernetes is smarter than MPI as it can kill, or preempt, only 50 workers of the experiment job, so to allow both jobs run at the same time. With Kubernetes, people have to build their programs into Docker images that run as Docker containers. Each container has its own filesystem and network port space. When A runs as a container, it removes only files in its own directory. This is to some extent like that we define C++ classes in namespaces, which helps us removing class name conflicts.

An Example A typical Kubernetes cluster runs an automatic speech recognition (ASR) business might be running the following jobs:

1) The speech service, with as many instances so to serve many simultaneous user requests.

2) The Kafka system, whose each channel collects a certain log stream of the speech service.

3) Kafka channels are followed by Storm jobs for online data processing. For example, a Storm job joins the utterance log stream and transcription stream.

4) The joined result, namely session log stream, is fed to an ASR model trainer that updates the model.

5) This trainer notifies ASR server when it writes updated models into Ceph.

6) Researchers might change the training algorithm, and run some experiment training jobs, which serve testing ASR service jobs.

Page 16: Deploying deep learning models with Docker and Kubernetes

The famous 'classical big data' on SparkApache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area.

In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.http://dx.doi.org/10.1007/s41060-016-0027-9

In addition to the research highlights we presented in the previous sections, there are other research works which have been done using Apache Spark as a core engine for solving data problems in machine learning and data mining [5,36], graph processing [16], genomic analysis [60,65], time series data [71], smart grid data [73], spatial data processing [87], scientific computations of satellite data [67], large-scale biological sequence alignment [97] and data discretization [68]. There are also some recent works on using Apache Spark for deep learning [46,64]. CaffeOnSpark is an open source project [60] from Yahoo [61] for distributed deep learning on big data with Apache Spark.

Page 17: Deploying deep learning models with Docker and Kubernetes

Tensorflow + Apache Spark

https://www.youtube.com/watch?v=PFK6gsnlV5E

https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/

You might be wondering: what’s Apache Spark’s use here when most high-performance deep learning implementations are single-node only? To answer this question, we walk through two use cases and explain how you can use Spark and a cluster of machines to improve deep learning pipelines with TensorFlow:

Hyperparameter Tuning: use Spark to find the best set of hyperparameters for neural network training, leading to 10X reduction in training time and 34% lower error rate.

Deploying models at scale: use Spark to apply a trained neural network model on a large amount of data.

How does using Spark improve the accuracy? The accuracy with the default set of hyperparameters is 99.2%. Our best result with hyperparameter tuning has a 99.47% accuracy on the test set, which is a 34% reduction of the test error. Distributing the computations scaled linearly with the number of nodes added to the cluster: using a 13-node cluster, we were able to train 13 models in parallel, which translates into a 7x  speedup compared to training the models one at a time on one machine.

The goal of this workshop is to build an end-to-end, streaming data analytics and recommendations pipeline on your local machine using Docker and the latest streaming analytics tools. First, we create a data pipeline to interactively analyze, approximate, and visualize streaming data using modern tools such as Apache Spark, Kafka, Zeppelin, iPython, and ElasticSearch.

http://advancedspark.com/:

Page 18: Deploying deep learning models with Docker and Kubernetes

Dask as an alternative to apache spark #1

https://youtu.be/1kkFZ4P-XHg

continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster

Matthew Rocklin's Blog

dask, the original project

dask.distributed, the distributed memory scheduler powering the cluster computing

dask.bag, the user API we’ve used in this post.

Amazon EC2 with Dask configured with Jupyter Notebooks, and Anaconda. https://github.com/dask/dask-ec2

Page 19: Deploying deep learning models with Docker and Kubernetes

Dask as an alternative to apache spark #2

http://dask.pydata.org/en/latest/spark.html

Spark is mature and all-inclusive. If you want a single project that does everything and you’re already on Big Data hardware then Spark is a safe bet, especially if your use cases are typical ETL + SQL and you’re already using Scala.

Dask is lighter weight and is easier to integrate into existing code and hardware. If your problems vary beyond typical ETL + SQL and you want to add flexible parallelism to existing solutions then dask may be a good fit, especially if you are already using Python and associated libraries like NumPy and Pandas.

If you are looking to manage a terabyte or less of tabular CSV or JSON data then you should forget both Spark and Dask and use Postgres or MongoDB.

https://news.ycombinator.com/item?id=10062076

Dask seems to be aimed at parallelism of only certain operations (some parts of NumPy and Pandas) on larger than memory data on a single machine. Spark is a general purpose computing engine that can work across a cluster of machines and has many libraries optimized for distributed computing (machine learning, graph, etc.).

The advantages of Dask seem to be that it is a drop in replacement for NumPy and Pandas. Granted, given the prevalence of those two libraries that isn't a small advantage.

https://www.quora.com/Is-https-github-com-blaze-dask-an-alternative-to-Spark

GPU Computing with Apache Spark and Pythonby Continuum Analytics, .slideshare.net

TensorFlow BasicsWeights Persistence. Save and Restore a model.Fine-Tuning. Fine-Tune a pre-trained model on a new task.

Using HDF5. Use HDF5 to handle large datasets.Using DASK. Use DASK to handle large datasets.

Page 20: Deploying deep learning models with Docker and Kubernetes

Kubernetes + DaskRunning on kubernetes on google container engine

This small repo gives an example Kubernetes configuration for running dask.distributed on Google Container Engine.

Dask Cluster Deploymentshttp://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze ProjectAll code in this post is experimental. It should not be relied upon. For people looking to deploy dask.distributed on a cluster please refer instead to the documentation instead.Dask is deployed today on the following systems in the wild:• SGE• SLURM,• Torque• Condor• LSF• Mesos• Marathon• Kubernetes• SSH and custom scripts… there may be more. This is what I know of first-hand.These systems provide users access to cluster resources and ensure that many distributed services / users play nicely together. They’re essential for any modern cluster deployment.

For example, both Olivier Griesl (INRIA, scikit-learn) and Tim O’Donnell (Mount Sinai, Hammer lab) publish instructions on how to deploy Dask.distributed on Kubernetes.

• Olivier’s repository• Tim’s repository

SciPy Tutorial Setup On Kuberneteswritten by Benjamin Zaitlen on 2016-09-30http://quasiben.github.io/blog/2016/9/30/scipy-setup/

Our goal was to give students access to a preconfigured cluster with zero entry requirements: push a button get a cluster with all tools installed. To accomplish this we need a handful of docker images:

• Web application: button and info

• Jupyter notebook

• proxy app (more on this later)

• cluster technologies: Spark, Dask, IPython Parallel

And a handful of Kubernetes concepts:

• Pods: collection of containers (similar to docker-compose)

• namespaces: named and isolated clusters

• replication controller: a scalable Pod.

Page 21: Deploying deep learning models with Docker and Kubernetes

That is code What about data then?

Using the different software above, an application can be deployed, scaled easily and accessed from the outside world in few seconds. But, what about the data? Structured content would probably be stored in a distributed database, like MongoDB, for example Unstructured content is traditionally stored in either a local file system, a NAS share or in Object Storage. A local file system doesn’t work as a container can be deployed on any node in the cluster.

On the other side, Object Storage can be used by any application from any container, is highly available due to the use of load balancers, doesn’t require any provisioning and accelerate the development cycle of the applications. Why ? Because a developer doesn’t have to think about the way data should be stored, to manage a directory structure, and so on.

The Amazon S3 endpoint used to upload and download pictures is displayed on the bottom left corner and shows that ViPR is used to store the data.

The fact that the picture is uploaded directly to the Object Storage platform means that the web application is not in the data path. This allows the application to scale without deploying hundreds of instances. This web application can also be used to display all the pictures stored in the corresponding Amazon S3 bucket.

The url displayed below each picture shows that the picture is downloaded directly from the Object Storage platform, which again means that the web application is not in the data path. This is another reason why Object Storage is the de facto standard for web scale applications.

recorditblog.com

http://www.slideshare.net/kubecon/kubecon-eu-2016-kubernetes-storage-101

Persistent Volumes WalkthroughThe purpose of this guide is to help you become familiar with Kubernetes Persistent Volumes. By the end of the guide, we’ll have nginx serving content from your persistent volume.

You can view all the files for this example in the docs repo here.

This guide assumes knowledge of Kubernetes fundamentals and that you have a cluster up and running.

See Persistent Storage design document for more information.

http://kubernetes.io/docs/user-guide/persistent-volumes/walkthrough/

Page 22: Deploying deep learning models with Docker and Kubernetes

Data Lakes vs data warehouses #1“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

The table below helps flesh out this definition. It also highlights a few of the key differences between a data warehouse and a data lake. This is, by no means, an exhaustive list, but it does get us past this “been there, done that” mentality:

Data. A data warehouse only stores data that has been modeled/structured, while a data lake is no respecter of data. It stores it all—structured, semi-structured, and unstructured. [See my big data is not new graphic. The data warehouse can only store the orange data, while the data lake can store all the orange and blue data.]

Processing. Before we can load data into a data warehouse, we first need to give it some shape and structure—i.e., we need to model it. That’s called schema-on-write. With a data lake, you just load in the raw data, as-is, and then when you’re ready to use the data, that’s when you give it shape and structure. That’s called schema-on-read. Two very different approaches.

Storage. One of the primary features of big data technologies like Hadoop is that the cost of storing data is relatively low as compared to the data warehouse. There are two key reasons for this: First, Hadoop is open source software, so the licensing and community support is free. And second, Hadoop is designed to be installed on low-cost commodity hardware.

Agility. A data warehouse is a highly-structured repository, by definition. It’s not technically hard to change the structure, but it can be very time-consuming given all the business processes that are tied to it. A data lake, on the other hand, lacks the structure of a data warehouse—which gives developers and data scientists the ability to easily configure and reconfigure their models, queries, and apps on-the-fly.

Security. Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of a data lake) are relatively new. Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake. It should be noted, however, that there’s a significant effort being placed on security right now in the big data industry. It’s not a question of if, but when.

Users. For a long time, the rally cry has been BI and analytics for everyone! We’ve built the data warehouse and invited “everyone” to come, but have they come? On average, 20-25% of them have. Is it the same cry for the data lake? Will we build the data lake and invite everyone to come? Not if you’re smart. Trust me, a data lake, at this point in its maturity, is best suited for the data scientists.

www.kdnuggets.com/2015/09

http://www.smartdatacollective.com/all/13556

Page 23: Deploying deep learning models with Docker and Kubernetes

Data Lakes Medical Examples

Setting Up the Data Lakehttp://www.slideshare.net/CasertaConcepts/setting-up-the-data-lake-55319460

searchhealthit.techtarget.com

Unlike most relational databases' linear representation and analysis of data, Franz's semantic graph database technology employs  with which users can graphically see data elements and their relationships.

Montefiore also recently started another program using the data lake to do cardio-genetic predictive analytics to determine the degrees of possibility of patients having sudden cardiac death based on their genetic background.

Page 24: Deploying deep learning models with Docker and Kubernetes

USE CASESto further illustrate the idea

Page 25: Deploying deep learning models with Docker and Kubernetes

Kubernetes in deep learning #1

https://openai.com/blog/infrastructure-for-deep-learning/

Deep learning is an empirical science, and the quality of a group's infrastructure is a multiplier on progress. Fortunately, today's open-source ecosystem makes it possible for anyone to build great deep learning infrastructure.

In this post, we'll share how deep learning research usually proceeds, describe the infrastructure choices we've made to support it, and open-source kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. We hope you find this post useful in building your own deep learning infrastructure.

Once the model shows sufficient promise, you'll scale it up to larger datasets and more GPUs. This requires long jobs that consume many cycles and last for multiple days. You'll need careful experiment management, and to be extremely thoughtful about your chosen range of hyperparameters.

Like much of the deep learning community, we use Python 2.7. We generally use Anaconda, which has convenient packaging for otherwise difficult packages such as OpenCV and performance optimizations for some scientific libraries. We also run our own physical servers, primarily running Titan X GPUs. We expect to have a hybrid cloud for the long haul: it's valuable to experiment with different GPUs, interconnects, and other techniques which may become important for the future of deep learning.

Scalable infrastructure often ends up making the simple cases harder. We put equal effort into our infrastructure for small- and large-scale jobs, and we're actively solidifying our toolkit for making distributed use-cases as accessible as local ones.

Kubernetes requires each job to be a Docker container, which gives us dependency isolation and code snapshotting. However, building a new Docker container can add precious extra seconds to a researcher's iteration cycle, so we also provide tooling to transparently ship code from a researcher's laptop into a standard image. We expose Kubernetes's flannel network directly to researchers' laptops, allowing users seamless network access to their running jobs. This is especially useful for accessing monitoring services such as TensorBoard. (Our initial approach — which is cleaner from a strict isolation perspective — required people to create a Kubernetes Service for each port they wanted to expose, but we found that it added too much friction.)

We're releasing kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. It runs as a normal Pod on Kubernetes and requires only that your worker nodes are in Auto Scaling groups.

Our infrastructure aims to maximize the productivity of deep learning researchers, allowing them to focus on the science. We're building tools to further improve our infrastructure and workflow, and will share these in upcoming weeks and months. We welcome help to make this go even faster!

Page 26: Deploying deep learning models with Docker and Kubernetes

Kubernetes in deep learning #2

https://news.ycombinator.com/item?id=12391505

May 9 INFRA · DATA · RESEARCH · NEWS FEED · PYTHON

Introducing FBLearner Flow: Facebook's AI backbone

Jeffrey Dunn, https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/

Page 27: Deploying deep learning models with Docker and Kubernetes

Kubernetes in deep learning #1

https://openai.com/blog/infrastructure-for-deep-learning/

Deep learning is an empirical science, and the quality of a group's infrastructure is a multiplier on progress. Fortunately, today's open-source ecosystem makes it possible for anyone to build great deep learning infrastructure.

In this post, we'll share how deep learning research usually proceeds, describe the infrastructure choices we've made to support it, and open-source kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. We hope you find this post useful in building your own deep learning infrastructure.

Once the model shows sufficient promise, you'll scale it up to larger datasets and more GPUs. This requires long jobs that consume many cycles and last for multiple days. You'll need careful experiment management, and to be extremely thoughtful about your chosen range of hyperparameters.

Like much of the deep learning community, we use Python 2.7. We generally use Anaconda, which has convenient packaging for otherwise difficult packages such as OpenCV and performance optimizations for some scientific libraries. We also run our own physical servers, primarily running Titan X GPUs. We expect to have a hybrid cloud for the long haul: it's valuable to experiment with different GPUs, interconnects, and other techniques which may become important for the future of deep learning.

Scalable infrastructure often ends up making the simple cases harder. We put equal effort into our infrastructure for small- and large-scale jobs, and we're actively solidifying our toolkit for making distributed use-cases as accessible as local ones.

Kubernetes requires each job to be a Docker container, which gives us dependency isolation and code snapshotting. However, building a new Docker container can add precious extra seconds to a researcher's iteration cycle, so we also provide tooling to transparently ship code from a researcher's laptop into a standard image. We expose Kubernetes's flannel network directly to researchers' laptops, allowing users seamless network access to their running jobs. This is especially useful for accessing monitoring services such as TensorBoard. (Our initial approach — which is cleaner from a strict isolation perspective — required people to create a Kubernetes Service for each port they wanted to expose, but we found that it added too much friction.)

We're releasing kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. It runs as a normal Pod on Kubernetes and requires only that your worker nodes are in Auto Scaling groups.

Our infrastructure aims to maximize the productivity of deep learning researchers, allowing them to focus on the science. We're building tools to further improve our infrastructure and workflow, and will share these in upcoming weeks and months. We welcome help to make this go even faster!

Page 28: Deploying deep learning models with Docker and Kubernetes

Docker customers

Published on Aug 11, 2016

In this video Ajay Dankar, Senior Director Product Management at PayPal discusses why they selected Docker and Docker Trusted Registry to help them containerize their legacy apps to more efficiently utilize their infrastructure and secure workloads.

--

Docker is an open platform for developers and system administrators to build, ship and run distributed applications. With Docker, IT organizations shrink application delivery from months to minutes, frictionlessly move workloads between data centers and the cloud and can achieve up to 20X greater efficiency in their use of computing resources. Inspired by an active community and by transparent, open source innovation, Docker containers have been downloaded more than 700 million times and Docker is used by millions of developers across thousands of the world’s most innovative organizations, including eBay, Baidu, the BBC, Goldman Sachs, Groupon, ING, Yelp, and Spotify. Docker’s rapid adoption has catalyzed an active ecosystem, resulting in more than 180,000 “Dockerized” applications, over 40 Docker-related startups and integration partnerships with AWS, Cloud Foundry, Google, IBM, Microsoft, OpenStack, Rackspace, Red Hat and VMware.

https://www.youtube.com/watch?v=wf4Jg-9gv9Q

Page 29: Deploying deep learning models with Docker and Kubernetes

Business data into value

8 ways to turn data into value with Apache Spark machine learningOCTOBER 18, 2016 by Alex Liu Chief Data Scientist, Analytics Services, IBM

http://www.ibmbigdatahub.com/blog/8-ways-turn-data-value-apache-spark-machine-learning

1. Obtain a holistic view of business

In today's competitive world, many corporations work hard to gain a holistic view or a 360 degree view of customers, for many of the key benefits as outlined by data analytics expert Mr. Abhishek Joshi. In many cases, a holistic view was not obtained, partially due to the lack of capabilities to organize huge amount of data and then to analyze them. But Apache Spark’s ability to compute quickly while using data frames to organize huge amounts of data can help researchers quickly develop analytical models that provide a holistic view of the business, adding value to related business operations. To realize this value, however, an analytical process, from data cleaning to modeling, must still be completed.

4. Avoid customer churn by rethinking churn modeling

Losing customers means losing revenue. Not surprisingly, then, companies strive to detect potential customer churn through predictive modeling, allowing them to implement interventions aimed at retaining customers. This might sound easy, but it can actually be very complicated: Customers leave for reasons that are as divergent as the customers themselves are, and products and services can play an important, but hidden, role in all this. What’s more, merely building models to predict churn for different customer segments—and with regard to different products and services—isn’t enough; we must also design interventions, then select the intervention judged most likely to prevent a particular customer from departing. Yet even doing this requires the use of analytics to evaluate the results achieved—and, eventually, to select interventions from an analytical standpoint. Amid this morass of choices, Apache Spark’s distributed computing capabilities can help solve previously baffling problems.

5. Develop meaningful purchase recommendations

Recommendations for purchases of products and services can be very powerful when made appropriately, and they have become expected features of e-commerce platforms, with many customers relying on recommendations to guide their purchases. Yet developing recommendations at all means developing recommendations for each customer—or, at the very least, for small segments of customers. Apache Spark can make this possible by offering the distributed computing and streaming analytics capabilities that have become invaluable tools for this purpose.

ebaytechblog: Spark is helping eBay create value from its data, and so the future is bright for Spark at eBay. In the meantime, we will continue to see adoption of Spark increase at eBay. This adoption will be driven by chats in the hall, newsletter blurbs, product announcements, industry chatter, and Spark’s own strengths and capabilities.

http://dx.doi.org/10.1186/s40165-015-0014-6

https://thinkbiganalytics.com/big_data_solutions/data-science/

Page 30: Deploying deep learning models with Docker and Kubernetes

Open data and Apache Spark

-----Jump to Topic-----00:00:06 - Workshop Intro & Environment Setup00:13:06 - Brief Intro to Spark00:17:32 - Analysis Overview: SF Fire Department Calls for Service00:23:22 - Analysis with PySpark DataFrames API00:29:32 - Doing Date/Time Analysis00:47:53 - Memory, Caching and Writing to Parquet01:00:40 - SQL Queries01:21:11 - Convert a Spark DataFrame to a Pandas DataFrame-----Q & A-----01:24:43 - Spark DataFrames vs. SQL: Pros and Cons?01:26:57 - Workflow for Chaining Databricks notebooks into Pipeline?01:30:27 - Is Spark 2.0 ready to use in production?

https://www.youtube.com/watch?v=iiJq8fvSMPg

Page 31: Deploying deep learning models with Docker and Kubernetes

Internet of things (IoT)

To proof that our IoT platform is really independent on application environment, we took one IoT gateway (RaspberryPi 2) from the city project and put into Austin Convention Center during OpenStack Summit together with IQRF based mesh network connecting sensors that measure humidity, temperature and CO2 levels. This demonstrates ability that IoT gateway can manage or collect data from any technology like IQRF, Bluetooth, GPIO, and any other communication standard supported on Linux based platforms.

We deployed 20 sensors and 20 routers on 3 conference floors with a single active IoT gateway receiving data from entire IQRF mesh network and relaying it to dedicated time-series database, in this case Graphite. Collector is MQQT-Java bridge running inside docker container managed by Kubernetes.

The following screenshot shows real time CO2 values from different rooms on 2 floors. Historical graph shows values from Monday. You can easily recognize when the main keynote session started and when was the lunch period.

Page 32: Deploying deep learning models with Docker and Kubernetes

Healthcare

https://www.youtube.com/watch?v=ePp54ofRqRs

https://www.healthdirect.gov.au/

How open source container tech can impact healthcare

At Red Hat, we believe that creating open source platforms allows the tech community to develop the best software possible. We recently launched a series of films highlighting the open source movement’s impact on healthcare, including initiatives that promote open patient data and provide 3D-printed prosthetics.

Health is a great context to start exploring OpenShift’s open source capabilities. We designed OpenShift to allow developers to take full advantage of containers (Docker) and orchestration (Kubernetes), without having to learn the internals of how to build containers from scratch or understand sys admin enough to deploy production-quality apps that can scale on demand.

OpenShift makes using containers and orchestration accessible by letting you focus on code instead of writing Dockerfiles and running Docker builds all day. With the integrated Source-to-Image open source project, the platform automatically creates containers while requiring only the URL for your source code repository.  

openshift.devpost.com

Improving Container Security: Docker and More After 6 months and 15 successful beta deployments, Twistlock is announcing the general availability of our container security suite. Twistlock came out of stealth in May 2015. Since then, we have been working diligently with a select group of beta customers to validate the value of our offerings. This diverse group of 15 beta testers, including Wix,AppsFlyer, and HolidayCheck, spans financial services, hospitality, healthcare, Internet services, and government. These customers confirmed that we are hitting the sweet spot of their most pressing container security needs -- a majority of them already deployed our product into their production environments, protecting live services and customer data.

The logical resource boundaries established in Docker containers are almost as secure as those established by the Linux operating system or by a virtual machine, according to a report by Gartner analyst Joerg Fritsch. However, Docker and Linux containers in general fall short when it comes to container management and administration, Fritsch said in his report, "Security properties of containers managed by Docker."

Page 33: Deploying deep learning models with Docker and Kubernetes

Neuroscience & bioinformatics

http://dx.doi.org/10.1016/j.conb.2015.04.002

Most large-scale analytics, whether in industry or neuroscience, involve common patterns. Raw data are massive in size. Often, they are processed so as to extract signals of interest, which are then used for statistical analysis, exploration, and visualization. But raw data can be analyzed or visualized directly (top arrow). And the results of each successive step informs how to perform the earlier ones (feedback loops). Icons below highlight some of the technologies, discussed in this essay, that are core to the modern large-scale analysis workflow.

“Cloud deployment also makes it easier to build tools that run identically for all users, especially with virtual machine platforms like Docker. However, cloud deployment for neuroscience does require transferring data to cloud storage, which may become a bottleneck. Deploying on academic clusters requires at least some support from cluster administrators but keeps the data closer to the computation. … There is also rapidly growing interest in the ‘‘data analysis notebook’’. These notebooks – the Jupyter notebook being a particularly popular example – combine executable code blocks, notes, and graphics in an interactive document that runs in a web browser, and provides a seamless front-end to a computer, or a large cluster of computers if running against a framework like Spark. Notebooks are a particularly appealing way to disseminate information; a recent neuroimaging paper, for example, provided all of its analyses in a version-controlled repository hosted on GitHub with Jupyter notebooks that generate all the figures in the paper [45]—a clear model for the future of reproducible science

https://www.docker.com/customers/docker-helps-varian-medical-systems-battle-cancer

https://dx.doi.org/10.12688/f1000research.7536.1

http://dx.doi.org/10.1371/journal.pone.0152686

http://dx.doi.org/10.1186/s13742-015-0087-0

http://homolog.us/blogs/blog/2015/09/22/is-docker-for-suckers/

Page 34: Deploying deep learning models with Docker and Kubernetes

Reproducible SCIENCE

http://dx.doi.org/10.1038/nj7622-703a

http://dx.doi.org/10.1038/533452a

http://t-redactyl.io/blog/2016/10/a-crash-course-in-reproducible-research-in-python.html

http://conference.scipy.org/proceedings/scipy2016/pdfs/christian_oxvig.pdf

The use of 'custom MATLAB scripts'

Page 35: Deploying deep learning models with Docker and Kubernetes

Reproducible SCIENCE with docker

ANACONDA AND DOCKER BETTER TOGETHER FOR REPRODUCIBLE DATA SCIENCEMonday, June 20, 2016, continuum.io/blog

Anaconda integrates with many different providers and platforms to give you access to the data science libraries you love on the services you use, including Amazon Web Services, Microsoft Azure, and Cloudera CDH. Today we’re excited to announce our new partnership with Docker.

As part of the announcements at DockerCon this week, Anaconda images will be featured in the new Docker Store, including Anaconda and Miniconda images based on Python 2 and Python 3. These freely available Anaconda images for Docker are now verified, will be featured in the Docker Store when it launches, are being regularly scanned for security vulnerabilities and are available from theContinuumIO organization on Docker Hub.

Anaconda and Docker are a great combination to empower yourdevelopment, testing and deployment workflows with Open Data Science tools, including Python and R. Our users often ask whether they should be using Anaconda or Docker for data science development and deployment workflows. We suggest using both - they’re better together!

Page 36: Deploying deep learning models with Docker and Kubernetes

Reproducible SCIENCE Between Jupyter and Docker

Jupyter / JupyterLab does not come really as 'plug'n'play' and you still have to have all the dependencies resolved

Build own condo packages, and deploy continuum.io/blog/developer-blog/whats-old-and-new-conda-build

Anaconda Enterprise

Notebookscontinuum.io