dockercon eu 2015: using docker with nosql

Salamander: Using Docker with NoSQLManuel E. de PazSoftware Architect at BEEVA (a BBVA Company)

Hello, my name is Manuel de Paz, I work as a software architect at BEEVA and today I come to present Salamander as use case of Docker, Amazon Web Services and No Relational Databases.

Cesar Silgo () - Yo pondria una imagen del panel real si se puede. En el re:Invent vimos un booth de un "salamandra" que era infinitamente ms feo y all estaban a todo trapoManuel Eusebio de Paz Carmona () - Ok, busco un pantallazo o uno mltiple

I would like to begin this track with a proposal, something different from traditional slide decks.

use #SalamanderAtDockerCon during the track to Q&A,Salamanders Team its awaiting to answer you.

All our team could not be here today and Salamander is a broad project, the whole Salamander team will be in Twitter answering your questions. You can use the #SalamanderAtDockerCon to ask about the slides, technologies, project or anything related ....Although I hope to answer many of these questions during the session.

Some IdeasSalamander, an adventure in AWS.How does Docker helps us? Our next Challenges

Today we will talk about three things briefly to introduce the Salamander project. I will focus on how we are using Docker in production, and what is more interesting how we will keep using it. Later we finish the session with a 3-minute video showing the tool as functional vision of the project.

What is Salamander?

Salamander is a dashboard, a visualization tool that has been developed to analyze the logs from batch process and dynamically generates reports with different kind of charts based on user request. As it's shown in this collage we use Gantt diagrams, time series, area charts, ... maybe the most interesting are the Dagre interactive graphics that allow us to go deeper browsing in the relationships between entities.

Frontend, Backend & APIData ETL

Batch Import

native-drivers

JavaWhat is Salamander?

One of the most interesting features of the project is the use of technologies based on Javascript for supporting all business logic. All parts of the application are developed with the new philosophy MEAN.

Manuel Eusebio de Paz Carmona () - Unificar iconos

Data Processing

Salamander is an analytical tool that lets you mix three visions of the executions of the bank's batch processes.

This tool shows together different perspectives of time: the future in the form of planned processes lists and the past, in the form of log files. Giving now, a broader perspective in terms of quantity and capacity of analysis.

ETL & Data Typologies

Relationships

1 day 250.000 nodes

The banks batch processes are executed sequentially in a complex network of dependencies, predecessors and successors. Such hierarchical relationship may be collected only in the form of graph database. The kind of database that comes closest to this approach is Neo4j, completed by the framework and tools of Graphaware allows us to look into long chains of relations. In a simple way, we treat the bank's network of processes as a social network to analyze it.

ETL & Data Typologies

DataProperties

1 day 300.000 executions

Moreover, the raw data from logs, with a more or less constant format, are stored in a document database. We have chosen to MongoDB for its proximity to the technology stack used. And the recommendations of Neo4j for massive node storage. This additional separation allows us to reduce the size of the nodes to improve the overall performance of Neo4j.

Workflows and pipelines

Like any other IT project, we aim to automate all possible processes, with regard to the code, but also in loading data.

The way to solve these needs related with code was to design a "pipeline" of deployments linking the stages of app with different branches from code repositories.

The solution adopted for data has allowed us to create a "flow" of repetitive, stable, and safe loading tasks that it's the base for new container generating. We'll be back over this in further slides.

Deploy & Continuous Integration

VPC

EC2

DevOps Flow

Here we have a number of tools known in DevOps and Continuous Integration.

Above all, the process is governed by Jenkins, who is responsible for preprocessing, deployments, downloads code and dependencies, testing, ...

In addition we include some technologies that allow us to control the health of the platform and investigate the performance.

Data Pipeline

Process

Java

AWS S3

The other project data flow is guided by Rundeck. Through this platform, the files are collected from a secure file server, saved to AWS S3, processed and integrated into MongoDB or saved to docker containers which are then stored in a private Docker Registry.

Cloud Architecture

The idea that closes this introduction is the infrastructure.We talked about the purpose of Salamander, its composition, its tools and its processes.

It's necessary therefore a map that put on the table situation for all stakeholders, closing the perspective, showing the relationships but especially helping us to situate the use of Docker.

Proof of Concept

Elastic BeanstalkEMREC2

S3

Salamander started as a proof of concept, with 4 machines and a couple of AWS Services. We might think that the infrastructure needs led us to increase the number of machines, the parts involved, seeking tags as high availability, fault tolerance, redundancy ... nothing further from reality.

In fact, for nearly two months we've a backend deployed on a small EC2 instance instead of a medium one.In Salamander project, the data was more relevant for improve the infrastructure.

Manuel Eusebio de Paz Carmona () - Update with new one without flat map & technologies

MongoDB

Datas GrowthDB on Instance

ReplicaSet

Sharding

ReplicaSet

HighAvailability

HorizontalScalingThe data requirements force us to grown...

+

For MongoDB, the road was clear. We went from 1 to 3 machines in a replicaSet reading from all nodes. And after that, we had a cluster machine with sharding, reaching a total of 9 machines (3 config servers, 2 queryRouters and 4 shards). The next step is to change the shards by replicaSets, from 9 to 16 machines. Is everything clear does not? Data are roughly logs. so more size, more machines. We are using the tools offered by database.

But what about Neo4j?

MicroDataDB on Instance

VerticalScalingDB on Instance

...

DATAConcept of MicroData = Isolation + ScalingNeo4j

For Neo4j is also well defined, arbiter, cluster instances, shared cache. That's all very well if you have a complex network with different kind o of nodes strongly associated and used or asked with some frequency.

However, what happens if our data is not required to consult directly? If we want to ask for an snapshot? A day in Salamander It is an information piece of 300 megabytes with 250,000 nodes, each of them with similar characteristics, but that hardly differ from the 250,000 nodes of the next day.

Thus arose the concept of microdata. The simple and clear segmentation of the immutable, persistent information, designed to be easy to access on demand.

Production SituationEC2 Container Service

Identity & Access Management

Glacier

ElastiCache

EC2

S3

VPC

Here we can see a map of the current infrastructure of the project.This environment is replicated for Test Staging with slight differences in the capabilities of the machines and the nature of the data.

The lower path shows the initial application with three layers, frontend, backend and proxy. It also displays an M2M API, in the top zone, in order to provide a search engine to Salamander.

In the green zone we can see two groups of MongoDB databases, a replicaSet for high availability and the Sharding cluster for massive storage of processed logs.

The blue area, with three clusters of instances, it's for running docker containers. Seeing this we could explain better the three challenges we have faced with Docker.

Manuel Eusebio de Paz Carmona () - Update & add new Cluster: Consolas

Docker Challenge #1

The first challenge that we solved was about storage the containers. They act as a book, its useful by itself but still needs to be stored. Our new access data unit has been decided and we could not continue growing exponentially and forcing resources to scale vertically.

Docker Challenge #1

1 Machine & 3x Container AWS S3 Persistence, Virtually infiniteDocker ComposeDocker Registry

S3

So after analyzing several alternative network storage systems, we decided to use the tools provided Docker and use it with Amazon Web Services. Thus we prepare a machine with three containers: Frontend, a REDIS Database and Docker Registry configured it to save the containers in AWS S3.

We designed a system of branches derived with our base container and tags for easy identification. We use Docker Compose to support and deploy that infrastructure.

Docker Challenge #1EC2 instance contents

Amazon S3 bucket

cluster #2cluster #3

This a map that shows the relation between our instances. The EC2 clusters access to registry to download the new images and run them to made available to the application. Each cluster has it's own destination depending of the requirements of the business logic.

We are currently awaiting the new Private Registration service Amazon to analyze if provides high availability to our infrastructure.

Docker Challenge #2

Our next challenge is related with safety, stability and solidity of the stored information.

In the same way we that use MongoDB replicaSet for some redundancy and better performance, we use the resources of AWS EC2 Container Service to achieve fault tolerance, high availability and load balancing.

Manuel Eusebio de Paz Carmona () - Cambiar foto por una ms clara

Cluster used for support Business LogicHigh AvailabilityFail RedundancyAWS EC Services: Containers as a Service

Docker Challenge #2

+

EC2 Container Service

AWS EC2 Container Service classifies the containers in two kind of entities: tasks and services, the tasks are containers that are run once, and the services are permanently running, even in a case of fail they're automatically restarted.

To save the need of discovering new ports and containers, all the machines related with business logic are pointing to an Elastic Load Balancing, to give a common entry point to the app. This service provide us load balancing through Round Robin between containers, fail redundancy and high availability.

Cluster #2EC2 Container Service

EC2 instance contents

DATAElastic Load Balancing


DATA

EC2 instanceweb appserverAWS ECS ServicesDocker Challenge #2

The proposed solution is based on a cluster with several machines where containers are running with the most vital information. AWS EC2 service Container Service, along with the AWS Elastic Load Balance are in charge of distributing the requests between two containers. The driver restful Neo4j allows us to obtain the information independently to the node and removes the need for fixed channels or sessions. The distribution of the containers is performed by the default planner offered by AWS.

Docker Challenge #3

The last of the challenges that we have solved in Salamander, at production level, came as a request of our client.It was an unusual request, but the whole project it's unusual. The client wanted to launch their own queries against database.

Security & Stability Vs FlexibilityNeo4j WebGUI BrowserStress IsolationAWS EC2 Container: Container as a Service

Docker Challenge #3


In order to solve this problem we had to compromise between security and the stability of systems. The design had to satisfy client and the security department and the IT Team. The web application used a tunnel formed by two securitized proxies, allowing access to valid users connect to a container that holds an exact copy of the same database which other screens of the application. Everything without damaging the overall performance.

Docker Challenge #3

Cluster #3EC2 Container Service


EC2 instanceweb appserver

EC2 instanceRESTserver

DATA

????????????Frontend:Web ProxyBackend:Reverse Proxy

user

The design was as follows: a cluster with a machine running a container acting as essential service using the names provided by AWS EC2 Container Service. This services is responsible for keeping alive the container. The request are proxied from the other parts of the application directly to the container.Neo4j is configured as read only and it doesn't translate its own bottlenecks to the app.

Next Challenges

Well, these situations resolve are currently working in production, but ... we have new challenges in which we are working and we would like to share them here.

Docker Challenge #4

AWS ECS Container: tasksDB as a ServiceContainersAuto organized proxy nodes

Seaport: Cluster Discover & Management

Node.js


We are currently working in a system to provide the possibility of choosing which microdata container load and view . We are using Seaport, a nodejs library for discovering and managing services in a cluster, and tasks from EC2 Container Service.

New ChallengesPrivate Registry

Amazon S3 bucket

cluster

EC2 instanceweb appserver

EC2 instanceRESTserverFrontend:WebAppBackend:REST Server

user

EC2 instanceRESTserverOrchestrator:Proxy & Mgmt

DATA

DATA

DATA...

DATA

Node.js

seaport

This is the final snapshot. An user can choose from several microData containers even initially switched off. Its request travels to backend as usual. These requests finally arrives to a orchestrator that raise a thread for each database. A thread is the responsible for manage a container as a task from EC2 Container Services and transfer the requests to that container. REDIS it's used as database to save timestamps and check the last use of a container. Each thread evaluates this info to switch off itself and the container in case of no use for a long time.

The Private Docker registry is used to know what microData are available to deploy. They are readed from an S3 bucket. The orchestrator offer this information to backend, and so to the frontend. The container it's built by downloading from our private registry.

Next Challenges

Reduce the size of containersAWS Private Registry ServiceReplace RunDeck for AWS Lambda

Lambda

Do you remember the third challenge solved with Docker? There was an unlabeled container. One of our next challenges is to provide an Apache Zeppelin container so our users can access and request to the raw data with tools as Spark.

Another point it's to reduce the size of images, we are making tests to replace containers with base on debian:java for another based on alpine:java.

Another interesting way to investigate its to use AWS Lambda for ETL processes.

Conclusions

Giving back on our steps, I think the entire deck of slides could be summarized as follows.There are two ideas that are the most important of the track, if you remember them, the time has been well spent.

Conclusions

Docker help us to avoid a high complex database infrastructure

Data SegmentationIsolationFine Tuning Performance

Docker has been the engine that has allowed us to divide the information and propose a robust data model, highly scalable and with a proper segmentation to our needs.

Conclusions

AWS EC2 Container Services API help us from a more complex architecture.

High Availability Fail RedundancyManagementService DiscoveryReducing Costs

AWS EC2 Container Service has been the key element to simplify the container infrastructure. Without it we would have had to allocate more resources and specific software and machines.

Requirements

Some criteria that help us:

Immutable DataSegmentable InfoNoSQL with Restful Driver

Albert Uderzo

Salamander is an unique project with lot of features that allowed us to experiment with different technologies.i.e. Neo4j Restful interface facilitate us the use of horizontally scalable resources. The nature of the data its another favorable point to use with Docker containers.

DEMO

Link to video

After this endless series of slides, we will watch a video showing a tour of the application functionality.This video shows the challenges with screens and interactions of the app.

Our Team

Salamander is a project that has lasted about two years and could not happen without the participation of many people, to whom I am grateful. This slide is dedicated to them, and support the effort to thank its work. Thank you very much.

NameTechnologyVersionREDISDatabase NoSQLN/A / 2:3.0.5-3Neo4jDatabase NoSQL2.1.6 / 2.2MongoDBDatabase NoSQL2.6.11Docker RegistryService1.6NodeJSApp Developmentv4.0.0 / v0.12.2 / v0.10.29AWS SDKApp Development2.0.31ExpressApp Development4.0.0AngularJSApp Development1.4.7

References & Versions

Image @ slideLink7https://www.flickr.com/photos/niznoz/3116766661 10https://www.flickr.com/photos/fordeu/5709826282 13https://unsplash.com/photos/ni9mKm62QnA 18http://mrg.bz/xBONJ8 21https://www.flickr.com/photos/cote/14692025435/ 24https://flic.kr/p/imxgqh27https://unsplash.com/photos/ZXVk-NMWtgg 31https://unsplash.com/photos/_r19nfvS3wY

Image Credits & Attributions

All product names, logos, brands, and other trademarks featured or referred in this presentation are the property of their respective trademark holders.

Feel free to contact us:[email protected]@[email protected]

www.beeva.com [email protected]

Thank you!Manuel E. de [email protected]@beeva.com