European Commission – DIGIT D1
Big Data Test Infrastructure Updates for ESTAT
October 2018
2
Big Data Test Infrastructure - Objectives
• The state-of-the-art of the initiative BDTI
1. Where are we?
2. The business service "PaaS for implementing Big Data Use Cases"
• Next steps
1. Roadmap
2. Potential collaboration with ESTAT
• Alignment for the potential collaboration with ESTAT
Table of Content
Big Data Test Infrastructure - The state-of-the-art Where are we?
3
Core business service
PaaS – Sandbox environment
The Community Building and Innovation Portal
1
2
Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service
4
Amazon Virtual Machines Example (Hardware Stack)
In order to estimate the needed resources for the virtual machines, we used the technical requirements for the ESTAT pilots. Please note that, these example templates prepared to cover any kind of requirements, for this reason they are larger than one will be needed for a standard pilot.
As an addition to the technical requirements, we also made an estimation for use such as; • Number of nodes/server: 6 (2 master and 4 slaves); • The average monthly usage of the BDTI: 132 hours/month (6 hours a day, 22 working day a month); • The average storage needed per pilot: 10TB (included multiple back-up of same data on HDFS); • Server location: Ireland (due to its lower cost compared to the other data centres of the cloud providers).
For the hardware stack of the PaaS Business Service, AWS instances will be used.
Model
Technical Specifications of Single Templates
Template Storage RAM
Computing capacity Network
Family name Vendor name Subtype Size Type Name of CPU Frequency Number of CPU
Amazon 1 Compute optimized
C5 c5.9xlarge . EBS only 72 GB Intel Xeon Platinum 3.0 GHZ 36 vCPU 10 GB
Amazon 2 Compute optimized
C5 c5.18xlarge . EBS only 144 GB Intel Xeon Platinum 3.0 GHZ 72 vCPU 25 GB
Big Data Test Infrastructure - The state-of-the-art Focus on PaaS Business Service – Software Stack
5
Amazon EMR Features
Elastic
It enables a quick and easy provision as much capacity as one needs, and automatically or manually add and remove capacity. There two possible options such as: • Deploying multiple clusters • Resizing a running cluster
Low cost
It is designed to reduce the cost of processing large amounts of data. Some of the features that make it low cost include low per-second pricing can be listed such as: • Amazon EC2 Spot integration; • Amazon EC2 Reserved Instance integration; • Amazon S3 integration.
Flexible data stores
One can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.
The PaaS will be implemented with Amazon Cloud services and Amazon EMR.
Elasticity
Flexible Data Stores
Big Data Test Infrastructure - The state-of-the-art Focus on PaaS Business Service – Software Stack
6
Amazon EMR Features
Use the tool you want
One can easily select and use the latest open source projects on the EMR cluster, including applications in the Apache Spark and Hadoop ecosystems. Software is installed and configured by Amazon EMR.
Hadoop tools
Amazon EMR supports powerful and proven Hadoop tools such as:
• Spark • Hive • Pig • Hbase
• Impala • Phoenix • Hue • Presto
• Zeppelin • Oozie • Tez • Flink
Additionally, it can run distributed computing frameworks besides Hadoop MapReduce such as Spark or Presto using bootstrap actions. You can also use Hue and Zeppelin as GUIs for interacting with applications on your cluster.
Third Party tools
Amazon EMR can be used with a wide variety of third party software tools
For the software stack of the PaaS Business Service, Amazon EMR will be used.
Third Party tools
Open source software preinstalled
Activities 2018 2019
Q3 Q4 Q1 Q2 Q3 Q4
Big Data Platform
Target operating & governance
model
BDTI Community building and
innovation portal
Additonal BDTI
business services
CEF
collaboration
7
Big Data Test Infrastructure - Proposed next steps Roadmap of the initiative
BDTI implementation Implementation and running of the PaaS service
CEF
Design of the Big Data PaaS
TOM and governance model definition
Definition of the BDTI technical specification
First set of pilots Second set of pilots
Legend Done To be done
15/12/2018
Big Data platform activation
contract config test
1 4
6
End of Q2 2019
Results of the first set of pilots
End of 2019
Results of the second set of pilots and technical
specification release
Refinement and implementation of governance model
Governance & Operational model execution (on-going updating)
Implementation of the other BDTI
services
2
Design of the other BDTI business services
5
Design of the Community building
service
Configuration of the Atlassian suite
3
Implementation and Running of the Community building service (Portal mgmt., workshop and know-how sharing sessions)
8
Big Data Test Infrastructure - Proposed next steps Potential collaboration with ESTAT
TEST Proposed duration: 15-20 days
Objective: Fine tuning and training of the platform
FIRST ROUND FOR PILOTING
Proposed duration: 4-6 months Objective: implementation of
3-5 pilots in total
SECOND ROUND FOR PILOTING
Proposed duration: 4-6months Objective: implementation of
3-5 pilots in total
1 2
3
ESTAT could support this phase in orfer to have
bilateral benefits
ESTAT could implement 2-3 pilots projects
ESTAT could implement 2-3 pilots projects
the platform will be used in contemporary from MSs to implement
3-5 pilots projects during the year
Big Data Test Infrastructure - Questions Alignment for the potential collaboration with ESTAT
9
Q1. Do you agree with the general idea of PaaS?
Q2. Do you agree with the estimated Infrastructure resources per pilot?
Q3. Do you have expertise using Amazon AWS and, in particular, Amazon EMR?
Q4. What do you think about storing data in local file systems?