european commission digit d1 big data test …...big data test infrastructure - the state-of-the-art...

9
European Commission – DIGIT D1 Big Data Test Infrastructure Updates for ESTAT October 2018

Upload: others

Post on 21-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: European Commission DIGIT D1 Big Data Test …...Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service 4 Amazon Virtual Machines Example (Hardware

European Commission – DIGIT D1

Big Data Test Infrastructure Updates for ESTAT

October 2018

Page 2: European Commission DIGIT D1 Big Data Test …...Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service 4 Amazon Virtual Machines Example (Hardware

2

Big Data Test Infrastructure - Objectives

• The state-of-the-art of the initiative BDTI

1. Where are we?

2. The business service "PaaS for implementing Big Data Use Cases"

• Next steps

1. Roadmap

2. Potential collaboration with ESTAT

• Alignment for the potential collaboration with ESTAT

Table of Content

Page 3: European Commission DIGIT D1 Big Data Test …...Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service 4 Amazon Virtual Machines Example (Hardware

Big Data Test Infrastructure - The state-of-the-art Where are we?

3

Core business service

PaaS – Sandbox environment

The Community Building and Innovation Portal

1

2

Page 4: European Commission DIGIT D1 Big Data Test …...Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service 4 Amazon Virtual Machines Example (Hardware

Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service

4

Amazon Virtual Machines Example (Hardware Stack)

In order to estimate the needed resources for the virtual machines, we used the technical requirements for the ESTAT pilots. Please note that, these example templates prepared to cover any kind of requirements, for this reason they are larger than one will be needed for a standard pilot.

As an addition to the technical requirements, we also made an estimation for use such as; • Number of nodes/server: 6 (2 master and 4 slaves); • The average monthly usage of the BDTI: 132 hours/month (6 hours a day, 22 working day a month); • The average storage needed per pilot: 10TB (included multiple back-up of same data on HDFS); • Server location: Ireland (due to its lower cost compared to the other data centres of the cloud providers).

For the hardware stack of the PaaS Business Service, AWS instances will be used.

Model

Technical Specifications of Single Templates

Template Storage RAM

Computing capacity Network

Family name Vendor name Subtype Size Type Name of CPU Frequency Number of CPU

Amazon 1 Compute optimized

C5 c5.9xlarge . EBS only 72 GB Intel Xeon Platinum 3.0 GHZ 36 vCPU 10 GB

Amazon 2 Compute optimized

C5 c5.18xlarge . EBS only 144 GB Intel Xeon Platinum 3.0 GHZ 72 vCPU 25 GB

Page 5: European Commission DIGIT D1 Big Data Test …...Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service 4 Amazon Virtual Machines Example (Hardware

Big Data Test Infrastructure - The state-of-the-art Focus on PaaS Business Service – Software Stack

5

Amazon EMR Features

Elastic

It enables a quick and easy provision as much capacity as one needs, and automatically or manually add and remove capacity. There two possible options such as: • Deploying multiple clusters • Resizing a running cluster

Low cost

It is designed to reduce the cost of processing large amounts of data. Some of the features that make it low cost include low per-second pricing can be listed such as: • Amazon EC2 Spot integration; • Amazon EC2 Reserved Instance integration; • Amazon S3 integration.

Flexible data stores

One can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.

The PaaS will be implemented with Amazon Cloud services and Amazon EMR.

Elasticity

Flexible Data Stores

Page 6: European Commission DIGIT D1 Big Data Test …...Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service 4 Amazon Virtual Machines Example (Hardware

Big Data Test Infrastructure - The state-of-the-art Focus on PaaS Business Service – Software Stack

6

Amazon EMR Features

Use the tool you want

One can easily select and use the latest open source projects on the EMR cluster, including applications in the Apache Spark and Hadoop ecosystems. Software is installed and configured by Amazon EMR.

Hadoop tools

Amazon EMR supports powerful and proven Hadoop tools such as:

• Spark • Hive • Pig • Hbase

• Impala • Phoenix • Hue • Presto

• Zeppelin • Oozie • Tez • Flink

Additionally, it can run distributed computing frameworks besides Hadoop MapReduce such as Spark or Presto using bootstrap actions. You can also use Hue and Zeppelin as GUIs for interacting with applications on your cluster.

Third Party tools

Amazon EMR can be used with a wide variety of third party software tools

For the software stack of the PaaS Business Service, Amazon EMR will be used.

Third Party tools

Open source software preinstalled

Page 7: European Commission DIGIT D1 Big Data Test …...Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service 4 Amazon Virtual Machines Example (Hardware

Activities 2018 2019

Q3 Q4 Q1 Q2 Q3 Q4

Big Data Platform

Target operating & governance

model

BDTI Community building and

innovation portal

Additonal BDTI

business services

CEF

collaboration

7

Big Data Test Infrastructure - Proposed next steps Roadmap of the initiative

BDTI implementation Implementation and running of the PaaS service

CEF

Design of the Big Data PaaS

TOM and governance model definition

Definition of the BDTI technical specification

First set of pilots Second set of pilots

Legend Done To be done

15/12/2018

Big Data platform activation

contract config test

1 4

6

End of Q2 2019

Results of the first set of pilots

End of 2019

Results of the second set of pilots and technical

specification release

Refinement and implementation of governance model

Governance & Operational model execution (on-going updating)

Implementation of the other BDTI

services

2

Design of the other BDTI business services

5

Design of the Community building

service

Configuration of the Atlassian suite

3

Implementation and Running of the Community building service (Portal mgmt., workshop and know-how sharing sessions)

Page 8: European Commission DIGIT D1 Big Data Test …...Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service 4 Amazon Virtual Machines Example (Hardware

8

Big Data Test Infrastructure - Proposed next steps Potential collaboration with ESTAT

TEST Proposed duration: 15-20 days

Objective: Fine tuning and training of the platform

FIRST ROUND FOR PILOTING

Proposed duration: 4-6 months Objective: implementation of

3-5 pilots in total

SECOND ROUND FOR PILOTING

Proposed duration: 4-6months Objective: implementation of

3-5 pilots in total

1 2

3

ESTAT could support this phase in orfer to have

bilateral benefits

ESTAT could implement 2-3 pilots projects

ESTAT could implement 2-3 pilots projects

the platform will be used in contemporary from MSs to implement

3-5 pilots projects during the year

Page 9: European Commission DIGIT D1 Big Data Test …...Big Data Test Infrastructure - The state-of-the-art Focus on PaaS- Sandbox Business Service 4 Amazon Virtual Machines Example (Hardware

Big Data Test Infrastructure - Questions Alignment for the potential collaboration with ESTAT

9

Q1. Do you agree with the general idea of PaaS?

Q2. Do you agree with the estimated Infrastructure resources per pilot?

Q3. Do you have expertise using Amazon AWS and, in particular, Amazon EMR?

Q4. What do you think about storing data in local file systems?