l’architettura di classe enterprise di nuova generazione

Massimo BrignoliPrincipal Solution [email protected]@massimobrignoli

L’architettura di Classe Enterprise di Nuova Generazione

Agenda

• Nascita dei Data Lake• Overview di MongoDB • Proposta di

un’architettura EDM• Case Study & Scenarios• Data Lake Lessons

Learned

Quanti dati?• Una cosa non manca alla aziende: dati

– Flussi dei sensori– Sentiment sui social– Log dei server– App mobile

• Analisti stimano una crescita del volume di dati del 40% annuo, 90% dei quali non strutturati.

• Le tecnologie tradizionali (alcune disegnate 40 anni fa) non sono sufficienti

La Promessa del “Big Data”• Scoprire informazioni collezionando ed analizzando i

dati porta la promessa di– Un vantaggio competitivo– Risparmio economico

• Un esempio diffuso dell’utilizzo della tecnologia Big Data è la “Single View”: aggregare tutto quello che si conosce di un cliente per migliorarne l’ingaggio e i ricavi

• Il tradizionale EDW scricchiola sotto il carico, sopraffatto dal volume e varietà dei dati (e dall’alto costo).

La Nascita dei Data Lake• Molte aziende hanno iniziato a guardare verso

un’architettura detta Data Lake:– Piattaforma per gestire i dati in modo flessibile– Per aggregare I dati cross-silo in un unico posto– Permette l’esplorazione di tutti i dati

• La piattaforma più in voga in questo momento è Hadoop:– Permette la scalabilità orizzontale su hardware commodity– Permette una schema di dati variegati ottimizzato in lettura– Include strati di lavorazione dei dati in SQL e linguaggi comuni– Grandi referenze (Yahoo e Google in primis)

Perché Hadoop?• Hadoop Distributed FileSystem è disegnato per scalare

su grandi operazioni batch• Fornisce un modello write-one read-many append-only • Ottimizzato per lunghe scansione di TB o PB di dati• Questa capacità di gestire dati multi-strutturati è

usata:– Segmentazione dei clienti per campagne di marketing e

recommendation– Analisi predittiva– Modelli di Rischio

Ma va bene per tutto?• I Data Lake sono disegnati per fornire l’output di

Hadoop alle applicazioni online. Queste applicazioni hanno dei requisiti tra cui:– Latenza di risposta in ms– Accesso random su un sottoinsieme di dati indicizzato– Supporto di query espressive ed aggregazioni di dati– Update di dati che cambiano valori frequentemente in real-time

Hadoop è la risposta a tutto?• Nel nostro mondo guidato ormai dai dati, i millisecondi

sono importanti.– Ricercatori IBM affermano che il 60% dei dati perde valore alcuni

millisecondi dopo la generazione– Ad esempio identificare una transazione di borsa fraudolenta è

inutile dopo alcuni minuti• Gartner predice che il 70% delle installazioni di Hadoop

fallirà per non aver raggiunto gli obiettivi di costo e di incremento del fatturato.

Enterprise Data Management Pipeline

…

Siloed source databases

External feeds (batch)

Streams

Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png

Transform

Store raw data

AnalyzeAggregate

Pub-sub, ETL, file imports

Stream Processing

Users

Other System

s

In Dettaglio• Join non necessarie causano pessime

performance• Costoso scalare verticalmente• Lo schema rigido rende difficile il

consolidamento di datai variabili o non strutturati

• Ci sono differenze nei record da eliminare durante la fase di aggregazione

• I processi soventi durano ore durante la notte

• I dati sono vecchi per prendere decisioni intraday

Veloce Overview di MongoDB

Documents Enable Dynamic Schema & Optimal Performance

Relational

MongoDB{ customer_id : 1,

first_name : "Mark",last_name : "Smith",city : "San Francisco",phones: [{

number : “1-212-777-1212”,

dnc : true,type : “home”

},number : “1-212-777-

1213”, type : “cell”

}] }

Customer ID First Name Last Name City

0 John Doe New York1 Mark Smith San Francisco2 Jay Black Newark3 Meagan White London4 Edward Daniels Boston

Phone Number Type DNC Customer ID

1-212-555-1212 home T 0

1-212-555-1213 home T 0

1-212-555-1214 cell F 0

1-212-777-1212 home T 1

1-212-777-1213 cell (null) 1

1-212-888-1212 home F 2

Document Model BenefitsAgility and flexibilityData model supports business changeRapidly iterate to meet new requirements

Intuitive, natural data representationEliminates ORM layerDevelopers are more productive

Reduces the need for joins, disk seeksProgramming is more simplePerformance delivered at scale

{customer_id : 1,first_name : "Mark",last_name : "Smith",city : "San Francisco",phones: [{

number : “1-212-777-1212”,dnc : true,type : “home”

},number : “1-212-777-1213”,

type : “cell”}]

}

MongoDB Technical CapabilitiesApplicatio

n

Driver

Mongos

PrimarySeconda

rySeconda

ry

Shard 1PrimarySeconda

rySeconda

ry

Shard 2

…PrimarySeconda

rySeconda

ry

Shard N

db.customer.insert({…})db.customer.find({ name: ”John Smith”})

1. Dynamic Document Schema{ name: “John Smith”,

date: “2013-08-01”, address: “10 3rd St.”, phone: {

home: 1234567890, mobile: 1234568138 } }

2. Native language drivers

5. High performance

- Data locality

- Indexes- RAM

3. High availability

6. Horizontal scalability

- Sharding

4. Workload Isolation

Morphia

MEAN Stack

Java Python PerlRuby

Drivers & Ecosystem

3.2 Features Relevant for EDM• WiredTiger as default storage engine• In-memory storage engine• Encryption at rest• Document Validation Rules• Compass (data viewer & query builder)• Connector for BI (Visualization)• Connector for Hadoop• Connector for Spark• $lookUp (left outer join)

Data Governance with Document Validation

Implement data governance without sacrificing agility that comes from dynamic schema

• Enforce data quality across multiple teams and applications

• Use familiar MongoDB expressions to control document structure

• Validation is optional and can be as simple as a single field, all the way to every field, including existence, data types, and regular expressions

MongoDB Compass

For fast schema discovery and visual construction of ad-hoc queries

• Visualize schema– Frequency of fields– Frequency of types– Determine validator rules

• View Documents• Graphically build queries• Authenticated access

MongoDB Connector for BIVisualize and explore multi-dimensional documents using SQL-based BI tools. The connector does the following:

• Provides the BI tool with the schema of the MongoDB collection to be visualized

• Translates SQL statements issued by the BI tool into equivalent MongoDB queries that are sent to MongoDB for processing

• Converts the results into the tabular format expected by the BI tool, which can then visualize the data based on user requirements

Dynamic LookupCombine data from multiple collections with left outer joins for richer analytics & more flexibility in data modeling

• Blend data from multiple sources for analysis

• Higher performance analytics with less application-side code and less effort from your developers

• Executed via the new $lookup operator, a stage in the MongoDB Aggregation Framework pipeline

Aggregation Framework – Pipelined AnalysisStart with the original collection; each record (document) contains a number of shapes (keys), each with a particular color (value)

• $match filters out documents that don’t contain a red diamond

• $project adds a new “square” attribute with a value computed from the value (color) of the snowflake and triangle attributes

• $lookup performs a left outer join with another collection, with the star being the comparison key

• Finally, the $group stage groups the data by the color of the square and produces statistics for each group

Partner Ecosystem (500+)

MongoDB Architecture Patterns

1. Operational Data Store (ODS)2. Enterprise Data Service3. Datamart/Cache4. Master Data Distribution5. Single Operational View 6. Operationalizing Hadoop

System of Record

System of Engagement

Enterprise Data Management Pipeline

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

Come scegliere il layer di Data Management per ognuno degli stage?

Processing Layer

?

When you want:1. Secondary

indexes2. Sub-second

latency3. Aggregations in

DB4. Updates of data

For:1. Scanning files2. When indexes

not needed

Wide column store (e.g. HBase)

For:1. Primary key

queries2. If multiple

indexes & slices not needed

3. Optimized for writing, not reading

MongoDB Hadoop/Spark ConnectorDistribute

d processing/analytic

s

• Sub-second latency• Expressive querying• Flexible indexing• Aggregations in

database• Great for any subset

of data

• Longer jobs• Batch analytics• Append only files• Great for scanning all

data or large subsets in files

- MongoDB Hadoop Connector

- Spark-mongodb

Both provide:• Schema-on-

read• Low TCO• Horizontal

scale

Data Store for Raw Dataset

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

Store raw data

Transform

- Typically just writing record-by-record from source data

- Usually just need high write volumes

- All 3 options handle that

Transform read requirements- Benefits to reading multiple

datasets sorted [by index], e.g. to do a merge

- Might want to look up across tables with indexes (and join functionality in MDB v3.2)

- Want high read performance while writes are happening

Interactive querying on the raw data could use indexes with MongoDB

Data Store for Transformed Dataset

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

Aggregate

Transform

Often benefits to updating data as merging multiple datasets

Dashboards & reports can have sub-second latency with indexes

Aggregate read requirements- Benefits to using indexes for grouping- Aggregations natively in the DB would

help- With indexes, can do aggregations on

slices of data- Might want to look up across tables with

indexes to aggregate

Data Store for Aggregated Dataset

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

AnalyzeAggregate


Analytics read requirements- For scanning all of data,

could be in any data store- Often want to analyze a

slice of data (using indexes)- Querying on slices is best in

MongoDB

Data Store for Last Dataset

…



Streams


Transform

Store raw data

AnalyzeAggregate


Stream Processing

Users

Other System

s

Analyze

Users


- At the last step, there are many consuming systems and users

- Need expressive querying with secondary indexes

- MongoDB is best option for the publication or distribution of analytical results and operationalization of data

Other SystemsOften digital

applications- High scale- Expressive querying- JSON preferred

Often RESTful services,

APIs

Architettura EDM Completa

…



Streams

Data processing pipeline


Stream ProcessingDownstrea

m Systems

… …

Single CSR

Application

Unified Digital Apps

Operational

Reporting

…

… …

Analytic Reporting

Drivers & Stacks

Customer

Clustering

Churn Analysi

s

Predictive

Analytics

…

Distributed Processing

Governance to choose where to load and process data

Optimal location for providing operational response times & slices

Can run processing on all data or slices

Data Lake

Example scenarios1.Single Customer View

a. Operationalb. Analytics on customer segmentsc. Analytics on all customers

2.Customer profiles & clustering

3.Presenting churn analytics on high value customers

Single View of CustomerSpanish bank replaces Teradata and Microstrategy to increase business and avoid significant cost

Problem Why MongoDB ResultsProblem Solution Results

Took days to implement new functionality and business policies, inhibiting revenue growth

Branches needed an app providing single view of the customer and real time recommendations for new products and services

Multi-minute latency for accessing customer data stored in Teradata and Microstrategy

Built single view of customer on MongoDB – flexible and scalable app easy to adapt to new business needs

Super fast, ad hoc query capabilities (milliseconds), and real-time analytics thanks to MongoDB’s Aggregation Framework

Can now leverage distributed infrastructure and commodity hardware for lower total cost of ownership and greater availability

Cost avoidance of 10M$+

Application developed and deployed in less than 6 months. New business policies easily deployed and executed, bringing new revenue to the company

Current capacity allows branches to load instantly all customer info in milliseconds, providing a great customer experience

Large Spanish Bank

Case StudyInsurance leader generates coveted single view of customers in 90 days – “The Wall”


No single view of customer, leading to poor customer experience and churn

145 years of policy data, 70+ systems, 15+ apps that are not integrated

Spent 2 years, $25M trying build single view with Oracle – failed

Built “The Wall” pulling in disparate data and serving single view to customer service reps in real time

Flexible data model to aggregate disparate data into single data store

Churn analysis done with Hadoop with relevant results output to MongoDB

Prototyped in 2 weeks

Deployed to production in 90 days

Decreased churn and improved ability to upsell/cross-sell

Top 15 Global Bank

Kicking Out OracleGlobal bank with 48M customers in 50 countries terminates Oracle ULA & makes MongoDB database of choice


Slow development cycles due to RDBMS’ rigid data model hindering ability to meet business demands

High TCO for hardware, licenses, development, and support (>$50M Oracle ULA)

Poor overall performance of customer-facing and internal applications

Building dozens of apps on MongoDB, both net new and migrations from Oracle – e.g., significant portion of retail banking, including customer-facing and backoffice apps, fraud detection, card activation, equity research content mgt.)

Flexible data model to develop apps quickly and accommodate diverse data

Ability to scale infrastructure and costs elastically

Able to cancel Oracle ULA. Evaluating what apps can be migrated to MongoDB. For new apps, MongoDB is default choice

Apps built in weeks instead of months or years, e.g., ebanking app prototyped in 2 weeks and in production in 4 weeks

70% TCO reduction

l’architettura di classe enterprise di nuova generazione

Data & Analytics