big$data,$hadoop$and$nosql$ - irityoann.pitarch/docs/coursbigdata/bigdatalesson.pdf · agenda 1....

Big Data, Hadoop and NoSQL (Almost) Everything you have ever

wanted to know about

Yoann Pitarch

Agenda

1.  Big Data 1.  DefiniHon 2.  ApplicaHons

2.  Hadoop / MapReduce CompuHng Paradigm 1.  Overview 2.  How it works

3.  NoSQL databases

Agenda

IntroducHon

J. Manyika et al., Big data, the next fron0er for innova0on, compe00on, and produc0vity, McKinsey Global InsHtute, 2011

Some DefiniHons •  A buzz word

•  « Data which size is too large and complex to be treated (harversted, stored, analysed, spreaded) by usual systems »

•  « A data can be considered as big when tradi8onal analysis methods fail in dealing with them »

•  « A collecHon of data from tradi8onal and digital sources inside and outside a company that represents a source for ongoing discovery and analysis »

The 3, 4 or 5 “Vs”

•  Possibly Veracity and/or Variability as well Source: h_p://www.datasciencecentral.com/

Digital Informa8on and usage: what has changed ?

Digital Data

Analogique

Digital

1986 1993 2000 2007

3 16 54

NOTE: Numbers may not sum to rounding Hilbert and Lopez, « The world’s technological capacity to store, communicate, and compute informa0on », Science, 2011 J. Manyika et al., Big data, the next fron0er for innova0on, compe00on, and produc0vity, McKinsey Global InsHtute, 2011.

Exabytes

Storage and CompuHng

1986 1993 2000 2007

NOTE: Numbers may not sum to rounding SOURCE : Hilbert and Lopez, « The world’s technological capacity to store, communicate, and compute informa0on », Science, 2011 J. Manyika et al., Big data, the next fron0er for innova0on, compe00on, and produc0vity, McKinsey Global InsHtute, 2011.

1012 million instruc5ons per second

6% 65%

6% 3% 5%

<0,001 0.004 0.289

Types Video Image Audio Texte/

Numbers

Insurance

Banking

CommunicaHon and media

ConstrucHon

EducaHon

Gouvernement

Health care

PénétraHon

Medium

SOURCE: McKinsey Global Ins0tute analysis J. Manyika et al., Big data, the next fron0er for innova0on, compe00on, and produc0vity, McKinsey Global InsHtute, 2011.

Big Data from Internet/Web 2.0

R. Kalakota, 2012

Internet Users

World: 2 405 518 376 internet users in june 2012 (populaHon: 7 017 846 922) France: 52,228,905 Internet users in june 2012 Indexed Web : more than 8.15 billion pages (Sept, 2012)

Internet World Stats – www.internetworldstats.com/stats.htm

Internet users

PopulaHon 0

500 000 000

1 000 000 000

1 500 000 000

2 000 000 000

2 500 000 000

3 000 000 000

3 500 000 000

4 000 000 000

Africa Asia Europe Middle East North

America LaHn America Oceania

Source: h_p://fr.slideshare.net/stevenvanbelleghem/social-‐media-‐around-‐the-‐world-‐2011

North Ouest South East Europe

Know at least one SN 98% 97% 99% 99% 98%

Member of at least one SN 75% 66% 77% 79% 73%

Avg nber of SN 1,8 1,5 2,2 1,9 1,9

Social Network PenetraHon

Social Network •  Facebook users

–  835 525 280 (march 2012) – 50% via mobile –  25 000 000 in France (penetraHon rate 38%)

Social Networks •  Twi_er users

–  140 000 000 acHve users –  340 000 000 tweets per day

•  Some stats (h_p://fr.slideshare.net/stevenvanbelleghem)

–  400 billion users daily (Facebook) –  Session: facebook 37mn; twi_er 23 mn –  50% follow a trade mark –  36% write about a company or a trade mark –  19% write about the company they belong to

Big Data from Science •  Astronomy

–  THéMIS solar telescope: 20 TB raw reduced data –  European Solar Telescope: ½ PB/day (2018) –  CDS: data base of all the astronomical objects, related papers, …

•  Fluid analysis –  ParHcle Image Velocimetry: 1 mn ; 1000 images/seconde ; 1 TB (encodage 16 bits)

–  EquaHon Navier-‐Stokes: grild 108 noeuds; 105 steps; 40 TB of data per variable

•  Medecine –  Radiology, video of surgeries, reports

Harvested by Smart Sensors

• Car industry • Meters: water, power, … • Networks: communicaHons (emails, telephone, …)

Submarine Cable Map

•  Prism •  Qui ?

Data Centers

•  « Within four years, two-‐thirds of all data center traffic across the world — as well as workloads — will be cloud based » Cisco – 2012

•  Global data center traffic will grow fourfold between 2011 and 2016, reaching a total of 6.6 ze_abytes annually

•  Global cloud traffic, the fastest-‐growing component of data center traffic, will grow sixfold – a 44% combined annual growth rate (CAGR) – from 683 exabytes of annual traffic in 2011 to 4.3 ze_abytes by 2016

Data Centers

h_p://www.cisco.com/en/US/soluHons/collateral/ns341/ns525/ns537/ns705/ns1175/Cloud_Index_White_Paper.html

Global Data Center IP Traffic Growth

Data Centers

h_p://www.cisco.com/en/US/soluHons/collateral/ns341/ns525/ns537/ns705/ns1175/Cloud_Index_White_Paper.html

Workload DistribuHon: 2011-‐2016

ApplicaHons

documentaHon.Hce.ac-‐orleans-‐tours.fr

ApplicaHons

•  Science monitoring & technical watch –  Analysis of the compeHtors –  Trend detecHon

•  Market segmentaHon / customer micro-‐segmentaHon •  Social network analysis

–  Opinion mining –  Digital idenHty

•  ReacHon to failures •  AutomaHc and smart regulaHon

–  Water, electricity, …

•  AutomaHc monitoring of domesHc equipments

Technologies and Methods

– CS: Collect, Store, Treat, Spread – Agregate, Analyse, Visualize – MulHdisciplinary: CS, mathemaHcs, economy, sociology

– Methods •  MulHdimensional analysis and visualiszaHon •  Machine learning •  Data mining

Methods

– Not necessarly fully new methods but – Volume implies new treatment principles

•  Large processing and storage capacity => cloud compuHng

•  High-‐performance and parallel compuHng => MAHOUT Apache Mahout: Scalable machine learning and data mining

•  Data reducHon / dimensionality reducHon •  Distributed data & treatments => NOSQL ; MapReduce

Apache™ Hadoop®!

Cloud CompuHng

4+ billion phones by 2010 [Source: Nokia]

Web 2.0-enabled PCs, TVs, etc.

Businesses, from startups to enterprises

An emerging compuHng paradigm where data and services reside in massively scalable data centers and can be ubiquitously accessed from any connected devices over the internet.

Credit: IBM Corp. [F. Desprez, LIP ENS Lyon / INRIA]

Cloud CompuHng •  Two key concepts

–  Processing 1000x more data does not have to be 1000x harder

–  Cycles and bytes, not hardware are the new commodity

•  Cloud compuHng is –  Providing services on virtual machines allocated on top of a large physical machine pool

–  A method to address scalability and availability concerns for large scale applicaHons

–  DemocraHzed distributed compuHng

[F. Desprez, LIP ENS Lyon / INRIA]

Amazon ElasHc Compute Cloud A set of APIs and business models which give developer-level access to Amazon’s infrastructure and content:

"   Data As A Service "   Amazon E-‐Commerce Service "   Amazon Historical Pricing

"   Search As A Service "   Alexa Web InformaHon Service "   Alexa Top Sites "   Alexa Site Thumbnail "   Alexa Web Search Playorm

"   Infrastructure As A Service "   Amazon Simple Queue Service "   Amazon Simple Storage Service "   Amazon ElasHc Compute Cloud

"   People As A Service "   Amazon Mechanical Trunk

Compute

Store Message

ElasHc Compute Cloud

Simple Storage Service

Simple Queue Service

Credits: Jeff Barr, Amazon

Before Analysis

HARVESTING

World Digital Library

h_p://www.wdl.org/en/

HarvesHng

•  Technical issues – Different API to query – Different formats

•  Other issues – SelecHng sources – Defining queries – Need for homogeneizaHon

Issues and Challenges •  Data heterogenity

–  Formats (pages vs tweets vs videos) –  Fiability : quality, validity, privacy – Accessibility: open data /

•  Technics and technologies – Hardware (capacity, security) –  So{ware

•  OrganisaHonnal –  Various skills and competences (computer science, mathemaHcs, economics, social science, ….)

Agenda

Large-‐Scale Data AnalyHcs •  MapReduce compuHng paradigm, e.g., Hadoop, vs. tradiHonal

database systems

Database

�  Many enterprises are turning to Hadoop

¡  Especially applicaHons generaHng big data

¡ Web applicaHons, social networks, scienHfic applicaHons

Why Hadoop is Able to Compete?

Scalability (petabytes of data, thousands of machines)

Database vs.

Flexibility in accepHng all data formats (no schema)

Commodity inexpensive hardware

Efficient and simple fault-‐tolerant mechanism

Performance (tons of indexing, tuning, data organizaHon tech.)

Features: -‐ Provenance tracking -‐ AnnotaHon management -‐ ….

What is Hadoop (1)

•  Hadoop is a so{ware framework for distributed processing of large datasets across large clusters of computers –  Large datasets à Terabytes or petabytes of data –  Large clusters à hundreds or thousands of nodes

•  Hadoop is open-‐source implementaHon for Google MapReduce

•  Hadoop is based on a simple programming model called MapReduce

•  Hadoop is based on a simple data model, any data will fit

What is Hadoop (2) •  Hadoop framework consists on two main layers

–  Distributed file system (HDFS) –  ExecuHon engine (MapReduce)

Hadoop Master/Slave Architecture

•  Hadoop is designed as a master-‐slave shared-‐nothing architecture

Master node (single node)

Many slave nodes

Design Principles of Hadoop

•  Need to process big data •  Need to parallelize computaHon across thousands of nodes

•  Commodity hardware –  Large number of low-‐end cheap machines working in parallel to solve a compuHng problem

•  This is in contrast to Parallel DBs –  Small number of high-‐end expensive machines

Design Principles of Hadoop

•  Automa8c paralleliza8on & distribu8on – Hidden from the end-‐user

•  Fault tolerance and automa8c recovery – Nodes/tasks will fail and will recover automaHcally

•  Clean and simple programming abstrac8on – Users only provide two funcHons “map” and “reduce”

Who Uses MapReduce/Hadoop

•  Google: Inventors of MapReduce compuHng paradigm

•  Yahoo: Developing Hadoop open-‐source of MapReduce

•  IBM, Microso{, Oracle •  Facebook, Amazon, AOL, NetFlex •  Many others + universiHes and research labs

Hadoop: How it Works

Hadoop Architecture

Master node (single node)

Many slave nodes

•  Distributed file system (HDFS) •  ExecuHon engine (MapReduce)

Hadoop Distributed File System (HDFS)

Centralized namenode -‐ Maintains metadata info about files

Many datanode (1000s) -‐ Store the actual data -‐ Files are divided into blocks -‐ Each block is replicated N Hmes (Default = 3)

File F 1 2 3 4 5

Blocks (64 MB)

Main ProperHes of HDFS

•  Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data

•  ReplicaHon: Each data block is replicated many Hmes (default is 3)

•  Failure: Failure is the norm rather than excepHon •  Fault Tolerance: DetecHon of faults and quick, automaHc recovery from them is a core architectural goal of HDFS – Namenode is consistently checking Datanodes

Map-‐Reduce ExecuHon Engine (Example: Color Count)

Shuffle & SorHng based on k

Reduce

Input blocks on HDFS

Produces (k, v) ( , 1)

Parse-hash

Consumes(k, [v]) ( , [1,1,1,1,1,1..])

Produces(k’, v’) ( , 100)

Users only provide the “Map” and “Reduce” func5ons

ProperHes of MapReduce Engine

•  Job Tracker is the master node (runs with the namenode) –  Receives the user’s job –  Decides on how many tasks will run (number of mappers) –  Decides on where to run each mapper (concept of locality)

•  This file has 5 Blocks à run 5 map tasks

•  Where to run the task reading block “1” •  Try to run it on Node 1 or Node 3

Node 1 Node 2 Node 3

ProperHes of MapReduce Engine (Cont’d)

•  Task Tracker is the slave node (runs on each datanode) –  Receives the task from Job Tracker –  Runs the task unHl compleHon (either map or reduce task) –  Always in communicaHon with the Job Tracker reporHng progress

Reduce

Parse-hash

In this example, 1 map-‐reduce job consists of 4 map tasks and 3 reduce tasks

Key-‐Value Pairs

•  Mappers and Reducers are users’ code (provided funcHons) •  Just need to obey the Key-‐Value pairs interface •  Mappers:

–  Consume <key, value> pairs –  Produce <key, value> pairs

•  Reducers: –  Consume <key, <list of values>> –  Produce <key, value>

•  Shuffling and Sor8ng: –  Hidden phase between mappers and reducers –  Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>>

MapReduce Phases

Deciding on what will be the key and what will be the value è developer’s responsibility

Example 1: Word Count

Map Tasks

Reduce Tasks

•  Job: Count the occurrences of each word in a data set

Example 2: Color Count

Shuffle & SorHng based on k

Reduce

Input blocks on HDFS

Parse-hash

Consumes(k, [v]) ( , [1,1,1,1,1,1..])

Produces(k’, v’) ( , 100)

Job: Count the number of each color in a data set

Part0003

Part0002

Part0001

That’s the output file, it has 3 parts on probably 3 different machines

Example 3: Color Filter

Job: Select only the blue and the green colors Input blocks on HDFS

Write to HDFS

•  Each map task will select only the blue or green colors

•  No need for reduce phase Part0001

Part0002

Part0003

Part0004

That’s the output file, it has 4 parts on probably 4 different machines

Bigger Picture: Hadoop vs. Other Systems

Distributed Databases Hadoop

CompuHng Model -‐  NoHon of transacHons -‐  TransacHon is the unit of work -‐  ACID properHes, Concurrency

control

-‐  NoHon of jobs -‐  Job is the unit of work -‐  No concurrency control

Data Model -‐  Structured data with known schema -‐  Read/Write mode

-‐  Any data will fit in any format -‐  (un)(semi)structured -‐  ReadOnly mode

Cost Model -‐  Expensive servers -‐  Cheap commodity machines

Fault Tolerance -‐  Failures are rare -‐  Recovery mechanisms

-‐  Failures are common over thousands of machines

-‐  Simple yet efficient fault tolerance

Key CharacterisHcs -‐ Efficiency, opHmizaHons, fine-‐tuning -‐ Scalability, flexibility, fault tolerance

Agenda

IntroducHon

•  Relaxing ACID properHes •  NoSQL databases

–  Categories of NoSQL databases –  Typical NoSQL API –  RepresentaHves of NoSQL databases

•  Conclusions

Relaxing ACID properHes •  Atomicity, Consistency, IsolaHon, Durability •  Cloud compuHng: ACID is hard to achieve, moreover, it is not

always required, e.g. for blogs, status updates, product lisHngs, etc.

•  Availability –  TradiHonally, thought of as the server/process available 99.999 %

of Hme

–  For a large-‐scale node system, there is a high probability that a node is either down or that there is a network parHHoning

•  ParHHon tolerance –  Ensures that write and read operaHons are redirected to available

replicas when segments of the network become disconnected

Eventual Consistency •  Eventual Consistency

–  When no updates occur for a long period of Hme, eventually all updates will propagate through the system and all the nodes will be consistent

–  For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service

•  BASE (Basically Available, So{ state, Eventual consistency) properHes, as opposed to ACID

•  So{ state: copies of a data item may be inconsistent •  Eventually Consistent – copies becomes consistent at some later Hme if there are no more updates to that data item

•  Basically Available – possibiliHes of faults but not a fault of the whole system

CAP Theorem •  Suppose 3 properHes of a system

–  Consistency (all copies have same value) –  Availability (system can run even if parts have failed) –  ParHHons (network can break into two or more parts, each with acHve

systems that can not influence other parts)

•  Brewer’s CAP “Theorem”: for any system sharing data it is impossible to guarantee simultaneously all of these three properHes

•  Very large systems will parHHon at some point –  it is necessary to decide between C and A –  tradiHonal DBMS prefer C over A and P –  most Web applicaHons choose A (except in specific applicaHons such

as order processing)

CAP Theorem

•  Drop A or C of ACID –  Relaxing C makes replicaHon easy, facilitates fault tolerance,

–  Relaxing A reduces (or eliminates) need for distributed concurrency control.

NoSQL databases •  The name stands for Not Only SQL •  Common features:

–  Non-‐relaHonal –  Usually do not require a fixed table schema –  Horizontal scalable –  Mostly open source

•  More characterisHcs –  Relax one or more of the ACID properHes (see CAP theorem) –  ReplicaHon support –  Easy API (if SQL, then only its very restricted variant)

•  Do not fully support relaHonal features –  No join operaHons (except within parHHons), –  No referenHal integrity constraints across parHHons.

Categories of NoSQL databases •  Key-‐value stores •  Column NoSQL databases •  Document-‐based •  XML databases (myXMLDB, Tamino, Sedna) •  Graph database (neo4j, InfoGrid)

Key-‐Value Data Stores

•  Example: SimpleDB – Based on Amazon’s Single Storage Service (S3) –  Items (represent objects) having one or more pairs (name, value), where name denotes an a_ribute.

– An a_ribute can have mulHple values. –  Items are combined into domains.

Column-‐oriented* •  Store data in column order •  Allow key-‐value pairs to be stored (and retrieved on key)

in a massively parallel system –  Data model: families of a_ributes defined in a schema, new a_ributes can be added

–  Storing principle: big hashed distributed tables –  ProperHes: parHHoning (horizontally and/or verHcally), high availability etc. completely transparent to applicaHon

* Be_er: extendible records

Column-‐oriented

•  Example: BigTable –  Indexed by row key, column key and timestamp. i.e. (row: string , column: string , Hme: int64 ) à String.

–  Rows are ordered in lexicographic order by row key. –  Row range for a table is dynamically parHHoned, each row range is called a tablet.

–  Columns: syntax is family:qualifier

“Contents:” “anchor:cnnsi.com” “anchor:my.look.ca”

“mff.ksi.www” “MFF” “MFF.cz” t3

t5 t6 t9 t8

column family

A table representaHon of a row in BigTable

Row key Time stamp Column name Column family Grandchildren

h_p://ksi.... t1 "Jack" "Claire" 7

t2 "Jack" "Claire" 7 "Barbara" 6

t3 "Jack" "Claire" 7 "Barbara" 6 "Magda" 3

Column-‐oriented

•  Example: Cassandra – Keyspace: Usually the name of the applicaHon; e.g., 'Twi_er', 'Wordpress‘.

– Column family: structure containing an unlimited number of rows

– Column: a tuple with name, value and Hmestamp – Key: name of record – Super column: contains more columns

Document-‐based

•  Based on JSON format: a data model which supports lists, maps, dates, Boolean with nesHng

•  Really: indexed semistructured documents •  Example: Mongo

–  {Name:"Jaroslav", Address:"Malostranske nám. 25, 118 00 Praha 1“ Grandchildren: [Claire: "7", Barbara: "6", "Magda: "3", "Kirsten: "1", "OHs: "3", Richard: "1"]

Typical NoSQL API •  Basic API access:

– get(key) •  Extract the value given a key

– put(key, value) •  Create or update the value given its key

– delete(key) •  Remove the key and its associated value

– Execute(key, operaHon, parameters) •  Invoke an operaHon to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).

RepresentaHves of NoSQL databases key-‐valued

Name Producer Data model Querying

SimpleDB Amazon set of couples (key, {attribute}), where attribute is a couple (name, value)

restricted SQL; select, delete, GetAttributes, and PutAttributes operations

Redis Salvatore Sanfilippo

set of couples (key, value), where value is simple typed value, list, ordered (according to ranking) or unordered set, hash value

primitive operations for each value type

Dynamo Amazon like SimpleDB simple get operation and put in a context

Voldemort LinkeId like SimpleDB similar to Dynamo

RepresentaHves of NoSQL databases column-‐oriented

BigTable Google set of couples (key, {value}) selection (by combination of row, column, and time stamp ranges)

HBase Apache groups of columns (a BigTable clone)

JRUBY IRB-based shell (similar to SQL)

Hypertable Hypertable like BigTable HQL (Hypertext Query Language)

CASSANDRA Apache (originally Facebook)

columns, groups of columns corresponding to a key (supercolumns)

simple selections on key, range queries, column or columns ranges

PNUTS Yahoo (hashed or ordered) tables, typed arrays, flexible schema

selection and projection from a single table (retrieve an arbitrary single record by primary key, range queries, complex predicates, ordering, top-k)

RepresentaHves of NoSQL databases document-‐based

MongoDB 10gen object-structured documents stored in collections; each object has a primary key called ObjectId

manipulations with objects in collections (find object or objects via simple selections and logical expressions, delete, update,)

Couchbase Couchbase1 document as a list of named (structured) items (JSON document)

by key and key range, views via Javascript and MapReduce

1a{er merging Membase and CouchOne

DATAKON 2011 J. Pokorný 75

Conclusions of the NoSQL secHon

•  NoSQL database cover only a part of data-‐intensive cloud applicaHons (mainly Web applicaHons).

•  Problems with cloud compuHng: –  SaaS applicaHons require enterprise-‐level funcHonality, including ACID transacHons, security, and other features associated with commercial RDBMS technology, i.e. NoSQL should not be the only opHon in the cloud.

–  Hybrid soluHons: •  Voldemort with MySQL as one of storage backend •  deal with NoSQL data as semistructured data ⇒ integraHng RDBMS and NoSQL via SQL/XML

Conclusions of the NoSQL secHon •  next generaHon of highly scalable and elasHc RDBMS:

NewSQL databases (from April 2011) –  they are designed to scale out horizontally on shared nothing machines,

–  sHll provide ACID guarantees, –  applicaHons interact with the database primarily using SQL, –  the system employs a lock-‐free concurrency control scheme to avoid user shut down,

–  the system provides higher performance than available from the tradiHonal systems.

•  Examples: MySQL Cluster (most mature soluHon), VoltDB, Clustrix, ScalArc, …

DATAKON 2011 J. Pokorný 78

Conclusions of the NoSQL secHon

•  New buzzword: SPRAIN – 6 key factors for alternaHve data management: –  Scalability –  Performance –  Relaxed consistency –  Agility –  Intricacy –  Necessity

•  Conclusion in conclusions: These are exciHng Hmes for Data Management research and development

Références •  J. Manyika et al., Big data, the next fron0er for innova0on, compe00on, and produc0vity,

McKinsey Global InsHtute, 2011. •  S. Déjean, J. Mothe, Visual clustering for data analysis and graphical user interfaces,

Handbook of Cluster Analysis, 2012. •  D. Oelke et al., Visual Opinion Analysis of Customer Feedback Data, IEEE Symposium of

Visual AnalyHcs in Science and Technology, 2009 •  D. Oelke et al., Visual Readability Analysis: How to Make Your WriHngs Easier to Read, IEEE

Trans. on VisualizaHon and computer graphics, 18(5):662-‐674, 2012 •  A. Ynnerman, Visualizing the medical data explosion, TED, 2011 •  N. Kejzar, S. Korenjak-‐Cerne, and V. Batagelj. Network analysis of works on clustering and

classificaHon from web of science. In Proceedings of IFCS’09, pages 525–536. Springer, 2010. •  S. Déjean, P MarHn, A. Baccini, and P. Besse. Clustering Hme series gene expression data

using smoothing spline derivaHves. EURASIP Journal on Bioinforma0cs and Systems Biology, 2007. arHcle ID 70561.

•  B. Dousset. Tetralogie : interacHvity for compeHHve intelligence, 2012. h_p://atlas.irit.fr/PIE/OuHls/Tetralogie.html

•  A. Baccini, S. Déjean, L. Lafage, and J. Mothe. How many performance measures to evaluate informaHon retrieval systems? Knowledge and Informa0on System, pages 693–713, 2011.

big$data,$hadoop$and$nosql$ - irityoann.pitarch/docs/coursbigdata/bigdatalesson.pdf · agenda 1....

Documents

hadoop online tutorials - indiatrainings.in · menu search...

statewidehighschoolexit( exams(and(( data5basedschool...

hadoop , hadoop , hadoop !!!

hadoop ecosystem - hadoop 生態系

Обзор hadoop-дистрибутивов. Тюнинг...

why use hadoop?, challenges / learning hadoop & average...

hadoop summit 2010 machine learning using hadoop

apache’cloudstack’ - fossc.om · deﬁnion’...

snapshotting in hadoop distributed file system for hadoop...

hadoop interview questions version 2.0.0 author: hadoop...

hadoop world 2010: productionizing hadoop: lessons learned

continuous delivery for linux/windows/hadoop...beta cluster...

professional hadoop® solutions - startseite€¦ · the...

2. hadoop -...

hadoop summit 2010 benchmarking and optimizing hadoop

hadoop 3 (2017 hadoop taiwan workshop)

analyzing hadoop with hadoop

hadoop, hadoop, hadoop!!! jerome mitchell indiana university

hadoop deployment manual -...

docker based hadoop provisioning - hadoop summit 2014