nadia rauch& mariano di claudio · 2012. 11. 12. · pipeline: data integration, aggregation,...

159
Nadia Rauch & Mariano di Claudio Tassonomy and Review of Big Data Solution 1 Slide del corso: Sistemi Collaborativi e di Protezione (Prof.Paolo Nesi) Corso di Laurea Magistrale in Ingegneria 2012/2013

Upload: others

Post on 24-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Nadia Rauch & Mariano di Claudio

Tassonomy and Review of

Big Data Solution

1

Slide del corso:Sistemi Collaborativi e di Protezione (Prof.Paolo Nesi)Corso di Laurea Magistrale in Ingegneria 2012/2013

Page 2: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Who I am

• Nadia Rauch, Eng. • PhD Student and Researcher at DISIT Lab.• Email: [email protected]

2

Page 3: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Index

1. The Big Data Problem• 4V’s of Big Data• Big Data Problem• CAP Principle• Big Data Analysis Pipeline

2. Big Data Application Fields3. Overview of Big Data Solutions4. Data Management Aspects

• NoSQL Databases• CouchBase

5. Architectural aspects• eXist

3

Page 4: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Index

6. Access/Data Rendering aspects 7. Data Analysis and Mining/Ingestion Aspects 8. Comparison and Analysis of Architectural 

Features• Data Management• Data Access and Visual Rendering• Mining/Ingestion• Data Analysis• Architecture

4

Page 5: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

The Big Data Problem.

Part 1 – Nadia Rauch

5

Page 6: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Index

6

1. The Big Data Problem• 4V’s of Big Data• Big Data Problem• CAP Principle• Big Data Analysis Pipeline

2. Big Data Application Fields3. Overview of Big Data Solutions4. Data Management Aspects

• NoSQL Databases• CouchBase

5. Architectural aspects• eXist

Page 7: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

What is Big Data?

4V’s of Big Data

VarietyVolume

Variability

Velocity

7

Page 8: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

4V’s of Big Data

• Volume: Companies amassing terabytes/petabytes of information, and they always look for faster, more efficient, and lower‐cost solutions for data management.

• Velocity: How quickly the data comes at you. For time‐sensitive processes, big data must be used as streams of Data in order to maximize its value.

• Variety: Big data is any type of data: structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. 

• Variability: Refers to variance in meaning and in lexicon, but also to the variability in data structure.

8

Page 9: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Major problems of Big Data

How is it possible to discover their “value”?• Using complex modeling process: hypotheses are 

formulated, statistical, visual and semantic models are implemented to validate them;

• Applying several techniques and analysis so much different to the data collected.

Problems:• Identify a unique architecture adaptable to all possible area;• The large number of application areas, so different from 

each other;• The different channels through which data are daily 

collected.

9

Page 10: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Brewer’s Theorem (CAP’s Principle)

• Consistency: if the system is in a consistent state after an operation, all clients see the same data.

• Availability: every request received by a non‐failing node in the system, must result in a response.

• Partition tolerance: the         system continues to function even when split into disconnected subsets (by a network disruption).

You can satisfy at most 2 out of this 3 requirements!

10

Page 11: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Possible Combinations with C, A and P

• CA: a database is able to provide distributed transactional semantics, only in the absence of network partition that separates server nodes;

• CP: if we are dealing with partitions, transactions to an ACID database may be blocked until the partition perfectly works, in order to avoid risk of inconsistency.

• AP: system is still available under partitioning, but some of the data returned may be inaccurate.

11

Page 12: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Big Data Analysis Pipeline 

12

Page 13: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Pipeline: Data Acquisition and Recording

• Huge amount of data can be filtered and compressed by orders of magnitude. – Challenge: define these filters in such a way that they do not discard useful information.

• Detail regarding experimental conditions, procedures may be required to interpret the results correctly.– Challenge: automatically generate the right metadata.

• Must be possible to research both into metadata and into data systems.

– Challenge: create optimized data structures that allow searching in acceptable times.

13

Page 14: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Pipeline: Information Extraction and Cleaning

• The information collected will not be in a format ready for analysis (surveillance photo VS picture of the stars).– Challenge: realize an information extraction process that pulls out information, and express it in a form suitable for analysis.

• Big Data are incomplete and errors may have been committed during Data Acquisition phase. – Challenge: define constraints and error models for many emerging Big Data domains. 

14

Page 15: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Pipeline: Data Integration, Aggregation, Representation

• Data is heterogeneous, it is not enough throw it into a repository. – Challenge: create a data record structure that is suitable to the differences in experimental details.

• Many ways to store the same information: some designs have advantages over others, for certain purposes. – Challenge: create tools to assist in database design process and developing techniques.

15

Page 16: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Pipeline: Query Processing, Data Modeling, and Analysis

• Methods for querying and mining Big Data are different from traditional statistical analysis.– Challenge: create a scaling complex query processing techniques to terabytes while enabling interactive response times. 

• Interconnected Big Data forms large heterogeneous networks, with which information redundancy can be explored to compensate for missing data, to crosscheck conflicting cases and to uncover hidden relationships and models.

– Challenge: add coordination between database systems and provide SQL querying, with analytics packages that perform various forms of non‐SQL 

processing (data mining, statistical analyses).

16

Page 17: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

17

Big Data Application Fields

Part 2 – Mariano Di Claudio

Page 18: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Application Fields

“Big Data” problems one referredto the combination of large volumeof data to be treated in short time.

There are many areas where big data are currently used withremarkable results and excellent future prospects to fully dealwith the main challenges like data analysis, modeling,organization and retrieval.

BIGDATA

Volume

Velocity

Variety

Variability

18

Page 19: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Application Fields

• A major investment in Big data can lay the foundations for the next generation of advances in medicine, science, business and information technology…

• Healthcare and Medicine• Data Analysis – Scientific Research• Educational• Energy and Transportation• Social Network – Internet Service – Web Data• Financial/Business• Security

19

Page 20: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Healthcare and Medicine

• In Healthcare/Medical field large amount of information iscollected about :– Electronic Patient Record (EPR)– Symptomatology– Diagnoses– Therapies– Responses to treatments

• In just 12 days approximately 5000 patients entered theemergency department.

• In Medical Research two main application are:– Genomic Sequence Collections (A single sequencingexperiment yield 100 milion short sequences)

– Analysis of neuroimging data (Intermidiate data stored ~ 1.8PetaBytes)

V VolumeVelocityVariability

20

Page 21: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

There is the need, in medical field, to use run time data 

to support the analysis of existing processes.

Healthcare and Medicine

• Healthcare processes are characterized by the fact that several organizational units can be involved in the treatment process of patients and that these organizational units often have their own specific IT applications.

• At the same time – especially in the case of public hospitals ‐ there is growing pressure from governmentalbodies to refactor clinical processes in order to improve efficiency and reduce costs.

21

Page 22: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Healthcare and Medicine

• Data mining techniques might be implemented to deriveknowledge from this data in order to either identify newinteresting pattern in infection control data or to examinereporting practices.

• Process Mining is accomplished through techniques of analysisand evaluation of event stored in log files. In hospitals with theElectronic Patient Record (EPR) have investigated techniques tothe fast access and extraction of information from event’s log,to produce easily interpretable models, using partitioning,clustering and preprocessing techniques.

• By building a predictive model, it could be possible either toprovide decision support for specific triage and diagnosis or toproduce effective plans for chronic disease management,enhancing the quality of healthcare and lower its cost.

22

Page 23: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

perform predictive studies on the incidence of certain diseases

Healthcare and Medicine

• At the base of genomics there are techniques of genecloning and sequencing of DNA with the objective toknow the entire genome of organisms.

• The knowledge of the entire genome allows to moreeasily identify the genes involved and to observe howthese interact, particularly in the case of complexdiseases such as tumors.

• Huge amount of data, genetic knowledge, clinicalpractice, appropriateDB

http://www.atgc‐montpellier.fr/gkarrays

A new opportunity is the use of GkArrays (combination of three arrays), an intelligent solution for indexing huge 

collections of short reads.23

Page 24: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Application Fields

• A major investment in Big data can lay the foundations for the next generation of advances in medicine, science, business and information technology…

• Healthcare and Medicine• Data Analysis – Scientific Research• Educational• Energy and Transportation• Social Network – Internet Service – Web Data• Financial/Business• Security

24

Page 25: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Analysis – Scientific Research• There are several areas of science and research in which the

aim of Big Data analysis is to extract meaning from data anddetemine what actions take.– Astronomy (Automated sky survey)

• Today, 200 GB of new high resolution optical data arebeing captured each evening by charge‐coupled devices(CCDs) attached to telescopes.

– Biology (Sequencing and encoding genes)– Sociology (Web log analysis of behavioral data)

• With up to 15 million people world wide accessing theinternet each day (nearly 10 hours per week on line.).

– Neuroscience (genetic and neuro‐imaging data analysis)• Scientific research is highly collaborative and involves scientists

from different disciplines and from different countries. V VolumeVarietyVariability

25

Page 26: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Analysis – Scientific Research

• To cope with the large amount of experimental data producedby modern disciplines, the University Montpellier started theZENITH project.

• Zenith adopt a hybrid architecture p2p/cloud.

http://www.sop.inria.fr/teams/zenith

26

The idea of Zenith is to exploit p2p because the collaborative nature of scientific data, centralized control, and use the potentialities of computing, storage                                               and network resources in the Cloud                                          model, to manage and analyze this                                              large amount of data.

Page 27: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Analysis – Scientific Research

• Europeana is platform for the management andsharing of multimedia content

• Millions of content are indexed and retrieved in realtime

• Each of them was early modeled with a simplemetadata model (ESE)

• A new more complex models with a set of semanticrelationships (EMD), is going to be adopted in shorttime

www.europeana.eu

27

Page 28: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Application Fields

• A major investment in Big data can lay the foundations for the next generation of advances in medicine, science, business and information technology…

• Healthcare and Medicine• Data Analysis – Scientific Research• Educational• Energy and Transportation• Social Network – Internet Service – Web Data• Financial/Business• Security

28

Page 29: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

A new approach of teaching can be defined by exploiting the big data management

Educational

• Big Data has the potential to revolutionize not justresearch, but also education.

• Main educational data are• students' performance (Project KDD 2010 *)• mechanics of learning• answers to different pedagogical strategies

VVolumeVariability

* https://pslcdatashop.web.cmu.edu/KDDCup/

29

• These data can be used to definemodels to understandwhat students actually know, their progresses and howto enrich this knowledge.

Page 30: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Educational

• The proposed new models of teaching exploits the potentialof computer science and technology combined withtechniques for data analysis with the aim of clarifying issuestaking into account pedagogical, psychological and learningmechanisms, to define personalized instruction, that meetsthe different needs of different individual students or groups.

• Another sector of interest, in this field, is the e‐learningdomain, where are defined two mainly kinds of users: thelearners and the learning providers.

– All personal details of learners and the online learning providers' information are stored in specific db, so applying data mining with e‐learning can be able to realize teaching 

programs targeted to particular interests and needs through an efficient decision making.

30

Page 31: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Application Fields

• A major investment in Big data can lay the foundations for the next generation of advances in medicine, science, business and information technology…

• Healthcare and Medicine• Data Analysis – Scientific Research• Educational• Energy and Transportation• Social Network – Internet Service – Web Data• Financial/Business• Security

31

Page 32: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Energy and Transportation

• The movement of people, public transport and private vehicleswithin and between metropolitan cities is critical to economicstrength and quality of life and involves a great variety of data:– Trains and Bus schedules

• About 33.000 tourists visit Rome every day• About 1.600.000 workers move to Rome at day

– GPS information– Traffic interruptions– Weather

• A data‐centric approach can also help for enhancing theefficiency and the dependability of a transportation system.• The optimization of multimodal transportation infrastructure 

and their intelligent use can improve traveller experience and operational efficiencies.

VVolumeVarietyVelocityVariability

32

Page 33: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Energy and Transportation

• Through the analysis and visualization of detailed road networkdata and the use of a predictive model it is possible to achievean intelligent transportation environment.

• Furthermore, through the merging of high‐fidelity geographicalstored data and real‐time sensor networks scattered data, itcan be made an efficient urban planning system that mix publicand private transportation, offering people more flexiblesolutions (Smart Mobility).– Amsterdam – Copenaghen – Berlino – Venezia – Firenze –Bologna …

– http://www.smartcityexhibition.it/• This new way of travelling has interesting implications for 

energy and environment.

33

Page 34: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Energy and Transportation

• In the field of energy resources optimization andenvironmental monitoring, very important are the datarelated to the consumption of electricity.

• The analysis of a set of load profiles and geo‐referencedinformation, with appropriate data mining techniques,and the construction of predictive models from thatdata, could define intelligent distribution strategies inorder to lower costs and improve the quality of life.

34

Page 35: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Energy and Transportation

• Researchers have shown that instrumenting a homewith just three sensors for :– Electricity, Power, and Water – possible todetermine the resource usage of individualappliances.

– There is an opportunity to transform homes andlarger residential areas into rich sensor networks,and to integrate information about personal usagepatterns with information about energy availabilityand energy consumption, in support of evidence‐based power management.

35

Page 36: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Application Fields

• A major investment in Big data can lay the foundations for the next generation of advances in medicine, science, business and information technology…

• Healthcare and Medicine• Data Analysis – Scientific Research• Educational• Energy and Transportation• Social Network – Internet Service – Web Data• Financial/Business• Security

36

Page 37: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

• The volume of data generated by internet services, websites,mobile applications and social network is great, but speed ofproduction is variable, due to human factor.

Facebook uploads 3 billion photos per month,approximately 3600TB/year.50 million tweets daily average exchangedInstagram has reached 7.3 million unique users per day inUSA

• From these large amounts of data collected through socialnetworks, researchers try to predict the collective behavior, orfor example through the trend of the Twitter hash tag are ableto identify models of influence.

• In a broader sense by all this information is possible to extract knowledge and data relationships, by improving the activity of 

query answering.

VVolume, VarietyVelocity, Variability

37

Social Network – Internet Service – Web Data

Page 38: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

• Some researcher have proposed an alternative use of thesedata to create a novel form of urban living, in an initiativecalled ConnectiCity. Key aspect are:

• Create a set of tools to capture in real‐time various forms ofcity‐relevant user generated content from a variety of types ofsources:– Social networks – websites – mobile applications…To interrelate it to the territory using Geo‐referencing, Geo‐Parsing and Geo‐Coding techniques, and to analyze and classifyit using Natural Language Processing techniques to identify:– Users' emotional approaches– Topics of interest– Networks of attention and of influence and trending issues.

38

Social Network – Internet Service – Web Data

Page 39: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

• Some researcher have proposed an alternative useof these data to create a novel form of urbanliving, in an initiative called ConnectiCity. Keyaspect are:

• To make this information available and accessibleboth centrally and peripherally, to enable thecreation of novel forms of decision‐makingprocesses as well as to experiment innovativemodels for participated, peer to peer, user‐generated initiatives. Project link ‐ http://www.connecticity.net/

http://www.opendata.comunefi.it

39

Social Network – Internet Service – Web Data

Page 40: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Application Fields

• A major investment in Big data can lay the foundations for the next generation of advances in medicine, science, business and information technology…

• Healthcare and Medicine• Data Analysis – Scientific Research• Educational• Energy and Transportation• Social Network – Internet Service – Web Data• Financial/Business• Security

40

Page 41: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Financial and Business• The task of finding patterns in business data is not new.

Traditionally business analysts use statistical techniques.• Nowadays widespread use of pc and networking

technologies has created large electronic repositories thatstore numerous business transaction.– the magnitude of data is ~ 50‐200 PBs at day– Internet audience in Europe is about 381,5 milion uniquevisitors

– 40% of European citizens do online shopping• These data can be analyzed in order to define:

– Prediction about the behavior of users– Identify buying pattern of individual/group customers– Provide new custom services.

Using Data warehousing technology and mature machine learning techniques.

VVolumeVelocityVariability

41

Page 42: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Financial and Business

• In financial field, instead, investment and business plansmay be created thanks to predictive models derivedusing techniques of reasoning and used to discovermeaningful and interesting patterns in business data.

1. Selection of data for analysis (from a network of DB)2. Cleaning operation removes discrepancies and

inconsistencies3. Data set is analyzed to identified patterns (models that

show the relationship among data).4. It should be possible to translate the model into

actionable business plan, that help the organizationachieve its goal.

5. A model/pattern that satisfies these condition becomebusiness knoweldge.

42

Page 43: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Application Fields

• A major investment in Big data can lay the foundations for the next generation of advances in medicine, science, business and information technology…

• Healthcare and Medicine• Data Analysis – Scientific Research• Educational• Energy and Transportation• Social Network – Internet Service – Web Data• Financial/Business• Security

43

Page 44: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Security• Intelligence, Surveillance, and Reconnaissance (ISR) define topics that are well suited for data‐centric computational analyses, – close to one zettabyte (1021bytes or one billion terabytes) of digital data are generated each year.

• Important data sources for intelligence gathering include:– Satellite and UAVs Image

– Intercepted communications: civilian and military, including voice, email, documents, transaction logs, 

and other electronic data ‐ 5 billion mobile phones in use worldwide.

VVolumeVarietyVelocityVariability

44

Page 45: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Security

– Radar tracking data.– Public domain sources (web sites, blogs, tweets, and other Internet data, print media, television, and radio).

– Sensor data (meteorological, oceanographic, security camera feeds).

– Biometric data (facial images, DNA, iris, fingerprint, gaitrecordings).

– Structured and semi‐structured information supplied by companies and organizations: • airline flight logs, credit card and bank transactions, phone call records, employee personnel records, electronic health records, police and investigative records.

45

Page 46: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Security

• The challenge for intelligence services is to find, combine, anddetect patterns and trends in the traces of importantinformation.

• The challenge becomes one of finding meaningful evolvingpatterns in a timely manner among diverse, potentiallyobfuscated information across multiple sources. Theserequirements greatly increase the need for very sophisticatedmethods to detect subtle patterns within data, withoutgenerating large numbers of false positives so that we do notfind conspiracies where none exist.

• As models of how to effectively exploit large‐scale datasources, we can look to Internet companies, such as Google,Yahoo, and Facebook.

46

Page 47: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Security

• Within the world of intelligence, we should consider computer technology and a consolidated machine learning technology as a way to augment the power of human analysts, rather than as a way to replace them.

• The key idea of machine learning is:– First apply structured statistical analysis to a data set to generate a predictive model .

– Then to apply this model to different data streams to support different forms of analysis.

47

Page 48: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Overview of Big Data solutions

Part 3 – Nadia Rauch

48

Page 49: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Index

49

1. The Big Data Problem• 4V’s of Big Data• Big Data Problem• CAP Principle• Big Data Analysis Pipeline

2. Big Data Application Fields3. Overview of Big Data Solutions4. Data Management Aspects

• NoSQL Databases• CouchBase

5. Architectural aspects• eXist

Page 50: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

4 class of Main Requirements

• Architecture aspects: processing a very huge dataset is important to optimize workload, for example with a parallel architecture or providing a dynamic allocation of computation resources.

• Data Management aspects: the main solutions available are moving in the direction of handle increasing amounts of data.

• Access/Data Rendering aspects: is important to be able to analyze Big Data with Scalable Display tools.

• Data Analysis|Mining/Ingestion aspects: Statistical Analysis, Facet Query, answer queries quickly and 

quality data in terms of update, coherence and consistency are important operation that an ideal 

architecture must support or ensure.50

Page 51: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

51

Data Management Aspects

Part 4 ‐ Nadia Rauch

Page 52: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Index

52

1. The Big Data Problem• 4V’s of Big Data• Big Data Problem• CAP Principle• Big Data Analysis Pipeline

2. Big Data Application Fields3. Overview of Big Data Solutions4. Data Management Aspects

• NoSQL Databases• CouchBase

5. Architectural aspects• eXist

Page 53: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Management Aspects (1/2)

• Scalability: Each storage system for big data must be scalable, with  common and cheap hardware (increase the number of storage discs.) 

• Tiered Storage: To optimize the time in which we want the required data. 

• High availability: Key requirement in a Big Data architecture. The design must be distributed and optionally lean on a cloud solution.

• Support to Analytical and Content Applications: analysis can take days and involve several machines working in parallel. These data may be required in part to other applications. 

• Workflow automation: full support to creation, organization and transfer of workflows. 

53

Page 54: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Management Aspects (2/2)

• Integration with existing Public and Private Cloud systems: Critical issue: transfer the entire data. Support to existing cloud environments, would facilitate a possible data migration.

• Self Healing: The architecture must be able to accommodate component failures and heal itself without customer intervention. Techniques that automatically redirected to other resources, the work that was carried out by failed machine, which will be automatically taken offline.

• Security: The most big data installations are built upon a web services model, with few facilities for countering web threats, while it is essential that data are protected from theft and unauthorized access.

54

Page 55: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

NoSQL Definition

• “Not Only SQL”.• Definition: Name for a subset of structured storage software, designed for increased optimization for high‐performance operations on large dataset.

Why NoSQL?• ACID (Atomicity, Consistency, Isolation, Durability) doesn’t scale well.

• Web apps have different needs: High Availability, LowCost Scalability & Elasticity, Low Latency, FlexibleSchemas, Geographic Distributions.

• Next Generation Databases mostly addressing some of this needs being non‐relational, distributed, open‐

source and horizontally scalable.55

Page 56: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

NoSQL Characteristics

PROs:• schema‐free;• High Availability;• Scalability;• easy replication support;• simple API;• eventually consistent / BASE (not ACID);

CONs:• Limited query capabilities;• hard to move data out from one NoSQL to some other system (but 50% are JSON‐oriented);

• no standard way to access a NoSQL data store.

56

Page 57: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Eventually consistent

• The storage system guarantees that if no new updates are made to the object, eventually (after the inconsistency window closes) all accesses will return the last updated value.

• BASE (Basically Available, Soft state, Eventual consistency) approach.

57

Page 58: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Types of NoSQL Database

• Key‐Value DB• Col‐ Family/Big Table DB• Document DB• XML DB• Object DB• Multivalue DB

58

Page 59: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Key‐Value Database (1/2)

• High scalable database, not suitable for large data sets.

• Solution that allows to obtain good speed and excellent scalability.

• Different permutations of key‐value pairs, generate different database. 

• This type of NoSQL databases are used for large lists of elements, such as stock quotes.

59

Page 60: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Key‐Value Database (2/2)

• Based on Amazon’s Dynamo Paper• Example: Dynomite, Voldemort, Tokyo • Data Model: collection of kay‐value pairs

60

Page 61: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Column Family/Big Table Database (1/2)

• Key‐value stores can be column‐family database (group columns)

• Most advanced type where keys and values can be composed. 

• Very useful with time series or with data coming from multiple sources, sensors, device and website, with high speed. 

• They require good performance in reading and writing operations 

• They are not suitable for datasets where data have the same importance of relationship.

61

Page 62: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Column Family/Big Table Database (2/2)

• Based on Google BigTable Paper• Data Model: Big Table, Column Family• Example: Hbase, Hypertable

62

Page 63: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Document Database (1/2)

• Fit perfectly to the OO programming• Have the same behavior of key‐value, but the contest of this document is understood and taken into account. 

• Useful when data are hardly representable with a relational model due to high complexity

• Used with medical records or with data coming from social networks.

63

Page 64: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Document Database (2/2)

• Inspired on Lotus Note• Data Model: Collection of K‐V collection• Example: CouchDB, MongoDB

64

Page 65: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Graph Database (1/2)

• They take up less space than the volume of data with which they are made and store a lot of information on relationship between data.

• Used when dataset is strongly interconnected and not tabular. 

• The access model is transactional and suitable for all those applications that need transactions. 

• Used in field like geospatial, bioinformatics, network analysis and recommendation engines.

• Is not easy execute query on this database type.

65

Page 66: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Graph Database (2/2)

• Inspired by Graph’s Theory• Data Model: nodes, rels, K‐V on both• Example: AllegroGraph, Objectivity

66

Page 67: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Object Database 

• Allow the creation of very reliable storage system.

• They directly integrate the object model of programming language in which they were written (not much popularity). 

• They cannot provide a persistent access to data, due to the strong binding with specific platforms.

• Example: Objectivity, Versant.

67

Page 68: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

XML Database

• Large XML document optimized to contain large quantities of semi‐structured data

• The data model is flexible• Queries are simple to perform. • They are not very efficient.• Example: eXist, BaseX

68

Page 69: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Multivalue Database (1/2)

• Synonymous with Pick Operating System.• Support use of attributes that can be a list of value, rather than a single value. 

• The data model is well suited to XML • Sometimes they are classified as NoSQL but is possible to access data using SQL.

• They have not been standardized, but simply classified in pre‐relational, post‐relational, relational and embedded.

69

Page 70: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Multivalue Database (2/2)

• Data Model: simile relational• Example: OpenQM, Jbase.

70

Page 71: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase (www.couchbase.com)

• Open source NoSQL for mission‐critical systems;• No multi‐operation transaction support.• Easily scale apps to an “infinite” number of users• Schema‐less document database (flexibility, indexing and querying similar to RDBMS capability)

• High performance with Low latency (sub millisecond reads and writes, no drop in performance as app scales)

• NoSQL fixed schema• Automatic replication

71

Page 72: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase Features

• API: Memcached API+protocol (binary and ASCII)• Protocol: Memcached REST interface for cluster conf + management;

• Written in: C/C++ +Erlang (clustering);• Replication: Peer to Peer, fully consistent;• Misc: Transparent topology changes during operation, provides memcached‐compatible caching buckets, commercially supported version available.

72

Page 73: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase

• MemCached compatible: enables applications to read and write data with sub‐millisecond latency and sustained high throughput. Users can allocate cache resources per database depending on application needs.

• Zero‐downtime maintenance: can perform any maintenance tasks on a live cluster. Not only do cluster topology changes like adding or removing servers require no application downtime, but even upgrading software can be done on a running system.

73

Page 74: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase

• Clone to grow with auto‐sharding: you can add (or remove) servers in your cluster with the click of a button allowing you to easily keep your resources in step with the changing needs of your application.  By simply adding more servers, you can horizontally scale your application with additional RAM as well as I/O capacity. Auto‐sharding distributes data uniformly across servers and enables direct routing of requests to the appropriate server without the need for any application changes.

74

Page 75: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase Features

• Production‐ready management and monitoring: provides advanced monitoring with a rich administration web interface. Users can monitor the entire cluster using real‐time system monitoring charts that reflect statistics for the entire cluster.

• Reliable storage architecture: asynchronously persists all data to disk and enable you to store datasets larger than the physical RAM size. Couchbase automatically moves data between RAM and disk and keeps the working                    set in the object‐level cache. 

75

Page 76: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase Features

• Data replication with auto‐failover: To ensure high availability of your application CouchbaseServer’s, data replication capability allows you to maintain multiple copies of your data within a cluster. Replica data is distributed across all nodes reducing the impact of failure on individual nodes.

• Professional SDKs for wide variety of languages: fully supported SDKs for Java, C#, PHP, C, Python and Ruby.

76

Page 77: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase Advantages

• Performance. The document data model keeps related data in a single physical location in memory and on disk (a document). This allows consistently low‐latency access to the data –reads and writes happen with very little delay. 

• Dynamic elasticity. Because the document approach keeps records “in one place” (a single document in a contiguous physical location), it is much easier to move the data from one server to another while maintaining consistency. 

77

Page 78: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase Advantages

• Schema flexibility. While all NoSQL databases provide schema flexibility. Keyvalue and document – oriented databases enjoy the most flexibility. A key‐value or document‐oriented database requires no database maintenance to change the database schema (to add and remove “fields” or data elements from a given record). 

• Query flexibility. Query expressiveness: the ability to ask the database questions, like “return me a list of all the farms in which a player purchased X last month”. Document‐databases provide the best balance of schema flexibility without giving up the ability to do sophisticated queries.

78

Page 79: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase Basic Operations

Docs distributed evenly across servers in the cluster. Each server stores  both active & replica docs. Only one server active at a time.App reads, writes, updates docs. Multiple App Servers can access same document at same time.

79

Page 80: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase – Add nodes

2 servers added to cluster (1click operation).Docs automatically rebalanced across cluster. Minimum Doc movement.Cluster Map   updated.App Database calls now distributed over larger # of servers.

80

Page 81: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase – Remove a node

Cluster detect server has failed: promotes replicas of Docs to active, update cluster Map.App server requests for Doc now go to appropriate serverTypically rebalance would follow.

81

Page 82: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase Architecture

Cluster Management

82

Data Management

Page 83: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Query example couchbase

Access document directly from ID using an HTTP request • http://127.0.0.1:8092/default/300 • 8092 is the Couch API REST port used to access data (where 8091 is the port for the Admin console)

• default is the bucket in which the document is stored• 300 is the id of the documentSearch your data with queries• "Give me all the employee of the Tech department“• View = Map function written in JavaScript that will generate a key value list.

Admin Console Views tab ‐>"Create Development View"

83

Page 84: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

CouchBase Examples 

84

Page 85: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

85

Architectural Aspects

Part 5 ‐ Nadia Rauch

Page 86: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Index

86

1. The Big Data Problem• 4V’s of Big Data• Big Data Problem• CAP Principle• Big Data Analysis Pipeline

2. Big Data Application Fields3. Overview of Big Data Solutions4. Data Management Aspects

• NoSQL Databases• CouchBase

5. Architectural aspects• eXist

Page 87: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Architectural aspects (1/2)

• Clustering size: the number of nodes in each cluster, affects the completion times of each job: a greater number of nodes, correspond to a less completion time of each job.

• Input data set: increasing the size of the initial data set, the processing time of the data and the production of results, increase. 

• Data node: greater computational power and more memory, is associated with a shorter time to completion of a job.

• Date locality: it is not possible to ensure that the data is available locally on the node. we need to retrieve the data blocks to be processed, and the time of completion be significantly higher.

87

Page 88: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Architectural aspects (2/2)

• Network: The network affects the final performance of a Big Data management system; connections between clusters make extensive use during read and write operations. Highly available and resiliency network, which is able to provide redundancy.

• Cpu: more processes to be carried out are CPU‐intensive, greater will be the influence of the CPU power on the final performance of the system.

• Memory: For applications memory‐intensive is good to have the amount of memory on each server, able to cover the needs of the cluster. 2‐4GB of memory for each server, if memory is insufficient performance would suffer a lot.

88

Page 89: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist (exist‐db.org)

• eXist is an open source database management system entirely built on XML technology, also called a native XML database. 

• Unlike most relational database management systems, eXist uses Xquery to manipulate its data.

89

Page 90: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist Main Features

• API: Xquery;• XML: DB API, DOM, SAX;• Protocols: HTTP/REST, WebDAV, SOAP, XML‐RPC, Atom;• Query Method: Xquery;• Written in: Java (open source);• Concurrency: Concurrent reads, lock on write; • Misc: Entire web applications can be written in XQuery, using XSLT, XHTML, CSS, and Javascript (for AJAX functionality). (1.4) adds a new full text search index based on Apache Lucene, a lightweight URL rewriting and MVC framework, and support for Xproc.

90

Page 91: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist Database Features

• Data Storage: Native XML data store based on B+treesand paged files. Document nodes are stored in a persistent DOM.

• Collections: Documents are managed in hierarchical collections, similar to storing files in a file system. 

• Indexing: Based on a numeric indexing scheme which supports quick identification of relationships between nodes, such as parent/child, ancestor/descendant or previous‐/next‐sibling.

• Index Creation: Automatic index creation by default. Uses structural indexes for element and attribute nodes, a fulltext index for text and attribute values and range indexes for typed values. Fulltext indexing can be turned on/off for selected parts of a document. 

91

Page 92: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist Database Features

• Query Engine: eXist's query engine avoid top‐down or bottom‐up tree traversals. Instead, it relies on fast path join algorithms to compute node relationships. Path joins can be applied to the entire node sequence in one, single step. 

• Updates: Document‐level and node‐level updates with XUpdate and Xquery update extensions.

• Authorization Mechanism: Unix‐like access permissions for users/groups at collection and document level.

• Security: eXtensible Access Control Markup Language (XACML) for XQuery access control.

92

Page 93: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist Database Features

• Transactions/Crash Recovery: The database supports full crash recovery based on a journal in which all transactions are logged. In case of a database crash, all committed transactions will be redone while incomplete transactions will be rolled back. 

• Deployment: eXist may be deployed as a stand‐alone database server, as an embedded Java library or as part of a web application (running in the servlet engine).

• Backup/Restore functionality is provided via Java admin client or Ant scripts. The backup contains resources as readable XML that allows full restore of a database including user/group permissions.

93

Page 94: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist Limits

• Max. Number of Documents: The maximum number of documents to be stored in the database is 231.

• Max. Size per Document Theoretically, documents can be arbitrary large. In practice, some properties like the child count of a node are limited to a 4‐byte int number. There are also operating system limits, e.g. the max size of a file in the filesystem.

94

Page 95: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist ‐ Indexing Scheme Features

• Indexing scheme: allow a quick identification of structural relationships between nodes.

• To determine the relationship between nodes, they don't need to be loaded into main memory.

• Every node in the document tree is labeled with a unique identifier (node ID).

• No further information is needed to determine if a node can be the ancestor, descendant or sibling of another.

• Original numbering scheme used in eXist was based on level‐order numbering.

95

Page 96: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist ‐ Old Indexing Scheme

• Level‐order numberingmodels the document tree as a complete k‐tree, it assumes that every node in the tree has k child nodes. If a node has less than k children, leave the remaining child IDs empty before continuing with the next sibling node.

• A lot of IDs will not be assigned to actual nodes. The resulting tree is a k‐ary complete, it runs out of available IDs quite fast, limiting the maximum size of a document to be indexed.

96

Page 97: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist ‐ Old Indexing Scheme Modified

• Instead of assuming a fixed number of child nodes, eXist recomputes this number for every level of the tree.

• Indexing a regularly structured document of a few hundred MB was no longer an issue.

• The numbering scheme still doesn't work well if the document tree is rather imbalanced. 

• Level‐order numbering has disadvantages: inserting, updating or removing nodes in a stored 

document is very expensive, requires at least a partial renumbering and re‐indexing of the nodes 

in the document.97

Page 98: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist ‐ New Indexing Solution

• 2006 ‐ "dynamic level numbering" (DLN)• Numbering scheme based on variable‐length IDs that avoids to put a conceptual limit on the size of the document to be indexed.

• Number of positive side‐effects, including fast node updates without re‐indexing.

• Dynamic level numbers (DLN) are hierarchical IDs (Dewey's decimal classification).

• IDs: a sequence of numeric values, separated by a separator. The root node has id 1. All nodes below it consist of the ID of their parent node used as prefix and a level value.

98

Page 99: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

eXist ‐ New Indexing Solution

• 1 is the root node, 1.1 is the first node on the second level, 1.2 the second, etc…

• Using this scheme, to determine the relation between any two given IDs is a trivial operation and also works for the ancestor‐descendant                                     is easy.

99

Page 100: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

XML Data Store Organization

100

dom.dbx

collections.dbx

elements.dbx

words.dbx

Page 101: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

XML Data Store Organization

eXist uses 4 index files (based on B+‐trees):• collections.dbxmanages the collection hierarchy.

• elements.dbx indexes elements and attributes.• dom.dbx collects DOM nodes in a paged file and associates node identifiers to the actual nodes.

• words.dbx keeps track of word occurrences and is used by the full‐text search extensions.

101

Page 102: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

XML Data Store Organization

• This helps to keep the number of inner B+treepages small and yields a better performance for queries on entire collections

• Creating an index entry for every single document in a collection leads to decreasing performance for collections containing a larger number (>5000) of rather small (<50KB) documents.

102

Page 103: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Query example

• doc() is restricted to a single document‐URIargument

• xmldb:document() accepts multiple document paths to be included into the input node set. 

• xmldb:document() without an argumentincludes EVERY document node in the currentdatabase instance. 

• doc("/db/shakespeare/plays/hamlet.xml")//SPEAKER

• xmldb:document('/db/test/abc.xml', '/db/test/def.xml')//title

103

Page 104: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

104

Access/Data Rendering Aspects

Part 6 ‐Mariano Di Claudio

Page 105: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Index

105

6. Access/Data Rendering aspects 7. Data Analysis and Mining/Ingestion Aspects 8. Comparison and Analysis of Architectural 

Features• Data Management• Data Access and Visual Rendering• Mining/Ingestion• Data Analysis• Architecture

Page 106: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Access and Data Rendering aspects

– User Access: thanks to an Access ManagementSystem, is possible to define different types of users(normal users, administrator, etc.) and assign toeach, access rights to different parts of the Big datastored in the system.

– Separation of duties: using a combination ofauthorization, authentication and encryption, maybe separate the duties of different types of users.• These separation provides a strong contribution to 

safeguard the privacy of data, which is a fundamental feature in some areas such as 

health/medicine o government.

106

Page 107: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Access and Data Rendering aspects

– Efficient Access: defining of standard interfaces(specially in business and educational application),managing concourrencies issues, multi‐platformand multi‐device access it is possible not decreasedata’s availability and to improve user experience.

– Scalable Visualizion: because a query can give alittle or enormous set of result, is important ascalable display tools, that allow a clear vision inboth cases (i.e. 3D adjacency matrix for RDF )

107

Page 108: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

RDF‐HDT Library

• RDF is a standard for describing resources on the web,using statements consist of subject ‐ predicate ‐ object

• An RDF model can be represented by a directed graph

• An RDF graph is represented physically by a serializationRDF/XML  ‐ N‐Triple  ‐ Notation3

Subject ObjectPredicate

108

Page 109: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

RDF‐HDT Library

Main scalability drawbacks of the Web of Data are:– Publication

• No Recommendations/methodology to publish at large scale 

• Use of Vocabulary of Interlinked Data and Semantia Map– Exchange

• Main RDF format (RDF/XML – Notation3 – Turtle…) havea document‐centric view

• Use of universal compressor (gzip) to reduce their size– Consumption (query)

• Costly Post‐processing due to Decompression and Indexing (RDF store) issues

109

Page 110: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

RDF‐HDT Library

• HDT (Header, Dictionary, Triples) is a compact datarepresentation for RDF dataset to reduce its verbosity.

• RDF graph is represented with 3 logical components.– Header– Dictionary– Triples

• This makes it an ideal format for storing and sharingRDF datasets on the Web.

110

Page 111: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

RDF‐HDT Library

111

Page 112: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

RDF‐HDT Library ‐ Header

• HDT improve the value of metadata• Header is itself an RDF graph containing informationabout:– provenance (provider, publication dates, version),– statistics (size, quality, vocabularies),– physical organization (subparts, location of files)– other types of information (intellectual property, signatures).

112

Page 113: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

RDF‐HDT Library – Dictionary + Triples

• Mapping between each term used in a dataset and unique IDs.• Replacing long/redundant terms with their corresponding IDs.• Graph structures can be indexed and managed as integer‐steam.• Compactness and consumption performance with an advanced

dictionary serialization.

113

Page 114: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

RDF‐HDT Library – Triples

• These ID‐triples component compactly represent the RDF graph.• ID‐triple is the key component to accessing and querying the RDF

graph.• Triples component can provide a succint index for some basic

queries.

optimizes space ‐ provides efficient performance in primitive operations

114

Page 115: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

RDF‐HDT Library ‐ Pro

• HDT for exchanging

115

Page 116: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

RDF‐HDT Library

HDT‐it!

116

Page 117: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

117

Data Analysis and Mining/Ingestion Aspects

Part 7 ‐Mariano Di Claudio

Page 118: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Index

118

6. Access/Data Rendering aspects 7. Data Analysis and Mining/Ingestion Aspects 8. Comparison and Analysis of Architectural 

Features• Data Management• Data Access and Visual Rendering• Mining/Ingestion• Data Analysis• Architecture

Page 119: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Mining and Ingestion

• Mining and Ingestion are two key features in the fieldof big data, in fact there is a tradeoff between– The speed of data ingestion– The ability to answer queries quickly– The quality of the data in terms of update, coherenceand consistency.

• This compromise impacting the design of any storagesystem (i.e. OLTP vs OLAP).

• For instance, some file‐systems are optimized for reads and others for writes, but workloads generally involve a 

mix of both these operations

119

Page 120: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Mining and Ingestion

• Type of indexing ‐ to speed data ingest records could be written tothe cache or apply advanced data compression techniques,meanwhile the use of a different method of indexing can improvespeed of data retrieval operations at only cost of an increasedstorage space.

• Management of data relationship ‐ in some context, data andrelationships between data, have the same importance.Furthermore new types of data, in the form of highly interrelatedcontent, need to manage multi‐dimensional relationships in real‐time. A possible solution it is to store relationships in apposite datastructures which ensure good ability to access and extraction inorder to adequately support predictive analytics tools.

• Temporary data ‐ some kinds of data analysis are themselves big:computations and analyses that create enormous amounts oftemporary data that must be opportunely managed to avoidmemory problems. In In other cases, however, make some statisticson the information that is accessed more frequently, it is possible touse techniques to create well‐defined cache system or temporaryfiles then optimize the query process.

120

Page 121: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop

• Hadoop is an open source framework forprocessing, storing and analyzing massiveamounts of distributed, unstructured data.

• It is designed to scale up from a single serverto thousands of machines, with a very highdegree of fault tolerance.

121

Page 122: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop

• Hadoop was inspired by MapReduce, a user‐defined function developed by Google in early2000s for indexing the Web.

• Hadoop is now a project of the Apache SoftwareFoundation, where hundreds of contributorscontinuously improve the core technology

Key IdeaRather than banging away at one, huge block ofdata with a single machine, Hadoop breaks up BigData into multiple parts so each part can beprocessed and analyzed at the same time

122

Page 123: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop

Hadoop changes the economics and the dynamics of largescale computing, defining a processing solution that is:

– Scalable – New nodes can be added as needed, and addedwithout needing to change data formats, how data isloaded, how jobs are written, or the applications on top.

– Cost effective – Hadoop brings massively parallel computingto commodity servers. The result is a sizeable decrease inthe cost per terabyte of storage, which in turn makes itaffordable to model all your data.

– Flexible – Hadoop is schema‐less, and can absorb any typeof data, structured or not, from any number of sources.Data from multiple sources can be joined and aggregated inarbitrary ways enabling deeper analyses than any onesystem can provide.

– Fault tolerant –When you lose a node, the system redirects work to another location of the data and continues 

processing without missing a beat.123

Page 124: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop main features

• Hadoop goal is to scan large data set toproduce results through a distribute andhighly scalable batch processing systems.

• Apache Hadoop has two main components:– HDFS

– MapReduce

124

Page 125: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop main features

– HDFS is a file system that spans all the nodes in aHadoop cluster for data storage ‐ default block size inHDFS is 64MB

– It links together the file systems on many local nodes tomake them into one big file system.

125

_ HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes.

Page 126: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop main features

• MapReduce is the heart of Hadoop. It is thisprogramming paradigm that allows for massivescalability across hundreds or thousands of servers in aHadoop cluster.

• The Map function in the master node takes the input,partitions it into smaller sub‐problems, and distributesthem to operational nodes.– Each operational node could do this again, creating amulti‐level tree structure.

– The operational node processes the smaller problems,and returns the response to its root node.

• In the Reduce function, however, the root node, once took the answers of all the sub‐problems, combining them to get the answer to the problem it is trying to 

solve.

126

Page 127: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop main features

Map‐Reduce Paradigm

127

Page 128: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

How Hadoop Works

1. A client accesses unstructured and semi‐structureddata from different sources including log files, socialmedia feeds and internal data stores.

2. It breaks the data up into "parts“, which are thenloaded into a file system made up of multiple nodesrunning on commodity hardware.

3. File systems such as HDFS are adept at storing largevolumes of unstructured and semi‐structured data asthey do not require data to be organized intorelational rows and columns.

128

Page 129: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

How Hadoop Works

4. Each "part" is replicated multiple times and loadedinto the file system so that if a node fails, anothernode has a copy of the data contained on the failednode.

129

5. A Name Node acts as facilitator, 

communicating back to the client information such as which nodes 

are available, where in the cluster certain data 

resides, and which nodes have failed.

Page 130: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

How Hadoop Works

6. Once the data is loaded into the cluster, it is ready tobe analyzed via the MapReduce framework.

7. The client submits a "Map" job ‐ usually a querywritten in Java – to one of the nodes in the clusterknown as the Job Tracker.

8. The Job Tracker refers to the Name Node todetermine which data it needs to access to completethe job and where in the cluster that data is located.

130

Page 131: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

How Hadoop Works

9. Once determined, the Job Tracker submits the queryto the relevant nodes.

10. Rather than bringing all the data back into a centrallocation for processing, processing then occurs ateach node simultaneously, or in parallel. This is anessential characteristic of Hadoop.

11. When the each node has finished processing its givenjob, it stores the results.

131

Page 132: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

How Hadoop Works

• The client initiates a "Reduce" job through the JobTracker in which results of the map phase storedlocally on individual nodes are aggregated todetermine the “answer” to the original query, thenloaded on to another node in the cluster.

• The client accesses these results, which can then beloaded into one of number of analytic environmentsfor analysis.

• TheMapReduce job has now been completed.

132

Page 133: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop components• A Hadoop “stack” is made up of a number of components. They

include:– Hadoop Distributed File System (HDFS): The default storagelayer in any given Hadoop cluster.

– Name Node: The node in a Hadoop cluster that provides theclient information on where in the cluster particular data isstored and if any nodes fail;

– Secondary Node: A backup to the Name Node, it periodicallyreplicates and stores data from the Name Node should it fail;

– Job Tracker: The node in a Hadoop cluster that initiates andcoordinates MapReduce jobs, or the processing of the data.– Slave Nodes: The grunts of any Hadoop cluster, slave nodes 

store data and take direction to process it from the Job Tracker. 

133

Page 134: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop complementary subproject

134

Page 135: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop Pros & Cons• The main benefit of Hadoop are that it allows enterprises to

process and analyze large volumes of unstructured and semi‐structured data, heretofore inaccessible to them, in a cost‐and time‐effective manner.

• Because Hadoop clusters can scale to petabytes and evenexabytes of data, enterprises no longer must rely on sampledata sets but can process and analyze ALL relevant data.

• Data Scientists can apply an iterative approach to analysis,continually refining and testing queries to uncover previouslyunknown insights. It is also inexpensive to get started withHadoop.

• Developers can download the Apache Hadoop distribution forfree and begin experimenting with Hadoop in less than a day.

135

Page 136: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Hadoop Pros & Cons

• The downside to Hadoop and its myriad componentsis that they are immature and still developing.

• As with any young, raw technology, implementingand managing Hadoop clusters and performingadvanced analytics on large volumes of unstructureddata requires significant expertise, skill and training.

136

Page 137: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Comparison and Analysis of Architectural Features

Part 8 – Nadia Rauch

137

Page 138: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Index

138

6. Access/Data Rendering aspects 7. Data Analysis and Mining/Ingestion Aspects 8. Comparison and Analysis of Architectural 

Features• Data Management• Data Access and Visual Rendering• Mining/Ingestion• Data Analysis• Architecture

Page 139: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Examined Product

• ArrayDBMS• CouchBase• eXist• Hadoop• Hbase• MapReduce• MonetDB• Objectivity

• OpenQM• RdfHdt• RDF3x

139

Page 140: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Management (1/2)

ArrayD

BMS

CouchB

ase

Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce

Mon

etDB

Objectiv

ity

Ope

nQM

Rdf H

dt 

Library

RDF 3X

Data Dimension Huge

100PB

PB 254GB

231

doc.10PB PB 10PB TB TB 16TB 100mi

l Triples(TB)

50mil Triples

Traditional / Not Traditional

NoSQL

NoSQL

NoSQL

NoSQL

NoSQL

NoSQL

NoSQL

SQL XML/ SQL++

NoSQL

NoSQL

NoSQL

Partitioning blob blob 20MB

blob N chunk

A blob blob blob Chunk 32KB

(N) blob

Data Storage Multidim. Array

1 document/concept

object database +B‐tree

XML document + tree

Big Table

Big Table

N BAT Classic Table in which define +models

1file/table (data+dictionary)

3structures,RDF graph for Header

1Table + permutations

140

Page 141: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Management (2/2)

ArrayD

BMS

CouchB

ase

Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce

Mon

etDB

Objectiv

ity

Ope

nQM

Rdf H

dt 

Library

RDF 3X

Data Model Multidimension

Document Store

ObjectDB

XML DB

Column Family

Column Family

CF, KV, OBJ, DOC

Ext SQL

Graph DB

Multivalue DB

RDF RDF

Cloud A Y A Y/A Y Y Y A Y N A AAmount of Memory reduce

sdocumentis+ Bidimensionalindex

Objects + index

Documents + index

Optimized Use

N Efficient Use

Like RDBMS

No compression

‐50% dataset

Less than dataset

Metadata Y Y Y Y Y Y Y/A opt Y/A N Y Y

141

Page 142: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Management

• Managing of PB• NoSQL• Blob 20MB• Data Store: 1 document to concept• Data Model: Document store• Support to Cloud solution• Memory Requirements: Documents + Bidimensional index

• Metadata management

142

Page 143: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Access and visual Rendering| Mining/Ingestion

ArrayD

BMS

CouchB

ase

Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce

Mon

etDB

Objectiv

ity

Ope

nQM

Rdf H

dt 

Library

RDF 3X

Scalable visual rendering

N N N N A N N N N Y Y (3d adiacent matrix)

N

Visual rendering and visualization

N N Y N A N P N N Y Y P

Access Type Web‐interface

Multiple point

Y (Interfaces)

Y (extend XPath sintax)

Sequential

Cache for frequent access

Sequential (no random)

Full SQLinterfaces

Y (OID) Y (concourrency)

Access on demand

SPARQL

Type of indexing Multimensional‐index

Y(incremental)

Y B+‐tree(XISS)

HDFS h‐files Distributed Multileval‐treeIndexing 

Hash index

Y (function e objectivity/SQL++ interfaces )

B‐tree based

RDF‐graph

Y (efficient triple indexes)

Data relationships  Y N Y Y A N Y Y Y Y Y Y

143

Page 144: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Access and visual Rendering| Mining/Ingestion

• External tools for Rendering and Scalable Visualization. 

• Sequential Access• Smart Indexing with HDFS• External tools for Data Relationship management

144

Page 145: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Analysis

ArrayD

BMS

CouchB

ase Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce Mon

etDB

Objectiv

ity Ope

nQM

Rdf H

dt 

Library

RDF 3X

Statistical Support Y Y/A N N Y Y N A (SciQL)

Y A N N

Log Analysis N N N N Y P Y N N N N YSemantic Query P N P A N N N N N N Y YStatistical Indicator Y N N N Y Y N N Y A Y Y(Perfo

rmance)

CEP (Active Query) N Y N N N N N N N N N NFaceted Query N N N Y N A(filter

)Y N Y 

(Objectivity PQE)

N Y N

145

Page 146: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Analysis

• No Statistical Support • No Log Analysis• Support to Semantic & Facet Query• Support to Statistical Indicator definition

146

Page 147: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Architecture (1/2)

ArrayD

BMS

CouchB

ase Db4

o

eXist

Had

oop

HBa

se

Map

Red

uce

Mon

etD

B Objectiv

ity Ope

nQM

Rdf H

dt 

Library

RDF 3X

Distributed Y Y Y P (hard) Y Y Y Y Y (‐) A Y

High Availability Y Y N Y Y Y N Y YFault Tolerance A Y Y Y Y Y Y N Y N A AWorkload Compu

tation intensitive

Auto distribution

So high for update of enteirfiles

configurable

Write intensitive

configurable

Read dominated (or rapidly change)

More Possibility

Divided among more processes

optimized

Indexing Speed more than RDBMS

non‐optimal performance 

5 – 10 times more than SQL

High speed with B+Tree

A Y Y/A More than RDBMS

High Speed

Increased speed with alternate key

15 volte RDF

aerodynamic

147

Page 148: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Architecture (2/2)

148

ArrayD

BMS

Couch

Base

Db4

o

eXist

Had

oop HBa

se

Map

Redu

ce

Mon

etDB

Objecti

vity

Ope

nQM Rd

f Hdt 

Library

RDF 3X

Parallelism Y Y Transational

Y Y Y Y N Y N N Y

N. task configurable N Y N Y Y A Y N Y N N N

Data Replication N Y Y Y Y Y A Y Y Y A A

Page 149: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Architecture

• Complex Implementation in a Distributed System

• Not very High Availability• Fault tollerant• High Workload for update of enteir files• High Speed Indexing thanks to B+Tree• Support to Parallel Computing• Configurable # of tasks• Complete support to Data Replication

149

Page 150: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Management

150

Health

care

Educationa

l

Energy/Transpo

rtati

on Social Network 

Internet Service     

Web

 Data

Fina

ncial/Bu

sine

ss

Security

Data An

alysis  

Scientific Re

search 

(biomed

ical...)

Data Dimension Huge PB 300PB TB 1000EB PB 1Bil TB PB

Traditional / Not Traditional NoSQL SQL Both Both Both Both Both

Partitioning Blob Y/ Cluster

Cluster

Chuncks

Cloud Y P Y Y P Y YMetadata Y Y Y Y N Y Y

Page 151: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Management

151

• Exchange of EB• Involve Traditional / NoSQL DB• Support to Cloud System• High use of Metadata

Page 152: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Access and visual Rendering| Mining/Ingestion

152

Health

care

Educationa

l

Energy/Transpo

rtation

Social Network Internet 

Service     W

eb Data

Fina

ncial/Bu

sine

ss

Security

Data An

alysis  Scien

tific 

Research (b

iomed

ical...)

Scalable visual rendering P Y N Y N Y Y

Visual rendering and visualization

Y N A Y A N Y

Access Type Y (Concurrency)

Standard Interfaces

Get API Open ad hoc SQL access

A N AQL (SQL arrayversion)

Page 153: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Access and visual Rendering| Mining/Ingestion

• Possible integration of scalable rendering tools• Availability of visual reports• Concurrency Access Type recommended

153

Page 154: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Analysis

154

Health

care

Educationa

l

Energy/Transpo

rta

tion

Social Network 

Internet Service     

Web

 Data

Fina

ncial/Bu

sine

ss

Security

Data An

alysis  

Scientific Re

search 

(biomed

ical...)

Type of Indexing N Ontologies

Key IndexDistribuited

Y (InvertedIndex, MinHash)

N N ArrayMultiDim

Data relationships  Y Y N Y Y N YStatistical Analisys Tools Y Y A Y P N YLog Analysis Y Y A Y Y Y ASemantic Query N N N Y N N NStatistical indicator Y Y P N P P YCEP (Active Query) N N N N N Y NFacet Query N Y N Y Y P Y

Page 155: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Data Analysis

• Semantics Indexing• Data Relationship Management• Benefit with Statistical Analysis Tools• Benefit with Log Analysis• High utilization of Facet Query• Statistical Indicator relevance

155

Page 156: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Architecture

156

Health

care

Educationa

l

Energy/Transpo

rtation

Social Network Internet 

Service     W

eb Data

Fina

ncial/Bu

sine

ss

Security

Data An

alysis  Scien

tific 

Research (b

iomed

ical...)

Distributed Y P Y Y Y Y Y

High reliable Y Y N N Y N Y

FaultTollerance N Y Y N Y Y N

Indexing Speed N N N P N N Y

Parallelism Y N Y Y N N Y

Page 157: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Architecture

• Distributed Architecture preferred• Requires High Reliability• High speed indexing• Parallel computing required

157

Page 158: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Comparison Table

158

ArrayD

BMS

CouchB

ase

Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce

Mon

etDB

Objectiv

ity

Ope

nQM

RdfHdt

Library

RDF 3X

Healthcare Y Y Y Y Y Y Y Y

Education Y Y

Energy/Transportation

Y Y Y Y Y

Social Network P Y Y Y Y Y Y Y

Financial/Business

Y Y Y Y Y Y

Security Y Y Y Y

Data Analysis (Astronomy ‐Biology ‐Sociology)

Y Y Y Y Y Y Y Y Y

Page 159: Nadia Rauch& Mariano di Claudio · 2012. 11. 12. · Pipeline: Data Integration, Aggregation, Representation • Data is heterogeneous, it is not enough throw it into a repository

Contact us!

Nadia Rauch: [email protected] Di Claudio: [email protected]

159