nadia rauch& mariano di claudio · 2012. 11. 12. · pipeline: data integration, aggregation,...

Nadia Rauch & Mariano di Claudio

Tassonomy and Review of

Big Data Solution

1

Slide del corso:Sistemi Collaborativi e di Protezione (Prof.Paolo Nesi)Corso di Laurea Magistrale in Ingegneria 2012/2013

Who I am

• Nadia Rauch, Eng. • PhD Student and Researcher at DISIT Lab.• Email: [email protected]

2

Index

1. The Big Data Problem• 4V’s of Big Data• Big Data Problem• CAP Principle• Big Data Analysis Pipeline

2. Big Data Application Fields3. Overview of Big Data Solutions4. Data Management Aspects

• NoSQL Databases• CouchBase

5. Architectural aspects• eXist

3

Index

6. Access/Data Rendering aspects 7. Data Analysis and Mining/Ingestion Aspects 8. Comparison and Analysis of Architectural

Features• Data Management• Data Access and Visual Rendering• Mining/Ingestion• Data Analysis• Architecture

4

The Big Data Problem.

Part 1 – Nadia Rauch

5

Index

6





What is Big Data?

4V’s of Big Data

VarietyVolume

Variability

Velocity

7

4V’s of Big Data

• Volume: Companies amassing terabytes/petabytes of information, and they always look for faster, more efficient, and lower‐cost solutions for data management.

• Velocity: How quickly the data comes at you. For time‐sensitive processes, big data must be used as streams of Data in order to maximize its value.

• Variety: Big data is any type of data: structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more.

• Variability: Refers to variance in meaning and in lexicon, but also to the variability in data structure.

8

Major problems of Big Data

How is it possible to discover their “value”?• Using complex modeling process: hypotheses are

formulated, statistical, visual and semantic models are implemented to validate them;

• Applying several techniques and analysis so much different to the data collected.

Problems:• Identify a unique architecture adaptable to all possible area;• The large number of application areas, so different from

each other;• The different channels through which data are daily

collected.

9

Brewer’s Theorem (CAP’s Principle)

• Consistency: if the system is in a consistent state after an operation, all clients see the same data.

• Availability: every request received by a non‐failing node in the system, must result in a response.

• Partition tolerance: the system continues to function even when split into disconnected subsets (by a network disruption).

You can satisfy at most 2 out of this 3 requirements!

10

Possible Combinations with C, A and P

• CA: a database is able to provide distributed transactional semantics, only in the absence of network partition that separates server nodes;

• CP: if we are dealing with partitions, transactions to an ACID database may be blocked until the partition perfectly works, in order to avoid risk of inconsistency.

• AP: system is still available under partitioning, but some of the data returned may be inaccurate.

11

Big Data Analysis Pipeline

12

Pipeline: Data Acquisition and Recording

• Huge amount of data can be filtered and compressed by orders of magnitude. – Challenge: define these filters in such a way that they do not discard useful information.

• Detail regarding experimental conditions, procedures may be required to interpret the results correctly.– Challenge: automatically generate the right metadata.

• Must be possible to research both into metadata and into data systems.

– Challenge: create optimized data structures that allow searching in acceptable times.

13

Pipeline: Information Extraction and Cleaning

• The information collected will not be in a format ready for analysis (surveillance photo VS picture of the stars).– Challenge: realize an information extraction process that pulls out information, and express it in a form suitable for analysis.

• Big Data are incomplete and errors may have been committed during Data Acquisition phase. – Challenge: define constraints and error models for many emerging Big Data domains.

14

Pipeline: Data Integration, Aggregation, Representation

• Data is heterogeneous, it is not enough throw it into a repository. – Challenge: create a data record structure that is suitable to the differences in experimental details.

• Many ways to store the same information: some designs have advantages over others, for certain purposes. – Challenge: create tools to assist in database design process and developing techniques.

15

Pipeline: Query Processing, Data Modeling, and Analysis

• Methods for querying and mining Big Data are different from traditional statistical analysis.– Challenge: create a scaling complex query processing techniques to terabytes while enabling interactive response times.

• Interconnected Big Data forms large heterogeneous networks, with which information redundancy can be explored to compensate for missing data, to crosscheck conflicting cases and to uncover hidden relationships and models.

– Challenge: add coordination between database systems and provide SQL querying, with analytics packages that perform various forms of non‐SQL

processing (data mining, statistical analyses).

16

17

Big Data Application Fields

Part 2 – Mariano Di Claudio

Application Fields

“Big Data” problems one referredto the combination of large volumeof data to be treated in short time.

There are many areas where big data are currently used withremarkable results and excellent future prospects to fully dealwith the main challenges like data analysis, modeling,organization and retrieval.

BIGDATA

Volume

Velocity

Variety

Variability

18

Application Fields

• A major investment in Big data can lay the foundations for the next generation of advances in medicine, science, business and information technology…

• Healthcare and Medicine• Data Analysis – Scientific Research• Educational• Energy and Transportation• Social Network – Internet Service – Web Data• Financial/Business• Security

19

Healthcare and Medicine

• In Healthcare/Medical field large amount of information iscollected about :– Electronic Patient Record (EPR)– Symptomatology– Diagnoses– Therapies– Responses to treatments

• In just 12 days approximately 5000 patients entered theemergency department.

• In Medical Research two main application are:– Genomic Sequence Collections (A single sequencingexperiment yield 100 milion short sequences)

– Analysis of neuroimging data (Intermidiate data stored ~ 1.8PetaBytes)

V VolumeVelocityVariability

20

There is the need, in medical field, to use run time data

to support the analysis of existing processes.


• Healthcare processes are characterized by the fact that several organizational units can be involved in the treatment process of patients and that these organizational units often have their own specific IT applications.

• At the same time – especially in the case of public hospitals ‐ there is growing pressure from governmentalbodies to refactor clinical processes in order to improve efficiency and reduce costs.

21


• Data mining techniques might be implemented to deriveknowledge from this data in order to either identify newinteresting pattern in infection control data or to examinereporting practices.

• Process Mining is accomplished through techniques of analysisand evaluation of event stored in log files. In hospitals with theElectronic Patient Record (EPR) have investigated techniques tothe fast access and extraction of information from event’s log,to produce easily interpretable models, using partitioning,clustering and preprocessing techniques.

• By building a predictive model, it could be possible either toprovide decision support for specific triage and diagnosis or toproduce effective plans for chronic disease management,enhancing the quality of healthcare and lower its cost.

22

perform predictive studies on the incidence of certain diseases


• At the base of genomics there are techniques of genecloning and sequencing of DNA with the objective toknow the entire genome of organisms.

• The knowledge of the entire genome allows to moreeasily identify the genes involved and to observe howthese interact, particularly in the case of complexdiseases such as tumors.

• Huge amount of data, genetic knowledge, clinicalpractice, appropriateDB

http://www.atgc‐montpellier.fr/gkarrays

A new opportunity is the use of GkArrays (combination of three arrays), an intelligent solution for indexing huge

collections of short reads.23

Application Fields



24

Data Analysis – Scientific Research• There are several areas of science and research in which the

aim of Big Data analysis is to extract meaning from data anddetemine what actions take.– Astronomy (Automated sky survey)

• Today, 200 GB of new high resolution optical data arebeing captured each evening by charge‐coupled devices(CCDs) attached to telescopes.

– Biology (Sequencing and encoding genes)– Sociology (Web log analysis of behavioral data)

• With up to 15 million people world wide accessing theinternet each day (nearly 10 hours per week on line.).

– Neuroscience (genetic and neuro‐imaging data analysis)• Scientific research is highly collaborative and involves scientists

from different disciplines and from different countries. V VolumeVarietyVariability

25

Data Analysis – Scientific Research

• To cope with the large amount of experimental data producedby modern disciplines, the University Montpellier started theZENITH project.

• Zenith adopt a hybrid architecture p2p/cloud.

http://www.sop.inria.fr/teams/zenith

26

The idea of Zenith is to exploit p2p because the collaborative nature of scientific data, centralized control, and use the potentialities of computing, storage and network resources in the Cloud model, to manage and analyze this large amount of data.

Data Analysis – Scientific Research

• Europeana is platform for the management andsharing of multimedia content

• Millions of content are indexed and retrieved in realtime

• Each of them was early modeled with a simplemetadata model (ESE)

• A new more complex models with a set of semanticrelationships (EMD), is going to be adopted in shorttime

www.europeana.eu

27

Application Fields



28

A new approach of teaching can be defined by exploiting the big data management

Educational

• Big Data has the potential to revolutionize not justresearch, but also education.

• Main educational data are• students' performance (Project KDD 2010 *)• mechanics of learning• answers to different pedagogical strategies

VVolumeVariability

* https://pslcdatashop.web.cmu.edu/KDDCup/

29

• These data can be used to definemodels to understandwhat students actually know, their progresses and howto enrich this knowledge.

Educational

• The proposed new models of teaching exploits the potentialof computer science and technology combined withtechniques for data analysis with the aim of clarifying issuestaking into account pedagogical, psychological and learningmechanisms, to define personalized instruction, that meetsthe different needs of different individual students or groups.

• Another sector of interest, in this field, is the e‐learningdomain, where are defined two mainly kinds of users: thelearners and the learning providers.

– All personal details of learners and the online learning providers' information are stored in specific db, so applying data mining with e‐learning can be able to realize teaching

programs targeted to particular interests and needs through an efficient decision making.

30

Application Fields



31

Energy and Transportation

• The movement of people, public transport and private vehicleswithin and between metropolitan cities is critical to economicstrength and quality of life and involves a great variety of data:– Trains and Bus schedules

• About 33.000 tourists visit Rome every day• About 1.600.000 workers move to Rome at day

– GPS information– Traffic interruptions– Weather

• A data‐centric approach can also help for enhancing theefficiency and the dependability of a transportation system.• The optimization of multimodal transportation infrastructure

and their intelligent use can improve traveller experience and operational efficiencies.

VVolumeVarietyVelocityVariability

32


• Through the analysis and visualization of detailed road networkdata and the use of a predictive model it is possible to achievean intelligent transportation environment.

• Furthermore, through the merging of high‐fidelity geographicalstored data and real‐time sensor networks scattered data, itcan be made an efficient urban planning system that mix publicand private transportation, offering people more flexiblesolutions (Smart Mobility).– Amsterdam – Copenaghen – Berlino – Venezia – Firenze –Bologna …

– http://www.smartcityexhibition.it/• This new way of travelling has interesting implications for

energy and environment.

33


• In the field of energy resources optimization andenvironmental monitoring, very important are the datarelated to the consumption of electricity.

• The analysis of a set of load profiles and geo‐referencedinformation, with appropriate data mining techniques,and the construction of predictive models from thatdata, could define intelligent distribution strategies inorder to lower costs and improve the quality of life.

34


• Researchers have shown that instrumenting a homewith just three sensors for :– Electricity, Power, and Water – possible todetermine the resource usage of individualappliances.

– There is an opportunity to transform homes andlarger residential areas into rich sensor networks,and to integrate information about personal usagepatterns with information about energy availabilityand energy consumption, in support of evidence‐based power management.

35

Application Fields



36

• The volume of data generated by internet services, websites,mobile applications and social network is great, but speed ofproduction is variable, due to human factor.

Facebook uploads 3 billion photos per month,approximately 3600TB/year.50 million tweets daily average exchangedInstagram has reached 7.3 million unique users per day inUSA

• From these large amounts of data collected through socialnetworks, researchers try to predict the collective behavior, orfor example through the trend of the Twitter hash tag are ableto identify models of influence.

• In a broader sense by all this information is possible to extract knowledge and data relationships, by improving the activity of

query answering.

VVolume, VarietyVelocity, Variability

37

Social Network – Internet Service – Web Data

• Some researcher have proposed an alternative use of thesedata to create a novel form of urban living, in an initiativecalled ConnectiCity. Key aspect are:

• Create a set of tools to capture in real‐time various forms ofcity‐relevant user generated content from a variety of types ofsources:– Social networks – websites – mobile applications…To interrelate it to the territory using Geo‐referencing, Geo‐Parsing and Geo‐Coding techniques, and to analyze and classifyit using Natural Language Processing techniques to identify:– Users' emotional approaches– Topics of interest– Networks of attention and of influence and trending issues.

38


• Some researcher have proposed an alternative useof these data to create a novel form of urbanliving, in an initiative called ConnectiCity. Keyaspect are:

• To make this information available and accessibleboth centrally and peripherally, to enable thecreation of novel forms of decision‐makingprocesses as well as to experiment innovativemodels for participated, peer to peer, user‐generated initiatives. Project link ‐ http://www.connecticity.net/

http://www.opendata.comunefi.it

39


Application Fields



40

Financial and Business• The task of finding patterns in business data is not new.

Traditionally business analysts use statistical techniques.• Nowadays widespread use of pc and networking

technologies has created large electronic repositories thatstore numerous business transaction.– the magnitude of data is ~ 50‐200 PBs at day– Internet audience in Europe is about 381,5 milion uniquevisitors

– 40% of European citizens do online shopping• These data can be analyzed in order to define:

– Prediction about the behavior of users– Identify buying pattern of individual/group customers– Provide new custom services.

Using Data warehousing technology and mature machine learning techniques.

VVolumeVelocityVariability

41

Financial and Business

• In financial field, instead, investment and business plansmay be created thanks to predictive models derivedusing techniques of reasoning and used to discovermeaningful and interesting patterns in business data.

1. Selection of data for analysis (from a network of DB)2. Cleaning operation removes discrepancies and

inconsistencies3. Data set is analyzed to identified patterns (models that

show the relationship among data).4. It should be possible to translate the model into

actionable business plan, that help the organizationachieve its goal.

5. A model/pattern that satisfies these condition becomebusiness knoweldge.

42

Application Fields



43

Security• Intelligence, Surveillance, and Reconnaissance (ISR) define topics that are well suited for data‐centric computational analyses, – close to one zettabyte (1021bytes or one billion terabytes) of digital data are generated each year.

• Important data sources for intelligence gathering include:– Satellite and UAVs Image

– Intercepted communications: civilian and military, including voice, email, documents, transaction logs,

and other electronic data ‐ 5 billion mobile phones in use worldwide.

VVolumeVarietyVelocityVariability

44

Security

– Radar tracking data.– Public domain sources (web sites, blogs, tweets, and other Internet data, print media, television, and radio).

– Sensor data (meteorological, oceanographic, security camera feeds).

– Biometric data (facial images, DNA, iris, fingerprint, gaitrecordings).

– Structured and semi‐structured information supplied by companies and organizations: • airline flight logs, credit card and bank transactions, phone call records, employee personnel records, electronic health records, police and investigative records.

45

Security

• The challenge for intelligence services is to find, combine, anddetect patterns and trends in the traces of importantinformation.

• The challenge becomes one of finding meaningful evolvingpatterns in a timely manner among diverse, potentiallyobfuscated information across multiple sources. Theserequirements greatly increase the need for very sophisticatedmethods to detect subtle patterns within data, withoutgenerating large numbers of false positives so that we do notfind conspiracies where none exist.

• As models of how to effectively exploit large‐scale datasources, we can look to Internet companies, such as Google,Yahoo, and Facebook.

46

Security

• Within the world of intelligence, we should consider computer technology and a consolidated machine learning technology as a way to augment the power of human analysts, rather than as a way to replace them.

• The key idea of machine learning is:– First apply structured statistical analysis to a data set to generate a predictive model .

– Then to apply this model to different data streams to support different forms of analysis.

47

Overview of Big Data solutions


48

Index

49





4 class of Main Requirements

• Architecture aspects: processing a very huge dataset is important to optimize workload, for example with a parallel architecture or providing a dynamic allocation of computation resources.

• Data Management aspects: the main solutions available are moving in the direction of handle increasing amounts of data.

• Access/Data Rendering aspects: is important to be able to analyze Big Data with Scalable Display tools.

• Data Analysis|Mining/Ingestion aspects: Statistical Analysis, Facet Query, answer queries quickly and

quality data in terms of update, coherence and consistency are important operation that an ideal

architecture must support or ensure.50

51

Data Management Aspects

Part 4 ‐ Nadia Rauch

Index

52





Data Management Aspects (1/2)

• Scalability: Each storage system for big data must be scalable, with common and cheap hardware (increase the number of storage discs.)

• Tiered Storage: To optimize the time in which we want the required data.

• High availability: Key requirement in a Big Data architecture. The design must be distributed and optionally lean on a cloud solution.

• Support to Analytical and Content Applications: analysis can take days and involve several machines working in parallel. These data may be required in part to other applications.

• Workflow automation: full support to creation, organization and transfer of workflows.

53

Data Management Aspects (2/2)

• Integration with existing Public and Private Cloud systems: Critical issue: transfer the entire data. Support to existing cloud environments, would facilitate a possible data migration.

• Self Healing: The architecture must be able to accommodate component failures and heal itself without customer intervention. Techniques that automatically redirected to other resources, the work that was carried out by failed machine, which will be automatically taken offline.

• Security: The most big data installations are built upon a web services model, with few facilities for countering web threats, while it is essential that data are protected from theft and unauthorized access.

54

NoSQL Definition

• “Not Only SQL”.• Definition: Name for a subset of structured storage software, designed for increased optimization for high‐performance operations on large dataset.

Why NoSQL?• ACID (Atomicity, Consistency, Isolation, Durability) doesn’t scale well.

• Web apps have different needs: High Availability, LowCost Scalability & Elasticity, Low Latency, FlexibleSchemas, Geographic Distributions.

• Next Generation Databases mostly addressing some of this needs being non‐relational, distributed, open‐

source and horizontally scalable.55

NoSQL Characteristics

PROs:• schema‐free;• High Availability;• Scalability;• easy replication support;• simple API;• eventually consistent / BASE (not ACID);

CONs:• Limited query capabilities;• hard to move data out from one NoSQL to some other system (but 50% are JSON‐oriented);

• no standard way to access a NoSQL data store.

56

Eventually consistent

• The storage system guarantees that if no new updates are made to the object, eventually (after the inconsistency window closes) all accesses will return the last updated value.

• BASE (Basically Available, Soft state, Eventual consistency) approach.

57

Types of NoSQL Database

• Key‐Value DB• Col‐ Family/Big Table DB• Document DB• XML DB• Object DB• Multivalue DB

58

Key‐Value Database (1/2)

• High scalable database, not suitable for large data sets.

• Solution that allows to obtain good speed and excellent scalability.

• Different permutations of key‐value pairs, generate different database.

• This type of NoSQL databases are used for large lists of elements, such as stock quotes.

59

Key‐Value Database (2/2)

• Based on Amazon’s Dynamo Paper• Example: Dynomite, Voldemort, Tokyo • Data Model: collection of kay‐value pairs

60

Column Family/Big Table Database (1/2)

• Key‐value stores can be column‐family database (group columns)

• Most advanced type where keys and values can be composed.

• Very useful with time series or with data coming from multiple sources, sensors, device and website, with high speed.

• They require good performance in reading and writing operations

• They are not suitable for datasets where data have the same importance of relationship.

61

Column Family/Big Table Database (2/2)

• Based on Google BigTable Paper• Data Model: Big Table, Column Family• Example: Hbase, Hypertable

62

Document Database (1/2)

• Fit perfectly to the OO programming• Have the same behavior of key‐value, but the contest of this document is understood and taken into account.

• Useful when data are hardly representable with a relational model due to high complexity

• Used with medical records or with data coming from social networks.

63

Document Database (2/2)

• Inspired on Lotus Note• Data Model: Collection of K‐V collection• Example: CouchDB, MongoDB

64

Graph Database (1/2)

• They take up less space than the volume of data with which they are made and store a lot of information on relationship between data.

• Used when dataset is strongly interconnected and not tabular.

• The access model is transactional and suitable for all those applications that need transactions.

• Used in field like geospatial, bioinformatics, network analysis and recommendation engines.

• Is not easy execute query on this database type.

65

Graph Database (2/2)

• Inspired by Graph’s Theory• Data Model: nodes, rels, K‐V on both• Example: AllegroGraph, Objectivity

66

Object Database

• Allow the creation of very reliable storage system.

• They directly integrate the object model of programming language in which they were written (not much popularity).

• They cannot provide a persistent access to data, due to the strong binding with specific platforms.

• Example: Objectivity, Versant.

67

XML Database

• Large XML document optimized to contain large quantities of semi‐structured data

• The data model is flexible• Queries are simple to perform. • They are not very efficient.• Example: eXist, BaseX

68

Multivalue Database (1/2)

• Synonymous with Pick Operating System.• Support use of attributes that can be a list of value, rather than a single value.

• The data model is well suited to XML • Sometimes they are classified as NoSQL but is possible to access data using SQL.

• They have not been standardized, but simply classified in pre‐relational, post‐relational, relational and embedded.

69

Multivalue Database (2/2)

• Data Model: simile relational• Example: OpenQM, Jbase.

70

CouchBase (www.couchbase.com)

• Open source NoSQL for mission‐critical systems;• No multi‐operation transaction support.• Easily scale apps to an “infinite” number of users• Schema‐less document database (flexibility, indexing and querying similar to RDBMS capability)

• High performance with Low latency (sub millisecond reads and writes, no drop in performance as app scales)

• NoSQL fixed schema• Automatic replication

71

CouchBase Features

• API: Memcached API+protocol (binary and ASCII)• Protocol: Memcached REST interface for cluster conf + management;

• Written in: C/C++ +Erlang (clustering);• Replication: Peer to Peer, fully consistent;• Misc: Transparent topology changes during operation, provides memcached‐compatible caching buckets, commercially supported version available.

72

CouchBase

• MemCached compatible: enables applications to read and write data with sub‐millisecond latency and sustained high throughput. Users can allocate cache resources per database depending on application needs.

• Zero‐downtime maintenance: can perform any maintenance tasks on a live cluster. Not only do cluster topology changes like adding or removing servers require no application downtime, but even upgrading software can be done on a running system.

73

CouchBase

• Clone to grow with auto‐sharding: you can add (or remove) servers in your cluster with the click of a button allowing you to easily keep your resources in step with the changing needs of your application. By simply adding more servers, you can horizontally scale your application with additional RAM as well as I/O capacity. Auto‐sharding distributes data uniformly across servers and enables direct routing of requests to the appropriate server without the need for any application changes.

74

CouchBase Features

• Production‐ready management and monitoring: provides advanced monitoring with a rich administration web interface. Users can monitor the entire cluster using real‐time system monitoring charts that reflect statistics for the entire cluster.

• Reliable storage architecture: asynchronously persists all data to disk and enable you to store datasets larger than the physical RAM size. Couchbase automatically moves data between RAM and disk and keeps the working set in the object‐level cache.

75

CouchBase Features

• Data replication with auto‐failover: To ensure high availability of your application CouchbaseServer’s, data replication capability allows you to maintain multiple copies of your data within a cluster. Replica data is distributed across all nodes reducing the impact of failure on individual nodes.

• Professional SDKs for wide variety of languages: fully supported SDKs for Java, C#, PHP, C, Python and Ruby.

76

CouchBase Advantages

• Performance. The document data model keeps related data in a single physical location in memory and on disk (a document). This allows consistently low‐latency access to the data –reads and writes happen with very little delay.

• Dynamic elasticity. Because the document approach keeps records “in one place” (a single document in a contiguous physical location), it is much easier to move the data from one server to another while maintaining consistency.

77

CouchBase Advantages

• Schema flexibility. While all NoSQL databases provide schema flexibility. Keyvalue and document – oriented databases enjoy the most flexibility. A key‐value or document‐oriented database requires no database maintenance to change the database schema (to add and remove “fields” or data elements from a given record).

• Query flexibility. Query expressiveness: the ability to ask the database questions, like “return me a list of all the farms in which a player purchased X last month”. Document‐databases provide the best balance of schema flexibility without giving up the ability to do sophisticated queries.

78

CouchBase Basic Operations

Docs distributed evenly across servers in the cluster. Each server stores both active & replica docs. Only one server active at a time.App reads, writes, updates docs. Multiple App Servers can access same document at same time.

79

CouchBase – Add nodes

2 servers added to cluster (1click operation).Docs automatically rebalanced across cluster. Minimum Doc movement.Cluster Map updated.App Database calls now distributed over larger # of servers.

80

CouchBase – Remove a node

Cluster detect server has failed: promotes replicas of Docs to active, update cluster Map.App server requests for Doc now go to appropriate serverTypically rebalance would follow.

81

CouchBase Architecture

Cluster Management

82

Data Management

Query example couchbase

Access document directly from ID using an HTTP request • http://127.0.0.1:8092/default/300 • 8092 is the Couch API REST port used to access data (where 8091 is the port for the Admin console)

• default is the bucket in which the document is stored• 300 is the id of the documentSearch your data with queries• "Give me all the employee of the Tech department“• View = Map function written in JavaScript that will generate a key value list.

Admin Console Views tab ‐>"Create Development View"

83

CouchBase Examples

84

85

Architectural Aspects

Part 5 ‐ Nadia Rauch

Index

86





Architectural aspects (1/2)

• Clustering size: the number of nodes in each cluster, affects the completion times of each job: a greater number of nodes, correspond to a less completion time of each job.

• Input data set: increasing the size of the initial data set, the processing time of the data and the production of results, increase.

• Data node: greater computational power and more memory, is associated with a shorter time to completion of a job.

• Date locality: it is not possible to ensure that the data is available locally on the node. we need to retrieve the data blocks to be processed, and the time of completion be significantly higher.

87

Architectural aspects (2/2)

• Network: The network affects the final performance of a Big Data management system; connections between clusters make extensive use during read and write operations. Highly available and resiliency network, which is able to provide redundancy.

• Cpu: more processes to be carried out are CPU‐intensive, greater will be the influence of the CPU power on the final performance of the system.

• Memory: For applications memory‐intensive is good to have the amount of memory on each server, able to cover the needs of the cluster. 2‐4GB of memory for each server, if memory is insufficient performance would suffer a lot.

88

eXist (exist‐db.org)

• eXist is an open source database management system entirely built on XML technology, also called a native XML database.

• Unlike most relational database management systems, eXist uses Xquery to manipulate its data.

89

eXist Main Features

• API: Xquery;• XML: DB API, DOM, SAX;• Protocols: HTTP/REST, WebDAV, SOAP, XML‐RPC, Atom;• Query Method: Xquery;• Written in: Java (open source);• Concurrency: Concurrent reads, lock on write; • Misc: Entire web applications can be written in XQuery, using XSLT, XHTML, CSS, and Javascript (for AJAX functionality). (1.4) adds a new full text search index based on Apache Lucene, a lightweight URL rewriting and MVC framework, and support for Xproc.

90

eXist Database Features

• Data Storage: Native XML data store based on B+treesand paged files. Document nodes are stored in a persistent DOM.

• Collections: Documents are managed in hierarchical collections, similar to storing files in a file system.

• Indexing: Based on a numeric indexing scheme which supports quick identification of relationships between nodes, such as parent/child, ancestor/descendant or previous‐/next‐sibling.

• Index Creation: Automatic index creation by default. Uses structural indexes for element and attribute nodes, a fulltext index for text and attribute values and range indexes for typed values. Fulltext indexing can be turned on/off for selected parts of a document.

91


• Query Engine: eXist's query engine avoid top‐down or bottom‐up tree traversals. Instead, it relies on fast path join algorithms to compute node relationships. Path joins can be applied to the entire node sequence in one, single step.

• Updates: Document‐level and node‐level updates with XUpdate and Xquery update extensions.

• Authorization Mechanism: Unix‐like access permissions for users/groups at collection and document level.

• Security: eXtensible Access Control Markup Language (XACML) for XQuery access control.

92


• Transactions/Crash Recovery: The database supports full crash recovery based on a journal in which all transactions are logged. In case of a database crash, all committed transactions will be redone while incomplete transactions will be rolled back.

• Deployment: eXist may be deployed as a stand‐alone database server, as an embedded Java library or as part of a web application (running in the servlet engine).

• Backup/Restore functionality is provided via Java admin client or Ant scripts. The backup contains resources as readable XML that allows full restore of a database including user/group permissions.

93

eXist Limits

• Max. Number of Documents: The maximum number of documents to be stored in the database is 231.

• Max. Size per Document Theoretically, documents can be arbitrary large. In practice, some properties like the child count of a node are limited to a 4‐byte int number. There are also operating system limits, e.g. the max size of a file in the filesystem.

94

eXist ‐ Indexing Scheme Features

• Indexing scheme: allow a quick identification of structural relationships between nodes.

• To determine the relationship between nodes, they don't need to be loaded into main memory.

• Every node in the document tree is labeled with a unique identifier (node ID).

• No further information is needed to determine if a node can be the ancestor, descendant or sibling of another.

• Original numbering scheme used in eXist was based on level‐order numbering.

95

eXist ‐ Old Indexing Scheme

• Level‐order numberingmodels the document tree as a complete k‐tree, it assumes that every node in the tree has k child nodes. If a node has less than k children, leave the remaining child IDs empty before continuing with the next sibling node.

• A lot of IDs will not be assigned to actual nodes. The resulting tree is a k‐ary complete, it runs out of available IDs quite fast, limiting the maximum size of a document to be indexed.

96

eXist ‐ Old Indexing Scheme Modified

• Instead of assuming a fixed number of child nodes, eXist recomputes this number for every level of the tree.

• Indexing a regularly structured document of a few hundred MB was no longer an issue.

• The numbering scheme still doesn't work well if the document tree is rather imbalanced.

• Level‐order numbering has disadvantages: inserting, updating or removing nodes in a stored

document is very expensive, requires at least a partial renumbering and re‐indexing of the nodes

in the document.97

eXist ‐ New Indexing Solution

• 2006 ‐ "dynamic level numbering" (DLN)• Numbering scheme based on variable‐length IDs that avoids to put a conceptual limit on the size of the document to be indexed.

• Number of positive side‐effects, including fast node updates without re‐indexing.

• Dynamic level numbers (DLN) are hierarchical IDs (Dewey's decimal classification).

• IDs: a sequence of numeric values, separated by a separator. The root node has id 1. All nodes below it consist of the ID of their parent node used as prefix and a level value.

98

eXist ‐ New Indexing Solution

• 1 is the root node, 1.1 is the first node on the second level, 1.2 the second, etc…

• Using this scheme, to determine the relation between any two given IDs is a trivial operation and also works for the ancestor‐descendant is easy.

99

XML Data Store Organization

100

dom.dbx

collections.dbx

elements.dbx

words.dbx


eXist uses 4 index files (based on B+‐trees):• collections.dbxmanages the collection hierarchy.

• elements.dbx indexes elements and attributes.• dom.dbx collects DOM nodes in a paged file and associates node identifiers to the actual nodes.

• words.dbx keeps track of word occurrences and is used by the full‐text search extensions.

101


• This helps to keep the number of inner B+treepages small and yields a better performance for queries on entire collections

• Creating an index entry for every single document in a collection leads to decreasing performance for collections containing a larger number (>5000) of rather small (<50KB) documents.

102

Query example

• doc() is restricted to a single document‐URIargument

• xmldb:document() accepts multiple document paths to be included into the input node set.

• xmldb:document() without an argumentincludes EVERY document node in the currentdatabase instance.

• doc("/db/shakespeare/plays/hamlet.xml")//SPEAKER

• xmldb:document('/db/test/abc.xml', '/db/test/def.xml')//title

103

104

Access/Data Rendering Aspects

Part 6 ‐Mariano Di Claudio

Index

105



Access and Data Rendering aspects

– User Access: thanks to an Access ManagementSystem, is possible to define different types of users(normal users, administrator, etc.) and assign toeach, access rights to different parts of the Big datastored in the system.

– Separation of duties: using a combination ofauthorization, authentication and encryption, maybe separate the duties of different types of users.• These separation provides a strong contribution to

safeguard the privacy of data, which is a fundamental feature in some areas such as

health/medicine o government.

106

Access and Data Rendering aspects

– Efficient Access: defining of standard interfaces(specially in business and educational application),managing concourrencies issues, multi‐platformand multi‐device access it is possible not decreasedata’s availability and to improve user experience.

– Scalable Visualizion: because a query can give alittle or enormous set of result, is important ascalable display tools, that allow a clear vision inboth cases (i.e. 3D adjacency matrix for RDF )

107

RDF‐HDT Library

• RDF is a standard for describing resources on the web,using statements consist of subject ‐ predicate ‐ object

• An RDF model can be represented by a directed graph

• An RDF graph is represented physically by a serializationRDF/XML ‐ N‐Triple ‐ Notation3

Subject ObjectPredicate

108

RDF‐HDT Library

Main scalability drawbacks of the Web of Data are:– Publication

• No Recommendations/methodology to publish at large scale

• Use of Vocabulary of Interlinked Data and Semantia Map– Exchange

• Main RDF format (RDF/XML – Notation3 – Turtle…) havea document‐centric view

• Use of universal compressor (gzip) to reduce their size– Consumption (query)

• Costly Post‐processing due to Decompression and Indexing (RDF store) issues

109

RDF‐HDT Library

• HDT (Header, Dictionary, Triples) is a compact datarepresentation for RDF dataset to reduce its verbosity.

• RDF graph is represented with 3 logical components.– Header– Dictionary– Triples

• This makes it an ideal format for storing and sharingRDF datasets on the Web.

110

RDF‐HDT Library

111

RDF‐HDT Library ‐ Header

• HDT improve the value of metadata• Header is itself an RDF graph containing informationabout:– provenance (provider, publication dates, version),– statistics (size, quality, vocabularies),– physical organization (subparts, location of files)– other types of information (intellectual property, signatures).

112

RDF‐HDT Library – Dictionary + Triples

• Mapping between each term used in a dataset and unique IDs.• Replacing long/redundant terms with their corresponding IDs.• Graph structures can be indexed and managed as integer‐steam.• Compactness and consumption performance with an advanced

dictionary serialization.

113

RDF‐HDT Library – Triples

• These ID‐triples component compactly represent the RDF graph.• ID‐triple is the key component to accessing and querying the RDF

graph.• Triples component can provide a succint index for some basic

queries.

optimizes space ‐ provides efficient performance in primitive operations

114

RDF‐HDT Library ‐ Pro

• HDT for exchanging

115

RDF‐HDT Library

HDT‐it!

116

117

Data Analysis and Mining/Ingestion Aspects

Part 7 ‐Mariano Di Claudio

Index

118



Mining and Ingestion

• Mining and Ingestion are two key features in the fieldof big data, in fact there is a tradeoff between– The speed of data ingestion– The ability to answer queries quickly– The quality of the data in terms of update, coherenceand consistency.

• This compromise impacting the design of any storagesystem (i.e. OLTP vs OLAP).

• For instance, some file‐systems are optimized for reads and others for writes, but workloads generally involve a

mix of both these operations

119

Mining and Ingestion

• Type of indexing ‐ to speed data ingest records could be written tothe cache or apply advanced data compression techniques,meanwhile the use of a different method of indexing can improvespeed of data retrieval operations at only cost of an increasedstorage space.

• Management of data relationship ‐ in some context, data andrelationships between data, have the same importance.Furthermore new types of data, in the form of highly interrelatedcontent, need to manage multi‐dimensional relationships in real‐time. A possible solution it is to store relationships in apposite datastructures which ensure good ability to access and extraction inorder to adequately support predictive analytics tools.

• Temporary data ‐ some kinds of data analysis are themselves big:computations and analyses that create enormous amounts oftemporary data that must be opportunely managed to avoidmemory problems. In In other cases, however, make some statisticson the information that is accessed more frequently, it is possible touse techniques to create well‐defined cache system or temporaryfiles then optimize the query process.

120

Hadoop

• Hadoop is an open source framework forprocessing, storing and analyzing massiveamounts of distributed, unstructured data.

• It is designed to scale up from a single serverto thousands of machines, with a very highdegree of fault tolerance.

121

Hadoop

• Hadoop was inspired by MapReduce, a user‐defined function developed by Google in early2000s for indexing the Web.

• Hadoop is now a project of the Apache SoftwareFoundation, where hundreds of contributorscontinuously improve the core technology

Key IdeaRather than banging away at one, huge block ofdata with a single machine, Hadoop breaks up BigData into multiple parts so each part can beprocessed and analyzed at the same time

122

Hadoop

Hadoop changes the economics and the dynamics of largescale computing, defining a processing solution that is:

– Scalable – New nodes can be added as needed, and addedwithout needing to change data formats, how data isloaded, how jobs are written, or the applications on top.

– Cost effective – Hadoop brings massively parallel computingto commodity servers. The result is a sizeable decrease inthe cost per terabyte of storage, which in turn makes itaffordable to model all your data.

– Flexible – Hadoop is schema‐less, and can absorb any typeof data, structured or not, from any number of sources.Data from multiple sources can be joined and aggregated inarbitrary ways enabling deeper analyses than any onesystem can provide.

– Fault tolerant –When you lose a node, the system redirects work to another location of the data and continues

processing without missing a beat.123

Hadoop main features

• Hadoop goal is to scan large data set toproduce results through a distribute andhighly scalable batch processing systems.

• Apache Hadoop has two main components:– HDFS

– MapReduce

124


– HDFS is a file system that spans all the nodes in aHadoop cluster for data storage ‐ default block size inHDFS is 64MB

– It links together the file systems on many local nodes tomake them into one big file system.

125

_ HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes.


• MapReduce is the heart of Hadoop. It is thisprogramming paradigm that allows for massivescalability across hundreds or thousands of servers in aHadoop cluster.

• The Map function in the master node takes the input,partitions it into smaller sub‐problems, and distributesthem to operational nodes.– Each operational node could do this again, creating amulti‐level tree structure.

– The operational node processes the smaller problems,and returns the response to its root node.

• In the Reduce function, however, the root node, once took the answers of all the sub‐problems, combining them to get the answer to the problem it is trying to

solve.

126


Map‐Reduce Paradigm

127

How Hadoop Works

1. A client accesses unstructured and semi‐structureddata from different sources including log files, socialmedia feeds and internal data stores.

2. It breaks the data up into "parts“, which are thenloaded into a file system made up of multiple nodesrunning on commodity hardware.

3. File systems such as HDFS are adept at storing largevolumes of unstructured and semi‐structured data asthey do not require data to be organized intorelational rows and columns.

128

How Hadoop Works

4. Each "part" is replicated multiple times and loadedinto the file system so that if a node fails, anothernode has a copy of the data contained on the failednode.

129

5. A Name Node acts as facilitator,

communicating back to the client information such as which nodes

are available, where in the cluster certain data

resides, and which nodes have failed.

How Hadoop Works

6. Once the data is loaded into the cluster, it is ready tobe analyzed via the MapReduce framework.

7. The client submits a "Map" job ‐ usually a querywritten in Java – to one of the nodes in the clusterknown as the Job Tracker.

8. The Job Tracker refers to the Name Node todetermine which data it needs to access to completethe job and where in the cluster that data is located.

130

How Hadoop Works

9. Once determined, the Job Tracker submits the queryto the relevant nodes.

10. Rather than bringing all the data back into a centrallocation for processing, processing then occurs ateach node simultaneously, or in parallel. This is anessential characteristic of Hadoop.

11. When the each node has finished processing its givenjob, it stores the results.

131

How Hadoop Works

• The client initiates a "Reduce" job through the JobTracker in which results of the map phase storedlocally on individual nodes are aggregated todetermine the “answer” to the original query, thenloaded on to another node in the cluster.

• The client accesses these results, which can then beloaded into one of number of analytic environmentsfor analysis.

• TheMapReduce job has now been completed.

132

Hadoop components• A Hadoop “stack” is made up of a number of components. They

include:– Hadoop Distributed File System (HDFS): The default storagelayer in any given Hadoop cluster.

– Name Node: The node in a Hadoop cluster that provides theclient information on where in the cluster particular data isstored and if any nodes fail;

– Secondary Node: A backup to the Name Node, it periodicallyreplicates and stores data from the Name Node should it fail;

– Job Tracker: The node in a Hadoop cluster that initiates andcoordinates MapReduce jobs, or the processing of the data.– Slave Nodes: The grunts of any Hadoop cluster, slave nodes

store data and take direction to process it from the Job Tracker.

133

Hadoop complementary subproject

134

Hadoop Pros & Cons• The main benefit of Hadoop are that it allows enterprises to

process and analyze large volumes of unstructured and semi‐structured data, heretofore inaccessible to them, in a cost‐and time‐effective manner.

• Because Hadoop clusters can scale to petabytes and evenexabytes of data, enterprises no longer must rely on sampledata sets but can process and analyze ALL relevant data.

• Data Scientists can apply an iterative approach to analysis,continually refining and testing queries to uncover previouslyunknown insights. It is also inexpensive to get started withHadoop.

• Developers can download the Apache Hadoop distribution forfree and begin experimenting with Hadoop in less than a day.

135

Hadoop Pros & Cons

• The downside to Hadoop and its myriad componentsis that they are immature and still developing.

• As with any young, raw technology, implementingand managing Hadoop clusters and performingadvanced analytics on large volumes of unstructureddata requires significant expertise, skill and training.

136

Comparison and Analysis of Architectural Features


137

Index

138



Examined Product

• ArrayDBMS• CouchBase• eXist• Hadoop• Hbase• MapReduce• MonetDB• Objectivity

• OpenQM• RdfHdt• RDF3x

139

Data Management (1/2)

ArrayD

BMS

CouchB

ase

Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce

Mon

etDB

Objectiv

ity

Ope

nQM

Rdf H

dt

Library

RDF 3X

Data Dimension Huge

100PB

PB 254GB

231

doc.10PB PB 10PB TB TB 16TB 100mi

l Triples(TB)

50mil Triples

Traditional / Not Traditional

NoSQL

NoSQL

NoSQL

NoSQL

NoSQL

NoSQL

NoSQL

SQL XML/ SQL++

NoSQL

NoSQL

NoSQL

Partitioning blob blob 20MB

blob N chunk

A blob blob blob Chunk 32KB

(N) blob

Data Storage Multidim. Array

1 document/concept

object database +B‐tree

XML document + tree

Big Table

Big Table

N BAT Classic Table in which define +models

1file/table (data+dictionary)

3structures,RDF graph for Header

1Table + permutations

140

Data Management (2/2)

ArrayD

BMS

CouchB

ase

Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce

Mon

etDB

Objectiv

ity

Ope

nQM

Rdf H

dt

Library

RDF 3X

Data Model Multidimension

Document Store

ObjectDB

XML DB

Column Family

Column Family

CF, KV, OBJ, DOC

Ext SQL

Graph DB

Multivalue DB

RDF RDF

Cloud A Y A Y/A Y Y Y A Y N A AAmount of Memory reduce

sdocumentis+ Bidimensionalindex

Objects + index

Documents + index

Optimized Use

N Efficient Use

Like RDBMS

No compression

‐50% dataset

Less than dataset

Metadata Y Y Y Y Y Y Y/A opt Y/A N Y Y

141

Data Management

• Managing of PB• NoSQL• Blob 20MB• Data Store: 1 document to concept• Data Model: Document store• Support to Cloud solution• Memory Requirements: Documents + Bidimensional index

• Metadata management

142

Data Access and visual Rendering| Mining/Ingestion

ArrayD

BMS

CouchB

ase

Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce

Mon

etDB

Objectiv

ity

Ope

nQM

Rdf H

dt

Library

RDF 3X

Scalable visual rendering

N N N N A N N N N Y Y (3d adiacent matrix)

N

Visual rendering and visualization

N N Y N A N P N N Y Y P

Access Type Web‐interface

Multiple point

Y (Interfaces)

Y (extend XPath sintax)

Sequential

Cache for frequent access

Sequential (no random)

Full SQLinterfaces

Y (OID) Y (concourrency)

Access on demand

SPARQL

Type of indexing Multimensional‐index

Y(incremental)

Y B+‐tree(XISS)

HDFS h‐files Distributed Multileval‐treeIndexing

Hash index

Y (function e objectivity/SQL++ interfaces )

B‐tree based

RDF‐graph

Y (efficient triple indexes)

Data relationships Y N Y Y A N Y Y Y Y Y Y

143


• External tools for Rendering and Scalable Visualization.

• Sequential Access• Smart Indexing with HDFS• External tools for Data Relationship management

144

Data Analysis

ArrayD

BMS

CouchB

ase Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce Mon

etDB

Objectiv

ity Ope

nQM

Rdf H

dt

Library

RDF 3X

Statistical Support Y Y/A N N Y Y N A (SciQL)

Y A N N

Log Analysis N N N N Y P Y N N N N YSemantic Query P N P A N N N N N N Y YStatistical Indicator Y N N N Y Y N N Y A Y Y(Perfo

rmance)

CEP (Active Query) N Y N N N N N N N N N NFaceted Query N N N Y N A(filter

)Y N Y

(Objectivity PQE)

N Y N

145

Data Analysis

• No Statistical Support • No Log Analysis• Support to Semantic & Facet Query• Support to Statistical Indicator definition

146

Architecture (1/2)

ArrayD

BMS

CouchB

ase Db4

o

eXist

Had

oop

HBa

se

Map

Red

uce

Mon

etD

B Objectiv

ity Ope

nQM

Rdf H

dt

Library

RDF 3X

Distributed Y Y Y P (hard) Y Y Y Y Y (‐) A Y

High Availability Y Y N Y Y Y N Y YFault Tolerance A Y Y Y Y Y Y N Y N A AWorkload Compu

tation intensitive

Auto distribution

So high for update of enteirfiles

configurable

Write intensitive

configurable

Read dominated (or rapidly change)

More Possibility

Divided among more processes

optimized

Indexing Speed more than RDBMS

non‐optimal performance

5 – 10 times more than SQL

High speed with B+Tree

A Y Y/A More than RDBMS

High Speed

Increased speed with alternate key

15 volte RDF

aerodynamic

147

Architecture (2/2)

148

ArrayD

BMS

Couch

Base

Db4

o

eXist

Had

oop HBa

se

Map

Redu

ce

Mon

etDB

Objecti

vity

Ope

nQM Rd

f Hdt

Library

RDF 3X

Parallelism Y Y Transational

Y Y Y Y N Y N N Y

N. task configurable N Y N Y Y A Y N Y N N N

Data Replication N Y Y Y Y Y A Y Y Y A A

Architecture

• Complex Implementation in a Distributed System

• Not very High Availability• Fault tollerant• High Workload for update of enteir files• High Speed Indexing thanks to B+Tree• Support to Parallel Computing• Configurable # of tasks• Complete support to Data Replication

149

Data Management

150

Health

care

Educationa

l

Energy/Transpo

rtati

on Social Network

Internet Service

Web

Data

Fina

ncial/Bu

sine

ss

Security

Data An

alysis

Scientific Re

search

(biomed

ical...)

Data Dimension Huge PB 300PB TB 1000EB PB 1Bil TB PB

Traditional / Not Traditional NoSQL SQL Both Both Both Both Both

Partitioning Blob Y/ Cluster

Cluster

Chuncks

Cloud Y P Y Y P Y YMetadata Y Y Y Y N Y Y

Data Management

151

• Exchange of EB• Involve Traditional / NoSQL DB• Support to Cloud System• High use of Metadata


152

Health

care

Educationa

l

Energy/Transpo

rtation

Social Network Internet

Service W

eb Data

Fina

ncial/Bu

sine

ss

Security

Data An

alysis Scien

tific

Research (b

iomed

ical...)

Scalable visual rendering P Y N Y N Y Y

Visual rendering and visualization

Y N A Y A N Y

Access Type Y (Concurrency)

Standard Interfaces

Get API Open ad hoc SQL access

A N AQL (SQL arrayversion)


• Possible integration of scalable rendering tools• Availability of visual reports• Concurrency Access Type recommended

153

Data Analysis

154

Health

care

Educationa

l

Energy/Transpo

rta

tion

Social Network

Internet Service

Web

Data

Fina

ncial/Bu

sine

ss

Security

Data An

alysis

Scientific Re

search

(biomed

ical...)

Type of Indexing N Ontologies

Key IndexDistribuited

Y (InvertedIndex, MinHash)

N N ArrayMultiDim

Data relationships Y Y N Y Y N YStatistical Analisys Tools Y Y A Y P N YLog Analysis Y Y A Y Y Y ASemantic Query N N N Y N N NStatistical indicator Y Y P N P P YCEP (Active Query) N N N N N Y NFacet Query N Y N Y Y P Y

Data Analysis

• Semantics Indexing• Data Relationship Management• Benefit with Statistical Analysis Tools• Benefit with Log Analysis• High utilization of Facet Query• Statistical Indicator relevance

155

Architecture

156

Health

care

Educationa

l

Energy/Transpo

rtation

Social Network Internet

Service W

eb Data

Fina

ncial/Bu

sine

ss

Security

Data An

alysis Scien

tific

Research (b

iomed

ical...)

Distributed Y P Y Y Y Y Y

High reliable Y Y N N Y N Y

FaultTollerance N Y Y N Y Y N

Indexing Speed N N N P N N Y

Parallelism Y N Y Y N N Y

Architecture

• Distributed Architecture preferred• Requires High Reliability• High speed indexing• Parallel computing required

157

Comparison Table

158

ArrayD

BMS

CouchB

ase

Db4

o

eXist

Had

oop

HBa

se

Map

Redu

ce

Mon

etDB

Objectiv

ity

Ope

nQM

RdfHdt

Library

RDF 3X

Healthcare Y Y Y Y Y Y Y Y

Education Y Y

Energy/Transportation

Y Y Y Y Y

Social Network P Y Y Y Y Y Y Y

Financial/Business

Y Y Y Y Y Y

Security Y Y Y Y

Data Analysis (Astronomy ‐Biology ‐Sociology)

Y Y Y Y Y Y Y Y Y

Contact us!

Nadia Rauch: [email protected] Di Claudio: [email protected]

159