the new power of data: collection, integration, and analytics wenny rahayu professor in computer...

42
THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences La Trobe University, Melbourne Australia

Upload: frank-cox

Post on 17-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

THE NEW POWER OF DATA: Collection, Integration, and Analytics

Wenny RahayuProfessor in Computer ScienceHead, School of Engineering and Mathematical Sciences

La Trobe University, Melbourne Australia

Page 2: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

Where is La Trobe?

35,000 students3,200 staff

Page 3: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

3

Moving from Databases to Data Container

“Everyday, 2.5 quintillion bytes of data are created and 90% of the data in the world today was created within the past two years”.

IBM Corporation

…1015 = quadrillion (petabytes)1018 = quintillion (exabytes)1021 = sextillion (zettabytes)

“Worldwide information is more than doubling every two years, with 1.8 zettabytes or 1.8 trillion gigabytes projected to be created and replicated this year alone”.ZDNet news

VOLUME

Page 4: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

4http://archive.tiecon.org/content/big-data-landscape-why-should-you-care

Means for Data Collection

Page 5: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

5

We are not quite sure what the exact definition of a Data Scientist is, but if you deal with something generally related to converting data into useful insight then you will hopefully benefit from joining the group.

Whether you’re in business, academia, or government, and whether you’re an analyst, data miner, programmer, student, electrical engineer, computer scientist, physicist, etc, and you work with data to generate insights, build predictive models, build optimisation models, build reports/dashboards/visualisations, automate analyses, etc, using python, R, SQL, C/C+, Java, Tableau, Excel, Hadoop, etc, and you care about doing it right, efficiently, repetitively, optimally, visually, etc, then join us!

Source: http://www.meetup.com/Data-Science-Melbourne/

Multi-Disciplinary

Page 6: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

66

New ways of developing drugs – Novartis New Drug Research

Novartis Institute for Biomedical Research (NIBR) in Cambridge, Mass.

• A new breed of “data scientist” is working to re-invent the traditional drug research team. Instead of biologists, chemists and clinicians working in silos, pharmaceutical companies such as Novartis are assembling collaborative, cross-disciplinary teams.

• These teams include data scientists, drawing on their expertise in computer science and statistics to sift through information and attempt to extract answers to pressing questions. They collaborate with biologists and clinicians to develop a clear hypothesis and then put it to the test.

• https://www.novartis.com/stories/discovery/surfing-wave-big-data-analytics

Data Inspires NewScientific Innovation

Page 7: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

7

Smart Sensor Solution and Real Time Data Analytics

Database

Pasture

Recording behaviour, activity and relationship of animals

Options for sensor data download

Sensor data will be saved securely in a database

system for post-analysis

Sensor on lambs

Base-station

Sensor on Ewes

Proximity approach

Handheld Reader

Computer

Smart Sensor

RF Communication

Activity Sensors:Accelerometers, Gyroscope,

Magnetometer, Temperature

Low Power Processing and Storage

Battery Powered and Power Management Unit

User InterfaceAdministration /

Configuration

Data Visualization

Reporting System

Alert System

User Interface

Analysed sensor data reports will be accessible through a

web-base user interface

Multidisciplinary work between IT, Engineering, Centre of Technology Infusion, and the Agricultural Department.

Will produce low cost, long life, sensors for use with farm animals to monitor motion, proximity and true location.

Sensor data and real-time data analytics will provide actionable information to farmers on (parentage, health, oestrous, grazing information etc.)

Page 8: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

8

Big Data - the bottleneck issue

8

Gathering & preparingdata

(70~80%)

Analyzingdata

(20~30%)

Homogenous, standard enterprise data

Gathering & preparingdata

(95%)

Analyzingdata(5%)

Heterogeneous, Big Data

* Reference from Prof. Timos Sellis – Data Ecosystem - From Very large databases to Big Data Infra structure, La Trobe November 2015

Page 9: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

9

Also known as data fusion, data blending, data mapping, data acquisition, etc…

Informal description by Roderick et. al http://www.odbms.org/2015/11/what-is-data-blending/:

“… the answer is not always written on the same book as the question. Thus, we must learn to decipher it from multiple books. Some of them are in a foreign language, some are hundreds of times thicker than others, and most of them are by different authors who have never agreed on a literary style. And there is no catalogue.”

Data Integration

Page 10: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

10

Data Integration

The need to deal with large data size and different complexity of data formats/structures

Integration can be achieved through:• Standardization of data representations • Global semantic representation: ontology or

schema mediator• “Loose coupling” integration: data virtualization,

data container

Page 11: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

Standardization of data representations

Page 12: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

12

XML as the common ‘language’ of representation� XBLR (eXtensible Business Reporting Language)

� BSML (Bioinformatics Sequence Markup Language)

� HL7 (Health Level Seven)

� FIX (Financial Information eXchange)

� AIXM (Aeronautical Information eXchange Model)

� GML (Geograhical Markup Language)

� MathML (Mathematical Markup Language)

� GBXML (Green Building eXtensible Markup Language)

� And so on….

Example of Standardization

Page 13: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

13

Snapshot of a standard XML representation in Aviation - AIXM

<AirportHeliport ..

<timeSlice> <AirportHeliportTimeSlice gml:id="AHa1"> <gml:validTime> <gml:TimePeriod gml:id="AHb1"> <gml:beginPosition>7/8/2004 0:0:0</gml:beginPosition> <gml:endPosition>12/31/8888 0:0:0</gml:endPosition> </gml:TimePeriod> </gml:validTime> <interpretation>BASELINE</interpretation>

<designator>NFFN</designator> <name>NADI</name> <type>AD</type> <magneticVariation>12.24</magneticVariation> <ARP> <ElevatedPoint gml:id="AHc1"> <gml:coordinates decimal="." cs="," ts=" "> 177.443333333333,-17.7563888888889 </gml:coordinates> <elevation uom="FT">59</elevation> </ElevatedPoint> </ARP>

……… </AirportHeliportTimeSlice> </timeSlice> </AirportHeliport>

Page 14: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

14

Integration of standard XML representation in Aviation : AIXM, WXXM, FIXM, etc.

ADMSOracle

AIXM 5.0Oracle

AutomatedMapping Specification

ADMSAIXM-based database LAYER 1

ADMS Mapping and Migration to new AIXM5 Database

EFB Charting Publication …Visualisation Tool

Transformation to produce flat XML documents

LAYER 3External Service Providers

AIXM document

WXXM Weather data

FIXM, NOTAMXML data

???Future XMLData

Page 15: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

15

The International Standard Body OGC - XML Standard in Aviation Domain

AIXM = Aeronautical Information Exchange Model

WXXM = Weather Information Exchange Model

FIXM = Flight Information Exchange Model

Source: OGC – www.opengeospatial.org

Page 16: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

16

The layering design approach enables the integration of AIXM data with other Aeronautical XML based data

Aeronautical Reference Data NOTAM Airport Spatial Data Dynamic-Temporal Messaging

WEATHER Data

A 0 3 1 2 / 0 8 N O T A M N

Q ) L K A A / Q F A X X / I V / B O / A / 0 00 / 9 9 9 / 5 0 0 6 N 0 1 4 1 5 E 0 0 5

A ) L K P R B ) 0 8 0 3 2 3 0 0 0 0 C ) P E R M

E ) N E W P O S T A L A D D R E S S O F L K P R A D : K E K R A L O V S K E M U L E T I S T I 6 / 1 0 1 9 1 6 0 0 8 P R A H A 6 R U Y Z N E .

yyyy mm tmax tmin af rain sun degC degC days mm hours 2008 1 5.0 -1.4 21 --- 29.7 2008 2 7.3 1.9 8 --- 71.9 2008 3 6.2 0.3 13 --- 101.4 2008 4 8.6 2.1 5 --- 128.6 2008 5 15.8 7.7 0 --- 180.4

NBA5683GG YSCBNOCX YUZZNCLX012322 YBBBZEZXC0120/10 NOTAMR C0119/10Q) YBBB/QXXXX/IV/NBO/A/000/999/1653S14545EA) YBCSB) 1003012322 C) 1003050930 ESTD) DAILY 0800/0930 1800/2100E) INCREASED FLYING FOX ACTIVITY

Data Integration

Page 17: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

17

The layering design approach enables the integration of AIXM data with other Aeronautical XML based data

Data Integration

D a ta T y p es S PAT IA L T E M P O R A L

N O TA M m e ssa g e

lo c a tio n c o o rd in a te s , a re a

c o o rd in a te s

valid s ta r t a n d end d a te s , d u ra tio n

AV IAT IO N R E F E R E N C E

d a ta

lo c a tio n

c o o rd in a te s , a re a c o o rd in a te s ,

sh a p e

valid s ta r t a n d

end d a te s , p e rm a n e n t o r

te m p o ra ry W E AT H E R

d a ta lo c a tio n c o o rd in a te s , a re a

c o o rd in a te s , te m p e ra tu re ,

p re s su re

valid s ta r t a n d end d a te s , d u ra tio n

Ta b le 1 Av ia t io n d a ta to be in te g ra te d with te m p o ra l a n d sp a tia l in te g ra tio n p o ints

X1234/09 NO TA M Q) YM M L/Q MR XX/IV /NBO /A /00/999/3767S14484E002/ A) YM ML B) 07068:0:0 C) 070610:0:0 EST E) RWY 16/34 CONDITIONAL DUE TO RESUR FAC ING

<AIRPORT_HELIPORT num ="2"> <AH_UUID>16468</AH_UUID> <NAME>MELBO URNE</NA ME> <DESIG N ATO R>YM M L</DES IG NATO R> <RUNW AY_FULL_CO DE>16/34</RU NW AY_FULL_CO DE> <RUN_DIR_V ID>11781</RUN_DIR_V ID> <AH_USG_LIM_C O DE>CO NDITIONAL<AH_USG _LIM _CO DE> <AH_W ARN_DESCR>Resurfacing</AH_W ARN_DESCR>

< /A IRPORT_HELIPO RT> </ALL_A IRPORT_HELIPO RT>

Page 18: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

• Global semantic representation: ontology or schema mediator

Page 19: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

19

The Ontology

Ontology Definition• O = (C, H, R, P, I, A), where

• C = a set of entities in the ontology (class and instance)

• H = a set of taxonomic relationships between concepts.

• R = a set of non-taxonomic ontology concept relationships.

• P = a property set of ontology concept entities that connects a class property into a datatype.

• I = a set of ontology instance declaration (the relationships of instances with its class, its property and value, and other instances).

• A = is a set of axioms and rules that allow consistency checking of an ontology and infer new knowledge through some inferencing mechanism.

Page 20: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

20

The Ontology

Ontology Development• via Domain Expert

• From scratch

• Mostly manual and time consuming

• Valid and rich knowledge within ontology

• via Data Transformation

• Existing data required

• Based on specific data format transformation (e.g. RDB and XML)

• Automatic

• Knowledge richness limited to database content

Page 21: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

21

What we need…

• Global common knowledge

• Local ontology development may not be shared globally

• The value of local knowledge for global development

• Rich and valid knowledge

• Automatic development process

• Does not rely on the availability of domain expert

• Domain experts are not always present

• Immediate development

• Maintainable knowledge

Page 22: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

22

A Data-Driven Dynamic Common Ontology

The Concept• (i) create common ontologies automatically from community knowledge

representations and

• (ii) maintain its content by: capturing dynamic knowledge changes and updates specific in the community, and capturing world recent updates (eg through social media and news).

• Contents updates are done through propagation and enrichment.

Page 23: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

23

A Data-Driven Dynamic Common Ontology

The Concept• (i) create common ontologies automatically from community knowledge

representations and

• (ii) maintain its content by: capturing dynamic knowledge changes and updates specific in the community, and capturing world recent updates (eg through social media and news).

Page 24: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

24

A Data-Driven Dynamic Common Ontology

The Creation • CO = (C, H, R, P, I, A, S, CV), where

• S = is a set of similarity values between ontology knowledge components (class, instances, non-taxonomic relationship, and properties) and its respected external similar ontology knowledge component.

• CV = is a set of confidence values Cv residing in an ontology instance knowledge, which takes the ratio between the number of knowledge sources that mention a knowledge and the total number of knowledge sources.

• Why Confidence Value (CV)?

• Knowledge stability assurance. The new extracted knowledge is not always being the best knowledge and one particular piece of knowledge from one community may not necessarily become global community knowledge representation.

Page 25: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

25

A Data-Driven Dynamic Common Ontology

The Propagation• Why ?

• Frequent change in community

• Validity assurance from the knowledge source

• How ?• Using delta script

• A delta script is very useful when the original file is located in another place or in the distributed environment, since sending the whole updated file will consume resources and result in a greater chance of information loss

Page 26: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

26

A Data-Driven Dynamic Common Ontology

The Enrichment• Why ?

• Global knowledge update

• Validity assurance from the global understanding

• How ?• Take RECENT related document (e.g. recent news article)

• Ontology + Linguistic Pattern –based extraction

• Self-enrichment : find related recent document by exploiting keywords from the common ontology.

• Domain independent• Confidence value is considered

Page 27: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

ReferencesFudholi, D.H., Rahayu, W., Pardede, E. (2015). A data-driven dynamic ontology. J. Information Science 41(3): 383-

398.

Fudholi, D.H., Rahayu, W., Pardede, E. (2014). CODE (Common Ontology DEvelopment): A Knowledge Integration

Approach from Multiple Ontologies. IEEE AINA , 751-758 (2014), Victoria Canada.

Fudholi, D.H., Rahayu, W., Pardede, E., Hendrik. (2013). A Data-Driven Approach toward Building Dynamic

Ontology. ICT-EurAsia 2013: 223-232 (2013), Indonesia

Page 28: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

• Global semantic representation: ontology or schema mediator

Page 29: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

29

o Data can arrive from various heterogeneous data sources.

o Data from different have different structures.

o In most cases the underlying data is quite similar. But as the structures are different, conflicts arise.

Consolidating Data Sources

Page 30: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

30

Consolidating Data Sources

Page 31: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

ReferencesNguyen H. Q., David Taniar, J. Wenny Rahayu, Kinh Nguyen (2011) "Double-layered schema

integration of heterogeneous XML sources", Journal of Systems and Software, Vol. 84 (1), pp. 63-

76.

Nguyen, H., Rahayu, J.W., Taniar, D., Nguyen, K., 2008, Mediation-based XML query answerability, Proceedings of

the OTM 2008 Confederated International Conferences: On the Move to Meaningful Internet Systems (OTM 2008), 9

November 2008 to 14 November 2008, Springer, Berlin Germany, pp. 1550-1558.

Page 32: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

• “Loose coupling” integration: data virtualization, data container

Page 33: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

33

Data Container *

• The era of large heterogeneous data collection – moving from Databases to Data container

• Data container – contains a collection of resources, each of which has a unique reference/identifier

• The resources in a Data container can be: databases, database relations, database tuples, files, records in files, data streams, social media documents, parts of texts, maps, trajectories, etc.

* Reference from Prof. Timos Sellis – Data Ecosystem - From Very large databases to Big Data Infra structure, La Trobe November 2015

Page 34: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

34

Data Container *

User Query

Result

Page 35: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

35

o The data may arrive from various data sources from different locations.

o Data from multiple data sources are integrated and aggregated on the fly.

o The user experiences the presence of a real data warehouse. The user has no clue where the data is from, but it is available.

o Some more benefits are,

o Real-time availability of information for decision support.

o Data is less stale.

o Able to access data instantly.

Data Warehouse Virtualization

Page 36: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

36

Current Data Warehousing Trends in IndustryAccording to the latest Gartner Study (2015), Data Warehouses can be classified into

four main categories:

1. Traditional Data Warehouse

o Consolidates and stores historical data that arrive from various data sources

2. Operational Data Warehouse

o Data is structured and continuously loaded to support operational queries

3. Logical Data Warehouse

o Structured data and other content data typeso Utilizes Data Virtualization

4. Data Lake

o Uses flat architecture to store data in its original formato Supports ‘Schema on Read’ capabilities

Page 37: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

37

Traditional Data Warehouse

Page 38: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

38

Dynamic Data Warehouse

E Chang, W. Rahayu, M. Diallo, M. Machizaud: Dynamic Data Mart for Business Intelligence. IFIP AI 2015: 50-63.

Page 39: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

39

o 3M: Data Mining, Data Marshalling, and Data Meshing

o 3R: Recommendation, Reconciliation, Representation

Dynamic Data Warehouse

E Chang, W. Rahayu, M. Diallo, M. Machizaud: Dynamic Data Mart for Business Intelligence. IFIP AI 2015: 50-63.

Page 40: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

40

New Trends in Data WarehousingMicrosoft | IBM | Oracle | Cisco | Sap

• Real-time Data Warehousing• Support new data types• Support for cloud data• Data Lake • Real-time Data Warehousing• Logical Big Data Warehousing• Support for complex, structured

and unstructured data • Support for Big Data• Logical Data Warehousing• Data Lake

Page 41: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

41

Finally…

Integration can be achieved through:• Standardization – suitable for domain specific data

sources since it is relying on the availability of the standard

• Global semantic representation – suitable for data sources with an inherent common knowledge

• “Loose coupling” integration - suitable for large heterogeneous data sources/data container with a dynamic nature (frequent changes)

Page 42: THE NEW POWER OF DATA: Collection, Integration, and Analytics Wenny Rahayu Professor in Computer Science Head, School of Engineering and Mathematical Sciences

Thank you

CRICOS Provider 00115M