curare curationetgestiondegrandes …vargas-solar.com/wp-content/uploads/2018/09/soutenance... ·...
TRANSCRIPT
![Page 1: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/1.jpg)
1
CURARE: CURATION ET GESTION DE GRANDESCOLLECTIONS DE DONNÉES SUR LE NUAGE
DIRECTRICESPARISA GHODOUS, PR. U. LYON I, LIRIS
CATARINA FERREIRA DA SILVA, MCF. U. LYON I, LIRISGENOVEVA VARGAS-SOLAR, CR. CNRS, LIG-LAFMIA
Gavin Robert KEMP, LIRIS, [email protected]
CURARE: curating and managing big data collections on the cloud
![Page 2: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/2.jpg)
2
ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art
– Data Lakes– Data Curation: Workflow and Approaches– Curation Architectures and Platforms
§ CURARE: Service Oriented Architecture for Curating Data Collections– Approach for Curating Data Collections– Data Collection and View Model– Services for Curating Data Collections
§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
![Page 3: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/3.jpg)
3
CONTEXT & MOTIVATION
28/09/2018
“Data is everything and everything is data”, PythianTurning reality phenomena into data thanks to the Big Data trend
![Page 4: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/4.jpg)
4
BIG DATA DEFINITION
28/09/2018
§Data collections with characteristics difficult to process on single machines or traditional databases
§A new generation of tools, methods and technologies to collect, process and analyse massive data collections
à Tools imposing the use of parallel processing and distributed storage
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 5: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/5.jpg)
5
BIG DATA PROPERTIES- Volume (size) - Velocity (production rate)- Variety (data types & format)
- Variability (inconsistencies by constant meaning changes)
- Veracity (truth and consistency)
- Value (how much information)
3V
4V
5V
...
10V V’s models [Jagadish 2014]
28/09/2018
“Big Data can really be very small and not all large datasets are big!”- Mike 2.0 [Hillard 2012]
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 6: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/6.jpg)
6
BIG DATA LIFE CYCLE
Data Processing Data analytics [Feldman et al. 2013], Data Aggregation [Jayram et al. 2007]
Data Collections StorageNoSQL, NewSQL (Hive), Data sharding(MongoDB, CouchDB)
Harvesting &Cleaning
Preserving
Processing &Analysis
Exploiting
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 7: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/7.jpg)
7
CLOUD COMPUTING AND BIG DATA
Ready to use environments forStoring Big Data& Running greedy processing tasks
Software as a Service (SaaS)Full functional software
Infrastructure as a Service (IaaS)CPU, RAM, Disk
Platform as a Service (PaaS)Database systems, frameworks
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 8: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/8.jpg)
8
Different sizes, evolution in structure, completeness, production conditions & content, access policies modification …
What information is available in these collections?
Is the data clean? If not what has to be done to make it clean?
Are there relations between data collections which could be exploited for better prediction?
What are the update rates and in what way does this affect the collection?
THE RIGHT DATA FOR THE RIGHT ANALYTICS
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 9: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/9.jpg)
9
DATA CURATION
Data management through its lifecycle of interest & usefulness
§ Enable data discovery & retrieval§ Maintain data quality§ Add value§ Provide for re-use over time
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 10: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/10.jpg)
10
Define a data meta-data extraction modelfor supporting
decision making related to its exploitation
PROBLEM STATEMENT
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 11: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/11.jpg)
11
RESEARCH QUESTIONS§ Model in an integrated manner structural, semantic & quantitative meta-data?
§ Design a cloud service oriented architecture for enabling data curationconsidering variety and variability?
§ Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 12: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/12.jpg)
12
ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections
§ State of the Art – Data Lakes– Data Curation: Workflow and Approaches– Curation Architectures and Platforms
§ CURARE: Service Oriented Architecture for Curating Data Collections§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
![Page 13: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/13.jpg)
13
DATA LAKE
Centralized repository containing virtually inexhaustible amounts of raw data to be analysed
CRM Sensor logs
Data wrangling [Terrizzano et al. 2015]
Constance [Hai et al. 2016]
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 14: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/14.jpg)
14
DATA SWAMP
Repositories grow ever bigger and complex to the point that a lake becomes a swamp
CRM Sensor logs
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 15: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/15.jpg)
15
DATA CURATION WORKFLOW
Preserving Describing
Step 2: Extracting meta-data
Grooming
Provisioning
Step 3: Exploitation
Selecting
Vetting
Collecting
Step 1: Harvesting
Data wrangling [Terrizzano et al. 2015]28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 16: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/16.jpg)
16
EXTRACTING META-DATA
28/09/2018
Preserving DescribingExtracting meta-data
ExploringHarvesting
Level INo meta-data Partial meta-data Complete meta-data
Level 2 Level 3
Sensing &collectingraw data
Size, frequencyfreshness, type
Quantitative &qualitative data+ semantics
[Stonebraker et al. 2015]
information
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 17: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/17.jpg)
17
Approach Principle Pros ConsCuration at Source [Curry et al. 2010]
Simple tagging & informationextraction directly from the sensor
Pre-processes data according to samples
No awareness of the whole data collection
Master Data Management[Weatherspoon et al. 2013]
Provide a standard vocabulary at the company scale
Standardizes the language used in data collections
Does not improve data structures
Semantic linking Constance [Hai et al. 2016]
Identify attributes referring to similar topics in different data sets
Explores data collections
Does not improve data structures
Data set clusteringGoods [Halevy 2016]
Discover similar data sets using clustering
Creates an synthesizedrepresentation of several data collections
Difficult to define a similarity criterion to cluster data with low quality (missing & null values, types)
Crowdsourcing and Collaboration spaces [Doan et al. 2005]
Communities produce, maintain & tag data (crowdsourcing)
Improves the qualityof raw data
Manual & a lot of human resources
APPROACHES FOR EXTRACTING META-DATA
28/09/2018
Level 3: manual and collaborative
Level I: simple tags as data is harvested
Level 2: pivot
vocabulary & semantics
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 18: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/18.jpg)
18
Technique Idea Pros ConsBig Data Stacks (Hadoop, Berkeley, Asterix, Spark)
Provide data processingtools using parallel programming modelsSupport query languages& visualisation tools
Versatile, general purpose Low level of adaptability Centred on analysis not on data collections exploration
Service based architectures [CoreKG Beheshti 2018, Lord2004]
Provide independentservices implementing the steps of the data curation workflow
Services enable to adapt the tools to the data collections & good maintainability
Services assembly is up to the programmer
Data curation network [Alfred P. Sloan Foundation 2017]
Provide a Web portal that integrates tools from libraries willing to curate data collections
General view of tools and data collectionsRuns on the WebCollaborative curation
No integration of toolsThe curation process is not explicitNo automatic control on who does what
DATA CURATION PLATFORMS
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 19: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/19.jpg)
19
§ Model in an integrated manner structural, semantic & quantitative meta-data?
– No integrated data curation model • to explore data collections in search • for well adapted analytics input
§ Design a cloud service oriented architecture for enabling data curationconsidering variety and variability?
– No elastic architecture – No target solution for data curation
LIMITATIONS OF THE STATE OF THE ART (1/2)
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 20: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/20.jpg)
20
Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?
– Not enough quantitative meta-data maintenance to make technical decisions
– Domain dependent curation • no explicit data curation workflow • exploration based on visualization
LIMITATIONS OF THE STATE OF THE ART (2/2)
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 21: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/21.jpg)
21
ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art
§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections
§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
![Page 22: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/22.jpg)
22
TAILORING BIG DATA CURATION
Propose a data curation, service based system providing data analysts
– Abstract information about data collections content with several associated releases
– Tools for curating data collections– Strategies for managing metadata related to data collections
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 23: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/23.jpg)
23
CONTRIBUTIONS AND RESULTS
28/09/2018
Prototype MongoDB & OpenStack Experiments on Grand Lyon and Twitter urban transport data collections
CURARE: Cloud Service Oriented Architecture for storing & curating data collections
Data collection & View modelData curation formalisation
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 24: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/24.jpg)
24
DATA CURATION APPROACH
Extracting as much meta data as possible to§ Make technical decisions§ Choose data exploitation & analysis strategies
DATA COLLECTION MODEL
VIEW MODEL
Structural meta-data
Quantitative meta-data
Extraction
Data Collections
Harvesting
Release & View: curated data collection
Views explorationassist decision making
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 25: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/25.jpg)
25
DATA CURATION APPROACH
28/09/2018
DATA COLLECTION MODEL
VIEW MODEL
Structural meta-data
Quantitative meta-data
Extraction
Data Collections
Harvesting
Release & View: curated data collection
Views explorationassist decision making
Data harvesting services
Data processing services
(meta-data extraction)
Storage services Curated data collections
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 26: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/26.jpg)
26
DATA CURATION APPROACH
28/09/2018
DATA COLLECTION MODEL
VIEW MODEL
Structural meta-data
Quantitative meta-data
Extraction
Data Collections
Harvesting
Release & View: curated data collection
Views explorationassist decision making
Data harvesting services
Data processing services
(meta-data extraction)
Storage services Curated data collections
CURARE: Cloud Service Oriented Architecture for Curating Data Collections
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 27: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/27.jpg)
27
ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art
§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections
§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
![Page 28: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/28.jpg)
28
DATA COLLECTION: STRUCTURAL META DATA
28/09/2018
Data producedat t+1Size:2
Data producedat t+2Size:3
Id::doc2temp:21humidi:61
Id::doc2N°car:5Red:true
Id::doc1Bus:2Loc:univ
Retrieved frommetadata & the
provider
Manual
Extracted
DataCollection
- id: URL- name: String- provider: String- licence: [public, restricted]- size: Num- author: String- description: String
Release- id: URL- release: Num- publicationDate : Date- size: Num
11..n
DataItem
- id : URL- name: String- attributs: List()
1
1..n
releases
items
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 29: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/29.jpg)
29
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
JSON Documents
GeographiccoordinatesRoad logicname
Observationdata
Traffic eventdescription
Item in R1
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
Data collection
Release/day
R1
Rn
...
ReleasesTHE GRAND LYON TRAFFIC DATA COLLECTION
28/09/2018
Items of a release
Geographiccoordinates
Road logicname
Observationdata
Traffic eventdescription
Item in R1
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 30: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/30.jpg)
30
DATA RELEASE DATA ITEM: TRAFFIC EVENT
28/09/2018
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
Geographiccoordinates
Road logicname
Observationdata
Traffic eventdescription
DataItem
- id : URL- name: String- attributs: List()
JSON documentStructure description
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 31: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/31.jpg)
31
DATA COLLECTION RELEASE: TRAFFIC EVENTS IN A PERIOD OF TIME
28/09/2018
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
DataItem
- id : URL- name: String- attributs: List()
Release- id: URL- release: Num- publicationDate: Date- size: Num
1
1..nitems
1items
"_id" : "tweets_736241_collection_24_17021","id" : 17021,"name" : "tweets_736241_collection_24_17021","publicationDate" : "Mon Aug 08 2016 00:00:41 GMT+0000 (UTC)","size" : 4386,"url" : "localhosttweets_736241_collection_2417021""dataItems": [ List(dataItem) ]
Release R1
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 32: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/32.jpg)
32
DATA COLLECTION: GRAND LYON TRAFFIC EVENTS
28/09/2018
DataItem
- id : URL- name: String- attributs: List()
Release- id: URL- release: Num- publicationDate: Date- size: Num
11..n
items
DataCollection
- id:URL- name: String- provider: String- licence: [public, restricted]- size: Num- author : String- description: String
1
1..n
releases"_id" : "CollectionInfo","author" : “twitter","description" : "this is a Collection","id" : ”...","licence" : "public","name" : ”tweets","provider" : ”twitter","releases" : [ List(Release)],"size" : 736241
"_id" : "tweets_736241_collection_24_17021","id" : 17021,"name" : "tweets_736241_collection_24_17021","publicationDate" : "Mon Aug 08 2016 00:00:41 GMT+0000 (UTC)","size" : 4386,"url" : "localhosttweets_736241_collection_2417021""dataItems": [ List(dataItem) ]
Release<instance>
DataCollection<instance>
11..nreleases
1..n1
items
DataItem<instance>
R1
Rn
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 33: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/33.jpg)
33
VIEW MODELData structure providing an aggregated perspective of a data collection content and its associated releases§ Describes the content of a data collection with series of tuples <attribute, type> § For every attribute the view describes
– basic statistical data associated to the values of every attribute: min, max, mean, median, histograms, standard deviations
– null and missing values representation for every couple <attribute, type> for every attribute in a document
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 34: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/34.jpg)
34
VIEW: QUANTITATIVE META DATA
28/09/2018
Extracted
Computed
View- id: URL- name: String - provider: String - author: String - description: String - code: Code - releaseSelectionRules:
Function- Source: String - default: version.id
AttributeDescriptor
- id: URL - name: String - type: String - valueDistribution: Hist- nullValue: num- absentValue: num- minValue, maxValue: type - mean, median, mode: type - std: type - count: num
0..n
1ReleaseView
- id: URL- version: num- publicationDate: date - size: num
0..n
1
attributDescriptors
releaseViews
extension
Data Collection
DataItem
ReleaseData producedat t+1Size:2
Data producedat t+2Size:3
Id::doc2temp:21humidi:61
Id::doc2N°car:5Red:true
Id::doc1Bus:2Loc:univ
extension
extension
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 35: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/35.jpg)
35
TRAFFIC EVENTS ATTRIBUTES STATISTICS: ATTRIBUTE DESCRIPTOR
28/09/2018
Release R1
For each attribute across all items in a release Ri
Quantitative meta-data: statistical measures of each attribute
AttributDescriptor{"_id" : "tweets_736241_viewMapReduce_24_17021.entities.hashtags.0.text.number","value" : {
"data": [9,9,9,9,9,9],"median": 9,“mean”: 9"count": 6,"valueDistribution": {"9" : 6},"type": "number","mode": {"value": "9","count": 6},"missing": 4380,"nulls": null }}
at12, v12...
DataItem<instance>
at11, v11...
DataItem<instance>
at1n, v1n...
DataItem<instance>
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 36: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/36.jpg)
36
TRAFFIC EVENTS STATISTICS: RELEASE VIEW
28/09/2018
Release Ri
Quantitative meta-data: aggregated measures of attributes statisticsin all its data items
ReleaseView"_id": "tweets_736241_viewMapReduce_24_17021","version": "17021","publicationDate": ISODate("2016-08-08T00:00:00Z"),"size": 765,"attributDescriptors": [List(AttributDescriptor)]
AttributDescriptor
Statistiques an
Statistiques a1
0..n
1
attributDescriptors
Aggregate
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 37: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/37.jpg)
37
GRAND LYON TRAFFIC EVENT STATISTICS: VIEW
28/09/2018
Data collection
View..."code" : {"model-framework": "MongoMapReduceFinalize",
"mapper": Mapper()"reducer": Reducer()"finalize": Finalizer() },
"rules": "function rule(version){\r\n\treturn version\r\n}"
Quantitative meta-data: aggregated measures of releases statistics
Aggregate0..n1
Releases R1ReleaseView
AttributDescriptor
Statistiques an
Statistiques a1
0..n
1
attributDescriptors
Rn
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 38: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/38.jpg)
38
ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art
§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections
§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
![Page 39: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/39.jpg)
39
SERVICES FOR CURATING DATA COLLECTIONS
28/09/2018
Integrated Data Curation Platform
Data Cleaning & Processing Services
PIG HADOOP
Data Harvesting Services
REST FLUME
Data storage Services
MongoDB Neo4J CouchDBTagged Data Collections
Decision Making Support Services
Data Analytics Services
Towards Cloud big data services for intelligent transport systems; G. Kemp, G. Vargas-Solar, C. Ferreira da Silva, P. Ghodous, C. Collet, P.Lopez. Concurrent Engineering 2015, Delft, Netherlands
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 40: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/40.jpg)
40
CLOUD DEPLOYMENT ENVIRONMENT
28/09/2018
Image modified from Windows data science virtual machinehttps://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 41: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/41.jpg)
41
CURARE DATA
HARVESTINGSERVICE
SaaS
CURARE CLOUD SERVICES
28/09/2018
Views(quantitative &
structural description)
CURARE DATA
STORAGESERVICES
PaaS
Meta-data (dataCollection, release)
CURAREDATA
CLEANINGSERVICE
Decision making services
Data analytics services
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Tagged Data Collections
Integrated Data Curation PlatformCURARE
DATAEXPLORATION
SERVICES
![Page 42: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/42.jpg)
42
DATA HARVESTING: BATCH
28/09/2018
CURARE DATA
HARVESTINGSERVICE
Data source(on demand producer)
connect()
get()
close()
Buffer & caching service
Service Instance
(continuous)
Method call instance
iteratoriterator
iterator
PaaS
Preserving DescribingExtracting meta-data
ExploringHarvesting
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 43: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/43.jpg)
43
Service Instance
(continuous)
DATA HARVESTING: STREAM
28/09/2018
subscribe()
receive()
stop()
Message queue service
push/pull
Method call instanceCURARE
DATAHARVESTING
SERVICE
PaaS
Preserving DescribingExtracting meta-data
ExploringHarvesting
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 44: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/44.jpg)
44
CURAREDATA
CLEANINGSERVICE
CURAREDISTRIBUTED
DATASTORAGESERVICES
post()
PaaS
CURARE DATA
HARVESTINGSERVICE
post()
DATA CLEANING & STORAGE
28/09/2018 Preserving DescribingExtracting meta-data
ExploringHarvesting
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 45: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/45.jpg)
45
DATA PROCESSING & EXPLORATION
28/09/2018
CURARE DISTRIBUTED
DATASTORAGESERVICES
Raw data
Curated data
Analysed data
CURAREDATA
EXPLORATIONSERVICES
Data analytics services
Decision making services
get(analysedData)
sendInstructions(makecollectionViews)
ViewdataCollections
sendInstructions(analyseData)
1
23
4
Preserving DescribingExtracting meta-data
ExploringHarvesting
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 46: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/46.jpg)
46
DATA VARIETY & VARIABILITY
28/09/2018
§ Data Collection and View Models– Data collections track structural meta-data over time through releases– Views track quantitative meta-data over time through release views– Attribute descriptor tracks every existing value– Release Views track the evolution over time of an attribute’s values and their
distribution§ Data Curation Service based Architecture
– Service swapping and adding thanks to SOA for enabling data processing tools for exploring with new data formats & meaning
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 47: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/47.jpg)
47
ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections
§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned
§ Conclusion and Perspectives
28/09/2018
![Page 48: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/48.jpg)
48
CURRENT PROTOTYPE IMPLEMENTATION
§ Deployed on Openstack1
§ Distributed Data and Access Service: MongoDB– 4 router, 3 config servers, 3 shards with 3 replica sets
§ Scripts and statistical operators– process data collections – generate objects-instantiating our model
§ Machines used– 4 VCPU 2.5 GHz, 8 Go RAM, 80 Go Disk
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
1 https://www.openstack.org
![Page 49: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/49.jpg)
49
EXPERIMENTS SETTING§ Profile the cost of extracting meta-data
– Computing statistical measures to compute quantitative meta-data– Measure execution time on centralized & parallel execution environments
§ Use case: test the interest of using views for a data sharding decision making process
§ Input data collections– Data set Grand Lyon with 2095 documents, 86 releases, 2.5Mb– Data set Twitter with 736242 documents, 125 releases, 2.5 Gb– Samples ranging from 4000 to 20000 documents
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 50: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/50.jpg)
50
ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections
§ Implementation and Experimentation–Current Prototype–Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned
§ Conclusion and Perspectives
28/09/2018
![Page 51: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/51.jpg)
51
How much does it cost to extract meta-data & compute Views?
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 52: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/52.jpg)
52
EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (1/2)
28/09/2018
Context and Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Varies with the size of the collection and the number of attributesBy far the biggest time consumer is the attribute collection
contributing for 90 to 95% of the time
– 2 min for the Grand Lyon event data set of 2000 documents – 36 hours for the Twitter data set of 700000 documents
Grand Lyon
![Page 53: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/53.jpg)
53
EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (2/2)
28/09/2018
Context and Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Speed gain relative to the size of the releases
– 6-7 sec for the Grand Lyon event data set of 2000 documents – 50 min for the Twitter data set of 700000 documents
Grand Lyon
![Page 54: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/54.jpg)
54
ROADMAP
§ Context & Problem Statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections
§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views–Experiment 2: Data Sharding–Comparison and Lessons Learned
§ Conclusion and Perspectives
28/09/2018
![Page 55: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/55.jpg)
55
SHARDING DATA ACROSS DIFFERENT STORES
28/09/2018
Traffic status{
"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
Balanced and smooth fragmentation(size, location, availability)
Shard 0
chunk chunk
Key range0 ... 20
Optimum distribution across shards providing storage spaces (chunks)
Shard 1
chunk chunk
Key range21 ... 40
Shard 2
chunk chunk
Key range41 ... 60
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 56: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/56.jpg)
56
EXPERIMENT 2: SHARDING WITHOUT VIEWSChoose shardingkey candidates
Associate key candidates with
shardingstrategies
Shard collection with strategies
Evaluate data distribution across
fragments
Data scientistIdentify attributes
manually
Choose ShardingStrategy
If poor queriesevaluation time
Explore the data collectionmanually
Extract a sample
Data scientist
Raw data collection
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 57: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/57.jpg)
57
EXPERIMENT 2: EVALUATION OF SHARDING KEYS DECISIONS WITHOUT VIEWS
Documents number & data varies § greatly for ranged location§ lesser extent for hashed
location
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 58: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/58.jpg)
58
EXPERIMENT 2: DECISION MAKING QUESTIONS§ Which attribute can be used to shard the collection?§ Which is the values distribution of each attribute?§ Is there critical data with particular availability requirements?§ Should some fragments be collocated?
28/09/2018
This information must be extracted, computed
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 59: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/59.jpg)
59
EXPERIMENT 2: SHARDING USING VIEWS
28/09/2018
Raw data collection
Getmeta-data
Extract meta-data
Compute meta-data
Create Data Collection Meta-data
Computequantitative measures
Discover domain value
properties (null, absent values)
Create Views
Identify attributes Choose shardingkeys candidates
Associate key candidates with
sharding strategies
Shard collection with strategies
Evaluate data distribution across
fragments
Data collection
Release
Item
View
ReleaseView
AttributeDescriptor
Data scientist
Data provider
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 60: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/60.jpg)
60
Identify the best candidate key attributes
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 61: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/61.jpg)
61
EXPERIMENT 2: IDENTIFYING ATTRIBUTES VALUES DISTRIBUTION
28/09/2018
nb values missing max count min counttweets.quoted_status.user.location.string 8436 3008392 16820 4tweets.user.location.string 11041 571962 269506 4tweets.user.time_zone.string 149 860539 870263 4tweets.place.bounding_box.coordinates.0.0.0.number 3 0 1983882 19950
tweets.place.bounding_box.coordinates.0.0.1.number 3 0 1968043 35784
tweets.place.bounding_box.coordinates.0.1.0.number 3 0 1983882 19950
tweets.place.bounding_box.coordinates.0.1.1.number 3 0 1969602 34231
tweets.place.bounding_box.coordinates.0.2.0.number 5 0 1898817 19950
tweets.place.bounding_box.coordinates.0.2.1.number 3 0 1969602 34231
tweets.place.bounding_box.coordinates.0.3.0.number 3 0 1898817 19950
tweets.place.bounding_box.coordinates.0.3.1.number 3 0 1968043 25784
tweets.quoted_status.user.geo_enabled.string 2 2938908 155678 95665
tweets.user.geo_enabled.string 1 0 3190337 3190337
ReleaseView
Too many missing values
Good distribution of valuesfew missing values àcandidate sharding key
Lead to unbalanced chunks
Attributes related to location
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 62: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/62.jpg)
62
With which sharding strategies (range, hash) can candidate keys cope?
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 63: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/63.jpg)
63
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY TIME ZONE (1/2)
28/09/2018
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
Abu D
habi
Amer
ica>L
os_Ang
eles
Asia>J
erus
alem
Auckla
nd
Belgra
de
Brisba
neCST
Centra
l Am
erica
Dhaka
Europ
e>Am
ster
dam
Europ
e>Mad
rid
Geo
rgetow
n
Hong K
ong
Jaka
rta
Kuala Lum
pur
Ljub
ljana
Mex
ico C
ity
Mos
cow
New D
elhi
Paris
Riyadh
Singap
ore
Taipe
iUTC
Wellin
gton
Paris
US & Canada
LjubljanaGreenland
Athens
Amsterdam
Missing
Attribute Descriptor
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 64: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/64.jpg)
64
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
Abu D
habi
Amer
ica>L
os_Ang
eles
Asia>J
erus
alem
Auckla
nd
Belgra
de
Brisba
neCST
Centra
l Am
erica
Dhaka
Europ
e>Am
ster
dam
Europ
e>Mad
rid
Geo
rgetow
n
Hong K
ong
Jaka
rta
Kuala Lum
pur
Ljub
ljana
Mex
ico C
ity
Mos
cow
New D
elhi
Paris
Riyadh
Singap
ore
Taipe
iUTC
Wellin
gton
28/09/2018
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY TIME ZONE (2/2)
Paris
US & Canada
LjubljanaGreenland
Athens
Amsterdam
Missing
Attribute Descriptor Shard1 Shard2 Shard3
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 65: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/65.jpg)
65
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (1/3)
28/09/2018
0
100000
200000
300000
400000
500000
600000
700000
AM
Arcach
on, F
ranceBarne
t
Bourg Pale
tte
Calvi>Mon
za
Chêne-B
ourg, S
uisse
Dans v
otre e
ntrepri
se
En casa
de @Ana
Nehrhoff
France
> Breizh
Gilbert
, AZ
Hogwart
Jedd
ah - T
aif
La rÃ
©alité
Lisieux
, Norm
andie
Lyon
- Ban
gui
Lyon
>saint
-etien
ne
Metz>>
Nancy
Mâcon-D
ijon
Nord, N
ord-P
as-de
-Cala
is
PVRIS
Paris, F
rance >
Pak
istan
Pom Pom G
alli - Tou
rs
Rillieux
-la-Pape
(69)
Saint-R
apha
ël, Prove
nce-A
lpes-C
ôte d'
Azur
Somewhe
re in
a crow
d
Ta gueu
le
Trémolat, F
rance
Villeuba
nneZion
clerm
ont-fer
rand > sq
uamish he
ll
listen
ing to
Dan
gerous W
oman
p-a-p>
paris
studle
y uk
étusson
Attribute Descriptor
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 66: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/66.jpg)
66
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (2/3)
28/09/2018
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
ANZIO(ro
ma)
Arena
s de S
an P
edro
Bay Area
Bow, Engla
nd
Canne
s, Fra
nce
Clichy
, Ile-de
-Fran
ce
Dibër
, “
F;Y;R
; of …
Estadio
Vicente C
alderon
France
, Lille
Greno
ble - L
yon IN
Kings L
andin
g, W
esteros
Le P
ortel
Lond
on;
Lyon
Sou
th!
Manch
ester
>Burnley
Montig
ny-lÃ
¨s-Metz
, Fra
nce
Netherla
nds, L
imbu
rg, V
enra
y
Orage
en Mars
Paris >
Tours
Pembs >
Manch
ester
> M;U
;F;C
Rebord
osa
Saint Q
uenti
n
Sheffiel
d (so
metimes
)
Sur Mars
; (Y' fa
is ch
aud ic
i)
Toulou
se, F
rancia
Versaille
s, Ile
-de-F
ranc
e
Yango
n(Mya
nmar
)→T
okyo
→P
aris
chap
uller P
hysic
ist*,
CMS-CERN gil
l
leo w
on | el
la
on th
e road
;;;
stars
hollow
écully
location
France
Lyon
missingAttribute Descriptor
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 67: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/67.jpg)
67
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (3/3)
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
A l'Oue
st !
Annonay
, Lyo
n
Banbrid
ge, N
orthern
Irelan
d
Booba>K
endri
ck; U
sain B
olt;
CHAMONIX - G
ENEVA
Chatel -
Frenc
h>Swiss
Alps
Croix r
ousse
Dublin
| May
o
FamÃlia
Bala
deira
∞
Frankfu
rt am M
ain, H
essen
Guadala
jara, Ja
lisco
Ici su
r Twitte
r
Kilkis
Ind; A
rea,
Greece
Le D
oulieu,
Franc
e
Lond
on - K
hoba
r - R
iyadh
Lyon
- Ville
urbann
eMED
Mexim
ieux, F
ranc
e
Médine, R
oyau
me d'Ara
bie S
aoud
ite
Noisy-l
e-Sec,
France
PARIS >
مهØ
¯ÙŠ
Paris | F
R
Pilton,
Englan
d
Renne
s > B
rest,
Fran
ce
Saint-A
madou,
Midi-PyrÃ
©nées
Sevilla,
Esp
aña
Strasb
ourg,
Fran
ce; �Tog
o
Valence
, RhÃ
´ne-A
lpes
Wes
t Ran
ch
avec
une f
leur qu
i est
moche
demi x
@cv
rpen
trz
in ba
iley's
arms
lyon4Ã
¨
pauli
ne - ne
lley -
cam
the la
nd of
tea,
rain
& queu
es
БеÐ
¾Ð³Ñ€Ð°Ð́
> Lon
don >
Cae
rdydd
location
France
Lyon
missing
Attribute Descriptor
28/09/2018
Shard1 Shard2 Shard3
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 68: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/68.jpg)
68
ROADMAP
§ Context and Problem statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections
§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned
§ Conclusion and Perspectives
28/09/2018
![Page 69: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/69.jpg)
69
Is decision making for sharding better done with or without views?
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 70: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/70.jpg)
70
COMPARISON
28/09/2018
Raw data collection
Getmeta-data
Extract meta-data
Compute meta-data
Create Data Collection Meta-data
Computequantitative
measures
Discover domain value properties (null, absent values)
Create Views
Identify attributes distribution
Choose shardingkeys candidates
Associate key candidates with
sharding strategies
Shard collection with strategies
Evaluate data distribution across
fragments
Data collection
Release
Item
View
ReleaseView
AttributeDescriptor
Data scientist
Data providerChoose shardingkey candidates
Associate key candidates with
sharding strategies
Shard collection with strategies
Evaluate data distribution across
fragments
Data scientistIdentify attributes
manually
Choose ShardingStrategy
If poor queriesevaluation time
Explore the data collectionmanually
Extract a sample
Data scientist
Raw data collection
Without views With views
Iterative process
- Guess sharding keys according to the data scientist knowledge
- Test – error approach based on samples
One shot process
- Full structural & quantitative knowledge of data collection
- Decision based on the whole data collection
Without Views With ViewsQuery attributes No YesExplorable attributes Sample AllDecision support Educated guess Measurable graphs
Evaluation based on- Queries execution / communication time: determine the degree of data
fragmentation and colocation- Balance of shard sizes
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 71: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/71.jpg)
71
LESSONS LEARNED
Sharding without Views Sharding with Views
Exploration Several days Few hours
Methods Using scripts to find division in the data
Querying the meta data to find related attributes, analysing the histogram for sharding
Quality Mediocre, distribution remains fairly poor
Almost perfect distribution of document across all the shards
Processing Fairly cheap and quick scripts
Expensive data processing
à Sharding with views is all quicker, more intuitive and effective
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 72: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/72.jpg)
72
ROADMAP
§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
![Page 73: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/73.jpg)
73
CONCLUSION
28/09/2018
§ Model in an integrated manner structural, semantic & quantitative meta-data?
§ Data collection & View models: structural and quantitative meta-data
§ Design a cloud service oriented architecture for enabling data curation consideringvariety and variability?
§ CURARE: Cloud service oriented architecture for storing & curating datacollections
§ Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?
§ Use case: choosing the best criterion for sharding data using Views
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 74: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/74.jpg)
74
SHORT & MID TERM PERSPECTIVES
§ Compare releases & data collections– visualisation– query languages enhancing current ad-hoc and imperative proposals
§ Evaluate our data curation approach with users feedback– explore newspapers collections & political campaigns– investigate the urban data from the pole CARA
§ Extend the view model & CURARE– discover semantic metadata and relationships among attributes– discover functional, temporal and causal dependencies
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 75: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/75.jpg)
75
LONG TERM PERSPECTIVES
§ Compose big data curation services according to QoS criteria– data properties and target applications
§ Data curation traceability– distributed and collaborative data curation based on block chain
§ Big data collections market– curation guided by cost and business models– QoS criteria like privacy, economic cost, provenance, reputation, trust
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
![Page 76: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/76.jpg)
7628/09/2018
CURARE: CURATING AND MANAGING BIGDATA COLLECTIONS ON THE CLOUD
Gavin Robert KEMP, LIRIS, [email protected]
![Page 77: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/77.jpg)
77
PUBLICATIONS§ Journals
1. Cloud big data application for transport; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P. Ghodous, C. Collet ,P. López Amaya; International Journal of Agile Systems and Management, Inderscience, 9 (3), 2016
2. Big data collections and services for building intelligent transport applications; G. Kemp, P. López Amaya, C.Ferreira Da Silva, G. Vargas-Solar, P. Ghodous, C. Collet; International Journal of Electronic BusinessManagement, Vol. 14 No. 1, pp. 1-10, 2016
3. CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; GavinKemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
§ Conferences4. Towards Cloud big data services for intelligent transport systems; G. Kemp, G. Vargas-Solar, C. Ferreira Da
Silva, P. Ghodous, C. Collet, P. Lopez. Concurrent Engineering, Jul. 2015, Delft, Netherlands5. Aggregating and Managing Big rEaltime Data in the Cloud : Application to intelligent transport for Smart
Cities; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P. Ghodous; Proceedings of the 1st InternationalConference on Vehicle Technology and Intelligent Transport Systems, May 2015, Lisbon, Portugal. pp.107-112
§ Book Chapter6. Service Oriented Big Data Management for Transport; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P.
Ghodous, C. Collet; Smart Cities, Green Technologies, and Intelligent Transport Systems / series Communicationsin Computer and Information Science, Springer, 579, pp. 267-281, 2016
28/09/2018
![Page 78: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/78.jpg)
7828/09/2018
![Page 79: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/79.jpg)
79
PARALLEL DBMS & PROCESSING ENVIRONMENTS FORCURATING DATA
28/09/2018
Preserving DescribingExtracting meta-data
ExploringHarvesting
ETL
Parallel Data Processing Platforms
Spark (RDD – Tables/Graphs)Hadoop ecosystem tools (e.g., Pig)
Parallel Data Processing Platforms
NoSQL & NewSQL(Parallel)
ParallelData Querying
& Analytics
Structured Data provision
Parallel data collection
(Flink, Stream, Flume)
Spark (descriptive statistics functions)Hadoop ecosystem tools (e.g., Hive)
Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS)Parallel analytics (Matlab, R)
![Page 80: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/80.jpg)
80
EXPLORING TRANSPORT EVENTS IN LYON USING VIEWS
28/09/2018
View<Events>
View<Bycicles>
ReleasesR1 ReleaseView
AttributDescriptor
Statistiques an
Statistiques a1
0..n
1
attributDescriptors
Rn
0..n1
Views explorationassist decision making
Search datasets about traffic events smaller < 30 Mbytes reporting eventsduring vacations
Which releases about bicycles distribution have less than 20% of missing values
ReleasesR1 ReleaseView
AttributDescriptor
Statistiques an
Statistiques a1
0..n
1
attributDescriptors
Rn
0..n1
Search releases reporting traffic eventsclose to bicycle stations
s %
knn
![Page 81: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/81.jpg)
81
DATA CURATION ON THE LAKES
28/09/2018
Approach Principle Pros ConsMeta data : Constance [Hai et al 2016], [Stonebraker et al 2013] Data wrangling [Terrizzano et al 2015]
Extraction of semantic meta-data or mapping items with conceptsQuery using SPARQL, regular expression
Possibility to express declarative queries; difficult to automate completely
No statistics, iterative data querying for exploring data
Data content description: descriptive statistics, processing according to data types, schema extraction
Explore data structure for extracting the schema
Compute descriptive statistics functions for every element
Aggregated view of the content despite data types. Simple to visualize and scale if important volume.
Adapted for (semi)structured data, manual tagging for multimedia content
Curation: [CoreKG, Curry 2016, QoSMOS2018, Tacit knowledge management 2017]
API with methods for preserving data collections. Querying and data fusion operations
Exploit Compass tool from MongoDB for providing a quantitative vision of data content
Data transformation from CSV to JSON.No semantic knowledge (terms, functional dependencies)
![Page 82: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/82.jpg)
8228/09/2018
![Page 83: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/83.jpg)
83
PROCESSING DATA COLLECTIONS
28/09/2018
Description
Prediction
Clustering (Bayesian, hierarchical, k-means, CLARA, PAM)
Classification(Neural network, PLSDA,
KNN, decision trees)
Trend prediction
Regression(PLSR, PCR)
Association
Modelling (LU, QR, PCA=SVD,
PARAFAC)How many accidents are reported per day?
Percentage of use of available bicycles in downtown?
Which are the traffic bottle-neck regions in the city?
Is the number of car accidents related to seasons?
How will the use of bicycles will evolve in downtown during the summer of the next 5 years?
What type of cars are those that have more accidents in the highspeed roads?
Will increasing the parking cost reduce car traffic in the city and increase the amount of people using public transport?
![Page 84: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/84.jpg)
84
PROCESSING DATA COLLECTIONS
28/09/2018
Spatial analysis of dynamic movements of Vélo’v, Lyon’s shared bicycle program [ECCS 2009]
Contribution: PCA et K-means for predicting the use tend of Velo’V in Lyon
Description Clustering PCA & k-means
Orthogonal linear transformation transforms the data to a new coordinate system such that - the greatest variance by some projection of the data
comes to lie on the first coordinate (first principal component),
- the second greatest variance on the second coordinate, and so on.
![Page 85: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/85.jpg)
85
PROCESSING DATA COLLECTIONS
28/09/2018
Inferring the Root Cause in Road Traffic Anomalies [IEEE Data Mining 2012]
Contribution: PCA for identifying traffic anomalies and therebydetecting problems in roads
DescriptionModelling
Principal Component Analysis (PCA)
![Page 86: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/86.jpg)
86
Can be sharded
DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS
C
A
P
C - A A - P
C - P
Data models
- Relational- Key-Value- Column oriented- Document oriented
- Dynamo- Voldemort- Tokyo Cabinet- KAI
- Cassandra- SimpleDB- CouchDB- Riak
- BigTable- HyperTable- Hbase
- MongoDB- TerraStore- Scalaris
- BerkeleyDB- MemcacheDB- Redis
- RDBM’s- MySQL- Postgres- etc
- Aster Data- GreenPlum- Vertica- Neo4j
Availability: each client can
always read & write
Partition tolerance: The system works well despite physical network partitions
Consistency: all clients always have
the same view of de data
28/09/2018
![Page 87: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/87.jpg)
87
DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS
28/09/2018
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
- Relational- Key-Value- Column oriented Tabular- Document oriented
Raw data collections
- How to transform data collections ?- Which is the best adapted model?
à Polyglot persistence
Approaches dealing with transformation rulesinspired in the relational case
tabular (csv, excel)
Media (XML, JSON, BLOB)Graph
![Page 88: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/88.jpg)
88
DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS
28/09/2018
Persistence- Which part of the document must persist?- Explicit vs. implicit persistence- In memory / hard disk
Fragmentation/Sharding & replication: - Vertical or horizontal fragmentation- Strategies: range, hash, tagged- Distribution & location
Availability & Fault tolerance- Replication & distribution
September 28, 2018
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
Memory/Cache
Raw data collections
![Page 89: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/89.jpg)
89
EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (1/2)
28/09/2018
§ Varies with the size of the collection– 2 min for the Grand Lyon event data set of 2000 documents – 36 hours for the Twitter data set of 700000 documents
§ By far the biggest time consumer is the attribute collection contributing for 90 to 95% of the time
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Grand Lyon
![Page 90: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data](https://reader034.vdocuments.site/reader034/viewer/2022050520/5fa3bdad271d3a2ee10c1ae4/html5/thumbnails/90.jpg)
90
§ Varies with the size of the collection– 6-7 sec for the Grand Lyon event data set of 2000 documents – 50 min for the Twitter data set of 700000 documents
EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (2/2)
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Grand Lyon