scalable bulk loading into graph databases efficiently · scalable bulk loading into graph...
TRANSCRIPT
![Page 1: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/1.jpg)
Piecing together large puzzles, efficiently
Scalable bulk loading into graph databasesWork in progress paper
Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter Saake
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
![Page 2: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/2.jpg)
Agenda
● Motivation● Background & The Graph Loading Process● Experiments● Conclusion & Future Work
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
![Page 3: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/3.jpg)
Motivation
How can we understand better the networks that we belong to?
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
![Page 4: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/4.jpg)
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
![Page 5: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/5.jpg)
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● An example of a practical application: Recommendations@Pinterest
![Page 6: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/6.jpg)
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● An example of a practical application: Dependency-driven analytics@Microsoft
![Page 7: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/7.jpg)
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Large graphs are ubiquitous○ ⅕ of participants use graphs with >100 M edges
● Scalability is the main challenge● Graph DBMSs are the most popular tool, at the moment
![Page 8: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/8.jpg)
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● User experience starts with data loading
● This can still be improved○ Currently no standard scale-out solution for the process (our focus)○ Limited handling of variable input data characteristics.
bin/neo4j-import --into retail.db --id-type string \ --nodes:Customer customers.csv --nodes products.csv \ --nodes orders_header.csv,orders1.csv,orders2.csv \ --relationships:CONTAINS order_details.csv \ --relationships:ORDERED customer_orders_header.csv,orders1.csv,orders2.csv
![Page 9: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/9.jpg)
Background
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Input data characteristics
Edge Lists, from SNAP Astro-Physics Collaboration Dataset
Implicit Entities, from SNAP Amazon Movie Reviews Dataset
Also property encodings, others…
Working today with large and diverse graph datasets is cumbersome
![Page 10: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/10.jpg)
Background
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● But before going any further, the single unavoidable slide :)● Property graphs (the underlying logical model we’re assuming)
● Directed● Labeled● Attributed,● Multi-graph
![Page 11: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/11.jpg)
The Graph Loading Process
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Topology-onlyrepresentations
Complete representations
● Moving data from input files to physical storage, while keeping with constraints
![Page 12: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/12.jpg)
The Graph Loading Process
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
![Page 13: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/13.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● The basic question we address today:
○ How much can an developer nowadays scale-out and tune the process, without changing database internals?
![Page 14: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/14.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Setup
JanusGraph (formerly Titan)
Datasets: Wiki-RfA (10,835 V, 159,388 E) and Google-Web (875,731 V, 5,105,039 E)
![Page 15: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/15.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Setup○ JanusGraph Version 0.1.1 (May,11,2017) ○ Apache Cassandra 2.1.1. ○ Commodity multi-core machine composed of 2 Intel(R) Xeon(R) CPU E5-2609
v2 @ 2.50GHz processors (8 cores in total) with 251 GB of memory.
![Page 16: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/16.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Gains from batching
● Fit more data inside a single transaction● The bigger the batch size, the faster the
loading process○ Batching works!
● Larger batch sizes don‘t guarantee better performance○ Poor use of transaction caches○ Higher costs for failed transactions
![Page 17: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/17.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
![Page 18: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/18.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Adding some parallelism
● Partition the data into chunks and load in parallel○ Here we report average of strategies.
● This consistently reduces the loading time● Less impact than batching.
○ Multiple users on the same data bring transaction commits overheads.
![Page 19: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/19.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
![Page 20: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/20.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
A closer look at the partitioning strategies
● EE: Part Edges, Balance Edges● VV: PV, BV● BE: PV, BE● DS: Extension to BE, deals with skew
All achieve good balancing in these datasets
![Page 21: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/21.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
![Page 22: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/22.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
No big differences between them for these datasets
Only imbalance in Wiki-Rfa VV 2 part.
Distribution Across Partitions in Google Web =>
![Page 23: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/23.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
No big differences between them for these datasets
Only imbalance in Wiki-Rfa VV 2 part.
Distribution Across Partitions in Wiki-RfA =>
![Page 24: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/24.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Putting it all together
Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Wiki - RfA)
● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.
○ It also increases with more users.
No improvements over batching
![Page 25: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/25.jpg)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Google Web)
● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.
○ It also increases with more users.
No improvements over batching
![Page 26: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/26.jpg)
Conclusion
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Batching is the best first strategy. We've seen gains from 100 minutes to 1.5.○ Small disclaimer: gains do not grow in proportion to sizes.
● But the combination of batching and partitioning is not straight-forward and can bring deterioration.○ How can we make them work well together?
● EE, BE/DS could be the default partitioning strategy○ But load imbalance is not the single factor affecting performance
More studies are next, moving our questions in studying physical storage alternatives, in tune with a broader picture of interest in supporting adaptive HTAP designs.
![Page 27: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter](https://reader033.vdocuments.site/reader033/viewer/2022050607/5fae244e8246907b9b28868c/html5/thumbnails/27.jpg)
Thanks :)
Questions?
Databases and Software Engineering Workgroup, OvGU University of Magdeburg