final presentation irt - jingxuan wei v1.2
Post on 15-Apr-2017
41 Views
Preview:
TRANSCRIPT
1
PARALLEL DATA ACQUISITION AND
RETRIEVAL APPROACH FOR HEAVY HAUL RAILWAYS(FINAL PRESENTATION)
Faculty of Information TechnologySupervisor: Assoc Prof David Taniar
BY: JINGXUAN WEI (Tom) 25025031
2
PRESENTATION OUTLINE Research Background
Instrumented Ore Car Program issue Problem of the existing database
Research Question Related works
Research Aim Data Acquisition
MongoDB Import and Export Tools Spark-MongoDB Application Result analysis
Data Retrieval Data retrieval by Spark SQL Data retrieval by Spark filter operation How to improve searching efficiency?
Conclusion and Future work
3
RESEARCH BACKGROUND( INSTRUMENTED ORE CAR (IOC) PROGRAM ) Railway In Mining, Pilbara region, WA
Loaded iron ore Equipped with sensors to collect data as Train run
Trained professionals to maintain the sensors
Aim of the program: • Monitoring track and wagon
performance.• Detect track abnormalities
INSTRUMENTED ORE CAR PROGRAM ISSUEWhat are the issues?• Sensor selection• Smart sensor is expensive.• Less expensive sensor is inaccuracy (Semi-
structured data)
• Database issue• Low data ingestion speed • Spend too much time on searching
Expected outcome:Equipped with many cheap sensors to collect data in order to obtain the desire outcome. (Reduce cost) 4
5
PROBLEM ANALYSISLow data ingestion speed in current database• High velocity data input:
• Each wagon fitted with 16 sensors• The one sensor produce 25 records per second• Approximately 200 wagons in one train• At least 30 trains running at the same time.
• Transaction management of relational databaseSpend too much time on searching • Large volume of unstructured data
6
THE CURRENT SYSTEM: HOW DOES THE DATA LOOK LIKE ?
Data Information:• Twenty one attributes
include train acceleration and geography information (latitude and longitude)
• Missing Track Information
Solution: • Append Track
Information
Concept Used• Geo Hashing Algorithm
((Wolfson & Rigoutsos, 1997))
7 RESEARCH QUESTION“How to improve the performance of data ingestion into the database?”“How to perform fast data retrieval in the IRT project?.
8
RELATED WORKS Use MongoDB to Enhancing the Management of Unstructured Data (Stevic, Milosavljevic, & Perisic, 2015)
Improvement of MongoDB Auto-Sharding (Liu, Wang, & Jin, 2012) Spark SQL (Armbrust et al., 2015)
Pervious work (Benchmark Model) Given the infrastructure we have for processing, we have successfully processed 40,000 records per second.
With the same infrastructure, based on the file system (CSV files provided by IRT), we have successfully retrieved results for 40 GB of data in less that 85 seconds.
9 RESEARCH AIMScalable Techniques for Parallel Data Acquisition and Retrieval of High-Velocity Data
WHY WE USE MONGODB? NoSQL Document Database Handle unstructured data well Improve Storage Capacity
10
11
HOW TO IMPLEMENT FAST DATA ACQUISITION? Approaches taken:MongoDB Default Import Tool
Regular MongoDBMongoDB Sharded Cluster
Spark-MongoDB Application
12
MONGODB DEFAULT IMPORT TOOL Command:mongoimport --db RegularDB --collection railwayDataCollection --type csv --headerline --file /mnt/data/IRTRailwayData80K.csv
13
MONGODB SHARDING TECHNOLOGY Sharded MongoDB Cluster
Divide the data set and distributes the data over multiple shards. Each shard is an independent database.
14
PARALLEL DATA ACQUISITION IN SHARDED DATABASE Reads / Writes Storage Capacity High Availability
15
CHANGE DATA DISTRIBUTION IN SHARDED DATABASE
Hashed Sharding sh.shardCollection("<database>.
<collection>", { <key> : <direction> } )
Ranged Sharding sh.shardCollection( "database.collecti
on", { <shard key> } )
16
MONGOIMPORT RESULT ANALYSIS(1)
40K 80K 160K 320K0.0
5.0
10.0
15.0
20.0
25.0
30.0
Hashing Sharding VS Ranged Sharding
Ranged Sharding Hashed Sharding
Number of records
Seco
nds
17
MONGOIMPORT RESULT ANALYSIS (2)
40K 80K 160K 320K0.05.0
10.015.020.025.0
Sharded Database VS Regular Database
Sharded MongoDB MongoDB (Regular)
Number of records
Seco
nds
18
MONGOIMPORT RESULT ANALYSIS (3) The bottleneck occurs in
the first section. Compare the database
enable sharding, the regular database perform better job in data acquisition.
The acquisition result can not meet industry requirement.(80000 per second)
19
SPARK-MONGODB APPLICATION FOR DATA ACQUISITION We need to set the spark environment first:
Create 80000 records as input batch:
Store into MongoDB:
20
SPARK-MONGODB APPLICATION RESULT
MongoImport:Shard database: 4.3 sRegular database 4.0s
Spark program 1.4s
40000 records 50000 records 60000 records 70000 records 80000 records0200400600800
1000120014001600
8221007
11671302
1444Data inserting – Router (Master) (4CPUs)
Number of records
Mill
iseco
nds
21
SPARK-MONGODB APPLICATION RESULT
New Record !!!
140000 records per
second120000 records 130000 records 140000 records 150000 records 160000 records0
200400600800
10001200
860 9311031 1053 1134
Data inserting -- Server 16CPUs
Number of records
Mill
iseco
nds
22
DATA RETRIEVAL APPROACH Database level Application level
23
PARALLEL DATA RETRIEVAL BY MONGODB
2970000 5940000 8910000 11880000 196814 393628 590442 78725602000400060008000
10000120001400016000
Searching performance between sharded MongoDB and regular MongoDB
Number of record
Millis
econ
ds
Conclusion:1. Sharded
MongoDB perform faster searching than Regular MongoDB
2. Hard to measure query execution time when the dataset is too big.
db.getCollection('Test').find({'accR3': { $gt: 4 , $lt:6}}). explain(‘executionStats’)
DATA RETRIEVAL BY SPARK SQL QUERY Create Spark SQL object
24
Create register temp table and run the searching query.
Sample result
DATA RETRIEVAL BY SPARK FILTER OPERATION Perform searching by using filter operation
25
26
SPARK-MONGODB APPLICATION RESULTData searching Local Machine 2CPU/I5 Server 16CPUs - Regular
DatabaseFilter Search Spark SQL
query
Filter Search Spark SQL query
4G (3.92G) 16881 877 5736 557/6028G (7.85G) 52229 2012 13281 175312G (11.78G) N/A N/A 19556 332316G (15.70G) N/A N/A 31179 451840G (39.93G) N/A N/A 79893 878345G (44.83G) N/A N/A 86399 10883
Query
1. SELECT * FROM railwayData Where accR3 > 4 and accR3 < 6
2. val result = readData.filter(readData("acc.r3") >= 4 && readData("acc.r3") <= 6)
27
HOW TO IMPROVE SEARCHING EFFICIENCY?
Approach 1
Approach 2
(Key-Value)
SPARK-MONGODB APPLICATION2 RESULT Adopt hash
partitioner to partition data and use mapPartitionsWithIndex to get the target partition.
Perform searching in the target partition
Narrow search scope
28
29
SPARK-MONGODB APPLICATION2 RESULT
1 2 301000020000300004000050000
30724 29058 27440
44737 42773 43200
Compare the performance between two searching approaches
Search by Hash Partition Search for all data
Number of testing
Millis
econ
ds
30
CONCLUSIONWe have successfully created a system that is able to accept the data as a batch or streams.
We solve the low data ingestion speed problem by writing a spark program.
We have successful import 1400000 record in one second in the MongoDB server.
We perform searching by using Spark SQL and execute SQL query in 40 GB of data within 11 seconds.
31
FUTURE WORK How to measure MongoDB query execute time in the very large database.
Efficient searching mechanism in sharded MongoDB by using Spark.
32
REFERENCE Wolfson, H. J., & Rigoutsos, I. (1997). Geometric hashing: An overview. IEEE
computational science and engineering, 4(4), 10-21. Stevic, M. P., Milosavljevic, B., & Perisic, B. R. (2015). Enhancing the
management of unstructured data in e-learning systems using MongoDB. Program, 49(1), 91-114.
Liu, Y., Wang, Y., & Jin, Y. (2012). Research on the improvement of MongoDB Auto-Sharding in cloud environment. Paper presented at the Computer Science & Education (ICCSE), 2012 7th International Conference on.
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., . . . Ghodsi, A. (2015). Spark sql: Relational data processing in spark. Paper presented at the Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.
33
THE TEAM
Dr. Maria Indrawan-SantiagoSenior Lecturer
Faculty of IT
Prajwol SangatResearch Assistant
Faculty of ITAssoc Prof David Taniar
Associate ProfessorFaculty of IT
Jingxuan WeiStudent
Faculty of IT
Subudh SaliStudent
Faculty of IT
34
Thank You Questions
?(Final Presentation) Supervisor: Assoc Prof David Taniar
BY: JINGXUAN WEI (Tom) 25025031
top related