driving behaviour as a telematic fingerprint

14
NASDAG.org Data Science in the Automotive Industry

Upload: philippe-n-dagher

Post on 12-Aug-2015

192 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Driving Behaviour as a Telematic Fingerprint

NASDAG.org

Data Science in the Automotive Industry

Page 2: Driving Behaviour as a Telematic Fingerprint

I am an Automotive Management Professional and a Computer Science Engineer from France, with an extensive experience in managing complex projects in Supply Chain and IT, as well as starting, developing and acquiring businesses in France, Russia, USA and the Middle East.

I came to Metis to understand, learn and practice how data science is transforming the Automotive Business. During my projects, I focused on:● Sentiment Analysis / Topic Modeling● Predictive Behavior Modeling● Driver Telematics

Philippe Dagher

Page 3: Driving Behaviour as a Telematic Fingerprint

Objective:

Categorize drivers based on their behaviour on the roads - their driving style and the type of roads that they follow.

Challenge:

Identify uniquely a driver (and hence his proper “driving behaviour”) based on the GPS log of a mobile phone located inside the car.

Idea:

Experiment Topic Modeling techniques especially Latent Semantic Indexing/Analysis (LSI/LSA) and Latent Dirichlet Allocation (LDA) to explain the observed trips by the unobserved behaviour of drivers.

Final Project @ Metis

Page 4: Driving Behaviour as a Telematic Fingerprint

Raw data for one trip

Page 5: Driving Behaviour as a Telematic Fingerprint

Machine learning approach (1/2)❖ Preprocess the data using statistical smoothing and compression algorithms

➢ Kalman Filtering➢ Ramer–Douglas–Peucker

❖ Extract road and driving style features➢ per Segment: Length, Slip Angle, Convexity, Radius➢ per Meter: Speed, Accelerations (tangential and normal), Jerk, Yaw, Pauses

❖ Bin the ouput and generate the Driving Alphabet➢ ex: d0, d1, d2… v0, v1, v2… a0, a1, a2… etc

❖ Build the Driving Vocabulary - “Driving Slides” per meter➢ ex: d3L4v2n3y1➢ for various preprocessing sensitivities or features combinations (langages)

❖ Translate trips from GPS log into documents➢ Tokenize, filter, … data is ready!

Page 6: Driving Behaviour as a Telematic Fingerprint

d1L6Br1 d1L8Sr1 d1L5Sr2 d1L6Ur2 d2L8Ur2 d3L4Sr3 d2L5Ur3 d3L4Ur4 d3L6Sr4 d3L7Sr3 d4L4Ur5 d4L3Ur5 d4L2Ur7 d5L4Sr6 d3L3Ur5 d4L3Sr6 d5L4Ur6 d4L3Ur7 d5L9Sr5 d2L5Ur4 d3L2Ur7 d6L1Sr9 d5L0Sr9 d5L1Sr9 d5L7Ur5 d2L6Ur2 d2L3Ur5 d4L1Ur8 d5L2Ur7 d6L10Sr5 d6L8Sr5 d2L4Ur3 d3L3Ur6 d5L4Srp1 v2a6n0j0y0p1 v1a6n0j3y0p1 v1a1n0j6y0p1 v1a11n0j6y0p1 v1a7n0j11y0p1 v1a16n0j7y0p1 v2a7n0j1y0p1 v2a6n0j2y0p1 v2a10n0j2y0p1 v3a6n1j3y0p1 v3a2n2j3y0p1 v3a5n2j3y0p1 v4a2n2j3y1p1 v4a5n2j5y1p1 v4a5n3j5y1p1 v4a4n3j1y1p1 v4a6n3j6y1p1 v4a5n4j5y1p1 v4a4n3j6y1p1 v4a5n4j0y1p1 v4a5n3j6y1p1 v4a5n2j9y1p1 v4a11n3j7y1p1 v3a2n2j7y0p1 v3a12n2j7y0p1 v2a1n1j3y0p1 v2a5n1j9y0p1 v2a11n1j9y0p1 v3a6n1j7y0p1 v3a5n1j7y0p1 v3a6n2j6y0p1 v3a6n1j34y0p1 v3a62n2j71y0p1 v8a56n11j38y2p1 v4a13n3j7y1p1 v4a4n3j4y1p1 v4a5n3j6y1p1 v4a4n2j6y1p1 v4a6n3j1y1p1 v3a5n2j2y0p1 v3a3n2j6y0p1 v3a11n1j4y0p1 v2a8n1j0y0p1 v2a7n1j7y0p1 v2a17n1j1y0p1 v2a10p1 v6a0n3j4y0p1 v6a6n3j7y0p1 v6a6n3j3y0p1 v6a1n3j3y0p1 v6a6n3j3y0p1 v6a5n2j1y0p1 v5a6n2j4y0p1 v5a6n2j3y0p1 v5a12n1j2y0p1 v4a9n1j0y0p1 v3a9n1j2y0p1 v3a5n0j3y0p1 v3a1n0j6y0p1 v3a11n0j6y0p1 v3a0n1j3y0p1 v3a6n1j0y0p1 v3a5n1j3y0p1 v3a11n0j6y0p1 v4a1n0j4y0p1 v4a6n0j3y0p1 v4a2n0j7y0p1 v4a13n0j11y0p1 v5a7n0j4y0p1 v5a1n0j0y0p1 v5a1n0j3y0p1 v5a6n0j6y0p1 v5a6n0j2y0p1 v5a2n0j7y0p1 v6a11n0j10y0p1 v6a6n0j3y0p1 v6a0n0j3y0p1 v6a5n0j6y0p1 v6a5n0j2y0p1 v6a1n0j1y0p1 v6a0n0j3y0p1 v6a6n0j7y0p1 v6a6n0j7y0p1 v6a6n0j7y0p1 v6a6n0j3y0p1 v6a0n0j2y0p1 v6a5n0j6y0p1 v6a5n0j7y0p1 v6a6n0j4y0p1 v6a0n1j3y1j3y0p1 v6a6n1j6y0p1 v6a5n1j2y0p1 v7a1n1j4y0p1 v5a3n1j1y0p1 v5a6n1j3y0p1 v5a10n1j3y0p1 v4a8n0j0y0p1 v3a8n0j0y0p1 v3a8n0j3y0p1 v2a10n0j1y0p1 v2a7n0j3y0p1 v2a6n0j7y0p1 v3a7n0j3y0p1 v2a7n0j6y0p1 v3a14n0j7y0p1 v3a4n0j4y0p1 v3a2n0j6y0p1 v3a12n0j3y0p1 v3a8n0j2y0p1 v3a5n0j0y0p1 v3a6n0j4y0p1 v4a1n0j3y0p1 v4a5n0j2y0p1 v4a1n0j0y0p1 v4a0n0j0y0p1 v4a0n0j0y0p1 v4a0n0j0y0p2 v4a1n0j3y0p1 v4a6n0j7y0p1 v4a6n0j10y0p1 v4a11n0j6y0p1 v3a2n0j0y0p1 v3a1n0j3y0p1 v3a6n0j0y0p1 v3a6n0j0y0p1 v2a5n0j2y0p1 v2a3n0j5y0p1 v2a10n0j5y0p1 v1a2n0j0y0p1 v1a1n0j3y0p1 v1a5n0j10y0p1 v1a11n0j7y0p1 v1a3n0j7y0p1 v1a12n0j7y0p1 v2a3n0j1y0p1 v2a1n0j6y0p1 v2a11n0j10y0p1 v3a6n0j10y0p1 v3a12n0j7y0p1 v4a1n0j3y0p1 v4a5n0j10y0p1 v3a11n0j6y0p1 v4a2n0j3y0p1 v4a6n0j3y0p1 v5a0n0j7y0p1 v5a12n0j8y0p1 v5a4n0j4y0p1 v5a2n3j3y0p1 v5a3n3j4y0p1 v5a6n3j7y0p1 v5a6n3j5y0p1 v5a4n3j2y0p1 v5a1n3j3y0p1 v5a6n3j2y0p1 v5a1n2j4y0p1 v5a6n2j3y0p1 v5a2n3j4y0p1 v5a6n3j2y0p1 v5a6n2j3y0p1 v4a0n2j1y0p1 v4a2n2j1y0p1 v4a0n2j4y0p1 v4a6n2j7y0p1 v5a6n2j4y0p1 v4a5n2j0y0p1 v4a5n2j2y0p1 v4a9n2j2y0p1 v5a5n2j3y0p1 v5a9n3j1y0p1 v5a9n3j1y0p1 v5a7n1j2y0p1 d6L1v5n0y0 d6L1v4n0y0 d6L1v4n0y0 d6L1v5n0y0 d6L1v4n0y0 d6L1v4n0y0 d5L0v4n0y0 d5L0v4n0y0 d5L0v5n0y0 d5L0v4n0y0 d5L0v4n0y0 d5L0v4n0y0 d5L0v3n0y0 d5L0v3n0y0 d5L0v2n0y0 d5L0v2n0y0 d5L0v2n0y0 d5L0v2n0y0 d5L0v3n0y0 d5L0v2n0y0 d5L0v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1vy1 d5L7v4n4y1 d5L7v4n3y1 d5L7v0n0y0 d5L7v0n0y0 d5L7v0n0y0 d5L7v1n0y0 d2L6v1n6y5 d2L6v2n8y6 d2L3v2n0y0 d2L3v2n0y0 d4L1v3n0y0 d4L1v3n0y0 d4L1v3n0y0 d4L1v4n0y0 d4L1v4n0y0 d4L1v4n0y0 d4L1v4n0y0 d4L1v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v5n0y0 d5L2v5n0y0 d5L2v5n0y0 d5L2v5n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v5n0y0 d5L2v4n0y0 d5L2v5n0y0 d5L2v4n0y0 d d6L10v3n2y0 d6L10v4n2y0 d6L10v3n1y0 d6L10v3n1y0 d6L10v2n1y0 d6L10v2n1y0 d6L10v1n0y0 d6L10v2n0y0 d6L10v1n0y0 d6L10v1n0y0 d6L10v2n0y0 d6L10v1n0y0 d6L10v1n0y0 d6L10v1n0y0 d6L8v1n0y0 d6L8v1n0y0 d6L8v2n0y0 d6L8v2n0y0 d6L8v2n0y0 d6L8v3n1y0 d6L8v3n2y0 d6L8v3n2y0 d6L8v4n2y1 d6L8v4n2y1 d6L8v4n3y1 d6L8v4n3y1 d6L8v4n3y1 d6L8v4n4y1 d6L8v4n3y1 d6L8v4n4y1 d6L8v4n3y1 d6L8v4n2y1 d6L8v4n3y1 d6L8v3n2y0 d6L8v3n2y0 d6L8v2n1y0 d6L8v2n1y0 d6L8v2n1y0 d6L8v3n1y0 d2L5v1n3y2 d2L5v1n2y2 d3L5v1n2y1 d3L5v2n3y2 d3L5v2n4y2 d3L5v2n6y3 d3L5v2n2y1 d3L5v2n2y1 d3L5v3n4y2 d4L6v2n5y3 d4L6v2n6y3 d4L6v3n8y3 d4L6v3n7y3 d4L6v3n7y3 d4L6v2n6y3 d4L6v2n4y2 d4L6v2n3y2 d2L6v1n12y11 d2L6v1n10y10 d1L1v1n0y0 d3L3v1n1y1 d3L3v1n1y0 d3L3v1n0y0 d3L3v1n0y0 d3L3v1n0y0 d2L8v0n3y6

Example of a translated trip

Page 7: Driving Behaviour as a Telematic Fingerprint

LDA: Bayesian Topic Model

Per trip“Driving Behaviour” proportionsfor each trip select a distribution of “Driving Behaviours”

DirichletparameterCorpus: possible “Driving Behaviour” distributions for trips

Per “Driving Slide”“Driving Behaviour” assignmentfor each “Driving Slide” select a “Driving Behaviour”

Observed“Driving Slide”select actual “Driving Slide” from the slected “Driving Behaviour”

“Driving Behaviours”each “Driving Behaviour” is a distribution of “Driving Slides”

“Driving Behaviour” hyperparameterpossible “Driving Slide” distributionsfor “Driving Behaviours”

����

Page 8: Driving Behaviour as a Telematic Fingerprint

Posterior Inference in LDA❖ Goal is to obtain this posterior:

➢ How much a trip contain of “Driving Behaviour” k( ) and ➢ “Driving Behaviour” “Driving Slides” assignements z

❖ Which means that I need to calculate:

❖ GENSIM Library➢ a Python+NumPy implementation of online LDA for inputs larger than the available RAM

Page 9: Driving Behaviour as a Telematic Fingerprint

Example trip in the new LDA space

Page 10: Driving Behaviour as a Telematic Fingerprint

❖ 2736 drivers❖ 200 trips/driverTotal : 547200 csv files (5.92 GB)

Challenge:

To come up with a "telematic fingerprint" capable of distinguishing when a trip was driven by a given driver, knowing that among the 200 provided trips of each driver, a few number of trips was not driven by him/her.

Submissions are judged on area under the ROC curve calculated in a global manner (all predictions together).

Validation on a Kaggle Competition

Page 11: Driving Behaviour as a Telematic Fingerprint

❖ Transpose all trips into the new Driving Behaviours Space❖ Take one by one each trip from a selected Driver❖ Build a prediction model trained with all other trips in the dataset:

➢ Trues if they belong to the selected Driver➢ Falses if they do not belong to this Driver

❖ Predict with the trained model, the belonging of the selected Trip to the Driver, then Ensemble several predictions using various sensitivities to enhance the score...

For performance reasons I will proceed by batches of 10 or 20 selected trips and compare each time to a randomly selected limited number of False trips

Other outlier detection / clustering techniques appear to be less performing

Machine learning approach (2/2)

Page 12: Driving Behaviour as a Telematic Fingerprint

MongoDB to hold 3.3 MM documents generatedParallel processing setup on 4 DigitalOcean Droplets with 8CPU each

Gensim Library which implements three methods:❖ latent semantic indexing (LSI, or LSA - A for Analysis)❖ latent Dirichlet Allocation (LDA)❖ random projections (RP)Also, it implements online versions of each technique.

Setting the infrastructure

Page 13: Driving Behaviour as a Telematic Fingerprint

Predicting❖ Achieving an AUC of 0.9 on Kaggle without any ensembling technique

which confirms the robustness of my approach...

Page 14: Driving Behaviour as a Telematic Fingerprint

Thank you

http://nasdag.org