data science @ the search party (dr. jan luts)

1. Data Science @ The Search Party Jan Luts

2. Outline About myself The Search Party What is data science @ The Search Party? Deduplication of candidates Visualization of career paths Technology - Software Conclusion 3. Outline About myself The Search Party What is data science @ The Search Party? Deduplication of candidates Visualization of career paths Technology - Software Conclusion 4. About myself Master in Information Sciences, Universiteit Hasselt, Belgium Master in Bioinformatics, Katholieke Universiteit Leuven, Belgium Master in Statistics, Katholieke Universiteit Leuven, Belgium PhD and Postdoc in Engineering, Department of Electrical Engineering, Katholieke Universiteit Leuven (Sabine Van Huffel, Johan Suykens) Predictive computer models, machine learning, decision support systems Postdoc, School of Mathematical Sciences, University of Technology Sydney, Australia (Matt Wand) Mean field variational Bayes, semiparametric regression, streaming data, real-time analysis October 2013: Data Scientist, The Search Party, Sydney 5. Outline About myself The Search Party What is data science @ The Search Party? Deduplication of candidates Visualization of career paths Technology - Software Conclusion 6. The Search Party There are major forces acting on Recruitment as an industry Traditional recruitment model under pressure from technology Pressure on pricing damaging agency profitability Bulk of agency costs are people who drive revenue Global economic uncertainty Corp. investment in internal talent sourcing teams ? 7. We allow potential employers to search a vast ocean of the worlds best candidates We connect employers with the Agencies who represent them to agree a fee and arrange an introduction Supporting this evolution is the worlds first marketplace for talent.. 8. http://thesearchparty.com/ 9. Employer 10. Employer 11. Recruiter 12. Recruiter 13. Employer 14. Outline About myself The Search Party What is data science @ The Search Party? Deduplication of candidates Visualization of career paths Technology - Software Conclusion 15. Data 2 million candidates 16. Data 2 million candidates 46 million skills 17. Data 2 million candidates 46 million skills 14 million employment history records Concrete Formworker Doran Contractors 1999-2012 Site Supervisor Allied Gold 1997-2000 Java Developer IBM 2010-2011 18. Data 2 million candidates 46 million skills 14 million employment history records 40000 vacancies 19. Data 2 million candidates 46 million skills 14 million employment history records 40000 vacancies 29 industries, 384 subsectors Engineerin g Accounting Administration & Office Support Advertising, Arts & Media Banking & Financial Services Call Centre & Customer Services Community Services & Development Construction Consulting & Strategy Design & Architecture Education & Training 20. Data 2 million candidates 46 million skills 14 million employment history records 40000 vacancies 29 industries, 384 subsectors 75 GB marketplace logs Create Candidate Publish Candidate Forgot Password Submit CandidateVote Up Vote Down Request Candidate Appeared In Search Results Account Login Upload CV 21. Data 2 million candidates 46 million skills 14 million employment history records 40000 vacancies 29 industries, 384 subsectors 75 GB marketplace logs 100 recruitment agencies 22. Data science @ The Search Party! Testing hypotheses Design of experiments Cross-validation Training data vs. test data Performance measure Building a prediction model Regression Support vector machines Variable selection Sensitivity, specificity Cost and benefit Clustering Topic modeling Distributed computing Programming Software engineering Data structures Term frequency - inverse document frequency Entity resolution Sentence detection Tokenization Sentiment analysis Part-of-speech tagging statistics machine learning data mining computer science information retrieval natural language processing 23. Outline About myself The Search Party What is data science @ The Search Party? Deduplication of candidates Visualization of career paths Technology - Software Conclusion 24. Deduplication of candidates Recruiter 1 Recruiter 2 Recruiter 3 The Search Party Database 25. Employer 26. Deduplication of candidates (Figure from Lise Getoor) 27. Deduplication of candidates (Figure from Lise Getoor) 28. Deduplication of candidates (Figure from Lise Getoor) 29. Clustering Entity resolution does not happen independently for each pair or candidates separately Number of clusters is unknown Many, many small (possibly singleton) clusters 30. Correlation clustering Take a pairwise similarity graph as input Edge {0,1} with = 1 if candidates i and j assigned to same cluster. is the belief that candidates i and j are the same Optimize: Define: 31. Correlation clustering Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing (ILP '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 19-27. 32. Pairwise similarity matrix We need a measure that quantifies the similarity between candidates: Candidate 1: Jan Luts, [email protected], KULeuven, UTS Candidate 2: Jan Luts, [email protected], KULeuven, UTS Candidate 3: Jam Lutf, [email protected] Candidate 4: J Luts, KULeuven Candidate 5: Ian Luts, [email protected], KULeuven, UTS, TSP Candidate 6: Jan Luts, [email protected], UTS, TSP 33. Term frequency - inverse document frequency jan. an.m n.m. luts uts@ mail gmai .com @hot jan_ Candidate1 1 1 1 1 1 1 1 1 0 0 Candidate2 1 1 1 1 1 1 1 1 0 0 Candidate3 1 1 1 1 1 1 1 1 0 0 Candidate4 0 0 0 0 0 0 0 0 0 0 Candidate5 1 1 1 1 1 0 1 1 0 0 Candidate6 0 0 0 1 1 1 0 1 1 1 These are called term frequencies Inverse document frequency for .com: log(6/5) TF-IDF for .com for candidate 6: 1 * log(6/5) = 0.18 TF-IDF for jan_ for candidate 6: 1 * log(6/1) = 1.79 Terms 34. Pairwise similarity matrix Combine cosine similarity values for name, email address, phone number, mobile number, skills, employment history, Cand 1 Cand 2 Cand 3 Cand 4 Cand 5 Cand 6 Cand 1 1 1 0.8 0.9 0.95 0.75 Cand 2 1 0.8 0.9 0.95 0.75 Cand 3 1 0.6 0.87 0.7 Cand 4 1 0.75 0.7 Cand 5 1 0.8 Cand 6 1 Correlation clustering 35. Correlation clustering Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing (ILP '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 19-27. O(2) Does not scale with increasing number of candidates! 36. Big Data Big Data criticism: You May Not Need Big Data After All, HBR, December 2013 Google Flu Trends: The Limits of Big Data, NYT, March 2014 Big data: are we making a big mistake?, FT Magazine, March 2014 The backlash against big data, The Economist, April, 2014 @ The Search Party: Sampling can help sometimes, but not always We have a lot of data, this creates new problems and we just have to deal with it We need the right tools and algorithms to process millions of data points 37. Deduplication of candidates So how can we do correlation clustering on millions of candidates? o Blocking: e.g. split data set in separate blocks based on gender, geographical location, o Canopy clustering: Pre-clustering algorithm used as a preprocessing step: Use a cheap distance measure to partition the data into overlapping subsets (i.e. canopies) Run expensive clustering on each canopy All candidates 38. Canopy clustering Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high- dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). ACM, New York, NY, USA, 169-178. Start with a list of the candidates in any order, and with two distance thresholds, T1 and T2, where T1 > T2. Pick a candidate of the list, make it a canopy center and approximately measure its distance to all other candidates. Put all candidates that are within distance threshold T1 into a canopy. Remove from the list all candidates that are within distance threshold T2. Repeat until the list is empty. 39. Canopy clustering Five canopies found Do correlation clustering on each canopy 40. Deduplication of candidates Strategy outline: Do canopy clustering using TF-IDFs Do expensive correlation clustering for each canopy using a similarity matrix based on all available candidate information (e.g. name, email, phone, mobile, employment history, publications, certificates, ) We need to do < 0.005 of all possible pairwise comparisons Optimization: Parallelization of TF-IDF computation, canopy clustering Run correlation clustering in parallel for each canopy 41. Large-scale data processing: Open-source software framework for distributed computing MapReduce programming model Resilient to failure 42. How to do canopy clustering on Hadoop? Two steps: Canopy generation: identify the canopy centers Canopy filling: assign candidates to canopies 43. Canopy generation on Hadoop Initialize: centers1 = {} centers2 = {} centers3 = {} centers4 = {} For each batch in parallel if , distance(candidate x, center i) > T2 output the pair (intermediateCenter, candidate x) Candidates Batch 1 Candidates Batch 2 Candidates Batch 3 Candidates Batch 4 Intermediate Centers Map: Reduce: Initialize: finalCenters = {} If , distance(intermediateCenter x, finalCenter i) > T2 output the pair (finalCenter, intermediateCenter x) 44. Canopy filling on Hadoop Retrieve canopyCenters from canopy generation job For each batch in parallel , if distance(candidate x, center i) < T1 output the pair (center i, candidate x) Candidates Batch 1 Candidates Batch 2 Candidates Batch 3 Candidates Batch 4 Center-Candidate Batch 1 Map: Reduce: For each batch: Output the list of all candidates belonging to the same canopy with center i Center-Candidate Batch 2 Center-Candidate Batch 3 45. Deduplication of candidates - Summary Our dedupe pipeline is a blend of concepts from information retrieval (TF-IDF), statistics and machine learning (correlation clustering) Applying it to large data sets causes new problems and requires redesigning/adjusting the algorithms (canopy clustering, distributed computing, hadoop) Integration in the existing platform: o How do data get in and out of the dedupe pipeline o Making it work in a production environment: Fail-safe code - in case of failure, handle it in a safe way 46. Outline About myself The Search Party What is data science @ The Search Party? Deduplication of candidates Visualization of career paths Technology - Software Conclusion 47. Visualization of career paths 14 million employment history records: Longitudinal data: transitions between different jobs Available data: job titles, employer, full description, skills, start dates, end dates, different versions of CV 48. Visualization of career paths Visualize transition between jobs based on job title: network consultant senior network consultant technical project manager senior network engineer technical consultantnetwork analyst network manager consultant network engineer network architect project manager IT manager .05 .04 .04 .11 .10 .12 .10.09 .06 .08 .18 49. Visualization of career paths Demo 50. Outline About myself The Search Party What is data science @ The Search Party? Deduplication of candidates Visualization of career paths Technology - Software Conclusion 51. Technology - Software 52. Outline About myself The Search Party What is data science @ The Search Party? Deduplication of candidates Visualization of career paths Technology - Software Conclusion 53. Conclusion Innovative work in a challenging environment Variety: understanding business problems, literature review, algorithm design, prototyping, evaluation, implementation, optimization Data science: statistics has a very important role to play Software engineering skills Big data: large data sets cause new problems Team work Passion! 54. Thanks!

data science @ the search party (dr. jan luts)

Science

data science

candidates i

search party database

streaming data

data scientist

worlds best candidates

test data performance

technology pressure