gdg devfest central italy 2013 1. 2 joint work with j. feldman, s. lattanzi, v. mirrokni (google...

Download GDG DevFest Central Italy 2013 1. 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)

If you can't read please download the document

Upload: chester-johnson

Post on 26-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • GDG DevFest Central Italy 2013 1
  • Slide 2
  • 2 Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google) and the AdWords team.
  • Slide 3
  • The AdWords Problem
  • Slide 4
  • ?
  • Slide 5
  • ?
  • Slide 6
  • Soccer Shoes
  • Slide 7
  • The AdWords Problem Soccer Shoes
  • Slide 8
  • Google Advertisement in Numbers Over a billion of query a day. A lot of advertisers. www.google.com/competition/howgooglesearchworks.html
  • Slide 9
  • Challenges Several scientific and technological challenges. How to find in real-time the best ads? How to price each ads? How to suggest new queries to advertisers? The solution to these problems involves some fundamental scientific results (e.g. a Nobel Prize-winning auction mechanism)
  • Slide 10
  • Google Advertisement in Numbers 2012 Revenues: 46 billions USD 95% Advertisement: 43 billions USD. http://investor.google.com/financial/tables.html
  • Slide 11
  • Goals of the Project Tackling AdWords data to identify automatically, for each advertiser, its main competitors and suggest relevant queries to each advertiser. Goals: Useful business information. Improve advertisement. More relevant performance benchmarks.
  • Slide 12
  • Information Deluge Large advertisers (e.g. Amazon, Ask.com, etc) compete in several market segments with very different advertisers. QueryInformation Nike store New York Market Segment: Retailer, Geo: NY (USA), Stats: 10 clicks Soccer shoes Market Segment: Apparel, Geo: London, UK, Stats: 4 clicks Soccer ball Market Segment: Equipment, Geo: San Franciso, CA, Stats: 5 clicks . millions of other queries .
  • Slide 13
  • Representing the data How to represent the salient features of the data? Relationships between advertisers and queries Statistics: clicks, costs, etc. Take into account the categories. Efficient algorithms.
  • Slide 14
  • Graphs: the lingua franca of Big Data Mathematical objects studied well before the history of computers. Knigsbergs bridges problem. Euler, 1735.
  • Slide 15
  • Graphs: the lingua franca of Big Data Graphs are everywhere! Social Networks Technological Networks Natural Networks
  • Slide 16
  • Graphs: the lingua franca of Big Data Formal definition A B C D A set of Nodes
  • Slide 17
  • Graphs: the lingua franca of Big Data Formal definition A B C D A set of Edges
  • Slide 18
  • Graphs: the lingua franca of Big Data Formal definition A B C D The edges might have a weight 1 4 2 3
  • Slide 19
  • Adwords data as a (Bipartite) Graph A lot of Advertisers Billions of Queries Hundreds of Labels
  • Slide 20
  • Semi-Formal Problem Definition Advertisers Queries
  • Slide 21
  • Semi-Formal Problem Definition A Advertisers Queries
  • Slide 22
  • Semi-Formal Problem Definition A Advertisers Queries Labels:
  • Slide 23
  • Semi-Formal Problem Definition A Advertisers Queries Labels:
  • Slide 24
  • Semi-Formal Problem Definition A Advertisers Queries Labels: Goal: Find the nodes most similar to A.
  • Slide 25
  • How to Define Similarity? Several node similarity measures in the literature based on the graph structure, random walk, etc. What is the accuracy? Can it scale to graphs with billions of nodes? Can be computed in real-time?
  • Slide 26
  • The three ingredients of Big Data A lot of data A sophisticated infrastructure: MapReduce Efficient algorithms: Graph mining
  • Slide 27
  • MapReduce
  • Slide 28
  • The work is spread across several machines in parallel connected with fast links.
  • Slide 29
  • Algorithms Personalized PageRank: Random walks on the graph Closely related to the celebrated Google PageRank.
  • Slide 30
  • Personalized PageRank
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Idea: perform a very long random walk (starting from v). Rank nodes by probability of visit assigns a similarity score to each node w.r.t. node v. Strong community bias (this can be formalized).
  • Slide 44
  • Personalized PageRank Exact computation is unfeasible O(n^3), but it can be approximated very well. Very efficient Map Reduce algorithm scaling to large graphs (hundred of millions of nodes) However
  • Slide 45
  • Algorithmic Bottleneck Our graphs are simply too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).
  • Slide 46
  • 1 st idea: Tackling Real Graph Structure Data size is the main bottleneck. Compressing the graph would speed up the computation.
  • Slide 47
  • 1 st idea: Tackling Real Graph Structure abcdefg AB A B Only advertisers. Advertisers and queries 1
  • Slide 48
  • 1 st idea: Tackling Real Graph Structure abcdefg AB 1 A B Advertisers and queries abc d e f g A B Ranking of the entire graph 2 Only advertisers.
  • Slide 49
  • 1 st idea: Tackling Real Graph Structure Theorem: the ranking computed is the corrected Personalized PageRank on the entire graph. Based on results from the mathematical theory Markov Chain state aggregation (Simon and Ado, 61; Meyer 89, etc.).
  • Slide 50
  • Algorithmic Bottleneck Our graphs are too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).
  • Slide 51
  • Two-stage Approach First stage: Large-scale (but feasible) MapReduce pre-computation. Second Stage: Fast iterative algorithm.
  • Slide 52
  • First Stage: Individual Category Rankings Advertisers Queries
  • Slide 53
  • First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings
  • Slide 54
  • First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings Precomputed Rankings
  • Slide 55
  • First Stage: Individual Category Rankings Advertisers Queries Precomputed Rankings Precomputed Rankings Precomputed Rankings
  • Slide 56
  • Second Stage: Rank aggregation Precomputed Rankings Precomputed Rankings Ranking of Red + Yellow A real-time iterative algorithm aggregates the rankings of a given node for a subset of the categories.
  • Slide 57
  • Algorithmic Bottleneck Our graphs are too big (billions of nodes) even for large-scale systems. MapReduce is not real-time. We cannot precompute the results for all subsets of categories (exponential time!).
  • Slide 58
  • Experimental evaluation shows the accuracy of the results. Fully implemented and currently under evaluation for integration in production systems. Ongoing research project for future scientific publications. Conclusions
  • Slide 59