challenges in scalable data mining: support for db-backed web sites raghu ramakrishnan professor,...

23
Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW- Madison

Upload: neal-strickland

Post on 23-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Challenges in Scalable Data Mining:Support for DB-Backed Web Sites

Raghu Ramakrishnan

Professor, UW-Madison

Page 2: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Talk Outline

Introduction– Personalization– User interactivity– DB implications

Background– Tracking users– Content management and delivery

Challenges

Page 3: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Introduction

Page 4: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Evolution of Websites

Standard Content Personalized Content

Passive Users Active Users

Page 5: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Fundamental Shift

User-centric design of websites.– Web, unlike print, phones, TV, and other media,

offers a unique opportunity to present each customer with a customized experience.

– Exploiting this potential is becoming a key differentiator across sites.

See http://www.personalization.com/resources/vendors/

Page 6: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Personalization

Adapt the site to each user, even each visit.– Bugs Bunny is different from Michael

Jordan– Last time, Bugs was shopping for

himself, this time he’s looking for a gift for Michael

Page 7: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Personalization

Technical Implications:– Need to know something about user, current visit– Need to dynamically alter requested page

Privacy concerns:– Will an individual user’s profile be disclosed/sold to

others? Will the profile information be used in ways other than to improve that user’s site-experience in ways that the user approves?

Page 8: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

User Interactions

Traditionally: Searches, purchases.– Doesn’t leverage a site’s biggest asset: its users.

Site itself is not changed by users in this model.

Richer interactions: Web communities.– Put up for auction, bid– Comment, rate– Form groups and work together– Ask, answer– User-generated content

Page 9: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Web Communities

Site content driven by users, and changes rapidly.

Viral growth patterns lead to high volumes of traffic.

Need to validate, review for quality.– Must track user activity

Greater need for personalization, push technologies.– Again, need to track users, dynamic pages

Page 10: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

DBs, Mining, and Websites

Personalization and increased user interactivity both lead to websites that deliver dynamically constructed pages, based on data in a DB.

Ergo, we have a vast new application domain for database management systems.

Ergo, we have a challenge: How best to adapt each page to the current user and context.

Page 11: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Background

Page 12: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Tracking Users: Cookies

GET– Browser issues this command to retrieve a doc;

includes all cookies visible to target server– Server responds with header info, including doc size,

server location, cookie directives, etc., plus document

Set-Cookie: visits=11

GET … Cookie: visits=10 …

Page 13: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Cookies

Server can set following parameters for cookies:– Name and value– When cookie expires– Which pages on server “see” the cookie– Which servers can “see” the cookie

E.g., Doubleclick servers can see cookies set at many sites

Alternative to cookies:– Carry request history along: modify each requested

page to “attach” history to every link on page!– Allows session tracking, but not across sessions.

Page 14: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Vignette StoryServer

A platform for developing dynamic web sites:– Content personalization and delivery– Content Management

An elaborate gateway that sits between web servers and DBMSs (and file systems).

Spin-off from CNET’s efforts to develop their own site.

Page 15: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Vignette StoryServer

A page is assembled dynamically from components:– Adaptive navigation bars– Summary components (e.g., top-ten lists)– Personalized elements (e.g., selected news);

integration with recommendation engines such as Net Perceptions’ GroupLens is supported

Caching support for components provides ability to trade-off degree of dynamism (and customization)

Page 16: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Data Mining Challenges

Page 17: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

A List of Challenges

Similarity (real-time) Matching (real-time) Trends (off-line) Correlation (off-line)

Page 18: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

The Similarity Problem

Find users with similar tastes, in context.– Joe’s looking at an Athlon processor; which users are

similar to Joe in their PC tastes? Whose recommendations is Joe likely to follow?

Find similar content, in context.– Which processors are similar in that they appeal to

the same groups of people?– Which processors are similar in that they have similar

performance characteristics?– Which articles appeal to the same people?

Page 19: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

The Matching Problem

Match user to data, in context.– What related information should you recommend to

Joe when he is looking at the Athlon PC product? Related products: graphics cards, monitors Related reviews, discussions If Joe’s been looking only at AMD products, other AMD

chips; if not, show alternatives from Intel

Match data to user, in context.– Which expert is best qualified to answer Joe’s

question?

Page 20: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

The Trends Problem

Identify trends in sales. Identify trends in overall user preferences, user

segmentation. Identify trends for individual users. Identify trends in overall product popularity,

product segmentation. Identify trends for specific products.

Page 21: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

The Correlations Problem

Given a set of trends (e.g., in pricing) track the impact on other trends. – Are there correlated trends?– Are there causal relationships?

Note that correlating a given trend to an overall trend is hard enough, but trying to find all other individual or product-specific trends that happen to be correlated is much harder!

Page 22: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Problem Characteristics

Large datasets: Many users, huge activity levels, lots of products, lots of documents, …

Real-time recommendations: “In context” Constantly evolving data: Data mining models

can get outdated, want to find trends. Variations:

– Attach recommendation engine to a user’s browser, rather than to the web server. (Purple Swami)

– Look for similar documents across sites and extract relevant metadata. (Whizbang)

Page 23: Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison

Summary

Lots of challenges. Lots of players.

– Companies that provide applications and integrate data mining into the application logic.

E.g., ATG, BroadVision, QUIQ, Vignette

– Companies that provide data mining tools. E.g., Blaze, Broadbase, DataSage, Engage, E.piphany, Net

Perceptions, Manna