challenges in scalable data mining: support for db-backed web sites raghu ramakrishnan professor,...

Challenges in Scalable Data Mining:Support for DB-Backed Web Sites

Raghu Ramakrishnan

Professor, UW-Madison

Talk Outline

Introduction– Personalization– User interactivity– DB implications

Background– Tracking users– Content management and delivery

Challenges

Introduction

Evolution of Websites

Standard Content Personalized Content

Passive Users Active Users

Fundamental Shift

User-centric design of websites.– Web, unlike print, phones, TV, and other media,

offers a unique opportunity to present each customer with a customized experience.

– Exploiting this potential is becoming a key differentiator across sites.

See http://www.personalization.com/resources/vendors/

Personalization

Adapt the site to each user, even each visit.– Bugs Bunny is different from Michael

Jordan– Last time, Bugs was shopping for

himself, this time he’s looking for a gift for Michael

Personalization

Technical Implications:– Need to know something about user, current visit– Need to dynamically alter requested page

Privacy concerns:– Will an individual user’s profile be disclosed/sold to

others? Will the profile information be used in ways other than to improve that user’s site-experience in ways that the user approves?

User Interactions

Traditionally: Searches, purchases.– Doesn’t leverage a site’s biggest asset: its users.

Site itself is not changed by users in this model.

Richer interactions: Web communities.– Put up for auction, bid– Comment, rate– Form groups and work together– Ask, answer– User-generated content

Web Communities

Site content driven by users, and changes rapidly.

Viral growth patterns lead to high volumes of traffic.

Need to validate, review for quality.– Must track user activity

Greater need for personalization, push technologies.– Again, need to track users, dynamic pages

DBs, Mining, and Websites

Personalization and increased user interactivity both lead to websites that deliver dynamically constructed pages, based on data in a DB.

Ergo, we have a vast new application domain for database management systems.

Ergo, we have a challenge: How best to adapt each page to the current user and context.

Background

Tracking Users: Cookies

GET– Browser issues this command to retrieve a doc;

includes all cookies visible to target server– Server responds with header info, including doc size,

server location, cookie directives, etc., plus document

Set-Cookie: visits=11

GET … Cookie: visits=10 …

Cookies

Server can set following parameters for cookies:– Name and value– When cookie expires– Which pages on server “see” the cookie– Which servers can “see” the cookie

E.g., Doubleclick servers can see cookies set at many sites

Alternative to cookies:– Carry request history along: modify each requested

page to “attach” history to every link on page!– Allows session tracking, but not across sessions.

Vignette StoryServer

A platform for developing dynamic web sites:– Content personalization and delivery– Content Management

An elaborate gateway that sits between web servers and DBMSs (and file systems).

Spin-off from CNET’s efforts to develop their own site.

Vignette StoryServer

A page is assembled dynamically from components:– Adaptive navigation bars– Summary components (e.g., top-ten lists)– Personalized elements (e.g., selected news);

integration with recommendation engines such as Net Perceptions’ GroupLens is supported

Caching support for components provides ability to trade-off degree of dynamism (and customization)

Data Mining Challenges

A List of Challenges

Similarity (real-time) Matching (real-time) Trends (off-line) Correlation (off-line)

The Similarity Problem

Find users with similar tastes, in context.– Joe’s looking at an Athlon processor; which users are

similar to Joe in their PC tastes? Whose recommendations is Joe likely to follow?

Find similar content, in context.– Which processors are similar in that they appeal to

the same groups of people?– Which processors are similar in that they have similar

performance characteristics?– Which articles appeal to the same people?

The Matching Problem

Match user to data, in context.– What related information should you recommend to

Joe when he is looking at the Athlon PC product? Related products: graphics cards, monitors Related reviews, discussions If Joe’s been looking only at AMD products, other AMD

chips; if not, show alternatives from Intel

Match data to user, in context.– Which expert is best qualified to answer Joe’s

question?

The Trends Problem

Identify trends in sales. Identify trends in overall user preferences, user

segmentation. Identify trends for individual users. Identify trends in overall product popularity,

product segmentation. Identify trends for specific products.

The Correlations Problem

Given a set of trends (e.g., in pricing) track the impact on other trends. – Are there correlated trends?– Are there causal relationships?

Note that correlating a given trend to an overall trend is hard enough, but trying to find all other individual or product-specific trends that happen to be correlated is much harder!

Problem Characteristics

Large datasets: Many users, huge activity levels, lots of products, lots of documents, …

Real-time recommendations: “In context” Constantly evolving data: Data mining models

can get outdated, want to find trends. Variations:

– Attach recommendation engine to a user’s browser, rather than to the web server. (Purple Swami)

– Look for similar documents across sites and extract relevant metadata. (Whizbang)

Summary

Lots of challenges. Lots of players.

– Companies that provide applications and integrate data mining into the application logic.

E.g., ATG, BroadVision, QUIQ, Vignette

– Companies that provide data mining tools. E.g., Blaze, Broadbase, DataSage, Engage, E.piphany, Net

Perceptions, Manna

challenges in scalable data mining: support for db-backed web sites raghu ramakrishnan professor,...

introduction slide

background slide

michael slide

dynamic pages slide

content personalization

user interactions

current user

uwmadison slide

Documents

raghu ramakrishnan chief scientist, audience & cloud...

the relational model content based on chapter 3 database...

distributed databases based on material provided by: jim...

database management systems solutions · pdf filedatabase...

authors brian f. cooper, raghu ramakrishnan, utkarsh...

relational algebra content based on chapter 4 database...

database management systems 1 raghu ramakrishnan sql:...

database management systems 1 raghu ramakrishnan the...

1 schema refinement and normal forms chapter 19 raghu...

limiting disclosure in hippocratic databases kristen lefevre...

bellwether analysis bellwether analysis predicting global...

incognito: efficient fulldomain kanonymity - iit bombay ·...

tian zhang raghu ramakrishnan miron livny presented by:...

database management systems 1 raghu ramakrishnan relational...

icicles: self-tuning samples for approximate query answering...

raghu ramakrishnan, johannes gehrke database management...

raghu ramakrishnan yahoo! research university of wisconsin

1 an overview of cloud computing @ yahoo! raghu ramakrishnan...

1 an overview of cloud computing raghu ramakrishnan chief...

olap over uncertain and imprecise data doug burdick, prasad...