cs6913: web search engines lecture on computational

65
CS6913: Web Search Engines Lecture on Computational Advertising and Spam (A Short Overview) Torsten Suel Computer Science and Engineering NYU Tandon School of Engineering

Upload: others

Post on 03-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS6913: Web Search Engines Lecture on Computational

CS6913: Web Search EnginesLecture on

Computational Advertising and Spam(A Short Overview)

Torsten Suel

Computer Science and EngineeringNYU Tandon School of Engineering

Page 2: CS6913: Web Search Engines Lecture on Computational

Computational Advertising

• Informally: advertising on the internet, or in any electronic media (e.g., future TV)

• More formally: automatically finding the best advertisement for a particular user in a particular context, using data mining, ML, and algo. techniques

• Particular user: a particular person, person with acertain history, member of a certain group, in acertain region

• Computational Advertising pays the bills on the webTop-k Document Retrieval Using Block-Max Indexes. With S. Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 3: CS6913: Web Search Engines Lecture on Computational

Traditional Advertising:• Huge industry ($180B per year in the US)

• Developed over the last centuries, but particularly in the last few decades (compare “Mad Men”)

• Influenced by Psychology, how to manipulate ...• TV, radio, newspapers, magazines, direct mail,

leaflets, billboards, yellow pages, flyers • Creative side and deal-making• Shaped by physical constraints (printing, spectrum)

• Not very personalized• Effects hard to measureTop-k Document Retrieval Using Block-Max Indexes. With S. Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 4: CS6913: Web Search Engines Lecture on Computational

From Tradit. to Computational Advertising

• Everything is moving to the internet (also ads)• Computational advertising takes up larger share• NYC: center of traditional ad industry• Now lots of CompAdv startups, + old ones follow• Internet: no physical constraints (but geo matters)

• And more user context à personalization• And more knowledge of outcome à did it work?• Attack of the nerds ...Top-k Document Retrieval Using Block-Max Indexes. With S. Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 5: CS6913: Web Search Engines Lecture on Computational

Basic Forms of Computational Advertising

• Textual Ads: Search match

• Textual Ads: Content match

• Display Ads: Banner Ads

• Others: video ads, mobile, social networks

Top-k Document Retrieval Using Block-Max Indexes.

With S. Ding. 34th Annual ACM SIGIR Conference,

Page 6: CS6913: Web Search Engines Lecture on Computational

Roles in Computational Advertising

• Users• Advertisers• Publishers• Matchmakers

... plus measurement, data brokers, resellersTop-k Document Retrieval Using Block-Max Indexes.With S. Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 7: CS6913: Web Search Engines Lecture on Computational

Textual Ads: Search match • Matching search queries with ads• Search means intent (John Batelle)• Can often show highly relevant ads• Also great for niche companies!• Often pay per click• Companies bid on keywords• Soft or hard match• Auction process to place ads• Publisher == matchmaker (often)

Top-k Document Retrieval Using Block-Max Indexes.With S. Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 8: CS6913: Web Search Engines Lecture on Computational

Search Ads:

Page 9: CS6913: Web Search Engines Lecture on Computational

Search Ads:

Page 10: CS6913: Web Search Engines Lecture on Computational

Textual Ads: Content match• Ads placed on 3rd party web sites• Placed based on site and based on user• Mostly textual, but often also images• Often based on profiles of users (age, past

web site visits) à privacy issues• Often real-time auctions with intermediaries• Many ad networks• Issues of trust and manipulation

Page 11: CS6913: Web Search Engines Lecture on Computational

Content Match:

Page 12: CS6913: Web Search Engines Lecture on Computational

Display Ads: Banner Ads• Often brand advertising, on certain big sites• E.g., cars, movies on CNN or Yahoo! Sports• Users not expected to buy right then• Pay per (million) impressions (ppi)• Often still large, negotiated campaigns• Creative part (design)• E.g., new baseball movie hits theaters• “we want to show ad to 10 million males

between age of 30 and 50 during next week” • Nontrivial optimization problems

Top-k Document Retrieval Using Block-Max Indexes.With S. Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 13: CS6913: Web Search Engines Lecture on Computational

Display Ads:

Page 14: CS6913: Web Search Engines Lecture on Computational

Display Ads:

Page 15: CS6913: Web Search Engines Lecture on Computational
Page 16: CS6913: Web Search Engines Lecture on Computational

Who pays and how to pay

• Pay per impression (ppi)• Pay per click (ppc)• Pay per action (ppa)• Also called “charging event”• Advertiser pays• Publisher and matchmaker get paid

Page 17: CS6913: Web Search Engines Lecture on Computational

Search ad auctions:

• Advertisers bid for keywords: e.g., 10c per click• “mens sneakers”, plus maybe geogr. constraints• Second price auction: “when user clicks on an ad,

advertiser bays price of next highest bidder”• Ranking of ads based on potential payoff for engine• To swing or not to swing (Broder et al, CIKM 2008)

• Pay per click, or sometimes pay per action• Options: hard/soft matchTop-k Document Retrieval Using Block-Max Indexes. With S. Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 18: CS6913: Web Search Engines Lecture on Computational

Search Ads:

Page 19: CS6913: Web Search Engines Lecture on Computational

Auctions:• Different parties submit bids for an item• Open or closed, single or multiple rounds• First- and second-price auctions, Dutch auction• Second-Price (Vickrey) Auction is truthful, i.e., it

motivates people to bid the true value of the item• Proof by case analysis.

Page 20: CS6913: Web Search Engines Lecture on Computational

Second-price auctions are truthful• You are bidding for an item• Suppose the item is worth $x to you• I.e., you want to buy at $x or lower, but not >$x• Claim: your best strategy is to offer $x

Page 21: CS6913: Web Search Engines Lecture on Computational

Second-price auctions are truthful• You are bidding for an item• Suppose the item is worth $x to you• I.e., you want to buy at $x or lower, but not >$x• Claim: your best strategy is to offer $x• Case 1: Suppose you bid more than $x

Page 22: CS6913: Web Search Engines Lecture on Computational

Second-price auctions are truthful• You are bidding for an item• Suppose the item is worth $x to you• I.e., you want to buy at $x or lower, but not >$x• Claim: your best strategy is to offer $x• Case 1: Suppose you bid more than $x

• If you win the auction, you regret• If you do not, then your higher bid made no

difference

Page 23: CS6913: Web Search Engines Lecture on Computational

Second-price auctions are truthful• You are bidding for an item• Suppose the item is worth $x to you• I.e., you want to buy at $x or lower, but not >$x• Claim: your best strategy is to offer $x• Case 2: Suppose you bid less than $x

Page 24: CS6913: Web Search Engines Lecture on Computational

Second-price auctions are truthful• You are bidding for an item• Suppose the item is worth $x to you• I.e., you want to buy at $x or lower, but not >$x• Claim: your best strategy is to offer $x• Case 2: Suppose you bid less than $x

• If someone else wins with a bid <$x, youregret not bidding more

• If you win, or someone else wins with bid >$x, then bidding <$x made no difference

Page 25: CS6913: Web Search Engines Lecture on Computational

Second-Price Auctions:

• Best strategy is to bid true value• Search uses Generalized Second-Price Auction• Generalized: pay k+1th highest price for kth slot• PPC: take probability of click into account

Page 26: CS6913: Web Search Engines Lecture on Computational

Second-Price Auctions:

• Best strategy is to bid true value• Search uses Generalized Second-Price Auction• Generalized: pay k+1th highest price for kth slot• PPC: take probability of click into account• Note: optimality may not extend to this case• Also, optimality is for one auction• Advertisers make many bids over days, with

budget constraints• E.g., strategies for depleting competitor budgets

Page 27: CS6913: Web Search Engines Lecture on Computational

Ad Exchanges:• Intermediaries between publishers and ad networks

• Exchange matches ads with advertising opportunities• Based on auction mechanism• Works for both search and content matches• But how to do 2nd-price auctions in this environment?.

From: S. Muthukrishnan: Ad Exchanges - Research Problems

Page 28: CS6913: Web Search Engines Lecture on Computational

Ad Search Engines:

• Systems for returning best ads given features• For search ads and content match• Features: search and ad keywords, landing pages,

user browsing/search/purchase history, etc.• Very high-dimensional search problem• i.e., keywords not (always) as dominant• Ad engines receive tens of billions of queries/day• For each search query, and for many page views,

July 2011.

Page 29: CS6913: Web Search Engines Lecture on Computational

Click Fraud:

• To get advertising money (content match)• To deplete competition’s budget • To manipulate auctions and prices

• (Or to increase rank of your site by clicking on it), July 2011.

Page 30: CS6913: Web Search Engines Lecture on Computational

Related: Adversarial Information Retrieval

• search engine positions is $$$• make lots of junk pages, try to make them rank high• use ads to make à $ (or $$$)

Page 31: CS6913: Web Search Engines Lecture on Computational

Optimizing display ads:

• Contracts: 10 million impression in next week• Only certain types of customers (age, interests, gender, etc)

• ... or you pay a fine• Unknown inventory:

- how many and what types of users will come?• Overlapping categories

- e.g., science fiction fans versus 20-30 year old males

Page 32: CS6913: Web Search Engines Lecture on Computational

Optimizing display ads: Example• A new Star Wars movie is released on June 15

• Studio wants to advertise (raise awareness)

• Wants to target people 15-30 years old, and any sci-fi fans

• Deal: show our ad to 10m such people between June 1-20

Page 33: CS6913: Web Search Engines Lecture on Computational

Optimizing display ads: Example• A new Star Wars movie is released on June 15

• Studio wants to advertise (raise awareness)

• Wants to target people 15-30 years old, and any sci-fi fans

• Deal: show our ad to 10m such people between June 1-20

• If you do, we pay $1m, but 20c less for each person less

Page 34: CS6913: Web Search Engines Lecture on Computational

Optimizing display ads: Example• A new Star Wars movie is released on June 15

• Studio wants to advertise (raise awareness)

• Wants to target people 15-30 years old, and any sci-fi fans

• Deal: show our ad to 10m such people between June 1-20

• If you do, we pay $1m, but 20c less for each person less

• Another campaign may target people 25-40 (part. overlap)

Page 35: CS6913: Web Search Engines Lecture on Computational

Optimizing display ads: Example• A new Star Wars movie is released on June 15

• Studio wants to advertise (raise awareness)

• Wants to target people 15-30 years old, and any sci-fi fans

• Deal: show our ad to 10m such people between June 1-20

• If you do, we pay $1m, but 20c less for each person less

• Another campaign may target people 25-40 (part. overlap)

• Also, legal to target by age, gender, race? (often not)

Page 36: CS6913: Web Search Engines Lecture on Computational

Keyword selection strategies

• In search ads• Car or automobile• Or dictionary (walmart, ebay)

• Advertiser tools provided by Google, Bing etc.Top-k Document Retrieval Using Block-Max Indexes. With S. Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 37: CS6913: Web Search Engines Lecture on Computational

Computational Advertising Research

• Three main types of approaches/directions:• IR approaches: ad retrieval as a search problem• Mechanism design, auctions, and game theory• Optimization problems in display advertising

• Others: • web/user mining• Privacy and privacy-preserving advertising• Scalability issues in ad engines-k Document Retrieval

Using Block-Max Indexes. With S. Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 38: CS6913: Web Search Engines Lecture on Computational

• Very complicated zoo of companies and roles- ad networks- ad campaign coordination/optimization- companies providing user data- arbitrage & manipulation- thousands of companies- real-time auctions in tens of millisecs

• Additional online advertising scenarios- monetizing social networks (facebook, linkedIn etc.)- mobile ads and ads in app-space- ITV and internet radio: ads not a broadcast (hulu)

Privacy: looking bad at the moment- (almost) everything is for sale …

Comp. Advertizing Marketplace

Page 39: CS6913: Web Search Engines Lecture on Computational

Ad company marketplace: (huge # of companies!!)

Page 40: CS6913: Web Search Engines Lecture on Computational

European ad company marketplace:

Page 41: CS6913: Web Search Engines Lecture on Computational

Local ad company marketplace:

Page 42: CS6913: Web Search Engines Lecture on Computational

Other Issues:

• Intermediaries, ad constraints, brand safety• Local and mobile ads• Interactive TV (Hulu etc)• Privacy!• Fairness!• Fap-k Document Retrieval Using Block-Max Indexes. With S.

Ding. 34th Annual ACM SIGIR Conference, July 2011.

Page 43: CS6913: Web Search Engines Lecture on Computational

Fairness Issues:• Recall: targeting certain demographics illegal

• Such targeting might be indirect or unintentional

• Different users might be offered different prices

• Fairness for buyers, or for sellers?

• Real estate, jobs

• Course: Stoyanovic/Wood: Responsible Data Science, DS-GA 1017, Spring 2021. https://dataresponsibly.github.io/rds/

Page 44: CS6913: Web Search Engines Lecture on Computational

NEXT TOPIC:Web Spam and Manipulation

• introduction• what is spam?• link spam: how to and countermeasures• content based spam detection

Page 45: CS6913: Web Search Engines Lecture on Computational

• a type of “fake meat”• email spam: unrequested, undesired messages sent out

to large numbers of people, most of which are notinterested and consider it a nuisance

• web spam: large amounts of web pages generated ormanipulated to achieve high rankings in search engines,or to harvest ad dollars, or to trick users into ...

• search engine manipulation• related: blog spam and blog link spam• related: bot accounts on twitter, facebook, unsafe sites

Introduction:What is “spam”?

Page 46: CS6913: Web Search Engines Lecture on Computational

What is Web Spam?• search engine ranking extremely important for online businesses• online book store: must be ranked high on “books”• Search Engine Optimization (SEO): optimize ranking• overly aggressive SEO: “web spam”• also in connection with ad networks: make artificial junk pages

that host ads to earn money

Example of Web Spam Techniques:• keyword stuffing: add irrelevant keywords to page

- in body, title, meta tags, anchor text, URL, and hostname- generation of random content, or copying of good content- hidden text: micro fonts, white font on white background

• cloaking: give different page to crawlers vs. browsers• reuse of domains and redirection• link spam: link farms and link exchange agreements

- links from blog comments and other online communities

Page 47: CS6913: Web Search Engines Lecture on Computational

Web Spam Example 1:

Page 48: CS6913: Web Search Engines Lecture on Computational

Web Spam Example 2:

• not clear what spam is• lots of pseudo-useful or replicated useful content• big grey area• beneficiaries may not be completely aware (e.g. BMW ban from Google)

Page 49: CS6913: Web Search Engines Lecture on Computational

• search engines have huge economic power: search engineranking can determine whether an online business willfail or prosper

• online book store: very important to be ranked high onqueries such as “book”, “buy book”, or “textbooks” etc.

• a few years ago, domain names (e.g., books.com) wereconsidered most important and sold for millions of $

• now ranking is king: chance to make $ ($$)• by selling stuff, or by hosting ads• search engine optimization (SEO) industry: “hire us andwe will improve ranking (visibility) of your web site”• white hat versus black hat search engine optimization

Why spam?

Page 50: CS6913: Web Search Engines Lecture on Computational

• search engine companies: “design a good web site, andwe will try to fairly rank your site. Do not focus ontricking or defeating our ranking function!”

• teacher: “learn the topic, and leave grading up to me. Donot try to find tricks to score high in the exam!”

• white hat (good): consult web sites on how to avoid basicmistakes, such as bad site structure, missing keywords,bad HTML or other constructs that SE cannot parse

• black hat (bad): will use any trick to get ranked higher

• boundary not clear, big grey zone

• SEO is multi-billion dollar industry, thousands of firms

Why spam?

Page 51: CS6913: Web Search Engines Lecture on Computational

SEO Consulting:

Page 52: CS6913: Web Search Engines Lecture on Computational

• term-based spamming: change content of page so thatpages appears highly relevant to certain queries

• link-based spamming: create artificial in-coming linksto that a page has very high Pagerank

• or often a combination of both

• click-spam ...• term-based spamming: put the word “book” many times

into a page, or put popular unrelated words into a page

• link-based spamming: create fake pages that all link tothe page that is promoted

• special case: blog comment spam• or sell or exchange links to highest bidder

How to spam?

Page 53: CS6913: Web Search Engines Lecture on Computational

• spammer may not be the same as the beneficiary• e.g., reputable company (e.g., bmw.de) is approached by

an SEO company promising to re-engineer their web siteto achieve higher visibility (ranking), and does this usingillegitimate means

• or a site is approached with an offer to sell Pagerank• search engine may detect this and punish the site• or a spammer might build his/her own network of sites

to attract visitors and make money of them via ads• common pattern affiliate sites: stores may pay a

percentage of sales to sites sending customers via links• eBay, amazon, adSense, hotel/airline sites

Who is spamming?

Page 54: CS6913: Web Search Engines Lecture on Computational

• detect manually: using own employees or based on usercomplaints

• automatic or semi-automatic detection- use simple statistics: what does spam usually look like?- learning-based approach: manually provide some examples- then blacklist any detected sites or pages- or demote them in the ranking

• collaborative approaches

• design of ranking functions that cannot be spammedImpossible? (consider term-based RF)

• design of economic frameworks that discourage spam • compare: economic solutions to email spam: postage

Only the rich can spam

How to deal with spam?

Page 55: CS6913: Web Search Engines Lecture on Computational

From: Modern Mechanix, 1934

Page 56: CS6913: Web Search Engines Lecture on Computational

• web spam• pornography filtering and detection• censorship: automatic filtering and blocking, or detection

of such filtering or blocking in search results• click fraud: prevention and detection• phishing site detection• web surveillance and user monitoring …• filter bubbles, conspiracy theories, and extremism• manipulation and how to detect and prevent (facebook, elections)

• harassment on social networks

Part of the Seedy Underbelly of the Net

Page 57: CS6913: Web Search Engines Lecture on Computational

How to Deal with Web Spam

• detecting and demoting spam – the basics• judging spam: precision and recall• link-based methods• content-based methods• spam fighting: an interactive framework

Page 58: CS6913: Web Search Engines Lecture on Computational

• web spam: web pages generated or manipulated for thepurpose of achieving high rankings in search engines

• types of spam: text spamming vs. link spamming

• how to deal with spamming:- detect (delete spam from engine)

- demote (decrease ranking of pages likely to be spam)

- discourage (build mechanisms that are hard to spam)

Recall from Before:

Page 59: CS6913: Web Search Engines Lecture on Computational

• types of spam: - text spam (try to rank spam itself higher)- link spam (link farms: try to use spam to rank other page or site higher)- or combinations of both- also, comment spam, ad/news spam, many other forms

• how to detect/demote spam:- text-based techniques: “find text that looks odd”- link-based techniques: “find unusual link structures”

(based on either page graph or site graph)- or more common: use link as measure of reputation

- based on other statistics: “find sites with off characteristics”(e.g., long URLs or titles, large or small pages, domains, ..)

- site structure, click-through data- detecting hiding/cloaking/copying of content

• or use human input

Dealing with Spam:

Page 60: CS6913: Web Search Engines Lecture on Computational

• remember: precision/recall• example:

- domain with 1,000,000 pages, including 250,000 spam pages

- spam detection algorithm reports 200,000 pages as spam

- of those 200,000, only 150,000 are really spam

precision 150,000/200,000 = 75%

recall 150,000/250,000 = 60%

• but how do we know this? How do we check how goodthe spam detection algorithm is?

• spam is in the eye of the beholder

use human judgment to evaluate

Evaluating Spam:

Page 61: CS6913: Web Search Engines Lecture on Computational

• use sampling to decrease data size for evaluation• an example:

- domain with 1,000,000 pages, including 250,000 spam pages- select 1,000 pages at random and have human decide if spam- if 258 of 1000 pages are spam, we expect 258,000 spam pages total- of these 1000 pages, maybe 196 were rated spam by algorithm,

including 152 of the 258 spam pages in the sampleestimated precision 152/196 = ?estimated recall 152/258 = ?

• good approximation if sample unbiased and large enough

• note: we also need a set of known spam to get started• machine learning/data mining/statistics problem

- remember text classification

Evaluating Spam: (ctd.)

Page 62: CS6913: Web Search Engines Lecture on Computational

• basic ideas: - suspicious word patterns: hotel hotel hotel hotel ..- suspicious words: casino, cheap pills, lawyers (might be wrong)

- many unrelated topics on same page/site (dictionary dumping, weaving)

- parts or most of text copied from other sites (duplicate detection)

- cloaking or hidden text- suspicious URL, title, or meta data patterns

(e.g., long URLs with hyphen: www.cheap-hotels-nyc.xyz.com/casino/pills.html)

• general approach based on machine learning/data mining:- start with small set of known spam and non-spam- often pages inspected or reported by human user- now a spam classifier can learn what spam looks like- simplest approach: naïve Bayes, Bayesian (as in text categorization!)

- “find other pages that look like these examples”- training set vs. evaluation set (should be disjoint sites?)

Textual Methods for Dealing with Spam:

Page 63: CS6913: Web Search Engines Lecture on Computational

• basic ideas: - good pages point to good pages, while spam may point to anything- if a good page links to you, you are good (positive)- if you link to a bad page, you are bad (negative)

• basic algorithmic approaches:- trustrank: starting from good pages, identify other good pages

(sort of like Pagerank)

- badrank: starting from bad pages, identify other bad pages(sort of like inverse Pagerank)

- plus variations: combinations, topic sensitive, site graph, etc

• approaches again need start set of examples- through human inspection or reporting- through text-based or other methods- through other link-based methods that detect “odd” link structures

Link-Based Methods for Dealing with Spam:

Page 64: CS6913: Web Search Engines Lecture on Computational

• major search engines should limit spam in results

• no fraud/attack sites in top-10 of common queries

• no fake bank sites, no sites trying to install trojans

• search engines make major investments to protect

• detecting fake sites trying to get passwords

• detecting if sites try to install software

Example: Search Engine Results

Page 65: CS6913: Web Search Engines Lecture on Computational

• the spam problem can probably not be completely solved

• but it can be kept under control by investing resources (?)

• compare: computer security, police (there will always be crime)

• obfuscation does help! (keep ranking function secret)

• battle of resources: - search engine vs. spammers- how many resources (people) needed to keep ranking clean enough?- maybe just a few smart people with powerful techniques?

• semi-automatic approach?• cooperative spam fighting? (community feedback)• economic approaches? (make spam less lucrative and more expensive to make)

General Perspective and Approach: