1006_the demographics of web search
TRANSCRIPT
-
8/7/2019 1006_The Demographics of Web Search
1/22
SIGIR 2010
-
8/7/2019 1006_The Demographics of Web Search
2/22
Query Disambiguation
A query wagner
Forwomen the most clicked URL was composer
Formen the most clicked URL was paint sprayers An incomplete term hal
In general hal indsey (evangelist)
People living in areas with above average
education level is hal higdon (writerandrunner)
-
8/7/2019 1006_The Demographics of Web Search
3/22
How to solve ambiguities
Grouping users bydemographic features, such
as age orincome.
Models are learned forindividual users. Given enough data about individual users,
models indeed be proofedvery powerful.
-
8/7/2019 1006_The Demographics of Web Search
4/22
Contribution
Wedemonstrate how public information forZIPcodes can beused to annotate both web queries andURLs with demographic features.
We show that theuserpopulation of a large,commercial search engine is representative of thewhole US population.
Weuncoverdifferences in search behavioracrossdemographic segments.
We show that using demographic information has apotential to improve state-of-the-art web searchresults.
-
8/7/2019 1006_The Demographics of Web Search
5/22
-
8/7/2019 1006_The Demographics of Web Search
6/22
-
8/7/2019 1006_The Demographics of Web Search
7/22
Query Log Preprocessing
Fourth, only web searches originating from the US-version of the search engine
pertaining to users with a valid ZIP code wereused
Fifth, queries without clicks on URLs werediscarded when multiple URLs where clicked forthe same query, multiple such
pairs were generated.
Sixth, queries were cast to lowercase but no stemming wasapplied and all special characters (such as apostrophes) werekept.
Seventh, immediate, repeatedduplicates of (query, URL) pairsby a singleuserwere conflated to a single instance.
we still kept repeated (query, URL) pairs fora singleuseras long asthere were otherpairs in between. (q,u1)
(q,u1)
(q,u1)
(q,u1)
(q,u2)
(q,u1)
-
8/7/2019 1006_The Demographics of Web Search
8/22
Basic statistics
-
8/7/2019 1006_The Demographics of Web Search
9/22
Demographic Feature Extraction
percapita income [P-c income k$]
bachelor's degree orhigher[BA degree %]
individuals below poverty level [below poverty %], race: white [white %], African American [African
American %], Asian [Asian %]
speaks a language otherthan English at home [non-English %].
-
8/7/2019 1006_The Demographics of Web Search
10/22
Demographic Feature Extraction
Pairs (input, target) were then labeled with
demographic information
directly from theuser's profile (birth yearandgender)
from using demographic information pertaining to
ZIP codes
-
8/7/2019 1006_The Demographics of Web Search
11/22
-
8/7/2019 1006_The Demographics of Web Search
12/22
Demographic Feature Extraction
The labels applied to each (input, target) pairwerediscretized
Forall demographic features weused quintiles: thepercentile intervals [0%; 20%], (20%; 40%], ..., (80%;100].
E.g., a ZIP code with no more than 12.8% of itspopulation 25 years and overholding a bachelor'sdegree would be placed in the lowest quintile forthe
corresponding feature the ZIP where weused only the two leading digits
giving a total of 99 buckets,
-
8/7/2019 1006_The Demographics of Web Search
13/22
-
8/7/2019 1006_The Demographics of Web Search
14/22
Data Quality
Users provided false profile information,
sometimes deliberately.
Solution derived the ZIP code by mapping theuser's IP
-
8/7/2019 1006_The Demographics of Web Search
15/22
METHODOLOGY
LetX, YandDberandom variables
corresponding to the input, target and
demographic information respectively.
Similarly, letx,y anddbe actual instances of
values of theserandom variables.
argmaxy
P(y |x, d) argmaxy
P(y |x)
-
8/7/2019 1006_The Demographics of Web Search
16/22
Table lists the fourmost
discriminating queries for
different demographicgroups
Query of Max P(D|Q)
Queries areranked by theaverage featurevalue
Olderpeople tend to be
more likely to use URLs as
web queries
-
8/7/2019 1006_The Demographics of Web Search
17/22
Finding
Queries predominantly issued byyoung userstend to berelated to chat rooms, music andsocial networking sites.
Queries which are issuedexclusively bymaleusers in oursample arerelated to sports, orcomputerhard and software.
Queries from areas where a language otherthan English is often spoken at home, turn outto be written in Spanish.
-
8/7/2019 1006_The Demographics of Web Search
18/22
People with a universitydegree
behavior
Numbers are computed overroughly 95.8M
Thesedifferences, though small, are statisticallysignificant at a confidence level well below 0.001,
using a t-test
-
8/7/2019 1006_The Demographics of Web Search
19/22
Conditional entropy
A queries q where a demographic group dhas anunusually high orlow conditional clickentropyH(U|q, d)
A high clickentropy can bedue to a numberofreasons. It can be that the presented web results forthat query are
poorand people have to try many pages
but it can also be seen an expression of high interest on apotentially multi-faceted topic
-
8/7/2019 1006_The Demographics of Web Search
20/22
-
8/7/2019 1006_The Demographics of Web Search
21/22
Application Results
The input has a support of at least 100 users forsomecombination (x,d), as well as at least another400 users forothervalues ofd.
The baseline system ranks targets according to P(y|x).
Oursystem ranks them byP(y|x, d).
The last column shows therelative gain.
-
8/7/2019 1006_The Demographics of Web Search
22/22
Conclusion
This is the first study that analyzes the websearch behaviorofdifferent demographicgroups formillions of US web users.
The simple but important observation thatmade this possible was the linkage of censusinformation forZIP codes.
Formost parts, the population of search engineusers appears to be a very good approximationof the US population.