1006_the demographics of web search

Upload: karthiga-nesamani

Post on 09-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 1006_The Demographics of Web Search

    1/22

    SIGIR 2010

  • 8/7/2019 1006_The Demographics of Web Search

    2/22

    Query Disambiguation

    A query wagner

    Forwomen the most clicked URL was composer

    Formen the most clicked URL was paint sprayers An incomplete term hal

    In general hal indsey (evangelist)

    People living in areas with above average

    education level is hal higdon (writerandrunner)

  • 8/7/2019 1006_The Demographics of Web Search

    3/22

    How to solve ambiguities

    Grouping users bydemographic features, such

    as age orincome.

    Models are learned forindividual users. Given enough data about individual users,

    models indeed be proofedvery powerful.

  • 8/7/2019 1006_The Demographics of Web Search

    4/22

    Contribution

    Wedemonstrate how public information forZIPcodes can beused to annotate both web queries andURLs with demographic features.

    We show that theuserpopulation of a large,commercial search engine is representative of thewhole US population.

    Weuncoverdifferences in search behavioracrossdemographic segments.

    We show that using demographic information has apotential to improve state-of-the-art web searchresults.

  • 8/7/2019 1006_The Demographics of Web Search

    5/22

  • 8/7/2019 1006_The Demographics of Web Search

    6/22

  • 8/7/2019 1006_The Demographics of Web Search

    7/22

    Query Log Preprocessing

    Fourth, only web searches originating from the US-version of the search engine

    pertaining to users with a valid ZIP code wereused

    Fifth, queries without clicks on URLs werediscarded when multiple URLs where clicked forthe same query, multiple such

    pairs were generated.

    Sixth, queries were cast to lowercase but no stemming wasapplied and all special characters (such as apostrophes) werekept.

    Seventh, immediate, repeatedduplicates of (query, URL) pairsby a singleuserwere conflated to a single instance.

    we still kept repeated (query, URL) pairs fora singleuseras long asthere were otherpairs in between. (q,u1)

    (q,u1)

    (q,u1)

    (q,u1)

    (q,u2)

    (q,u1)

  • 8/7/2019 1006_The Demographics of Web Search

    8/22

    Basic statistics

  • 8/7/2019 1006_The Demographics of Web Search

    9/22

    Demographic Feature Extraction

    percapita income [P-c income k$]

    bachelor's degree orhigher[BA degree %]

    individuals below poverty level [below poverty %], race: white [white %], African American [African

    American %], Asian [Asian %]

    speaks a language otherthan English at home [non-English %].

  • 8/7/2019 1006_The Demographics of Web Search

    10/22

    Demographic Feature Extraction

    Pairs (input, target) were then labeled with

    demographic information

    directly from theuser's profile (birth yearandgender)

    from using demographic information pertaining to

    ZIP codes

  • 8/7/2019 1006_The Demographics of Web Search

    11/22

  • 8/7/2019 1006_The Demographics of Web Search

    12/22

    Demographic Feature Extraction

    The labels applied to each (input, target) pairwerediscretized

    Forall demographic features weused quintiles: thepercentile intervals [0%; 20%], (20%; 40%], ..., (80%;100].

    E.g., a ZIP code with no more than 12.8% of itspopulation 25 years and overholding a bachelor'sdegree would be placed in the lowest quintile forthe

    corresponding feature the ZIP where weused only the two leading digits

    giving a total of 99 buckets,

  • 8/7/2019 1006_The Demographics of Web Search

    13/22

  • 8/7/2019 1006_The Demographics of Web Search

    14/22

    Data Quality

    Users provided false profile information,

    sometimes deliberately.

    Solution derived the ZIP code by mapping theuser's IP

  • 8/7/2019 1006_The Demographics of Web Search

    15/22

    METHODOLOGY

    LetX, YandDberandom variables

    corresponding to the input, target and

    demographic information respectively.

    Similarly, letx,y anddbe actual instances of

    values of theserandom variables.

    argmaxy

    P(y |x, d) argmaxy

    P(y |x)

  • 8/7/2019 1006_The Demographics of Web Search

    16/22

    Table lists the fourmost

    discriminating queries for

    different demographicgroups

    Query of Max P(D|Q)

    Queries areranked by theaverage featurevalue

    Olderpeople tend to be

    more likely to use URLs as

    web queries

  • 8/7/2019 1006_The Demographics of Web Search

    17/22

    Finding

    Queries predominantly issued byyoung userstend to berelated to chat rooms, music andsocial networking sites.

    Queries which are issuedexclusively bymaleusers in oursample arerelated to sports, orcomputerhard and software.

    Queries from areas where a language otherthan English is often spoken at home, turn outto be written in Spanish.

  • 8/7/2019 1006_The Demographics of Web Search

    18/22

    People with a universitydegree

    behavior

    Numbers are computed overroughly 95.8M

    Thesedifferences, though small, are statisticallysignificant at a confidence level well below 0.001,

    using a t-test

  • 8/7/2019 1006_The Demographics of Web Search

    19/22

    Conditional entropy

    A queries q where a demographic group dhas anunusually high orlow conditional clickentropyH(U|q, d)

    A high clickentropy can bedue to a numberofreasons. It can be that the presented web results forthat query are

    poorand people have to try many pages

    but it can also be seen an expression of high interest on apotentially multi-faceted topic

  • 8/7/2019 1006_The Demographics of Web Search

    20/22

  • 8/7/2019 1006_The Demographics of Web Search

    21/22

    Application Results

    The input has a support of at least 100 users forsomecombination (x,d), as well as at least another400 users forothervalues ofd.

    The baseline system ranks targets according to P(y|x).

    Oursystem ranks them byP(y|x, d).

    The last column shows therelative gain.

  • 8/7/2019 1006_The Demographics of Web Search

    22/22

    Conclusion

    This is the first study that analyzes the websearch behaviorofdifferent demographicgroups formillions of US web users.

    The simple but important observation thatmade this possible was the linkage of censusinformation forZIP codes.

    Formost parts, the population of search engineusers appears to be a very good approximationof the US population.