engineering web search applications

Download Engineering Web Search Applications

If you can't read please download the document

Upload: alessandro-bozzon

Post on 10-May-2015

14.297 views

Category:

Technology


1 download

DESCRIPTION

This tutorial, offered at the 10th International Conference on Web Engineering, presents the peculiarities of advanced Web search applications, describes some tools and techniques that can be exploited, and offers a methodological approach to development. The approach proposed in this tutorial is based on the paradigm of Model Driven Development (MDD), where models are the core artifacts of the application life-cycle and model transformations progressively refine models to achieve an executable version of the system. To cope with the process-intensive nature of the main interactions (i.e., content analysis, query management, etc.), we describe the use of Process Models (e.g., BPMN models). Indeed, search-based applications are considered as process- and content-intensive applications, due to the trends towards exploratory search and search as a process visions.

TRANSCRIPT

  • 1.Engineering Web Search Applications Alessandro Bozzon Marco Brambilla Vienna July 5, 2010

2.

  • Alessandro Bozzon
  • Post-doc @Politecnico di Milano
  • http://home.dei.polimi.it/bozzon
  • Marco Brambilla
  • Assistant Professor @Politecnico di Milano
  • http://home.dei.polimi.it/mbrambil

About the speakers 2010 Alessandro Bozzon, Marco Brambilla

  • Research background and interests
    • Web engineering and model-driven development
      • WebML and WebRatio
      • Complex enterprise application design
    • BPM, SOA and integration with Web application devel.
    • Search engine and complex search application development
      • Search Computing: multidomain search
      • Pharos: multimedia search framework

July 5, 2010 ABOUT // 3. About the tutorial

  • Information Retrieval is a >40y old discipline tackled from a myriad of viewpoints
  • This tutorial is:
    • Breadth-oriented
    • Development process driven
    • using real-world case studies as examples
  • The tutorial is necessarily shallow
    • But we provide references and links

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 ABOUT// 4. Agenda 2010 Alessandro Bozzon, Marco Brambilla 5. AGENDA

  • Introduction
    • What are Web search applications?
  • Requirements
    • Which are their requirements?
  • Design
    • How to design them?
  • Implementation
    • How to implement them?
  • Validation
    • How to measure their success?

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 AGENDA// 6. Introduction 2010 Alessandro Bozzon, Marco Brambilla 7. Search prevails

  • Searchis an integral part of online life of people
  • Web search has become a standard (and often preferred) source of information finding
    • ... 92%of Internet users say the Internet is a good place to go for getting everydayinformation... - 2004 Pew Internet Survey
  • Web search engines are now thesecond most frequently usedonline computer application, after email
  • Search is fully integrated into operating systems and is viewed as an essential part of most information systems

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION // 8. Some numbers

  • Web
    • Estimated size:~ 60 billion pages 22/06/2010
      • http://www.worldwidewebsize.com/
    • > 9.3 billion queries just in the U.S. inMay2010
      • http://blog.nielsen.com/nielsenwire/online_mobile/top-u-s-search-sites-for-may-2010/
      • and growing
  • Twitter
    • # of new tweets per day: 55 million
    • # of search queries per day: 600 million
  • Facebook
    • 400 Million Global Users (and growing)
    • The average Facebook User Spends 55 Minutes Per Day

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION // 9. more numbers

  • IDC Digital Universe report estimates:
    • digital data grew by 62% between 2008 and 2009
      • ~ 800,000 petabytes (PB)
    • >1.2 million PB in 2010
    • reach 35 ZB (zetabytes) by 2020.

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION // [Ramakrishnan and Tomkins 2007] 10. Information Retrieval

  • Information retrieval (IR)deals with the representation, storage, organization of, and access to information items.
    • Old discipline
  • As an academic field of study:
    • Information retrieval (IR) is devoted tofinding relevant documents , not finding simple match to patterns.
    • Information retrieval (IR) is finding material (usually documents) of anunstructured nature(usually text) that satisfy an information need from within large collections (usually stored on computers).
      • [Manning et al., 2007]

2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // July 5, 2010 11. Information Retrieval Applications

  • Search(ad hoc retrieval)
    • Static document collection
    • Dynamic queries

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //

  • Filtering
    • Queries are static
    • Document collection constantly changing
      • Example: corporate mails routed by predefined queries to different parts of the organizations

Static Document Collection Ranked Result Ad-Hoc query Document Routing System Predetermined queries or User profiles IncomingDocuments 12. The nature of information retrieval

  • retrieving all objects whichmight be useful or relevantto the user information need
    • Usuallyunstructuredqueries (no formal semantics)
      • The IR system interpret the contents of the information items
      • Examples: keyword-based queries, context queries, proximity, phrases, natural language queries
      • Also structural queries and, in recent systems, structured query languages are supported (but with a different semantics)
    • Errorsin the results aretolerated
    • Core concept:relevance
      • Relevance Ranking(accordingto the user need)
      • It is not clear what degree of relevancethe user is happy with
      • The user starts from the top of theranked list and explore down satisfied

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // 13. Information RetrievalisNOTData Retrieval

  • Data Retrieval (RDBMS, XML DB)
    • retrieving all objects whichsatisfy clearly defined conditionsexpressed trough a query language.
    • Data has a well defined structure and semantics
    • Formal query languages
      • Regular expression, relation algebra expression, etc.
    • Results areEXACT matches errors are not tolerated
    • Norankingw.r.t. the userinformation need
      • Binary retrieval: does not allow the user to control the magnitude of the output
      • For a given query, the system may return:
        • Under-dimensioned output
        • Over-dimensioned output

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // 14. The Information Retrieval Process July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION// Content Management Query analysis Query Interaction Generic search-oriented application B A C K E N D F R O N T E N D q q r r Search Result Composition Result Manipulation 15. Search Engine vs. Search Application

  • Search Engine
    • data management system which uses information retrieval algorithms to retrieve information items from one or more sources upon the submission of a query
  • Web Search Application
    • data management system where search engines are a piece of a more complex puzzle, that includes:
      • data source integration (e.g. databases,legacy systems, the Web)
      • content analysis technologies orchestration
      • user interfaces
      • Web-mediated social interactions, etc.

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // 16. Characterization of the user information need

  • It is not a simple problem:
    • Blurred goals
    • Sensory Gap
      • Gap between the object in the
      • world and the information in a
      • (computational) description
    • Semantic Gap
      • Lack of coincidence between the
      • (computational) description of the
      • information and their interpretation

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // 17. Evaluating an IR System

  • Precision: fraction of retrieved docs that are relevant
      • P(relevant|retrieved)
        • degree of soundness of the system
        • not considering the total number of documents
  • Recall: fraction of relevant docs that are retrieved
      • P(retrieved|relevant)
        • degree of completeness of the system

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // 18. Enterprise search

  • Public Web search engines are the ones known to the general public
  • But there is also a huge need (and market share!) forprofessional search over enterprise repositories
  • Enterprise search is covered by
    • Packaged suites
      • Microsoft FAST
      • Autonomy IDOL
      • IBM OmniFind
      • Exalead
    • Frameworks
      • Apache UIMA (ex IBM)
      • Smila
      • Solr

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // 19. Case Studies

  • Textual Search
    • YaGoBi
  • Multi-media Search
    • The PHAROS Project
  • Multi-domain Search
    • The Search Computing project
  • Example of Web Search Application
    • Chansonnier

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010CASE STUDIES // 20. YaGoBi

  • THEWeb Search
    • 92% of marketshare in the U.S.
  • Searching on
    • Web pages, Blog, News, Books, Scientific Publications, Emails
    • Images and Videos (but only troughtextual descriptions )
    • Tweets

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla CASE STUDIES // 21. The PHAROS Project

  • FP6 IP, 3Years, 12 Partners, ~15 M budget
  • Mission : Develop SOA-compliant,open and distributed technologyplatform for development of information access solutions foraudio visual content
  • www.pharos-audiovisual-search.eu

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES // 22. The Search Computing Project

  • European Research Council (ERC), 2008 Call for "IDEAS Advanced Grants, 5y (started in 2009)
  • Mission : provide the abstractions, foundations, methods, and tools required to answermulti-domain queries by interacting with a constellation of cooperating search services, usingranking and joining of results
  • as the dominant factors forservice
  • composition
  • www.search-computing.org

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES // 23. Chansonnier

  • BsC Thesis project
    • Mission : graduate
  • Open source video analysis
  • application based on
  • open frameworks(SMILA / SOLR)
    • Crawling of Web video
    • Download of song lyrics
    • Analysis on lyrics text
      • Language, emotion
    • Keyframe extraction for video snippets
  • http://github.com/giorgiosironi/Chansonnier

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010CASE STUDIES // 24. Requirements 2010 Alessandro Bozzon, Marco Brambilla 25. Key Requirements and Design Dimensions for Web Search 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

  • Data Source
  • User Behavior
  • Query Format
  • User Interface
  • Security
  • Data Analysis
  • Performance
  • Data Format
  • Social Interactions
  • Search Engine

26. Data Sources

  • Web
  • Databases
  • File systems
  • Intranet / Extranets
  • Legacy systems
  • Users
  • Sensors (in wide sense)and streams

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // 27. Data Type

  • Unstructured data
    • Textual Documents
    • Blog Posts
  • (Semi) Structured data
    • Software Code
    • Models
    • XML Files
  • Media
    • Pictures
    • Video
    • Music

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // 28.

  • Textual Analysis
    • Deals with basic language units (morphemes, roots, stems, words, phrases, sentences, etc.)
  • Media Analysis
    • Deals with media contents
      • Transcoding
      • Classification
      • Feature Extraction

Data Analysis July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

  • An activity performed at the purpose of providing a representation of a content item suited for the application

29. Search Engine _1

  • Textual
    • Textual contents represented as collection of unstructured text terms
  • Fielded
    • Textual contents structured infields(e.g., metadata)
  • Semi-structured
    • Textual contents organized incomplex (possibly heterogeneous)structure (e.g., XML, HTML)

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // 30. Search Engine _2

  • Content-based
    • Media contents described by low-level features
  • Geographic and other special dimensions
    • Content featuring geo-spatial features
    • Streaming content searched by temporal features (e.g., recency)

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // 31. Query Format

  • Representation of the user information need
    • Natural Language
      • For instance trough vocal interfaces
    • Keyword
      • Set of text items, plus Boolean (AND/OR/NOT), proximity ( lexical nearness) and/or wildcard conditions
    • Fielded Keyword
      • Text items defined on one or more fields
      • Queries to semi-structured search-engines andFaceted queries
    • Content-based
      • Query by example (text, image, video, audio, etc.)
    • Geographicand other special dimensions
      • Geographic coordinates plus spatial operator terms ( near, north of, within X kilometers from, etc.)
      • Timestamps plus temporal operator terms (recent, near, interval, etc.)

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // 32. YaGoBi

  • Data Sources
    • Web : crawling of Web resources
    • Users : comments, preferences, relationships
  • Data Types
    • Unstructured data :Web pages
    • Documents : PDF, PPT, DOC, etc.
  • Data Analysis
    • Textual : for content, document, and user generated comments
    • Media : some basic image analysis for color, faces, size
  • Search Engine
    • Fielded: filetype, page title, site, page content
    • Content-based: image similarity in Google
  • Query Format:
    • Fielded keyword
    • Geographic

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 33. PHAROS

  • Data Sources
    • Web : crawling of audio/video files
    • File System : NAS and content provider media archives
    • Users : comments, preferences, relationships
  • Data Types
    • Structured data : content provider description metadata
    • Media : hi-quality video and audio files
    • Semi-structured data : MPEG-7 description of processed media files and user annotations
  • Data Analysis
    • Textual : for content metadata and user generated comments
    • Media : for audio and video
      • Audio/Video Mood classification, Image concept classification, Music Genre, Danceability classification, face recognition and identification, speech to text

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 34. PHAROS

  • Search Engine
    • Semi-structured : XML search engine for MPEG-7 content description
      • Plusgeographicannotations and geo-based ranking
    • 3 content-based engines :
      • one CB for music,
      • one for images (shots of the video)
      • one for face similarity
  • Query Format
    • Fielded-keyword : XQuery for XML search engine
    • Query by example : for image, music and faces
    • MPQF: high level query language
      • AND/OR/AND THEN for fielded keyword and by-example queries

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 35. Query Federation in PHAROS July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // JPG Long/Lat XPath Keywords amsterdam here[contains(amsterdam)]and opic[contains(building)] Geo search R-tree index 52.37N 4.89 E Text search Inverted index XML search Semantic index Image search Similarity index Query analysis Federation 36. User Behavior

  • Search is evolving
    • Content Vs. Intent
      • People dont want to search
      • People want to get task done and get answers
    • Moving towardsidentifying a users task
    • Enabling means fortask completion
  • Search as a Process

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

  • Search applications must
    • Support the user in the search process
    • (try to) Infer the user intent to help him accomplishing his task

Ricardo Baeza-YatesNext Generation Search , 2 ndSeCo Workshop,Milan, 24/06/2010 Start End I am craving for a goodWiener Schnitzeland aSachertortein ViennaSearch Menu Reviews Map 37. Information Seeking[Bates, 2002] July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Bates, Marcia J. 2002. Toward an integrated model for information seeking and searching. In: The Fourth International Conference on Information Needs, Seeking and Use in Dierent Contexts. 38. Information Foraging

  • Information foragingapplies the ideas fromoptimal foraging theory to understand how human users search for information.
  • Assumption: humans use "built-in" foraging mechanisms that evolved to help our animal ancestors find food.

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

  • Some References
    • Fu, Wai-Tat; Pirolli, Peter (2007), "SNIF-ACT: a cognitive model of user navigation on the world wide web", Human-Computer Interaction: 335412
    • Jason Withrow, "Do your links stink?," American Society for Information Science Bulletin, June 1, 2002
    • Pirolli, Peter (2009), "An elementary social information foraging model", Proceedings of the 27th international conference on Human factors in computing systems: 605614

39. Moving between patches

  • Patches of information = websites
  • Problem:should I continue foraging in the current patchor look for another patch?
  • Expected gain from continuing in current patch vs. moving to another

2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // July 5, 2010 40. Information seeking funnel [D. Rose, 2008]

  • Wandering:the userdoes not haveaninformation seeking-goal in mind.
  • Exploring:the user has ageneral goalbut not a plan for how to achieve it.
  • Seeking: the user hasstarted to identifyinformation needs that must be satisfied but the needs are open-ended.
  • Asking:the user has avery specificinformationneed that corresponds to a closed-class question

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // 41. Berrypicking vs. Orienteering vs. Teleporting ...

  • Information needschange during interactions
      • M.J. Bates. The design ofbrowsing and berrypickingtechniques for the onlinesearch interface.OnlineReview, 13(5):407431,1989.
  • Orienteering [ Teevan et al., CHI 2004 ] :Searcher issues a quick, imprecise to get to approximately the right information space region and then follows known paths that require small steps that move them closer to their goal.Easy! (perfect query not needed)
  • Teleporting:Expert searchers issue longer queries to jump directly to the target. Requires more effort and experience.

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // 42. vs. exploratory search

  • Exploratory Search:users intent is primarily to learn more on a topic of interest, by exploring various directions and sources
    • exploratory searchblends querying and browsing strategies and is differentfromretrievalthat is best served by analytical strategies
          • Marchionini, G. Exploratory search:from finding to understanding.Communications ACM 49(4): 41-46 (2006)

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //

  • Some references
    • Definition and analysis of the problem
      • White, R. W., and Drucker, S. M. Investigating behavioral variability in web search. 16th WWW Conf. (Banff, Canada, 2007)
    • Complex Search and Exploratory Search
      • Aula, A., and Russell, D.M. Complex and Exploratory Web Search. ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008)

43. Multi-domain Exploratory Search

  • search for upcomingconcerts closeto anattractivelocation(like a beach, lake, mountain, natural park, and so on), considering also availability ofgood ,close-by hotels
  • Current approach the user can adopt:
    • Independently explore search services
    • Manually combine findings

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 44. Multi-domain Exploratory Search

  • expandthe search to get information about available restaurants near the candidate concert locations, news associated to the event and possible options to combine further events scheduled in the same days and located in a close-by place with respect to the first one

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 45. Existing Approaches _1

  • Topic based search : instance of exploratory search centered on the goal of collecting information on a subject matter of interest from multiple sources

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

        • Kosmix : topic discovery engine, keyword search, a topic page summarizes the most relevant information on the subject
        • Hakia : resume pages for topics associated with users queries, natural language processing techniques

46. Existing Approaches _2

  • Structured Object Search : process queries and present results that address entities or real world objects described in Web pages

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

        • Google Squared: keyword search, results collected in a table (called a square) featuring all the attributes relevant to the result items as columns headers
        • Google Fusion Tables: upload data tables (e.g., spreadsheet files) and join (or fuse) the data in some column with other tables

47. The note-taking limit

  • There is a limit after which the found options need to be marked down.

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // [Aula and Russel, 2008] 48. Liquid Queries

  • A new paradigm allowing users toformulateand getresponsestomulti-domainqueries through anexploratory information seekingapproach, based uponstructuredinformation sources exposed as software services
  • Compositeanswers obtained by aggregating search results from various domains
  • Highlightthe contribution of each search service
  • Joinof results based on the structural information afforded by the search service interfaces
  • Refinethe user query
  • Re-shapethe result list

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

  • Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri.Liquid Query: multi-domain exploratory search on the Web . WWW 2010, Raleigh, USA

49. Liquid Queries Definition _1

  • Template-based approach
  • It consists of subsetting and parametrizing the resource graph...

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Concert Artist Exhibition Restaurant Hotel Movie Metro Station Theatre Photo Landmark News Photo Concert Metro Station Restaurant News Exhibition Artist Hotel = inputs, outputs+GR = global ranking 50. Liquid Queries Definition _2

  • And then characterizing the user interaction
  • Plus:
    • Parametrization of global ranking
    • Data visualization options
    • .. and so on

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Photo Concert Metro Station Restaurant News Exhibition Artist Hotel Expand 51. Result Exploration Support

  • If the current set of combinations is not satisfactory, the user may ask formorevalues for a service (more one) or for all services (more all)
    • More concerts, more hotels, or more combinations
  • Add new informationabout further domains for selected combinations (expand)
    • Find close-by restaurants or co-located events
  • Aggregateinformation to ease analysis and readability (clustering, grouping)
    • Group events by venue
  • Reducethe number of shown items through filtering
    • Total walked distance for the night
  • Re-order(ranking or sorting)
    • Calculate derived values from existing ones
    • Total walked distance for the night
  • Alternativedata visualization
    • Map, parallel coordinates,

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //

  • DEMO :
  • http://demo.search-computing.org

52. User Intent

  • Understand the user information need
    • User intent taxonomy (Broder2002)
      • Informational want to learn about something (~40% / 65%)
      • Navigational want to go to a given page (~25% / 15%)
      • Transactional want to do something (web-mediated) (~35% / 20%)
      • Grey Areas
        • Find a good hub
        • Exploratory search

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // [from SIGIR 2008 Tutorial, Baeza-Yates and Jones]History nyonya food Singapore Airlines Jakarta Weather Nikon Finepix Car Rental Kuala Lumpur 53. Contextual Content Delivery

  • Context Vs. Personalization
  • Trigger the right search depending on the context
    • Task
    • Location
    • User Engagement
  • Not interested in your personal profile
    • Your favorite restaurant?
      • It depends on where you are!

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // from Ricardo Baeza-Yates, Next Generation Search ,2 ndSearch Computing Workshop, Milan, 24/06/2010 Demo: http://sandbox.yahoo.com/Motif 54. Relevance: the Top-k problem

  • Relevance of the resultswith respect to the request is the main expectation for search engine users
  • Top-k relevant items : retrieve quickly a number ( k)of highest ranking tuples in the presence of monotone ranking functions defined on the attributes of underlying relations
  • Some References
    • R. Fagin. Combining fuzzy information from multiple systems. J. Comput. Syst. Sci., 58(1):8399, 1999.
    • F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid. Rank-aware query optimization. In SIGMOD Conference, pages 203214, 2004
    • D. Martinenghi and M. Tagliasacchi: Proximity Rank Join,to appear in PVLDB

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 55. Result Diversification

  • Relevance is not the only success factor for a result set
  • User satisfactionis increased if the first items cover a good spectrum of options
    • If userintent is ambiguous , diversification tries to cover the most likely intents
    • If several top-kitems are very similar ,they can be clustered together
  • Thus: an optimization problem
  • Objective: find the set of kelements that contains themost relevant and diverse items
  • Maximal Marginal Relevance[Carbonell and Goldstein 1998]

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Relevance Diversity 56. User Interface

  • More Complete information on one search

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Shortcuts Deep Links Enhanced Results 57. User Interface July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 58. User Interface

  • Optimization of the result set layout (and of page space)

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 59. User Interface

  • Optimization of the result set layout (and of page space)

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 60. User Interface

  • Optimization of the result set layout (and of page space)

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 61. User Interface

  • Optimization of the result set layout (and of page space)

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // 62. Performance

  • Users dont want tolose their timewaiting for a search result
      • User satisfaction
  • Performances are the leading factorfor the evaluation ofWeb Search applications
    • Queries per seconds (QPS)
    • Time to Index
  • Scalability
    • Content
    • Queries
  • Distribution
    • Service-oriented computing
    • Content Delivery Networks
      • But intellectual properties may be a concern
    • More in section (ARCHITECTURE)

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // 63. Other Requirements

  • Social Interaction
    • Content evaluation
    • User relationships and actions as additional content description
  • Security & Privacy
    • Access policies
      • Collection Vs. Item level
    • Anonymity
      • Who I am = What I like + What I do + Where I am ?
      • A search process tells a lot about whom is doing it
  • Alessandro Bozzon, Tereza Iofciu, Wolfgang Nejdl, Antonio V. Taddeo, Sascha Tnnies, Role Based Access Control for the interaction with Search Engines, (COOPER) 2007, Crete, Greece .

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS // 64. Design 2010 Alessandro Bozzon, Marco Brambilla 65. Designing Web Search Applications

  • Reference architecture
  • Reference execution processes
  • Set of design dimensions
  • Development methodology
  • Tools supporting the methodology

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla DESIGN // 66. Search Applications from 1000 feet 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010DESIGN // 67. Bird eye view on Search Applications 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010DESIGN // 68. Search Application Processes July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla DESIGN // 69. An example of Indexing ProcessJuly 5, 2010 2010 Alessandro Bozzon, Marco Brambilla DESIGN // 70. Pharos: the architecture July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla DESIGN // 71. Search Computing: the architecture July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Main Query flow relation 72. Search Computing: the architecture July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla DESIGN // High level query Where can I attend a DB scientific conference close toa beautiful beach reachablewith cheap flights? Sub query 1 Where can I attenda DB scientificconference? Sub query 2 place close toa beautifulbeach? Sub query 3 place reachablewith cheap flight? 73. Search Computing: the architecture July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Low level query 1 ConfSearch(DB,placeX,dateY) Low level query 2 TourSearch(Beach,PlaceX) Low level query 3 Flight(cost Finland? Finlands?, Finlands?

        • Hewlett-Packard -> Hewlett and Packard as two tokens?
        • San Francisco: one token or two? How do you decide it is one token?
      • Language issues(normalization)
        • Accents: rsum vs. resume.
        • L'ensemble -> one token or two?
          • L ? L ? Le ?
      • How are your users like to write their queries for these words?Use locale?
        • Punctuation(e.g: U.S.A. vs. USA)
        • Numbers (100.45 vs. 100,45 vs. 1.0045 E+2 )
        • Dates (e.g. March 1 st2009 vs. 03/01/09 vs. 1/03/2009)
        • Case folding.
  • It depends on the addressed language
    • E.g., in Chinese spaces do not separate words
      • (tokenization based on vocabulary)

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 127. Stopword Removal

    • Removal of high-frequency words , which carry less information
    • Strategies
      • Statistical analysis on the indexed collection
      • Functional terms (articles, conjunctions, auxiliary verbs)
      • A-priori knowledge, based on the IR system domain
        • Creation of a stop-list with all the terms to remove
        • English stop list is about 200-300 terms (e.g., been, a, about, otherwise, the, etc..)
          • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
  • < 30% - 50% of tokens (smaller dictionary)
  • It candecrease recall(e.g. to be or not to be, let it be)
  • Most of WEB search enginesdo notremove stopwords[ ManningIR]

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 128. Phrases (noun groups)

    • Phrases capture the meaning behind the bag of words and result inmulti-term phrases
    • Uses of phrases:
      • Added to the query: a query New York should be modified to search for New York> 10% in precision and recall
      • Replace terms in index: empirically considered not as good as query rewriting

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 129. Phrases (noun groups) - Strategies

    • Simple Phrases
      • Many systems identify phrases as any pairs of terms not separated by:
        • stop term
        • punctuation mark
        • special character
      • Phrases occurring fewer than 25 times are removed (decrease in memory requirements)
    • NLP
    • Part Of Speech and Word Sense tagging
      • statistical or rule-based methods to identify the part of speech (noun, verb, adjective) of each token
    • Syntactic parsing
      • Identify the key syntactic components of a sentence usually by tagging according to POS and then applying a grammar (FSA and NFSA)
    • Thesauri

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 130. Thesauri

  • A thesaurus is as aclassificationscheme composed ofwords and phraseswhose organization aims atfacilitatingthe expression of ideas in written text
    • E.g.: synonyms and homonyms
      • Example entry from Rogets 1thesaurus: cowardlyadjective
        • Ignobly lacking in courage: cowardly turncoats.
        • Syns: chicken (slang) chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered
  • A thesaurus can be
    • Thematic: specific to the IR systems domain of application (most frequent case)
      • E.g.: Thesaurus of Engineering and Scientific Terms
    • Generic
  • A thesaurus can be used to
    • Helpuser formulate queries
    • Modificationof queries by the system
    • Selectindex terms

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 131. Thesauri

  • Many kinds of thesauri have been developed for IR systems
    • Hierarchical: synonyms(RTrelated terms, UFuse for),generalization(BTbroader term),specialization(NTnarrower term)
      • ISO and ANSI standards, almost always thematic
      • Manually built and updated by domain experts
    • Clustered:cluster (or synset) of words
      • Non-typed, semantic relationships among cluster
        • Each cluster is a set of word having strong semantic relationship (usually UF)
      • WORDNET
      • Clustered Thesauri can be automatically generated if no distinction is made among semantic relationships
    • Associative:graph of words, where nodes represents words and edges representssemantic similarityamong words
      • Edges can be oriented or not, according to the symmetry of the similarity relationship
      • Edged can be weighted (fuzzy pseudo-thesauri)
      • Can be automatic generated from a collection of documents using a co-occurrence relationships

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 132. Stemming and Lemmatization

  • Goals
    • Reduce terms to their roots before indexing
    • Reduce inflectional/variant forms to base form
      • language dependent
      • E.g.,
        • am, are, is -> be
        • car, cars, car's, cars' -> car
        • the boy's cars are different colors -> the boy carbe different color
  • Stemming : heuristic process that chops off the ends of words in the hope of achieving the goal correctly the most of the time
    • Stemming collapses derivationally related words
  • Lemmatization : NPL tool. It uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word
    • Lemmatization collapses the different inflectional forms of a lemma
    • Not widely used cause it harms performances

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 133. Stemming

  • Many different algorithms :
    • Porters algorithm
      • Commonest algorithm for stemming English
        • Porter, Martin F. 1980. An algorithm for suffix stripping.Program 14:130137.
        • http://www.tartarus.org/martin/PorterStemmer/
    • One-pass Lovins stemmer
      • Lovins, Julie Beth. 1968. Development of a stemming algorithm.Translation and
    • Lancaster
      • http://www.comp.lancs.ac.uk/computing/research/stemming/
      • Paice, Chris D. 1990. Another stemmer.SIGIR Forum 24:5661
      • http://snowball.tartarus.org/demo.php
  • Stemming increases recall while harming precision

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 134. Stemming Example July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 135. Tools for text analysis _1

  • Lucene and Solr contains a lot of text analyzer working on several languages
    • http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
      • CharFilters, Tokenizer, Token Analyzers
  • Apache Tika
    • http://tika.apache.org/
    • toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries
  • GATE(General Architecture for Text Engineering)
    • http://gate.ac.uk/
    • ANNIE (A Nearly-New Information Extraction System)
      • tokenizer, gazetteer, sentence splitter, part of speech tagger,
      • named entities transducer, coreference tagger
      • Support for English, Spanish, Chinese, Arabic, French, German,
        • Hindi, Italian, Cebuano, Romanian, Russian
  • MALLET(Machine Learning for Language Toolkit)
    • http://mallet.cs.umass.edu/index.php
    • Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 136. Tools for text analysis _2

  • OpenNLP
    • http://opennlp.sourceforge.net/projects.html
    • open source projects related to natural language processing)
  • Cognitive Computation Group University of Illinois
    • http://l2r.cs.uiuc.edu/~cogcomp/software.php
      • Chunker, Part of Speech tagger, String similarity, Semantic Role Labeler Named Entity Extractor, etc.
  • Supersense Tagger
    • http://medialab.di.unipi.it/wiki/SuperSense_Tagger
    • tool for assigning to each noun, verb, adjective and adverb of a sentence one of the45 standard WordNet supersenses
  • Wordnet Domains
    • http://wndomains.fbk.eu/hierarchy.html
  • Synesketch
    • http://www.synesketch.krcadinac.com/
    • Open source textual emotion recognition

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla SECTION NAME // 137. Multimedia Content Analysis

  • Computer are not able to catch the underlying meaning of a multimedia content.Annotation is needed.
  • Manual annotation
    • Expensive
      • It can take up to 10x the duration of the video
      • Problems in scaling to millions of contents
    • Incomplete or inaccurate
      • People might not be able to holistically catch all the meanings associated with a multimedia object
    • Difficult
      • Some contents are tedious to describe with words
        • E.g., a melody without lyrics
  • Automatic annotation
    • Reasonably good quality
      • Some technologies have a ~90% precision
    • Low cost

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010IMPLEMENTATION // 138. Audio Segmentation

  • GOAL: split an audio track according to contained information
    • Music
    • Speech
    • Noise
  • Additional usage
    • Identification and removal of ads

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010IMPLEMENTATION // 139. Video Segmentation

  • Keyframe segmentation:
    • segment a video track according to its keyframes
      • fixed-length temporal segments
  • Shot detection:
    • automated detection of transitions between shots
      • a shot is a series of consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space.

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CREDITS:Thorsten Hermes@SSMT2006 140. Speech Analysis

  • Speaker Identification : identify people participating in a discussion
  • Additional usage:
    • Vocal command execution
  • Speech To Text : automatically recognize spoken words belonging to an open dictionary

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // ERIC DAVID JOHN 141. Classification of Music Genre

  • GOAL: automatically classify the genre and mood of a song
    • Rock, pop, Jazz, Blues, etc.
    • Happy, aggressive, sad, melancholic,
  • Additional usage:
    • Automatic selection of songs for playlist composition
  • Tutorial from PHAROS Summer School
    • http://www.pharos-audiovisual-search.eu/res/files/SummerSchool/Programme_Summer_School_file.zip

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Rock Dance! 142. Images: Low-level features

  • GOAL: extract implicit characteristics of a picture
    • luminosity
    • orientations
    • textures
    • Color distribution

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 143. Face Identification and Recognition

  • GOAL: recognize and identify faces in an image
  • Usage examples:
    • People counting
    • Security applications

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CREDITS:Thorsten Hermes@SSMT2006 144. Image Concept Detection

  • GOAL: recognize context/ concepts of an image
    • E.g., playground, seaside, road, ...
  • Extraction of low level features from raw data
    • color histograms, color correlograms, color moments,co-occurrence texture matrices, edge direction histograms, etc..
  • Features can be used to builddiscrete classifiers , which may associate semantic concepts to images or regions thereof
    • The MediaMill semantic search engine defines 491 semantic concepts
      • http://www.science.uva.nl/research/mediamill/demo
  • Concepts can be detected also from text (e.g., from manual or automatic metadata) using NLP techniques

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 145. Image Object Identification

  • GOAL: identify objects appearing in a picture
    • Basket ball, cars, planes, players, etc.

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 146. Tools for media analysis _1

  • OpenCV
    • http://opencv.willowgarage.com/wiki/
    • Framework for image analysis
  • Octave
    • http://www.gnu.org/software/octave/
    • high-level language, primarily intended for numerical computations, it works well with Matlab
  • Marsyas(Music Analysis, Retrieval and Synthesis for Audio Signals)
    • http://marsyas.sness.net/
    • Framework for music analysis and retrieval

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 147. Tools for media analysis _2

  • TINA(TINA Is No Acronym)
    • http://www.tina-vision.net/
    • is an open source environment developed to accelerate the process of image analysis research.
  • Sphynx
    • http://cmusphinx.sourceforge.net/sphinx4/
    • speech recognition system written entirely in the Java
  • WEKA
      • http://www.cs.waikato.ac.nz/ml/weka/
      • A collection of machine learning algorithms for data mining

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // 148. Validation 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 149. Disclaimer

  • This section is inspired by the WWW2010 tutorialby Dasdan, Tsioutsiouliklis, Velipasaoglu @ WWW2010
  • Web Search Engine Metricsfor Measuring User Satisfaction
  • http://analytics.ncsu.edu/reports/wsmt.pdf

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 150. Measures for IR Systems

  • Measurableproperties
    • How fast does it process (index) documents?
      • Number of documents/hour
      • Average document size
    • How fast does it search?
      • Latency as a function of index size
      • Expressiveness of query language
      • Speed on complex queries
  • Thekeymeasure: userhappiness
    • What is this?
      • Speed of response/size of index are factors
        • But blindingly fast, useless answers wontmake a user happy
    • How do we quantify user happiness?

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 151. Measuring User Happiness

  • Whois the user we are trying to makehappy?
    • Depends on the setting
      • Web engine: user finds what they want andreturn to the engine
        • Can measure rate of return users
      • eCommerce site: user finds what they wantand make a purchase
        • Is it the end-user, or the eCommerce site,whose happiness we measure?
        • Measure time to purchase, or fraction ofsearchers who become buyers?
      • Enterprise (company/govt/academic): Care about user productivity
        • How much time do my users save whenlooking for information?
        • Many other criteria having to do with breadth of access, secure access

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 152. Evaluation measures

  • Relevance
    • Of searchresults
  • Coverage
    • Presence of content of interest in a catalog
  • Diversity
    • Ofresult set
  • Discovery and Latency
    • How many new resources (in the collection) are in the catalogue
    • How long it took to get the new resources in the catalog?
      • Time to first click
  • Freshness

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 153. Relevance as a measure of user happiness

  • How do you measure relevance?
  • In order to assess the performance of a IR system you needed a test collection composed of:
    • A benchmark document collection
    • A benchmark suite of queries
    • A binary assessment of eitherRelevantorIrrelevantfor each query-doc pair ( gold standard , orground truth )
  • Test collection must be of a reasonable size
    • Need to average performance since results are very variable over different documents and information needs

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 154. Evaluating Relevance

  • Setbased evaluation
  • Rankbased evaluation withexplicitjudgment
    • Absolute judgment
    • Preference judgment
  • Rankbased evaluation withimplicitjudgment
    • Direct and indirect evaluation by clicks
  • Modelbased evaluation
    • Browsing models
    • User satisfaction

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // NOTCOVERED HERE 155. Information Need Translation

  • Relevance is assessed relative to the neednot to the query
  • E.g., Information need:
    • I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
    • Query:wine red white heart attack effective
  • A document is relevant if itaddressesthe stated information need,not just because itcontainsall the word in the query

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 156. Set-based evaluation

  • The two most frequent and basic measures for IR effectiveness areprecisionandrecall
    • Precision: fraction of retrieved docs that are relevant
      • P(relevant|retrieved)
        • Provides a measure of the degree of soundness of the system
        • This not consider the total number of documents
    • Recall: fraction of relevant docs that are retrieved
      • P(retrieved|relevant)
        • Provides a measure of the degree of completeness of the system

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 157. Precision / Recall

  • Can get highrecall (but lowprecision ) byretrieving all docs for all queries!
    • Recall is anon-decreasingfunction of thenumber of docs retrieved
    • Precision usually decreases (in a good system)
  • Precisioncan be computedat different levels ofrecall
    • Perhaps most appropriate for web search: all people want are good matches on the first one or tworesults pages
  • Precision-oriented users
    • Web surfers
  • Recall-oriented users
    • Professional searchers, paralegals, intelligence analysts

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 158. F-Measure

  • Combined measurethat assesses the tradeoff between precision and recall (weighted harmonicmean):
    • Values of 1 emphasize recall
  • People usually use balancedF 1measure
    • i.e., with = 1 or =
  • Harmonic mean is conservative average
        • [CJ van Rijsbergen,Information Retrieval ]

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 159. Difficulties in using precision/recall

  • Average over large corpus/query
    • Need human relevance assessments
      • People arent reliable assessors
    • Assessments have to be binary
      • Nuanced assessments?
    • Heavily skewed by corpus/authorship
      • Results may not translate from one domain to another
  • The relevance of one document is treated asindependentof the relevance of other document
    • This is also an assumption in most retrieval system

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 160. Ranked Based evaluation

  • In ranked retrieval systems,PandRare values relative to arank position
  • Evaluation performed by computing precision as a function of recall
  • Function computed at each rank position in which a relevant
  • document has been retrieved
  • Resulting values are interpolated
  • yielding a precision/recall plot

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 161. Measures for Ranked Based evaluation

  • Mean average precision ( MAP )
    • Measure of quality at all recall levels
  • [email_address]
    • Not all queries will have more than K relevant results
    • Even a perfect system may have a score less than 1.0 for some queries
  • R-Precision [Allan 2005]
    • Use a variable result set cut-off for each query based on number of its relevant results
  • Mean Reciprocal Rank ( MRR )[ Voorhees 1999]
    • Reciprocal of the rank of thefirst relevant result averagedover a population of queries

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 162. Discounted Cumulative Gain (DCG)

  • [Jrvelin and Keklinen 2002]
  • Gain adjustable for importance of different relevance gradesfor user satisfaction
  • Discounting desirable for web ranking
    • Most users dont browse deep
    • Search engines truncate the list of results returned.
  • DCG yieldsunbounded scores
    • For each query, divide the DCG by the best attainable DCG for that query
    • Normalized Discounted Cumulative Gain (nDCG)

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //

  • Example:
  • Very Useful: 3
  • Somehow useful: 1
  • Not Useful: 0

163. Preference Judgment

  • Kendall taucoefficient
    • Based on counts of preferences
    • Range in [-1, 1]
    • Robust for incomplete judgments
  • Binary Preference (bpref)
    • Buckley and Voorhees (2004)
    • Designed for incompletejudgments
    • Generalized to graded judgment
      • De Beer and Moens (2006)

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // A: preferences in agreement D: preferences in disagreement N r= # of non-relevant docs above relevant doc r, In the first R non-relevant R = number of relevant results for the query 164. Presentation Metrics

  • How to present information?
    • Which information
    • Where they should be displayed
    • Which presentation elements should be used?
      • Font, colors, design elements, interaction design
    • Generalization
  • How to measure success?
    • User studies
      • On-line, on-home, usability, eye tracking, focus group, surveys
    • Log analysis
    • Editorial
      • Comparative, Perceived vs. actual

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 165. Not all results are likely to be reviewed July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // (Source:iprospect.comWhitePaper_2006_SearchEngineUserBehavior.pdf) 166. Clicks and views depend on rank July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // [Joachims et al, 2005] 167. Eye Tracking Studies July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 168. Heat Maps

  • Golden Triangle
    • Thefirst result is always considered moretrusted and morerelevant by default
    • The user spend less time reading the lower part of the page
    • [Marti A. Hearst,Search User Interfaces , Cambridge University Press, 2009]

July 5, 2010 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // 169. Thank you for your attention!

  • Questions?

2010 Alessandro Bozzon, Marco Brambilla Alessandro Bozzon Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/bozzonMarco Brambilla Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/mbrambil http://www.search-computing.org/book July 5, 2010REFERENCES // 170. References Books

  • Modern Information Retrieval
    • Ricardo Baeza-Yates, Berthier Ribeiro-Neto ,Addison Wesley Longman Publishing Co. Inc., 2010
  • [ManningIR] Introduction to Information Retrieval
    • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze,Cambridge University Press, 2008
  • Information Retrieval: Algorithms and Heuristics .
    • D.A. Grossman, O. Frieder. Springer, 2004
  • Managing Gigabytes.
    • I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999
  • Mining the Web: Analysis of Hypertext and Semi Structured Data .
    • S. Chakrabarti. Morgan Kaufmann, 2002
  • Search User Interfaces
    • Marti A. Hearst. Cambridge University Press, 2009
  • Search Computing Challenges and directions
    • Stefano Ceri, Marco Brambilla(eds.) . Springer LNCS, vol. 5950, 2010

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010REFERENCES // 171. References - Tutorial

  • Web Search Engine Metrics: Direct Metrics to Measure User Satisfaction
    • Ali Dasdan, Kostas Tsioutsiouliklis, Emre Velipasaoglu (Yahoo!)
    • www2010
  • Recent Progress on Inferring Web Searcher Intent
    • Eugene Agichtein (Emory University)
    • www2010
  • Applications of Open Search Tools
    • Rosie Jones, Ted Drake (Yahoo!)
    • www2010
  • [BAEZASeco2010] New Frontiers for Search
    • Ricardo Baeza-Yates
    • www2010
  • Web Mining for Search
    • Ricardo Baeza-Yates and Rosie Jones (Yahoo!)
    • SIGIR 2008

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010REFERENCES // 172. References - Papers

  • [Ramakrishnan and Tomkins 2007] Raghu Ramakrishnan, Andrew Tomkins:Toward a PeopleWeb
    • IEEE Computer 40(8): 63-72 (2007)
  • [Broder2002] A. Broder.A taxonomy of web search
    • SIGIR Forum, 36(2):310, 2002.
  • [BATES2002]Bates, Marcia J.Toward an integrated model for information seeking and searching
    • In: The Fourth International Conference on Information Needs, Seeking and Use in Dierent Contexts, 2002
  • [FU2007] Fu, Wai-Tat; Pirolli, Peter,SNIF-ACT: a cognitive model of user navigation on the world wide web
    • Human-Computer Interaction: 335412 , 2007
  • [Withrow2002] Jason Withrow,Do your links stink?
    • American Society for Information Science Bulletin, June 1, 2002
  • [Pirolli2009] Pirolli, PeterAn elementary social information foraging model
    • Proceedings of the 27th international conference on Human factors in computing systems: 605614, 2009
  • [D. Rose, 2008]
  • [BATES1989] M.J. Bates.The design of browsing and berrypicking techniques for the online search interface
    • Online Review, 13(5):407431,1989.
  • [Teevan et al., CHI 2004] Teevan, J., Alvarado, C., Ackerman, M. and Karger, D.The perfect Search Engine is not Enough: A Study of Orienteering Behavior in Directed Search
    • Proceedings of ACM CHI 2004, pp. 415-4422.
  • [MARCHIONINI2006]Marchionini, G.Exploratory search:from finding to understanding .
    • Communications ACM 49(4): 41-46 (2006)
  • [WHITE2007] White, R. W., and Drucker, S. M.Investigating behavioral variability in web search
    • 16th WWW Conf. (Banff, Canada, 2007)
  • [AULA2008] Aula, A., and Russell, D.M.Complex and Exploratory Web Search
    • ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008)

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010REFERENCES // 173. References - Papers

  • [BozzonEtAL2010] Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri.Liquid Query: multi-domain exploratory search on the Web
    • WWW 2010, Raleigh, USA
  • [FAGIN1999] R. Fagin.Combining fuzzy information from multiple systems
    • J. Comput. Syst. Sci., 58(1):8399, 1999.
  • [ILYAS1999] F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid.Rank-aware query optimization
    • In SIGMOD Conference, pages 203214, 2004.
  • [MARTINENGHI2010] D. Martinenghi and M. Tagliasacchi:Proximity Rank Join
    • to appear in PVLDB
  • [Carbonell and Goldstein 1998] J. Goldstein and J. Carbonell (1998), Summarization:Using MMR for Diversity- based Reranking
    • SIGIR98
  • [BozzonEtAl2007] Alessandro Bozzon,et AlRole Based Access Control for the interaction with Search Engines
    • International Workshop on Collaborative Open Environments for Project-Centered Learning (COOPER) 2007, Crete, Greece.
  • [BozzonEtAl2009] Alessandro Bozzon, Marco Brambilla, Piero FraternaliConceptual Modeling of Multimedia Search Applications using Rich Process Models
    • ICWE 2009, June 24-26, 2009, San Sebastian, Spain
  • [BozzonThesis2009]Alessandro Bozzon,Model-driven development of Search Based Web Applications
    • Ph.D Thesis, Politecnico di Milano, April 2009.
  • [BragaEtAl2010] D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca:Panta Rhei: An Execution Model for Queries over Web Information Sources
    • http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010REFERENCES // 174. References - Papers

  • [Allan 2005] J. Allan (2005),HARD track overview in TREC 2005: High accuracy retrieval from documents.
  • [Voorhees 1999] E.M. Voorhees (1999),TREC-8 question answering track report
  • [Jrvelin and Keklinen 2002] K. Jrvelin and J. Keklinen,Cumulated gain-based evaluation of IR techniques
    • ACM Trans. IS, 20(4): 422-446, 2002
  • [Buckley and Voorhees (2004)] C. Buckley and E.M. Voorhees,Retrieval evaluation with incomplete information
    • SIGIR04.
  • [De Beer and Moens (2006)] De Beer, Jan; Moens, Marie-Francine.Rpref: a generalization of Bpref towards graded relevance judgments
    • SIGIR 2006, Seattle, USA, 6-11 August 2006, pages 637-638, ACM

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010REFERENCES // 175. References - Links

  • Search Computing Course Lecture Notes
    • http://www.search-computing.it/course
  • Fabio Aolli,Universit di Padova, http://www.math.unipd.it/~aiolli/corsi/0809/IR/IR.html
  • http://www.ir.disco.unimib.it/

2010 Alessandro Bozzon, Marco Brambilla July 5, 2010REFERENCES //