indexing and searching cross media content in a social network

52
Indexing and Searching Cross Media Content in a Social Network Pierfrancesco Bellini, Daniele Cenni, Paolo Nesi University of Florence Department of Systems and Informatics Distributed Systems and Internet Technology Laboratory ECLAP Conference, May 7-9, 2012

Upload: paolo-nesi

Post on 30-Oct-2014

305 views

Category:

Technology


7 download

DESCRIPTION

Indexing and Searching Cross Media Content in a Social Network

TRANSCRIPT

Page 1: Indexing and Searching Cross Media Content in a Social Network

Indexing and Searching Cross Media Content in a Social

NetworkPierfrancesco Bellini, Daniele Cenni, Paolo Nesi

University of FlorenceDepartment of Systems and Informatics

Distributed Systems and Internet Technology Laboratory

ECLAP Conference, May 7-9, 2012

Page 2: Indexing and Searching Cross Media Content in a Social Network

ECLAP Social Network

ECLAP is a Digital Library on Performing Arts connected with Europeana

ECLAP is a Social Network (blogs, forums, comments, tagging, voting, …)

Page 3: Indexing and Searching Cross Media Content in a Social Network

Goals/Requirements Develop an Indexing/Searching solution for ECLAP

Social Network allowing: Indexing multilingual crossmedia content metadata and

data (e.g. documents) Indexing portal blogs, forums, events, group pages,

comments, etc. Efficient multilingual search (keyword search and

advanced search) supporting: misspelled words (e.g. shespeare) partial word search

Sorting and filtering search results re-index the whole data without blocking the system Log and monitor users activity …

Evaluate the Indexing/Searchig service

Page 4: Indexing and Searching Cross Media Content in a Social Network

ECLAP Data Model

4

Object

Video Audio

Document

Group/Channel

CollectionPlaylist

0..n

0..n

1..n

0..n

Image

AVObjectAnnotation0..n 1..2

1..n

0..n

ForumWebPage

CommentContentTaxonomyTerm 0..n 0..n 0..n1

0..n

0..n

Blog

Metadata

PerformingArts

Dublin Core

Technical

Page 5: Indexing and Searching Cross Media Content in a Social Network

Indexing Indexing & Search system

Based on Apache Solr Multilingual aspects

Translate the metadata or translate the query? We use metadata translation

Indexing schema Dublin Core + DCTerms (multi language) Performing Arts Technical (provider, content type, GPS, IPR, duration, quality, …)

Groups associations (multi language) Taxonomy associations (multi language) Comments & multi language tags FullText of the textual digital resources

Page 6: Indexing and Searching Cross Media Content in a Social Network

Indexing

Media TypeDC(ML) Tech

Perf. Arts

Full Text

Taxnmy,Group(ML)

Comment, Tags(ML) Votes

Audio/Video/Image Y Y Y Y Y YDocument(pdf, doc, …)

Y Y Y Y Y Y YCrossMedia(html, MPEG21,…)

Y Y Y Y Y Y YAggregations(playlist, collection, …)

Y Y Y Y Y Y

Info text(blog, web pages, forum, events, …)

(Y) Y Y

Page 7: Indexing and Searching Cross Media Content in a Social Network

Indexing Multilingual fields

title_en, title_it, title_de, title_fr, title_ca, … Catch-all fields

Component fields Boost Weighttext pdf_*, doc_*, ppt_*, htm_*, … 1.0body body_* 0.5title title_* 3.1description description_* 2.0contributor contributor_* 0.8subject subject_* 1.5taxonomy taxonomy_* 0.8PerformingArts PerformingArtsMetadata.# 1.0

Page 8: Indexing and Searching Cross Media Content in a Social Network

Indexing Re-indexing

In case of new indexing schema or index corruption the search system should not be blocked

The re-indexing is done on a separete indexing machine while the production system uses the actual index

During re-index the new uploaded/modified content is marked to be reindexed when the new index is put in production

Page 9: Indexing and Searching Cross Media Content in a Social Network

Searching Full text search

Uses the catch all fields to search for keywords in most important fields in alllanguages (title, description, text, body, subject,…)

Fuzzy search Allows matching mistyped words

Deep search Allows searching for partial words

Relevance & boosting of terms

Page 10: Indexing and Searching Cross Media Content in a Social Network

Searching Faceted search

Page 11: Indexing and Searching Cross Media Content in a Social Network

Searching Advanced search

Page 12: Indexing and Searching Cross Media Content in a Social Network

Search Facility Assessment Analisys performed on 3 months

11294 vists (6032 unique visits) 62768 page views (avg 5.76 pages per visit) 7.29 minutes of permanence on the portal 30502 contents accesses (view, play and

download)

Page 13: Indexing and Searching Cross Media Content in a Social Network

Search Facility Assessment

users# Full Text Query

# Faceted Query

# Last Posted List

#Featured List

# Popular List

simple registered

323 24 4 22 17

partners 1094 21 27 19 9

anonymous 2634 147 234 302 213

Total 4051 192 265 343 239

Clicks after query/list

1564 200 318 2799 231

Page 14: Indexing and Searching Cross Media Content in a Social Network

Search Facility Assessment Click order distribution

First page

Page 15: Indexing and Searching Cross Media Content in a Social Network

Conclusions Solution allows indexing multilingual

metadata and texts Searching & filtering results Search facility assessment show that

search is a used feature

Page 16: Indexing and Searching Cross Media Content in a Social Network

Context & Assessment

Context Social Network

User and content items Content distribution portal

Video on demand portal Archive, digital library, Performing Arts

http://www.eclap.eu Assessment

User behavior Log user actions on the Web portal

User happiness Measure the level of user satisfaction about the exposed

services

Page 17: Indexing and Searching Cross Media Content in a Social Network

Logging User Profile

User Profile Registered or anonymous, uid (user id) Timestamp YY-mm-dd hh:mm:ss IP address, Proxy type etc. Platform (OS, Browser) GeoIP data (Country, Region, City) Friends, connections

Betweenness, Eccentricity Joined groups User preferred contents

Page 18: Indexing and Searching Cross Media Content in a Social Network

Understanding User behavior

Online survey A simple module, in the right side of the portal Presenting 3 - 4 questions per topic (depending on the

current portal section visited) Stat Drupal Modules

Custom implemented modules Log User Activity Keep track and depict main figures about portal activity Can be filtered by date, user, type of content, group,

type of activity (content enrichment, social promotion, networking etc.)

Google Analytics

Page 19: Indexing and Searching Cross Media Content in a Social Network

Understanding User behaviorTop Metrics

Avg # Visits/User Avg # Queries/User Avg # Clicks/User Avg Visit duration Avg Query length Query refinement rate Next Page Click Rate Back Page Click Rate Frequency of searching (once/day, week etc.) Success of searching (assessment...) …

Page 20: Indexing and Searching Cross Media Content in a Social Network

Logging User Behavior Logging user activities on the portal

Downloads/Views Queries Anonymous/Register portal accesses

(login/logout) Adding/Updating/Deleting digital contents Menu clicks Content Upload Content Management Social Promotion & Networking

Page 21: Indexing and Searching Cross Media Content in a Social Network

Logging User Behavior

Content Accesses (Download/View) Axmedis Content

Pdf, Document, Video, Playlist, Slide, Flash, Image, Excel, Archive, Audio, Tool, Collection

Drupal Content Page, Blog, Event, Forum, Group, Comment

Distribution of Content Access per Access Type, Portal, Platform, Section, Locale,

Country, Region, City, Axoid, Nid, Content Type, Partner, User, Timestamp

Page 22: Indexing and Searching Cross Media Content in a Social Network

Logging User Behavior Queries (Simple, Faceted, Advanced)

Distribution of Queries per User, Content type, Device, IP, User Agent, Query Type,

Country, Region, City, Locale, Filter (faceted) Query Cloud Keyword Cloud IPR Wizard

Definition and usage of IPR Models Metadata Editor

Access and usage Add, Edit metadata

Video Annotations Personal content Other users content

Page 23: Indexing and Searching Cross Media Content in a Social Network

Logging User Behavior Social Promotion & Networking

Analysis of Eccentricity Betweenness Connections

Creation, Access of Public/Private Web Pages Activity on Forums, Blogs, Groups or between users

New Contents Comments to Objects/Web Pages Invited People Featured Objects Recommendations, suggested content Export/Import of links to/from other SN Private Messages

Page 24: Indexing and Searching Cross Media Content in a Social Network

Logging User Behavior Menu Clicks

Distribution of clicks per User, IP, Locale, Timestamp etc.

LAST POSTED, FEATURED, CALENDAR, ADVANCED SEARCH, UPLOAD AND INGEST, POPULAR, MY CONTENT, MY GROUPS , MY COLLEAGUES, GET AFFILIATED, TERMS OF USE, PRIVACY POLICY, TOP RATED, COURSES, LESS POPULAR, UPLOAD NEW CONTENT, etc.

Ranking/Voting # of ranked items Distribution per

User, IP, Locale, Timestamp etc. QR Code

Access from Mobile Devices Workflow

Distribution of Workflow Type Content Upload

Distribution of uploads per User, Partner, Timestamp

Page 25: Indexing and Searching Cross Media Content in a Social Network

Content AccessAffiliation # View/Play # Download

DSI 46 0Not partners/Affiliated

1292 14

Partners/Affiliated (except DSI)

6712 119

Public Users 21418 947

Affiliation # View/Play # DownloadDSI 3 0Not partners/Affiliated

100 4

Partners/Affiliated (except DSI)

218 11

Public Users 2225 869

September 1st – November 30th 2011

Page 26: Indexing and Searching Cross Media Content in a Social Network

Menu ClicksMenu # Clicks

ABOUT->ECLAP DESCRIPTION 671EVENTS->PAST AND FUTURE 536SEARCH->GROUPS 524ABOUT->ECLAP NEWS BLOG 463CONTENT->LAST POSTED 265CONTENT->FEATURED 343HOWTO->UPLOAD AND INGEST

330

SEARCH->ADVANCED SEARCH

314

EVENTS->CALENDAR 298ABOUT->ECLAP PARTNERS 269ABOUT->MAIN CONTACT 249CONTENT->POPULAR 239

September 1st – November 30th 2011

Page 27: Indexing and Searching Cross Media Content in a Social Network

SearchAffiliation # Simple Queries # Faceted

QueriesDSI 13 0Not partners/Affiliated

323 24

Partners/Affiliated (except DSI)

1094 21

Public Users 2634 147Affiliation # Advanced Queries

DSI 0Not partners/Affiliated

18

Partners/Affiliated (except DSI)

4

September 1st – November 30th 2011

Page 28: Indexing and Searching Cross Media Content in a Social Network

Drupal Stat Metrics Content Access per nid

September 1st – November 30th 2011

Page 29: Indexing and Searching Cross Media Content in a Social Network

Drupal Stat Metrics Views by Query

September 1st – November 30th 2011

Page 30: Indexing and Searching Cross Media Content in a Social Network

Drupal Stat Metrics Content Access per Platform

September 1st – November 30th 2011

Page 31: Indexing and Searching Cross Media Content in a Social Network

Understanding User behavior Drupal Stats (collapsible menus on the right)

Page 32: Indexing and Searching Cross Media Content in a Social Network

Google Analytics vs Drupal Stats

Service Pros Cons

Google Analytics

Traffic source data

Bounce rate Recency (since

when) Loyalty (how

often) Session times

IP approach, each IP is considered an unique visitor

Can’t deal with specific actions on portal (e.g. downloads, queries)

Drupal Stats Identity approach Actions Download User Access Queries Content type

filtering

Can’t deal with traffic source data and bounce rate

Session time raw approximation

Page 33: Indexing and Searching Cross Media Content in a Social Network

Sorting Results

Sorting by Upload Time (first time doc uploading date) Update Time (last time doc updating date) Score (doc relevance to search query)

Combined with faceting and paging

Page 34: Indexing and Searching Cross Media Content in a Social Network

Suggestions

REALTIME, while typing a query suggests similar searches ecl…

eclap eclap-de-2-1-1-user eclap-de-2-2-1-usergroup …

Page 35: Indexing and Searching Cross Media Content in a Social Network

ECLAP Survey

Page 36: Indexing and Searching Cross Media Content in a Social Network

Indexing/Searching Reqs Enriching search experience

Results Sorting Suggestions

Large # of contents (~ 104-106) External Indexing Service

Hidden/Private contents management Monitoring Exceptions

Email notifications Search Engine Friendly (Google, Bing, Yahoo etc.)

content site crawling HTML dumping

Page 37: Indexing and Searching Cross Media Content in a Social Network

External Indexing Service 1/3

Setup an external service to avoid server overloading when building the index Taxonomization Indexing (with exceptions monitoring) Index Synchronization Old Index replacement with new one Index updating Old contents cleaning (optional)

Page 38: Indexing and Searching Cross Media Content in a Social Network

External Indexing Service 2/3 Taxonomization

Has a cost pre-computing Digital content Execution Rule (JS) Indexed with object records

Performing Arts

Cinema Music

Documentary Historical Classical Pop

Cinema Music

Documentary

Classical

ObjectTaxonomy

Performing Arts

Taxonomy

Parent

Performing Arts

-

Cinema Performing Arts

Music Performing Arts

Documentary

Cinema

Historical Cinema

Classical Music

Pop Music

Page 39: Indexing and Searching Cross Media Content in a Social Network

External Indexing Service 3/3

Indexing with exceptions monitoring Real-time notifying system Event time and type (add, update) Full stacktrace info Customizable recipients Object Indexing Recovery

Resource Parse Error Metadata Indexing

• Index synchronization During external indexing, contents may be

Updated/added/deleted on the original index Need to update these contents

on the index (state flag)Indexed External

Indexed1 1

0 1

Page 40: Indexing and Searching Cross Media Content in a Social Network

Search Engine Friendly

HTLM dump service JAVA external service Periodically invoked by an AXCP rule Full metadata exporting Thumbnail Resource link Multilanguage Paginated results

Page 41: Indexing and Searching Cross Media Content in a Social Network

Conclusions Drupal integrated solution for user behavior tracking

and analysis Logging Stat Data Graph Online Survey

External Indexing Service Avoids server overloading HA of query service Error recovering Detailed event notifying system Index Optimization

Dumping tool for portal contents (SEO) Full metadata HTML exporting Scheduled Service

Page 42: Indexing and Searching Cross Media Content in a Social Network

Future Work

Keep collecting Data Deeper Data Analysis

User Sessions 1st, 2nd..., nth click average user behavior

Depict a modular view of the system usage Popularity/Usability for each feature & functionality

Social Network Analysis (SNA) Huge Population

User relationships, connections, friendships

Page 43: Indexing and Searching Cross Media Content in a Social Network

References

P. Bellini, I. Bruno, D. Cenni, P. Nesi, "Micro grids for scalable media computing and intelligence on distributed scenarious", IEEE Multimedia, 2011

P. Bellini, I. Bruno, D. Cenni, P. Nesi, M. Paolucci, M. Serena, "Semantic Model for Cultural Heritage Social Network and Cross Media Content for Multiple Devices", Conference of the Italian Association of Artificial Intelligence, Workshop for Cultural Heritage, 15-17 September 2011, Palermo, Italy

Page 44: Indexing and Searching Cross Media Content in a Social Network

Q & A

Page 45: Indexing and Searching Cross Media Content in a Social Network

APPENDIX

Page 46: Indexing and Searching Cross Media Content in a Social Network

Architecture (former)

DrupalApache HTTP

Searching Module

Indexing Module

Apache Tomcat

Apache SolrIndexing Service

XML/HTTP JSPSolrCell

AXCP

Grid Node

Rule Scheduler

SolrJ Client

Index Rebuilder Rule JS Indexing Rule JS

Page 47: Indexing and Searching Cross Media Content in a Social Network

Drupal

What is it?Open source content management platformDeveloped by Dries Buytaert in 2001Written in PHPUsers: The Economist, Examiner.com, The White House, data.gov.ukRuns on a WEB server (e.g. Apache, IIS) and a database (e.g. MySQL, PostgreSQL)

Page 48: Indexing and Searching Cross Media Content in a Social Network

Apache Lucene

What is it?High-performance, full-featured text search engine library (indexing and searching documents) Developed by Doug Cutting (2000) SourceForge, joined Apache Software Foundation in 2001Written entirely in JavaUsers: Wikipedia, Technorati, Nabble, TheServerSide, Akamai, SourceForge

Page 49: Indexing and Searching Cross Media Content in a Social Network

Apache Lucene

FeaturesRanked searching (best results returned first)Powerful query types: phrase queries, wildcard queries, proximity queries, range queries and moreFielded searching (e.g., title, author, contents)Date-range searchingSorting by any fieldMultiple-index searching with merged resultsAllows simultaneous update and searching

Page 50: Indexing and Searching Cross Media Content in a Social Network

Apache Lucene

FeaturesDocuments added via IndexWriterDocument = a collection of fieldsNo config files, dynamic field typingFlexible text analysis tokenizers, filtersSearch for documents via IndexSearcher

Hits = search(Query,Filter,Sort,topN)

Scoring: tf * idf * lengthNorm

Page 51: Indexing and Searching Cross Media Content in a Social Network

Apache Solr

What is it?A full text search server based on Lucene (Lucene sub-project)Developed by Yonik Seeley at CNET Networks (2004), donated to the Apache Software Foundation (2006)Written in Java, deployable as a WARUsers: CNET Reviews, CNET Channel, shopper.com, news.com, nines.org, krugle.com, oodle.com, booklooker.de

Page 52: Indexing and Searching Cross Media Content in a Social Network

Apache Solr Features

Advanced Full-Text Search Capabilities Optimized for High Volume Web Traffic Standards Based Open Interfaces (XML, JSON, HTTP) Web Administration InterfaceServer statistics exposed over JMX for monitoring Scalability, efficient Replication to other Solr Search Servers Flexible and Adaptable with XML configuration Extensible Plugin Architecture