the mediabase
DESCRIPTION
The MediaBase A Webinar for the TELMAP project December 16, 2010 Ralf Klamma RWTH Aachen University Information Systems & Database TechnologyTRANSCRIPT
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-1
TeLLNet
GALA The MediaBase
Ralf Klamma
Informatik 5 (DBIS),RWTH Aachen University
WebinarDecember 16, 2010
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-2
TeLLNet
GALA
The Overall Approach
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-3
TeLLNet
GALA
What is unique aboutthe MediaBase?
Community
Interdisciplinary multidimensional model of digital networks– Social network analysis (SNA) is defining measures for social
relations– Actor network theory (ANT) is connecting human and media agents– I* framework is defining strategic goals and dependencies– Theory of media transcriptions is studying cross-media knowledge
social softwareWiki, Blog, Podcast, IM, Chat, Email, Newsgroup, Chat …
i*-Dependencies(Structural, Cross-media)
Members(Social Network Analysis: Centrality,
Efficiency)
network of artifactsMicrocontent, Blog entry, Message, Burst, Thread,
Comment, Conversation, Feedback (Rating)
network of members
Communities of practice
Media Networks
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-4
TeLLNet
GALA
Modeling Dependencies Using the i* Framework
Eric S. K. Yu, Towards Modeling and Reasoning Support for Early-Phase Requirements Engineering, RE 1997
Network
Coordinator
Gatekeeper
Hub
Member
Iterant Broker
URL
isA
isA
isA
isA
Coordination
Artifact
Communication
Legend:AgentGoalResource Task
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-5
TeLLNet
GALA
What can you do with the Mediabase Community Interface for (Firefox Plugin)
– Adding media for crawling, searching & viewing– Observing social networks over time– Retrieving structural patterns of media– Applying Web 2.0 operations (tagging, etc.) on media
Writing your own crawlers Applying all kind of social network measures
– Centrality measures – Finding influential & powerful persons– Network statistics – Understand networks at large
Advanced queries in RDF Store on concepts and relations– Who is the owner of company x?– Structured input for conceptual mapping tools
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-6
TeLLNet
GALA
What is the MediaBase?Collection of Social Software artifacts: Mailing lists (>200 k) Blogs (>300 k) Websites Newsletters
The MediaBase• IBM DB 2 data store• 24/7 Perl crawlers for media artifacts• Community oriented Commander Interface• Social network analysis & visualization tools• PALADIN: A pattern language for automatic behavior detection• Automatic extraction of concepts and relations in RDF
Wikipedias RSS Feeds Forums …
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-7
TeLLNet
GALA
The Data Model
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-8
TeLLNet
GALA
MediaBase Model
A Mediabase is a six-tuple graph L), , , R,(A, M ηνµ=
A A R ×⊆L A : →µ
L R : →ν{ }1 0, R : →η
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-9
TeLLNet
GALA
Simplified Meta Model
Actor
Agent CommunityProcessMedium Artifact
Attribute has
stores creates is affected by belongs go
represents consumes performs ranks
… LocalizeTranscribeBrowse Address
isA
isA
Latour: On Recalling ANT, 1999
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-10
TeLLNet
GALA
Actors in the Mediabase
{ }Network Agent, Process, Artefact, Medium, A ⊆
⊆
Folksonomy site, gbookmarkin Social Forum, Wiki,room,Chat Podcast, Blog, site,-Web
Feed, Newsgroup, ,Newsletter lists, Mailing Medium
⊆
Reference Rankíng,,Multimedia Rating, URL,Review, Trackback, Tag, ,Executions
Thread, entry, Blog Burst, on,Conversati Feedback, Host, n,Transactio Entry, RSS Comment, Index, mail,-E Message,
Artifact
⊆Addressing ion,Transcript Retrieval,
,Monitoring Search, n,Acquisitio Process
⊆Expert onalist,Conversati Spammer, Troll, ,Questioner
person, Answering Dead, Reviewer, Lurker, Member, tor,AdministraAgent
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-11
TeLLNet
GALA
Medium – Artifact Compatibility
Email Mailing List Blog Transaction-
based Website Wiki Chat Room URL Forum
Message + + - - - - - +
Thread - + - - + - - +
Burst + + + + + - - +
Conversation - - - - - + - +
Blog Entry - - + - - - - -
Comment - - + + + - - +
Web Page - - - - + - + -
Transaction - - - + - - - -
Feedback - - - + - - - +
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-12
TeLLNet
GALA
The Crawlers
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-13
TeLLNet
GALA
Crawling Technologies
Artifact MediaW ∪=
Index Thread Message list MailingMW ∪∪∪=
Agent Process Artifact MediaI ∪∪∪=
Network Agent Process Artifact MediaG ∪∪∪∪=
Mix of dumps (Wikis) and special purpose crawlers:
Index Blogentry Blogroll BlogBW ∪∪∪∪= Comment
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-14
TeLLNet
GALA
Crawler Overview
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-15
TeLLNet
GALA
Website Crawler
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-16
TeLLNet
GALA
Feed Crawler
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-17
TeLLNet
GALA
Mailinglist Crawler
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-18
TeLLNet
GALA
News Crawler
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-19
TeLLNet
GALA
Podcast Crawler
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-20
TeLLNet
GALA
The MediaBase Commander
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-21
TeLLNet
GALA
Media Base Web 2.0 Commander Personalization (user annotates resources with tags and has his page) Community-awareness (resources and annotation of others are open) User-friendly interface (Firefox plug-in, easy insertion of resources, tags, tracking of
recent changes)
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-22
TeLLNet
GALA
Application Programmer Interfaces
Under Development– GraphService – Visualization and PALADIN
– http://dbis.rwth-aachen.de/~atlas/module_build/JavaDoc//atlas_las_services_graph-service/HEAD/javadoc/index.html
– TargETLy Service – RDF Data Generator– http://dbis.rwth-
aachen.de/~atlas/module_build/JavaDoc/atlas_theses_da_krenge_TargETLy2/HEAD/javadoc/index.html
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-23
TeLLNet
GALA
GraphService
AbstractDigitalNetwork – Representation ofMetaModel
Classes for Networks – Blogs, Mailinglists, etc. Classes for Basic SNA Classes for Pattern Analysis Classes for GraphLayout
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-24
TeLLNet
GALA
TargETLy Service
Connection to RDF Store OpenCalais Service – RDF Generator Pattern Analysis IntentAnalysis Collection of predefined RDF Queries
– e.g. companyCompetitor, companyEmployeeNumber– e.g. patentFiling, patentIssuance– e.g. personEmailAddress, creditRating
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-25
TeLLNet
GALA
PALADIN – Pattern Analysis
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-26
TeLLNet
GALA
PALADIN: Disturbances in Cross-media Social Networks
What is a disturbance?– Sensing an incompatibility
between theories exposed and theories-in-use
Disturbances are starting points of learning processes– Disturbances disturb,
prevent … but they are creating reflection
Disturbances are hard to detect or to forecast
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-27
TeLLNet
GALA
Pattern Language for PALADIN: Example Troll
Troll Pattern: This pattern tries to discover the cases when a troll exists in a digital social network. A troll in the network is considered a disturbance.
Disturbance:(EXISTS [medium | medium.affordance = threadArtefact]) &
(EXISTS [troll |(EXISTS [thread | (thread.author = troll) & (COUNT [message | (message.author = troll) & (message.posted = thread)]) > minPosts]) &(~EXISTS[ thread1, message1| (thread1.author1 != troll) &(message1.author = troll & message1.posted = thread1 ]))])])
Forces: medium; troll; network; member; thread; message; urlForce Relations: neighbour(troll, member); own thread(troll, thread)Solution: No attention must be paid to the discussions started by the troll. Rationale: The troll needs attention to continue its activities. If no attention is paid, he/she
will stop participating in the discussions. Pattern Relations: Associates Spammer pattern.
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-28
TeLLNet
GALA
Pattern Discovery ProcessPattern
DisturbanceVariables
Pattern TemplateDisturbance
VariablesPattern Parameters
Pattern Template Instance
Pattern Instance
Disturbance
Variables Pattern Parameters
Forces ForceRelations
Rationale
Dependencies
Description Solution
Pattern Relations
Disturbance Instances
Variables Pattern Parameters
Digital Social Network
1. Set pattern parameters
2. Instantiate disturbances
3. Evaluate disturbances
4a. Change Pattern Parameters
4b. Apply Pattern Solution
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-29
TeLLNet
GALA
PALADIN Case Study10 patterns of disturbance over 119 social network instances, 17359 individuals, 215 345 mails
Pattern Occurrences RemarksBurst 22 The pattern finds out topics which were very important for certain
period of time. Scalability is necessary.No Conversationalist 76 The existence implies little communication in the network.No Questioner 67 The existence implies that the network is not popular.No Answering Person 61 Occurs in small networks. The effects of the lack of an answering
person must be further checked with content analysis.Troll 2 Troll occurs very rarely in cultural communities. True negatives exist.Spammer 86 Spammers can be found often in discussion groups. False positives
exist.Leader 37 The pattern occurs in the network centered around a member.No Leader 40 Occurs in big networks where the members are distributed in
different clusters.Structural Hole 67 Occurs for members having neighbors with only one contact.Independent Discussions
13 Occurs in large networks where disconnected subnetworks exist. Scalability is necessary.
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-30
TeLLNet
GALA
Visualization & Analysis
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-31
TeLLNet
GALA
Social Network Analysis of Open Source Communities
Eclipse components network based on analysis of source code repository (Software Architecture)
Eclipse components network based on analysis of mailing list communication (Social Structure)
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-32
TeLLNet
GALA
Community Reflection about Development Process
Social platform: Eclipse forum eclipsezone Forum: Eclipse communication framework (ECF) Measure: degree centrality Statistics: 225 nodes, 283 edges
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-33
TeLLNet
GALA
Conversationalist Pattern Social platform: Eclipse mailing list Forum: Device debugging developer discussion
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-34
TeLLNet
GALA
Questioner Pattern Social platform: Eclipse mailing list Forum: Device debugging developer discussion
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-35
TeLLNet
GALA
Identification of End-Users and Developers in OSS Communities
Community Clustering
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-36
TeLLNet
GALA
Textual Analysis of Postings from Community Experts
Postings from experts of one of the identified communities
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-37
TeLLNet
GALA
Computer Science Knowledge Network:the Visualization
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-38
TeLLNet
GALA
Computer Science Knowledge Network:Clustering
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-39
TeLLNet
GALA
Interdisciplinary Venues:Top Betweenness Centrality
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-40
TeLLNet
GALA
High Prestige Series:Top PageRank
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-41
TeLLNet
GALA
Data Sets DBLP (http://www.informatik.uni-trier.de/~ley/db/)
- 788,259 author’s names- 1,226,412 publications- 3,490 venues (conferences, workshops, journals)
CiteSeerX (http://citeseerx.ist.psu.edu/)- 7,385,652 publications (including publications in reference lists)- 22,735,240 citations- Over 4 million author’s names
Combination- Canopy clustering [McCallum 2000]- Result: 864,097 matched pairs - On average: venues cite 2306 and
are cited 2037 times
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-42
TeLLNet
GALA
WikiWatcher – System Design
article
Article pages,URLS,
Revisions
Tim
Liz
Joe
123.45.67.89
Authors
RDB
Stage 1: SAX-based Parser in PERL
Stage 2: Dynamic Analysis and Visualization
Wiki Network Data
Metadata
[[Article]]
[[requested]]
article
[http://…]
[[Article2]]
Generating XMLdump/export files
Parsing wiki data/database transfer
Measurement
Network Analysis
Generating Networks
Visualization
[[never exists]]
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-43
TeLLNet
GALA
Network Heterogeneity Author Networks
– Author nodes (anonymous/registered users)
– Edges represent collaboration between authors during a period t
Article Networks– Article nodes
(incl. wiki namespaces)– Directed edges (links)
between articles As expected both kind of
networks stay heterogenous
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-44
TeLLNet
GALA
Importance of Network Actors Articles: High betweenness
centrality controls the flow of information within a Wiki
Betweenness values grow up or stay nearly constant during the evolution process
Determines– Important actors– Important articles– Vandalism
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-45
TeLLNet
GALA
Evolution of Shortest Paths Densification Power Law:
Complex networks may become denser during their growth
Generally this could not verified for wiki author networks!
The average distances stagnate at nearly 2 for all considered author networks
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-46
TeLLNet
GALA
Evolution of Author Networks Strongly connected components merged by collaboration of
two wiki authors
Author Network of German Wikia in July 2007 Author Network of German Wikia in August 2007
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-47
TeLLNet
GALA
Visualization & Analysis
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-48
TeLLNet
GALA
What you cannot do with theMediabase (in the moment )
Creating a new Mediabase in a new environment– Maintenance with databases, scripts and interfaces is tedious– Interfaces integrated into Zope/Plone
Not all media are equally supported– Very good support for mailing lists, forums, web sites and blogs– Less support for wikis, podcasts, social bookmarks
Lacking support for– Conceptual navigation interface (Conzilla!)– Discourse management tools– Weak signal analysis tools– Topic & sentiment & opinion mining tools– Automatic generation of recommendations
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. M. JarkeI5-KL-111010-49
TeLLNet
GALA
The Future of the Mediabase: CommunityBase
Self-modeling
Self-reflection
Activity Theory[Enge87]
Actor Network Theory [Lato05]
Community ofPractice [Weng98]
disturbancedisturbance disturbance
+/- -
Self-modeling phase contributes to self-reflection phase and vice versa
+
[PeKl08]
Community experiencerepository