what did they do? deriving high-level edit histories in wikis

32
What Did They Do? Deriving High-Level Edit Histories in Wikis Peter Kin-Fong Fong and Robert P. Biuk-Aghai Data Analytics and Collaborative Computing Group Department of Computer and Information Science Faculty of Science and Technology University of Macau

Upload: robert-biuk-aghai

Post on 07-Jul-2015

564 views

Category:

Technology


2 download

DESCRIPTION

Conference presentation at WikiSym 2010

TRANSCRIPT

Page 1: What Did They Do? Deriving High-Level Edit Histories in Wikis

What Did They Do?Deriving High-Level Edit Histories in Wikis

Peter Kin-Fong Fong and Robert P. Biuk-Aghai

Data Analytics and Collaborative Computing Group

Department of Computer and Information Science

Faculty of Science and Technology

University of Macau

Page 2: What Did They Do? Deriving High-Level Edit Histories in Wikis

Motivation

1000s of edits & editors a dayBut:

Which edits are significant?Who provides significant edits?

Look at wiki article history?Tedious!

Automatically analyze the nature of the editsTry to answer:

What Did They Do?

1: en - de (..450,000)2: es - it (..70,000)3: ja - pt (..50,000)4: vo - nl (..35000)5: zh - sv (..22500)6: hu - tr (..20,000)7: ca - no (..15000)http://stats.wikimedia.org/EN/PlotsPngDatabaseEdits.htm

Page 3: What Did They Do? Deriving High-Level Edit Histories in Wikis

Revision History

Page 4: What Did They Do? Deriving High-Level Edit Histories in Wikis

Looking at Byte Count?

2,070 bytes (new) – 2,055 bytes (old) = 15 bytes changed

But:

56 bytes were cut71 bytes were added56 + 71 = 127 bytes changed

Page 5: What Did They Do? Deriving High-Level Edit Histories in Wikis

Looking at Minor vs. Non-Minor Edit?

→ Minor edit flag: may not be available; subjective

53 bytes changedMinor

1 byte changedNon-Minor

Page 6: What Did They Do? Deriving High-Level Edit Histories in Wikis

Related Work

Page 7: What Did They Do? Deriving High-Level Edit Histories in Wikis

Related Work: Text Differencing

An O(ND) Difference Algorithm and Its Variations (Myers, 1986)Longest common subsequence methodDoes not take movement into accountBasis of Wikipedia “Diff” function

Alpha Bravo Charlie Delta Echo

Alpha Charlie Delta Golf Bravo

Alpha Bravo Charlie Delta Echo

Alpha Charlie Delta Golf Bravo

Page 8: What Did They Do? Deriving High-Level Edit Histories in Wikis

Alpha Charlie Delta Golf Bravo

Alpha Bravo Charlie Delta Echo

Alpha Charlie Delta Golf Bravo

Alpha Bravo Charlie Delta Echo

Related Work: Text Differencing

The String-to-String Correction Problem with Block Moves (Tichy, 1984)Diff by matching blocks of text, regardless positionMovement can be detectedBasis of Wikitrust

Page 9: What Did They Do? Deriving High-Level Edit Histories in Wikis

Text Differencing Granularity

Flexible Diff-ing in a collaborative writing system (Neuwirth et al. 1992)Differencing on word level only not sufficientHierarchical decompositionParagraph, Sentence, Phrase, Word, Character

Further decompose only when similarity is high

Wikipedia “Diff” function works at two levelsParagraph & WordAlways do word level diff when paragraph is changedOften produces hard-to-read difference statements

Page 10: What Did They Do? Deriving High-Level Edit Histories in Wikis

A hard-to-read Wikipedia Diff

Page 11: What Did They Do? Deriving High-Level Edit Histories in Wikis

Related Work: Edit Categorization

X. de Pedro Puente, 2007Wiki e-learning environmentRequires its student editors to categorize their own editsMarkup improvement, New information, …

Gorgeon and Swanson, 2009Studying evolution of conceptClassify each of 3,665 edit by bare eye:“simple but tedious”CategoriesCommon used: Vandalism, Spam

Self-defined: Challenged, Unchallenged

Page 12: What Did They Do? Deriving High-Level Edit Histories in Wikis

Problems

Produce an intuitive difference statement between two versions of text

Rate the significance of an editA metric that can be compared across different articles

Page 13: What Did They Do? Deriving High-Level Edit Histories in Wikis

Proposed Design

Page 14: What Did They Do? Deriving High-Level Edit Histories in Wikis

Wiki Edit History Analyzer

2. Text Differencing Engine

1. Lexical Analyzer

3. Action Categorizer

4. History Summarizer

MediaWiki

Revisions of Article

Summary of changes

Sequence of Tokens

List of Basic Edit Actions

High Level Edit Actions

Page 15: What Did They Do? Deriving High-Level Edit Histories in Wikis

Step 1: Lexical Analyzer

Break the raw text into tokens of words and symbols

Divide the whole article into sentences

Page 16: What Did They Do? Deriving High-Level Edit Histories in Wikis

Step 2: Text Differencing Engine

Two level differencing Sentence levelToken (word & markup) level

Approximate sentence matchingMatching rate:

Movement detectionTarget: minimize number of movement actionsOnly consider segment with 4 or more tokens, to avoid false tagging of common words

ji

jiji cnco

ccm

= ,,

2

Page 17: What Did They Do? Deriving High-Level Edit Histories in Wikis

Step 3: Action CategorizerRule based categorization

Example: Spelling correctionIf the matching rate of two sentences > 80%Calculate the character level edit distanceIf edit distance ≤ 3 Spelling correction⇒

Rosaleen's story to the hunstman/wolf: A she-wolf who arrives at a village.

Rosaleen's story to the huntsman/wolf: A she-wolf who arrives at a village.

93% matching rate, edit distance = 2

Page 18: What Did They Do? Deriving High-Level Edit Histories in Wikis

Step 3: Action Categorizer

WikipediaWikifyInter-language linksSpelling correctionAdd / Modify categoryAdd referencesContent re-organizationContent re-writing……

Page 19: What Did They Do? Deriving High-Level Edit Histories in Wikis

Step 4: History Summarizer

Summarize editsGenerate edit summaryCalculate edit significance

Edit significance by weighted sum

s=shighsbasic

shigh=∑x=1

m

∑i=1

n

w x , i cx , i

sbasic=∑i=1

n

w ins , ic ins , iwdel , i cdel , iw repl , ic repl , iwmov , icmov , i

Page 20: What Did They Do? Deriving High-Level Edit Histories in Wikis

Prototype

Page 21: What Did They Do? Deriving High-Level Edit Histories in Wikis

Prototype

Currently performs first 3 stepsHistory summarizer under development

Implemented in Java, PHP front-end to MediaWiki

Produces categorized edit statements

At early alpha stageSource code available at http://sourceforge.net/projects/weha/

Page 22: What Did They Do? Deriving High-Level Edit Histories in Wikis

Prototype Screenshot

Page 23: What Did They Do? Deriving High-Level Edit Histories in Wikis

Prototype Screenshot

Page 24: What Did They Do? Deriving High-Level Edit Histories in Wikis

Compare with Wikipedia diff

Page 25: What Did They Do? Deriving High-Level Edit Histories in Wikis

Preliminary Evaluation: Overview

10 articlesLength from 2000 to 41000 charactersConsecutive revisions with most edit actions were chosen (2 – 18 edit actions)

10 student volunteers, about equally distributed in terms of:Gender (6 male, 4 female)Technical background (4 have, 6 haven’t)Education level (5 undergrads, 5 post grads)

Each student evaluated 2 articles

Page 26: What Did They Do? Deriving High-Level Edit Histories in Wikis

Preliminary Evaluation: Process

Printouts of both versions were presentedEnd-user view, not source wiki textIdentical paragraphs removed to reduce evaluators workload

Evaluators mark and categorize changes manuallyPresent the edit list generated by our prototypeFor each item in the list, Ask evaluator if they agreed or notState the reason of disagreement, if any

Page 27: What Did They Do? Deriving High-Level Edit Histories in Wikis

Preliminary Evaluation: Results

Average agreement rate: 84.1%11 out of 20 evaluations agreed 100% to our edit listRemaining 9 evaluations ranged from 33.3% to 88.9%No significant differences between the different evaluator groups

Page 28: What Did They Do? Deriving High-Level Edit Histories in Wikis

Preliminary Evaluation: Results

Low agreement rate sample examinedOld: electrons quickly go around the nucleusNew: electrons move around the nucleus very quickly

“quickly” not considered moved because single word is lower than movement recognition threshold (4 tokens)

Possible improvementConsider intra-sentence moves regardless of length of token sequence

Page 29: What Did They Do? Deriving High-Level Edit Histories in Wikis

Conclusions

Page 30: What Did They Do? Deriving High-Level Edit Histories in Wikis

Conclusions

Our contributions:New difference model for wiki contentDesign of edit history analyzerPrototype implementation

Page 31: What Did They Do? Deriving High-Level Edit Histories in Wikis

Ongoing Work

Intuitiveness of edit statementFound some issues in preliminary evaluationMore adjustments & experiments needed

Decide weights of the edit significance modelAny good method to consult the Wikipedia community?Design of questionnaire and following experiments

Page 32: What Did They Do? Deriving High-Level Edit Histories in Wikis

Prototype source code available at:http://sourceforge.net/projects/weha/

Data Analytics and Collaborative Computing Grouphttp://www.fst.umac.mo/se/dacc/