copy or not

24
Copy or Not Dawei (David) Shi

Upload: clove

Post on 23-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Copy or Not. Dawei (David) Shi. Copy Or Not. Introduction Algorithm Framework Future work Demo. Copy Or Not. Introduction Algorithm Framework Future work Demo. Introduction. A web-based document comparator Calculate accurate similarity between 2 documents. Copy Or Not. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Copy or Not

Copy or NotDawei (David) Shi

Page 2: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 3: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 4: Copy or Not

Introduction A web-based document comparator Calculate accurate similarity between 2

documents

Page 5: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 6: Copy or Not

Algorithm Preprocessing Vector space Similarity calculation

Page 7: Copy or Not

Preprocessing

LowercaseStop

words filtering

Stemming

Page 8: Copy or Not

Preprocessing Stemming

› Porter Stemming Algorithm› E.g.

cat – cats meet – meeting agree – agreed correct - correctness

Page 9: Copy or Not

Vector Space Build dictionary 1

› word -> frequency Sort the keys of dictionary 1 Build dictionary 2

› key -> (index, count) Build binary vectors

› index -> occurrence

Page 10: Copy or Not

Similarity Calculation Vectors v1 and v2 Similarity = v1 * v2 / (norm(v1) *

norm(v2))

Page 11: Copy or Not

Performance Algorithms coded in Python

› Dynamic typing› Not good at numerical operations

Solution: numpy

Page 12: Copy or Not

Numpy A Python extension module Written mostly in C Define numerical array and matrix

types and basic operations on them

Page 13: Copy or Not

Numpy vs Python Python code

› a = range(10000000)› b = range(10000000)› c = []› for i in range(len(a)):

c.append(a[i] + b[i]) Takes up to 10 seconds on a several

GHz processor

Page 14: Copy or Not

Numpy vs Python Numpy code

› import numpy as np› a = np.arrange(10000000)› a = np.arrange(10000000)› c = a + b

Almost Instant

Page 15: Copy or Not

Numpy Usage Vector dot product Vector normalization Vector zero filling

Page 16: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 17: Copy or Not

Framework Django

› The web framework for perfectionists with deadlines

Page 18: Copy or Not

Libraries Python

› Numpy› Porter Stemming

jQuery

Page 19: Copy or Not

Hosting Alwaysdata

› Django 1.3› Python 2.6

Page 20: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 21: Copy or Not

Future Work Support file uploading and comparison Add HTML5 features

Page 22: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 23: Copy or Not

Demo http://imds.alwaysdata.net

Page 24: Copy or Not

Thank you!