copy or not

Post on 23-Feb-2016

28 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Copy or Not. Dawei (David) Shi. Copy Or Not. Introduction Algorithm Framework Future work Demo. Copy Or Not. Introduction Algorithm Framework Future work Demo. Introduction. A web-based document comparator Calculate accurate similarity between 2 documents. Copy Or Not. - PowerPoint PPT Presentation

TRANSCRIPT

Copy or NotDawei (David) Shi

Copy Or Not Introduction Algorithm Framework Future work Demo

Copy Or Not Introduction Algorithm Framework Future work Demo

Introduction A web-based document comparator Calculate accurate similarity between 2

documents

Copy Or Not Introduction Algorithm Framework Future work Demo

Algorithm Preprocessing Vector space Similarity calculation

Preprocessing

LowercaseStop

words filtering

Stemming

Preprocessing Stemming

› Porter Stemming Algorithm› E.g.

cat – cats meet – meeting agree – agreed correct - correctness

Vector Space Build dictionary 1

› word -> frequency Sort the keys of dictionary 1 Build dictionary 2

› key -> (index, count) Build binary vectors

› index -> occurrence

Similarity Calculation Vectors v1 and v2 Similarity = v1 * v2 / (norm(v1) *

norm(v2))

Performance Algorithms coded in Python

› Dynamic typing› Not good at numerical operations

Solution: numpy

Numpy A Python extension module Written mostly in C Define numerical array and matrix

types and basic operations on them

Numpy vs Python Python code

› a = range(10000000)› b = range(10000000)› c = []› for i in range(len(a)):

c.append(a[i] + b[i]) Takes up to 10 seconds on a several

GHz processor

Numpy vs Python Numpy code

› import numpy as np› a = np.arrange(10000000)› a = np.arrange(10000000)› c = a + b

Almost Instant

Numpy Usage Vector dot product Vector normalization Vector zero filling

Copy Or Not Introduction Algorithm Framework Future work Demo

Framework Django

› The web framework for perfectionists with deadlines

Libraries Python

› Numpy› Porter Stemming

jQuery

Hosting Alwaysdata

› Django 1.3› Python 2.6

Copy Or Not Introduction Algorithm Framework Future work Demo

Future Work Support file uploading and comparison Add HTML5 features

Copy Or Not Introduction Algorithm Framework Future work Demo

Demo http://imds.alwaysdata.net

Thank you!

top related