low/high findability analysis shariq bashir vienna university of technology seminar on 2 nd...

18
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Low/High Findability Analysis

Shariq BashirVienna University of TechnologySeminar on 2nd February, 2009

Page 2: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Classifying Low/High Findable Documents

• Data used in the Experiment• USPC Class 422 (Chemical apparatus and process

disinfecting, deodorizing, preserving, or sterilizing), and • USPC 423 (Chemistry of inorganic compounds).

• Total Documents • 54,353

• Queries• 3 terms Queries (Total 753,682), using Frequent Terms Extraction

concept. (QG-FT). • Retrieval System used

• TFIDF

Page 3: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Patents Extracted for Analysis

• Next, I extract bottom 173 (Low Findable documents) and Top 157 (High Findable documents) for analysis.

0

100

200

300

400

500

600

700

800

900

0 50 100 150 200 250 300 350

Documents ordered by r(d)

r(d

) s

co

re

Page 4: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Features Extraction• Next, I try to extract features from these patents, so

that, can we classify Low or High Findable documents using Classification Model, without doing heavy Findability Measurement.

• Features that I considered useful are– Patent Length size (Only Claim).– Number of Two Terms Pairs in Claim section, which have

support greater than 2.– Two Terms Pairs Frequencies in individual Patents.– Two Terms Pairs Frequencies in all Collection.– Two Terms Paris Frequencies in its most 30 Similar

Patents.

Page 5: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Features Analysis• Patent Length size (Only Claim). (First Feature)• Clearly with only considering Patent Length, we can’t differentiate Low

and High Findable documents.• Some short length patents are high Findable, and many Longer length

patents are low findable.

1

10

100

1000

10000

0 50 100 150 200 250 300

Documents ordered by r(d)

Fre

qu

en

cy

r(d)

Patent Length

Page 6: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Features Analysis– Number of Two Term Pairs in Claim section, which have support

greater than 2. (Second Feature)• Again, clearly with only considering this feature, we can’t differentiate

Low and High Findable documents.• However, on High Findable Patents, the Support goes little bit up.

1

10

100

1000

10000

0 50 100 150 200 250 300

Documents ordered by r(d)

Fre

qu

en

cy

r(d)

Two Term PairsFrequency

Page 7: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Features Analysis– Two Terms Pairs Frequencies in individual Patents,

which have support greater than 2 in Claim section. (Third Feature)

– The main aim of checking this feature was to analyze, are Patent writers try to hide their information (from Retrieval Systems) by lowering the frequencies of terms?

– Since, there could many pairs in each Patents, therefore in analysis, I take average of their support values.

Page 8: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Features Analysis– The frequency goes little bit up for High Findable documents, – However, still some high findable Patents have low frequencies,

and some low findable Patents have high frequencies.

1

11

21

31

41

51

61

0 50 100 150 200 250 300

Documents ordered by r(d)

Av

era

ge

Fre

qu

en

cy

Two Term Pairs Frequency

Page 9: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Features Analysis

– Two Terms Pairs Frequencies in all Collection. (Fourth Feature)

– The main aim of checking this feature was to analyze, the presence of Rare Term Pairs in individual Patens.

– Since, there could many pairs in each Patents, therefore in analysis, I take average of their support values.

Page 10: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Features Analysis– The frequency goes up for High Findable documents, – That’s mean Low Findable Patents frequently used

Rare Terms.

1

201

401

601

801

1001

1201

1401

1601

1801

2001

0 50 100 150 200 250 300

Documents ordered by r(d)

Av

era

ge

Fre

qu

en

cy

Two Term Pairs Frequency inCollection

Page 11: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Features Analysis

– Two Terms Paris Frequencies in their most 30 Similar Patents. (Fifth Feature)

– In last Rare terms checking analysis, I used whole collection by considering it as a single cluster.

– In this feature, I create cluster for every Patent using K-NN approach.

– In K-NN, I consider only 30 most Similar Patents.

Page 12: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Features Analysis– The frequency goes up for High Findable documents, – That’s mean the Term Pairs that are used in Low

Findable Patents, could not be found in their most similar Patents.

1

3

5

7

9

11

13

15

17

19

0 50 100 150 200 250 300

Documents ordered by r(d)

Av

era

ge

Fre

qu

en

cy

Two Term Pairs Frequencyin 30 Similar Patents

Page 13: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Putting all Together

• Classifying Low/High Findable documents, without using Findability Measurement.

• I used all these features of Patents, for training classification models.

• For classification training, I used WEKA toolkit.

• In class I used L (for Low Findable), and H (for High Findable).

Page 14: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

# r(d) F1 F2 F3 F4 F5 Class

1 64 434.4615 26 13.30769 4.538462 215 H

2 238 488.3333 6 13.33333 5.333333 97 H

3 17 1047.614 88 13.36364 2.613636 285 L

4 101 471.125 16 13.375 4 187 H

5 176 496.625 64 13.375 3.5 153 H

6 34 1033.396 96 13.4375 4.333333 266 L

7 19 405.625 16 13.5 6.625 72 L

F1: Patent Length size (Only Claim).F2: Number of Two Terms Pairs in Claim section, which have support greater than 2.F3: Two Terms Pairs Frequencies in individual Patents.F4: Two Terms Pairs Frequencies in all Collection.F5: Two Terms Paris Frequencies in its most 30 Similar PatentsClass: L (Low Findable), H (High Findable)

Sample Dataset

Page 15: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Multilayer Perceptron (with Cross-Validation 100)

• Correctly Classified Instances 245 74.2424 %• Incorrectly Classified Instances 85 25.7576 %• Kappa statistic 0.4848• Mean absolute error 0.3238• Root mean squared error 0.4309• Relative absolute error 64.918 %• Root relative squared error 86.2466 %• Total Number of Instances 330

• === Detailed Accuracy By Class ===

• TP Rate FP Rate Precision Recall F-Measure ROC Area Class• 0.756 0.27 0.715 0.756 0.735 0.794 L• 0.73 0.244 0.77 0.73 0.749 0.794 H• Weighted Avg. 0.742 0.256 0.744 0.742 0.743 0.794

Page 16: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Accuracy with J48• Correctly Classified Instances 237 71.8182 %• Incorrectly Classified Instances 93 28.1818 %• Kappa statistic 0.4364• Mean absolute error 0.3592• Root mean squared error 0.4722• Relative absolute error 72.0151 %• Root relative squared error 94.5234 %• Total Number of Instances 330

• === Detailed Accuracy By Class ===

• TP Rate FP Rate Precision Recall F-Measure ROC Area Class• 0.731 0.293 0.691 0.731 0.71 0.663 L• 0.707 0.269 0.745 0.707 0.726 0.663 H• Weighted Avg. 0.718 0.281 0.72 0.718 0.718 0.663

Page 17: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Naïve Bayes• Correctly Classified Instances 220 66.6667 %• Incorrectly Classified Instances 110 33.3333 %• Kappa statistic 0.3251• Mean absolute error 0.4227• Root mean squared error 0.4841• Relative absolute error 84.7803 %• Root relative squared error 96.9639 %• Total Number of Instances 330

• === Detailed Accuracy By Class ===

• TP Rate FP Rate Precision Recall F-Measure ROC Area Class• 0.558 0.236 0.68 0.558 0.613 0.701 L• 0.764 0.442 0.658 0.764 0.707 0.701 H• Weighted Avg. 0.667 0.345 0.668 0.667 0.663 0.701

Page 18: Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Some Other Features could be

• Frequency of Term Pairs in Referenced or Cited Patents.

• Frequency of Terms Pairs in Similar USPC classes.