patent search query log analysis shariq bashir department of software technology and interactive...
Post on 21-Dec-2015
214 views
TRANSCRIPT
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Patent Search QUERY Log Analysis
Shariq BashirDepartment of Software Technology and
Interactive Systems Vienna University of Technology
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
General Theme
In Automatic Evaluation of IR systems, query generation contains valuable importance.
Generally, query generation space is very large.
Need to understand, how to generate reasonable queries.
In this work, we understand this issue with the help Patent Search QUERY Log.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Automatic Query Generation for Analysis
Motivation/Problem– Patents contain large number of terms.– IR systems analysis using all combinations of terms is a difficult
task.• Demands large processing time.
• Can give wrong picture– A large combination of query terms are never used by users.
– Question?• How to generate reasonable queries?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Query Log of Patents Search
(Patents Search Query Log) can help in generating queries for Analysis.
Patent search users are more experimented, we can utilize their experienced for effective queries generation.
In Query Log Analysis, on one side we have Query Patents and on the other side, we have their Query Logs– So this helps us in understanding – The types of terms that are mostly used for searching patents.– Can Prune Irrelevant Terms.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Applications of Query Log Analysis
– Analyzing Bias of Retrieval Systems (Findability of Documents).
– Selecting Terms for Query Expansion.
– Learn to Rank for Prior-Art Search.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Experiments (QUERY Log DATASET)
Patent Search Query Log can be downloadable from USPTO portal (http://portal.uspto.gov/external/portal/pair).
Can’t be downloadable as a whole. Can be downloadable manually on individual patent basis.
Available in Scan Format, need OCR to convert in digital text format.
Need further cleansing operations to remove noise in queries.– Some queries contain past queries reference numbers.
– There were lot of number in the queries.• Patents application number• IPC classes
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Queries contain queries references
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Queries contain patent application numbers
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Queries contain IPC classes
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Experiments (QUERY Log DATASET)
242 Query Log of Patents are used for analysis.
15013 queries.
We only considered the text queries for analysis.
0
50
100
150
200
250
1 41 81 121 161 201 241
Patents ordered by Queries Frequency
Que
ries
Fre
quen
cy
Queries Frequency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8P
erce
nta
ge
in T
ota
l Ter
ms
Series1 0.74 0.26
Contain Text Terms Contain Numbers
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Query Log Analysis
Given Query Log, we analyze it on the basis of following factors.
1. Term Frequencies of Query Terms.1. Does Frequency of Terms in Patents contain any importance in Query Formulation?
2. Proximity/Closeness of Query Terms in Patent Text.
3. Query Terms Confidence in Similar IPC Classes.
4. Number of Retrieved Documents.
Query Log of (Y)
Query Patent (Y)
Understand diff between (All Terms of Patents/ and only Query Log Terms)
Automatic Queries Generation
All Terms of Query Patent
All Terms of Query Log
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terms Frequencies in Patents (1)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Term Frequency
Per
cen
tag
e in
All
Ter
ms
Series1 0.751587 0.134807 0.067483 0.021849 0.012805 0.00808 0.00323
>0 & <= 5 >5 & <= 10>10 & <=
50>50 & <=
100>100 & <=150
>150 & <=200
>200
All Terms of Query Patents:
1. Large percentage of Terms in Patents have lower frequency.
2. While, very few percentage of Terms have higher frequency > 10.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terms Frequencies in Patents (1)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Term Frequency
Perc
en
tag
e S
ele
cte
d i
n Q
ueri
es
Series1 0.00388 0.16457 0.24329 0.43883 0.57207 0.84107 0.81414
>0 & <= 5
>5 & <= 10
>10 & <= 50
>50 & <= 100
>100 & <=150
>150 & <=200
>200
[Percentage/out of Total Terms] Selected in Queries:
1. Higher Frequency Terms have very good percentage of selection in Queries.
2. Lower Frequency Terms such as <= 5, contain very poor percentage. Note in last slide almost 75% of Terms in Patents have <= 5 Frequency.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terms Frequencies in Patents (1)
[Percentage/out of Query Terms] Appeared in Query Log:
1. Higher Frequency Terms are more frequently appeared in Query Log as compared to Lower Frequency Terms (<= 5).
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Term Frequency
Per
cen
tag
e in
Ter
ms
of
Qu
erie
s
Series1 0.042932 0.32695 0.241959 0.141299 0.107956 0.100151 0.038754
>0 & <= 5>5 & <=
10>10 & <=
50>50 & <=
100>100 & <=150
>150 & <=200
>200
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terms Proximity/Closeness in Query Log (2)
Proximity refers to closeness of Two Terms in Patent Text. Helps in understanding whether Terms Proximity contains
any importance in Queries formulation.
Proximity of Terms is calculated with two approaches– Minimum distance between two terms.– Co-Occurrence Frequency using Window Size.
Terms Pairs are selected based upon two factors– All Terms pairs of Query Patent.– Only Terms pairs that appeared in Query Log.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terms Proximity/Closeness in Query Log
0.02
0.12
0.22
0.32
0.42
0.52
0.62
0.72
0.82
Terms Pairs Proximity
Pe
rce
nta
ge
in
All
Pa
irs
Percentage in All Terms Pairs
Percentage in Queries Terms Pair
Percentage in All Terms Pairs 0.286545 0.12139 0.267703 0.124298 0.115226 0.05207 0.031945
Percentage in Queries Terms Pair 0.881944 0.043403 0.039931 0.020833 0.005208 0.006944 0.001736
<= 7> 7 & <=15
>15 & <=30
>30 & <=50
>50 & <=100
>100 & <=150
>150
With Minimum Distance:– Lower Proximity Pairs are appeared in a larger percentage in Query Log, as compared to Higher
Proximity Pairs.– This indicates that users give more focus toward those terms, which are closer together in the
text.
– In All Terms Pairs of Patents, 71% of Pairs have Minimum Proximity > 7.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terms Proximity/Closeness in Query Log
With Co-Occurrence Frequency with Window Size = 14:– Higher Co-Occurrence Pairs are appeared in a larger percentage (90%) in Query
Log, as compared to Lower Co-Occurrence Pairs (10%).
– Almost 75% of All Pairs of Patents have Co-Occurrence Frequency <= 1.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pairs Co Occurance Frequency
Per
cen
tag
e in
All
Pai
rs
Percentage in All Pairs of Patents
Percentage in Query Log Pairs
Percentage in All Pairs of Patents 0.745737 0.254263
Percentage in Query Log Pairs 0.098125 0.901875
<= 1 > 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Frequency in Similar IPC Classes
Query Patents fall in many IPC Classes. Patent Users are usually experienced. Their terms are more target oriented.
Need to check what is the Frequency of Query-Log Terms Pairs similar IPC classes.
– Freq (IPC Classes) = Freq / |qd|• Freq = Frequency in similar IPC Classes
• |qd| =Total # of Retrieved Documents.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Support in IPC Classes
0%
10%
20%
30%
40%
50%
60%
Frequency in Similar IPC Classes
Pe
rce
nta
ge
in A
ll T
erm
s P
air
sPercentage in All Pairs of Patents
Percentage in Pairs of Quey Log
Percentage in All Pairs of Patents 50% 35% 8% 5% 2%
Percentage in Pairs of Quey Log 12% 21% 13% 33% 22%
<= 7%>7% & <=
18%>18% & <=22%
>22% & <= 33%
>33%
Analysis indicates higher support of QUERY Log Terms Pairs in similar IPC classes as compared to All Terms Pairs of Patents.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Number of Retrieved Documents
Number of Retrieved Document denotes, QUERY Terms are present in how many Patents.
More common the QUERY Terms will be, the Larger Number of Retrieved Documents will be
This factor is analyzed with – All Terms Pairs of Patent– All Terms Pairs of Query Log
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Number of Retrieved Documents
Analysis indicates Terms Pairs of Query Log, can retrieve smaller number of Patents as compared to All Terms Pairs of Patents.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of Retrieved Patents/Query
Pe
rce
nta
ge
in
To
tal
Te
rms
Pa
irs
All Terms Pairs of Patents
All Terms Pairs of Query Log
All Terms Pairs of Patents 0.242968 0.174361 0.193426 0.385505
All Terms Pairs of QueryLog
0.657385 0.229487 0.072564 0.039744
<= 5000>5000 & <=
15000>15000 & <=
50000> 50000
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion
For automatic IR System evaluation, Query Generation is an important factor.
We believe on the basis of past Query Log, we can understand this problem.
Using different statistical factors, there exists a huge difference between random queries and users queries.
We can considered these factors, while generating automatic queries.