meeting presentation sept.12

Meeting Presentation sept.12

Things to do since last meeting:(1) find out the number of drug name in FDA website (done, the number is 6244

which is OK for us to do search crawl on twitters).

(2) Read papers to find out new ideas about the query cost estimate.

**Predicting query performance

**what makes a query difficult, by David Camel

**learning to estimate query difficulty, sigir2005 best paper.

**Publications of Junghoo "John" Cho

Paper Review

Predicting query performanceThis a great paper since it introduced a new concept named clarity score which

can measure the similarity between query model and collection model. It helps us to view query difficulty from a new perspective: the weakness of query terms' ability to distinguish documents may lead query difficulty.

what makes a query difficult, by David Camel This is a good development of the previous paper. It expands the concept of

clarity score to a higher level concept of “distance model”. Distance does not only apply to query & collection, but also apply to query & relevant documents, relevant documents & collection, etc. What is more, the paper adopt more reasonable function: Jensen-Shannon divergence (JSD).

Paper Review

learning to estimate query difficultyThe paper offers a new view that sub-query coverage may also affect query

difficulty a lot. To support such view, the authors provide two complex machine learning method: histogram and modified decision tree. The result shows that difficult query is likely to be dominated by a single sub-query.

Some Ideas

A straight forward idea from David's paper is that we can do query deletion to maximum the distance between query and collection. The idea is not hard to implement. But I am wondering how much improvement we can get through this way.

Some Ideas

An advanced idea is to connect it with retrieval cost. As we see, the traditional cost for retrieval is as following:

n*(complexity of function*DF(i))

Thus computing cost is easy to be precomputed.

It is also interesting to consider deleting low IDF and low clarity terms. It will greatly reduce the computing cost while decrease or even increase the retrieval performance.

Some Ideas

It is also interesting to discuss term proximity and query expansion here. In my opinion, term proximity and external query term expansion may help to improve query clarity.

The cost of term proximity is about additional:

n*(n-1)/2*(DF1+DF2+averageTF1*averageTF2*comDoc)

The cost of external query term expansion is about additional:

n*(complexity of function*DF(i))+k*averageDoclength+N*(complexity of function*DF(i))

where n is the number of query terms, k is the number of top documents for expansion and N is number of terms expansed.

It will be interesting to discuss how many clarity could term proximity and external query term expansion can add.

meeting presentation sept.12

Documents

query expansion

query clarity

query collection

query model

difficult query

query deletion

query cost estimate

query difficultythe