project report (word).doc

25
Web/Database Search by Topics Project Report Case Western Reserve University ECES 433 Ling Yang, Kun Si [email protected] , [email protected] Dec. 11, 2000

Upload: hondafanatics

Post on 17-Dec-2014

3.399 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Project Report (word).doc

Web/Database Search by Topics

Project Report

Case Western Reserve UniversityECES 433Ling Yang, Kun [email protected], [email protected]. 11, 2000

Page 2: Project Report (word).doc

Today, a large amount of information resources is available online or in CDs/DVDs. These information resources are typically extremely large, usually containing hyper-linked table of contents, indexed search facilities on keyword, and some multimedia data such as images, audio/video streams, etc. When the information resources are used in their limited ways, they may be very effective. However, for different type of information extraction, they are not so easy to get meaningful results. Some time it is a very frustrated experience for the user to use this facility. A new technology is needed to search this information effectively.

In this project, we explore the information at website: http://researcher.sirs.com, and try to model its contents using topics and topic metalink. There are two goals in doing this, first is to find new reasonable metalink, second is to raise issues of relationships that cannot be modeled with current metalink model, discuss them and try to improve the current metalink model. A website has been developed to demonstrate our model.

We give a brief introduction about the website we have been studying in the next section, and in section 2, we explain our database model, the third section is devoted to the metalink we have defined and the issues about topic map, and section 4 is our implementation of the web searching on the database, in the last section, we give a conclusion of our work, and suggested further work.

1. Introduction to the study website: http://researcher.sirs.comIt is a child website of http://www.sirs.com, hosted by SIRS Mandarin, Inc. SIRS

Mandarin, Inc is an established information and technology provider to more than 50,000 libraries and institutions worldwide. Among all of its products, this company provides a powerful Web search interface that allows integrated access to full-text articles, Internet sites, documents and graphics from SIRS reference databases to SIRS Researcher, SIRS Government Reporter, SIRS Renaissance and SIRS NetSelect. The website we are studying: http://researcher.sirs.com, is such a searching website based on database SIRS Researcher, which is a general reference database with thousands of full-text articles, exploring social, scientific, health, historic, economic, business, political and global issues.

The searching website was designed to search an article by several different methods: Quick Search, Advanced Search, and Topic Browse. Quick Search was designed to search by keywords or subject headings. Advanced Search can be used to narrow down an article by the author, title, text, or their combination. Topic Browse allows the user browse through the topic categories. There are 8 topic categories and 46 subtopic categories in Topic Browse section. We limited our search topics only to Business: Consumerism, since there are so many topics in this search website and we could not possible and model all of them. We have taken out 100 articles with about 390 topics related to those articles.

2

Page 3: Project Report (word).doc

We have reviewed the articles and the involved topics one by one, tried to abstract the relationships between topics. At this moment, we have defined 16 metalink and 53 metalink instances. The next section discusses how these metalink and their instances are implemented in database.

2. Database ModelWe save the articles, topics, metalink and their instances in a relational database. There

are totally six tables, defined as:a. article (aid, title, path, published)b. topic (topic)c. metalink (metalink)d. article_topic (aid, topic, degree) foreign key (aid) references to article(aid), foreign

key(topic) references to topic(topic)e. metalink_instance(miid, topic1, topic2, metalink) foreign key(topic1) references

topic(topic), foreign key (topic2) references topic(topic), foreign key(metalink) references metalink(metalink));

f. article_metalink_instance(aid, miid) foreign key (aid) references article(aid), foreign key(miid) references metalink_instance(miid))

Figure 1 is the ER model, and Table 1-6 are the sample records. All the six tables are in BCNF. The SQL commands to create tables, insert records, and create indexes are in the attachment.

First table ‘article’ is used to save the information about the articles we have reviewed. Field ‘aid’ is the ID for each article and the primary key; ‘title’, by its name, is the title of the article; ‘path’ tells us where to find the article, it can be a URL or a local file path and name; ‘published’ is the date this article is published.

Second table ‘topic’ saves all the distinct topics discussed in these 100 articles. Table ‘metalink’ defines the metalink between these topics. The fourth table ‘article_topic’ is a list of which article involves which topics. Fields ‘aid’ and ‘topic’ combined together are the primary key, and they are foreign keys referring to table ‘artilce’ and ‘topic’, respectively. The third field ‘degree’ is an indicator about how deep a topic is discussed in an article. An article could include several topics, but focus on just one or two topics and only mention briefly the others. The value of this field can be one of the following three: ‘elementary’, ‘moderate’, ‘elaborate’, getting deeper as its goes on.

Table ‘metalink_instance’, as its name suggests, is the instance of the metalink defined in table ‘metalink’. All the metalink we have defined in this project are binary, so there are three fields necessary in this table, one for the metalink and two for the two topics which have a relationship saved in the field ‘metalink’ between them. Notice metalink are not necessarily binary, can be unary or ternary too. Each metalink instance has a unique ID

3

Page 4: Project Report (word).doc

‘miid’, which is the primary key. Although the combination of topic1, topic2, and metalink is enough to uniquely define a record, we still add a unique ID for each record, since they are going to be referred in other tables, e.g. table ‘article_metalink_instance’. We can save some space by this way.

Table ‘article_metalink_instance’ has only two ID fields, one is the article ID ‘aid’ referring to table ‘article’, and the other is the metalink_instance ID ‘miid’ we just mentioned. It is very similar to table ‘article_topic’, but besides of telling us which topics an article involves, it also tells us how these two (binary metalink) topics are related. This table has important meaning to our model, it add more dimensions to our model and change it from just a list of (article, topic) pair to a model with inner structure of each article. This table is built on the assumption that there exist multiple relationships (metalink) among the same pair of topics. However, this is not true in our project, at least not true so far. We will come back to it in the web implementation section.

3. Metalink and IssuesBy scrutinizing the 100 articles and the 390 topics, we came up with 16 metalink. They

are:1). BUILD_ON topic A BUILD_ON topic B means, if used in knowledge structure, if a reader wants to master topic A, they need to first know topic B. In other situations, both topics can be events, and event A is the result of event B2). COMPARE_WITH topic A and topic B are tended to be compared, e.g. medicare system in Canada and in U.S.A have been studied together in many articles.3). COMPETE two topics, e.g. products, compete with each with, e.g. generic drugs and their corresponding brand name products.4). COMPLEMENT the existence of one topic, e.g. a person, an occurrence, a tool, makes the other more complete.5). CONTROVERSIAL_TOPIC_IN one topic is a controversial topic in the field of the other topic, e.g. whether patients with terminal illness has the right to die is very controversial in the filed of medical ethics.6). HAVE_POWER_OVER a certain person, or group, organization, industry, government, etc, has influence in a certain field, e.g. the controlling power of pharmaceutical industry over drug market.7). IMPROVE the existence of certain people or policy, event, makes the other better, e.g. study shows that certified midwives often make childbirth safer.

4

Page 5: Project Report (word).doc

8). IN this is not topic A is physically in topic B, but the situation of topic A in the area of topic B, e.g. the health care reform in California.9). IS_A topic A has all the characteristics of topic B, e.g. a monkey is a mammal.10). IS_IN both topics are geographical or physical terms, e.g. Oregon is in U.S.11). LEAD_TO the happening of topic A, an event, leads to the happening of topic B, another event, e.g. brain jury may causes coma.12). OF topic A is one aspect of topic B, e.g. the side effects of drugs.13). PREVENT the existence of topic A, an event or policy, prevents the happening of topic B, e.g. vaccine may prevent some commutable diseases.14). SPECIAL_GROUP_IN a group of people or organization is a special group in the field of another topic, e.g. aged people are special group people in health insurance.15). SUBSPECIALTY_IN topic A is under topic B, e.g. correctional medicine is a subspecialty in medicine.16). TREAT a medicine or medical operation treats a disease.

The difficulty in finding and determining a metalink is to decide how general this metalink is. It cannot be too detailed, otherwise, its instances will not contain useful information, but degrade to just a record in database table. It cannot be too general either, e.g. ‘RELATE_TO’ is a metalink, but since almost every topic can relate to every other topic through certain ways, and it lost its meaning too. The criteria in defining a metalink can hardly be explained explicitly in words, but more out of instinct logical feelings.

The instantiation of metalink, in another word, searching a pair of topics that are related by the metalink, arise issues too. We need to take into consideration of the sources (articles in our projects) of these topics. For example, for metalink ‘IS_IN’, one instance is an engine IS_IN a car. Both topic engine and topic car have some sources associated with them, and any source for topic engine should relate to any source for topic car by ‘IS_IN’. However, not every match of two sources like this is meaningful or useful. A source of engine may devote to a specific model of engine, e.g. model c01 or c02, and a source of car may devote to a specific model of car, e.g. Honda civic, and Honda civic only uses engine model c01, but not c02. So the match of source of engine c02 and source for car Honda civic does not fit in metalink ‘IS_IN’. In the earlier report, we came up a solution that adds

5

Page 6: Project Report (word).doc

an applicable level of certain metalink instance. In this example, "an engine is IS_IN a car" only applies to sources about general engine and general car. When both sources get more detailed, model c02 for engine and Honda civic for car, the metalink instance need to be more specific too, e.g. "an c01 engine is in a Honda civic car", and all the sources about c01 engine and Honda civic car can be paired up. However, we found this solution complicates things, and it’s hard to decide which instance should have this ‘applicable level’ and which not. So now, if an instance of a metalink has above issue, we don’t include it as a legitimate instance of the metalink.

4. Implementation of Web SearchingWe have developed a web site http://vorlon.cwru.edu/~lxy21 to demonstrate how our

data model and topic metalink model can be used in web search. First the user can type in ONE topic that he is interested, and click on the search

button. Our web search engine will first search in table ‘metalink_instance’, and find the metalink instances that contain this topic, the result is print in a table as the suggested further search direction. The search engine will also look in the table ‘article_topic’, to find all the articles that contain this topic, and get the detailed information of these articles in table ‘article’, the search result is also print out on web page (See figure 2).

Topic ‘generic drugs’ show up in three metalink instances in our database: ‘generic drugs COMPETE brand name products’, ‘prices OF generic drugs’, and ‘side effects OF generic drugs’, and eight articles contains topic ‘generic drugs’. The three metalink instances are our suggested further search direction with topic metalink. If the user click on topic ‘brand name products’, our search engine will find articles contains either topic ‘generic drugs’ or topic ‘brand name products’ or both. Two articles, ‘Drug Makers Maneuver to Keep Generics Off Market’ and ‘ARE GENERIC DRUGS AS GOOD AS BRAND NAME?’, are on the top of the search results list. They both discuss the competition between the ‘brand name products’ and their corresponding ‘generic drugs’. The published dates of these articles are also displayed. Notice that there is a ‘2 topics’ associated with the first two articles on the list and ‘1 topics’ with others. They are the number of topics they user choose and the articles contains.

The user can also choose to start a new search by type in a topic and click on the search button.

The difference between our topic metalink search and the usual keywords search is not only that we know the relationship between topic ‘generic drugs’ and ‘brand name products’ is ‘COMPETE’, but also the relationship of these two topics in our top two search results is ‘COMPETE’. This is assured by the table ‘article_metalink_instance’, which should have records showing these two articles both have metalink_instance (‘generic drugs’, ‘COMPETE’, ‘brand name products’) association with their article IDs. Our search engine should search in the table ‘article_metalink_instance’, find the articles

6

Page 7: Project Report (word).doc

that have this required metalink, and display them as the top suggested articles. However, due to the lack of multiple relationship between topics, as we discussed in section 2, this table is not really. In another word, since there will not be another relationship between topic ‘generic drugs’ and ‘brand name products’ in our project (in the real world, there might be), if an article have both topics, the relationship will only be ‘COMPETE”, so there is no need to check table ‘article_metalink_instance’. In the further, while our database grows, and the metalink and their instances get more complicated, and there exist multiple relationships between two topics, our search engine will visit table ‘article_metalink_instance’ first, but now it is skipped.

We choose apache as the http web server, and use JSP and JDBC to communicate with the Oracle database, tomcat is the JSP engine.

5. Conclusion and further workIn our project, we have carefully studies 100 articles under the topic: Business,

Consumerism, and their involved topics, and extracted 16 metalink and 53 metalink instances from them. We have built a database to save all these information in a relational database, and developed a web site to demonstrate how our model helps people searching with topic metalink.

There is a lot can be done to improve this project. If we continue to work on it, we will rethink and organize the metalink. Our web site can also be made friendlier, e.g., with a list of all the possible topics, and a text index on all the articles.

7

Page 8: Project Report (word).doc

Figure 1. Entity_Relation Model of the Database

8

article

path

published

aid

title

mArticle_metalin

k_instance

narticle_topic

degree

m

topic miid

topic Metalink_instance

metalink

metalink

nm

n

Page 9: Project Report (word).doc

Figure 2. Search result of topic ‘generic drugs’

9

Page 10: Project Report (word).doc

Figure 3. Search result of topic ‘generic drugs’ and ‘brand name

10

Page 11: Project Report (word).doc

Table1. articleAID TITLE PATH PUBLISHED1 The Price We Pay /article/1.txt 16-Oct-002 Drug Makers Maneuver to Keep Generics Off Market /article/2.txt 17-Aug-003 MEDICAL ECONOMICS: SEVEN WAYS TO CUT YOUR PILL BILL /article/3.txt 1-Feb-004 ARE GENERIC DRUGS AS GOOD AS BRAND NAME? /article/4.txt 1-May-985 FDA SAYS GENERIC DRUGS ARE AS GOOD /article/5.txt 3-Feb-986 PRICE WARS OVER NAME-DROPPING /article/6.txt 1-May-947 Health-Care Reform: Battling the High Cost of Drugs /article/7.txt 1-Jul-938 WHEN GENERIC ISNT GENUINE /article/8.txt 9-Jul-899 FORGOTTEN PATIENTS: THE MENTALLY ILL /article/9.txt 1-Apr-0010 WHATS WRONG WITH MANAGED CARE AND HOW TO FIX IT /article/10.txt 1-Feb-0011 TOBACCO TRIUMPHS AS COURT SAYS NO TO UNION LAWSUITS /article/11.txt 11-Jan-0012 IT CANT HAPPEN HERE /article/12.txt 1-Dec-0013 CANADA RETHINKS ITS MEDICARE /article/13.txt 14-Dec-99

11

Page 12: Project Report (word).doc

14 HOSPITAL RANKINGS SHOW SAVINGS /article/14.txt 14-Dec-9915 PLANNING CAN HELP COVER COSTS OF AGING /article/15.txt 3-Oct-9916 JUSTICE DEPT. SUES TOBACCO COS. /article/16.txt 22-Sep-9917 SHOULD HMOs PAY FOR MAYBE? /article/17.txt 6-Jun-9918 BIOETHICS: A MORAL VACUUM? /article/18.txt 1-May-9919 DRUG BENEFIT NEWEST TWIST IN DEBATE OVER MEDICARE /article/19.txt 28-Apr-9920 THE NEW CONSUMER PARADIGM /article/20.txt 1-Apr-99

Table 2. topicTOPICabortionabused womenaccountingactions and defensesadvertisingafrican american womenagedagingaids (disease)aids (disease) and employmentaids (disease) educationalternative medicineamerican civil liberties unionamericans with disabilities act (1990)amphetaminesantidepressantsantismoking movementarrestasylum

12

Page 13: Project Report (word).doc

Table 3. metalinkMETALINKBUILD_ONCOMPARE_WITHCOMPETECOMPLEMENTCONTROVERSAL_TOPIC_INHAVE_POWER_OVERIMPROVEINIS_AIS_INLEAD_TOOFPREVENTSPECIAL_GROUP_INSUBSPECIALTY_INTREAT

Table 4. article_topicAID TOPIC DEGREE1 cost mentions1 generic drugs defines1 medical care elaborates1 prescription drugs mentions1 prescription pricing defines2 brand name products elaborates2 drugs mentions2 generic drugs mentions2 law and legislation defines2 legal loopholes elaborates2 patents mentions2 pharmaceutical industry defines3 generic drugs elaborates3 medical economics mentions3 prescription drugs defines3 prescription pricing elaborates4 brand name products mentions4 consumer education defines4 cost effectiveness elaborates4 generic drugs mentions4 pharmaceutical industry defines4 pharmacists elaborates4 safety elaborates4 u.s. food and drug adm. mentions

13

Page 14: Project Report (word).doc

Table 5. metalink_instanceMIID TOPIC1 TOPIC2 METALINK1 nursing home care home care services COMPARE_WITH2 generic drugs brand name products COMPETE3 layoffs unemployment COMPLEMENT4 living wills medical ethics CONTROVERSAL_TOPIC_IN5 living wills medicaid CONTROVERSAL_TOPIC_IN6 living wills medicare CONTROVERSAL_TOPIC_IN7 right to die medical ethics CONTROVERSAL_TOPIC_IN8 right to die medicaid CONTROVERSAL_TOPIC_IN9 right to die medicare CONTROVERSAL_TOPIC_IN10 right to refuse treatment medical ethics CONTROVERSAL_TOPIC_IN11 right to refuse treatment medicaid CONTROVERSAL_TOPIC_IN12 right to refuse treatment medicare CONTROVERSAL_TOPIC_IN13 substance abuse in pregnancy medical ethics CONTROVERSAL_TOPIC_IN14 substance abuse in pregnancy medicaid CONTROVERSAL_TOPIC_IN15 substance abuse in pregnancy medicare CONTROVERSAL_TOPIC_IN16 terminal care medical ethics CONTROVERSAL_TOPIC_IN17 terminal care medicaid CONTROVERSAL_TOPIC_IN18 terminal care medicare CONTROVERSAL_TOPIC_IN19 insurance companies medical policy HAVE_POWER_OVER20 family health IMPROVE

Table 6. article_metalink_instanceAID MIID1 24 425 26 27 409 5113 2915 3725 4126 2931 2133 4135 2241 4444 747 5351 557 561 5262 50

14

Page 15: Project Report (word).doc

Attachment:1. Default.html

<html><head><title>Project Page of ECES423 CWRU----Ling Yang & Kun Si</title><META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1"></head><body bgcolor=#ffffff text=#000000 link=#0000cc vlink=551a8b alink=#ff0000><div align="center"><font color="#CC3300" size="4"><b><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif">Project Page of ECES423</font></b></font> </div><p align="center"><font size="3" color="#003333" face="Arial, Helvetica, sans-serif">----<i>LingYang &amp; Kun Si</i></font> <br> </p> <form action="topicsearch.jsp" method=get> <div align="center"> <br><font face=arial,sans-serif size=-1>Type in one topic and start search </font> <br><input type=text name=topic value="generic drugs" framewidth=8 size=40 maxlength=256> <br><input type=submit value="Search"> <input type=hidden name=new value="yes"> </div> </form> <div align="center"> <br> <font size=-1><a href="/ling/report.doc/">Project Report</a> - <a href="/ling/source.html/">Source Code </a> - <a href="/ling/command.txt">Tables and SQL commands</a></font> </div><p align="center"><font size="2">created date: Dec.09,2000</font> </body></html>

15

Page 16: Project Report (word).doc

2. TopicSearch.jsp<!doctype html public "-//w3c//dtd html 4.0 transitional//en"><html><head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <title>Topic Map Search Engine---Ling Yang & Kun Si</title></head><body bgcolor="#FFFFFF"><font size=+2><B>Topic Map Search Engine</B></font><form action="topicsearch.jsp" method=get> <font face=arial,sans-serif size=-1>Type in one topic and start a new search </font> <br><input name="topic" type=text value="" framewidth=8 size=40 maxlength=256> <br><input type=submit value="Search"><input type=hidden name=new value="yes"> </form>

<%@ page language="java" import="java.sql.*"%><% Connection conn = null; Statement stmt_l = null; Statement stmt_r = null; Statement stmt = null; ResultSet rset_l = null; ResultSet rset_r = null; ResultSet rset = null; String newsearch=null; String query=null; String newtopic=null; Vector vTopic=null; try{

vTopic=(Vector) session.getValue("vTopic"); if (vTopic==null) vTopic=new Vector(); } catch (Exception e){} newsearch=request.getParameter("new"); if(newsearch!=null) vTopic.clear(); newtopic=request.getParameter("topic").toString().trim(); if(newtopic!=null && vTopic.indexOf(newtopic)==-1) vTopic.add(newtopic); query="select t.aid, count(t.aid) from article_topic t "; for (int i=0;i<vTopic.size();i++){ if(i>0){ query=query+" or t.topic='" +vTopic.elementAt(i).toString()+"' "; }else{ query=query+ "where t.topic='"+vTopic.elementAt(i).toString()+"' "; } } query=query+" group by t.aid order by count(t.aid) desc";

try {

16

Page 17: Project Report (word).doc

DriverManager.registerDriver(new oracle.jdbc.driver.OracleDriver()); conn = DriverManager.getConnection("jdbc:oracle:thin:@ylnw:1521:oracledb","system","manager"); stmt_l = conn.createStatement(); rset_l=stmt_l.executeQuery("select topic2, metalink from metalink_instance where topic1='"+newtopic+"'"); %> <br><font size=+1>click on the following topic to narrow your search </font> <br><table width="60%" border="1"> <tr bgcolor="#EEEEFF"><th>topic1</th> <th>metalink</th> <th>topic2</th></tr> <%while(rset_l.next()) {%> <tr><td><%=newtopic%></td><td><%=rset_l.getString(2).toString().toUpperCase()%></td> <td><a href="topicsearch.jsp?topic=<%=rset_l.getString(1).trim()%>"><B><%=rset_l.getString(1)%></B></a></td> </tr> <%}

stmt_r = conn.createStatement(); rset_r=stmt_r.executeQuery("select topic1, metalink from metalink_instance where topic2='"+newtopic+"'"); while(rset_r.next()) {%><tr> <td><a href="topicsearch.jsp?topic=<%=rset_r.getString(1).trim()%>"><B><%=rset_r.getString(1)%></B></a></td> <td><%=rset_r.getString(2).toString().toUpperCase()%></td><td><%=newtopic%></td> </tr> <%} %> </table><P> <hr> <% stmt = conn.createStatement(); rset=stmt.executeQuery(query); %> <br><font size=+1><B>Search Results:</B></font> <% while (rset.next()){ Statement instmt = conn.createStatement(); ResultSet inner=instmt.executeQuery("select * from article where aid=" +rset.getInt(1)); while (inner.next()) { %> <br><a href="<%=inner.getString(3).trim()%>"><%=inner.getString(2)%></a> &nbsp;&nbsp;&nbsp;&nbsp;last updated:<%=inner.getDate(4)%>&nbsp;&nbsp;(<%=rset.getInt(2)%> topics) <%} instmt.close(); inner=null; } session.putValue("vTopic",vTopic); }catch (Exception exx){

17

Page 18: Project Report (word).doc

exx.printStackTrace(); }finally{ if (stmt_l != null) stmt_l.close(); if (stmt_r != null) stmt_r.close(); if (stmt != null) { stmt.close(); conn.close(); } } %>

</body></html>

18

Page 19: Project Report (word).doc

3. SQL commands

create table topic(topic char(60), primary key(topic));

create table metalink(metalink char(40), primary key (metalink));

create table article(aid integer, title char(100), path char(60), published date,primary key (aid));

create table article_topic (aid integer, topic char(60), degree char(20), primary key (aid, topic), foreign key

(aid) references article(aid), foreign key(topic) references topic(topic));

create index atopic_idx on article_topic(topic);

create table metalink_instance(miid integer,topic1 char(60), topic2 char(60), metalink char(40), primary

key(miid), foreign key(topic1) references topic(topic), foreign key (topic2) references topic(topic), foreign

key(metalink) references metalink(metalink));

create index t1_idx on metalink_instance(topic1);

create index t2_idx on metalink_instance(topic2);

create index tl_idx on metalink_instance(metalink);

create table article_metalink_instance(aid integer, miid integer, primary key(aid,miid), foreign key (aid)

references article(aid), foreign key(miid) references metalink_instance(miid));

19