the development of sharing publication citation information website with article search system using...
TRANSCRIPT
The Development of Sharing Publication Citation Information Website with Article Search System
Using OKAPI BM25
Author
Hartono (26405055)
Supervisors
Resmana Lim, M.Eng.
Adi Wibowo, M.T.
• The need to obtain the necessary scientific journal.• Limited access to obtaining scientific journal.• The need to get article information, not only by harvesting, but also manual.• The need to obtain better search result.
Background
Problem :•How to get article information by harvesting from external journal site?•How to input article which formated BibTex, XML or PDF into database?•How to harvest article automatically at a certain period?•How to do indexes of article exist in database?•How to search by using OKAPI BM25 of existing article in database?Goal :•To develop information-sharing site for more complete article information and make user get the desired information
Problem & Goal
Context Diagram
Context Diagram
Harvesting Process start
End
Baca url oai request
Cek Metadata
Valid?
Harvest metadata
Y
N
Database artikel
Download source metadata
metadataformat verb example :http://citeseerx.ist.psu.edu/oai2? verb=ListMetadataFormats
listidentifiers verb example :http://citeseerx.ist.psu.edu/oai2? verb=ListIdentifiers&from=2010-03-17&until=2010-03-18&metadataPrefix=oai_dc
getrecord verb example : http://citeseerx.ist.psu.edu/oai2? verb=GetRecord&identifier=oai:CiteSeerXPSU:10.1.1.1.2918&metadataPrefix=oai_dc
listrecord verb example :http://citeseerx.ist.psu.edu/oai2?verb=ListRecords&from=2010-03-17&until=2010-03-18&metadataPrefix=oai_dc
Article Management Processstart
Baca artikel dari
database dan user
Approve?
Y
end
Nindexing
Indexing Process
Title Process Description ProcessProses Judul
Baca judul
Explode (judul)
Stopword (judul)
Stemming (judul)
Title_term = Title_term+1
Masih ada term?
return
N
Y
Proses description
Baca description
Explode (description)
Stopword (description)
Stemming (description)
description_term=description_term+
1
Masih ada term?
return
N
Y
Content Process Creator ProcessProses content
Baca content
Explode (content)
Stopword (content)
Stemming (content)
fullbody_term = fullbody_term+1
Masih ada term?
return
N
Y
Proses creator
Baca creator
Explode (creator)
creator_term = creator_term+1
Masih ada term?
return
N
Y
Explode Process Stop Word Process
Explode (input)
Baca input
Hilangkan tanda baca
Pecah kalimat menjadi kata
return
Proses stopword
(term)
Baca input yang sudah di explode
Stopword inggris?
Stopword indonesia?
Term tanpa stopword
N
N
Hapus term inggris
Hapus term indonesia
Y
Y
return
Stemming Process Hitung f(qi,D) ProcessStemming
(term)
Term tanpa stopword
Irregular verb?
Term ada di english lib?
Term hasil stemming
return
N
N
Y
Y
Stemming inggris
Stemming indonesia
Hitung f(qi,D)
Baca bobot term
TF = (title*bobot title) + (description*bobot description) +
fullbody_term-(title+description)*bobot fullbody
Update total_term, isi dengan TF
return
Total Artikel Process Hitung IDF Process
Hitung total term artikel
Jumlah total_term pada doc_term sesuai identifier
Update total_term pada article dengan hasil penambahan
return
Hitung idf
Hitung jumlah article (N)
Ambil semua master_term_id
dari master_term
Hitung jumlah article dari
doc_term yang sama dengan
master_term (n)
IDF = log10(((N-n) + 0.5) / (n+0.5)) + log10(0.5/(N+0.5))*-1
Update idf, isi dengan IDF
Masih ada term?
return
N
Y
Avgdl Process Search Process
Hitung avgdl
Baca semua
total_term pada article
Hitung rata-rata total_term
Update average_article isi dengan rata-rata
total term
return
start
Cari semua artikel yang memiliki
keyword
Ketemu?
Sorting hasil
end
Y
N
Hitung okapi
Input keyword
explode
stemming
stopword
OKAPI Process User Management ProcessHitung okapi
Ambil idf, word = keyword search
Ambil total_term dari doc term
(f(qi,D)
Jumlahkan fullbody_term dari
doc_term (|D|)
TF = (f(qi,D)*(k1+1)) / (f(qi,D)+k1*(1-b+b*(|D|/avgdl)
return
K1 = 2B = 0,75
start
Baca data member
end
File member
Valid?
Y
N
Message Managementstart
Baca inputan
message
end
File message
Valid?
Y
N
Entity Relationship Diagram (ERD)
memiliki
mempunyai
mempunyai
menulis
memasukkan
memilikimempunyai
memiliki
memiliki
memiliki
memiliki
memiliki
memiliki
memiliki
article
oai_identifierdatestam pdc_titledc_descriptionjournaleditorseriesdc_publishervolum enumbermonthaddressbook_titlepagesdc_form atdc_typedc_identifierdc_languagedc_coveragedc_rightsoai_idpublishedapprovaltotal_termscategory
article_average
article_average
category
category_name
citation
oai_identifiercitate
contributor
contributor_idoai_identifiercontributor
creator
creator_idoai_identifiercreator_nam e
date_article
date_idoai_identifierdc_date
doc_term
oai_identifiermaster_term_idtitle_termdescription_termfullbody_termcreator_termtotal_term
english_lib
idkata
harvest_tim e
oai_iddate_fromdate_until indexing_tim e
oai_identifiertime
irreg_verb
idkata_dasarkata_bkn_dasar
message
fromemailsubjectmessagemessage_status
master_term
master_term_idwordidf
oai_request
oai_idoai_urloai_statusreferfolder
refrerence
reference_idoai_identifierrelation
source
oai_identifierdownload_statussource
stop_word_eng
idkata
stop_word_indo
idkata
subject
subject_idoai_identifiersubject
term
title_termdescription_termfullbody_term
user
usernamepassworduser_statusfullnameemailinstitutionprofessionlast_vis itjoin_date
OKAPI BM25OKAPI BM25• Okapi BM25 is a function of ratings used search engines to give ratings on the desired documents based on relevance to a given query.
OKAPI BM25 Formula
Inverse Document Frequency
Article example :
Article Example
Title Description Content
Oai1 complex stockhast Numer analysi Model complex real
detail analysi build
Oai2 Managed abstrach
build
Manner detail Join creation numer
make possibl
Oai3 Structur detail
possibl
Real abstrach world Make detail usual
manner
Oai4 Build world explor Analysi detail Managed stockhast
replicating complex
explor
Manual :
Manual & Program IDF Calculation
Program :
Keyword example : complexManual : Program :
Manual & Program OKAPI Calculation
Article : 500 Keyword : Network SystemSearch result= 198 articleResult maybe relevan= 29 articleRelevan article result = 12Recall = 12/12 *100% = 100%Precision = 12/198 *100% = 6%
Recall Precision
Oai identifier Relevan Search rank
oai:CiteSeerXPSU:10.1.1.1.3301 tidak 15
oai:CiteSeerXPSU:10.1.1.1.8714 tidak 12
oai:CiteSeerXPSU:10.1.1.11.3246 ya 8
oai:CiteSeerXPSU:10.1.1.131.2961 tidak 6
oai:CiteSeerXPSU:10.1.1.133.114 ya 3
Recall Precision ContinueOai identifier Relevan Search rank
oai:CiteSeerXPSU:10.1.1.133.5166 tidak 16
oai:CiteSeerXPSU:10.1.1.134.7415 tidak 25
oai:CiteSeerXPSU:10.1.1.135.7151 tidak 13
oai:CiteSeerXPSU:10.1.1.138.8592 ya 5
oai:CiteSeerXPSU:10.1.1.143.7835 ya 24
oai:CiteSeerXPSU:10.1.1.143.9199 tidak 28
oai:CiteSeerXPSU:10.1.1.147.3140 ya 9
oai:CiteSeerXPSU:10.1.1.148.6013 ya 10
oai:CiteSeerXPSU:10.1.1.149.7229 tidak 18
oai:CiteSeerXPSU:10.1.1.2.8672 tidak 29
oai:CiteSeerXPSU:10.1.1.2.876 ya 4
oai:CiteSeerXPSU:10.1.1.28.2069 tidak 21
oai:CiteSeerXPSU:10.1.1.28.3751 tidak 23
oai:CiteSeerXPSU:10.1.1.31.5233 ya 17
Recall Precision ContinueOai identifier Relevan Search rank
oai:CiteSeerXPSU:10.1.1.32.3394 tidak 19
oai:CiteSeerXPSU:10.1.1.34.422 ya 20
oai:CiteSeerXPSU:10.1.1.37.133 tidak 26
oai:CiteSeerXPSU:10.1.1.37.886 tidak 27
oai:CiteSeerXPSU:10.1.1.46.7941 ya 1
oai:CiteSeerXPSU:10.1.1.5.5436 ya 2
oai:CiteSeerXPSU:10.1.1.61.8860 tidak 22
oai:CiteSeerXPSU:10.1.1.62.5142 tidak 14
oai:CiteSeerXPSU:10.1.1.8.4971 tidak 11
oai:CiteSeerXPSU:10.1.1.94.3465 ya 7
Keyword : music modelSearch result = 150 articleResult maybe relevan = 30 articleRelevan article result = 14Recall = 14/14 *100% = 100%Precision = 14/150 *100% = 9.3%
Recall Precision Continue
Oai identifier Relevan Search rank
oai:CiteSeerXPSU:10.1.1.10.1860 ya 19
oai:CiteSeerXPSU:10.1.1.10.2860 tidak 29
oai:CiteSeerXPSU:10.1.1.111.3072 ya 18
oai:CiteSeerXPSU:10.1.1.127.8691 ya 21
oai:CiteSeerXPSU:10.1.1.130.1856 ya 6
oai:CiteSeerXPSU:10.1.1.133.7089 tidak 27
oai:CiteSeerXPSU:10.1.1.140.3374 tidak 10
Recall Precision ContinueOai identifier Relevan Search rank
oai:CiteSeerXPSU:10.1.1.140.8940 ya 25
oai:CiteSeerXPSU:10.1.1.142.7598 tidak 12
oai:CiteSeerXPSU:10.1.1.149.6567 ya 30
oai:CiteSeerXPSU:10.1.1.152.2688 ya 11
oai:CiteSeerXPSU:10.1.1.154.24 tidak 16
oai:CiteSeerXPSU:10.1.1.154.2529 ya 20
oai:CiteSeerXPSU:10.1.1.155.1750 tidak 33
oai:CiteSeerXPSU:10.1.1.16.7401 tidak 32
oai:CiteSeerXPSU:10.1.1.17.1013 ya 1
oai:CiteSeerXPSU:10.1.1.18.6229 tidak 13
oai:CiteSeerXPSU:10.1.1.2.6849 tidak 31
oai:CiteSeerXPSU:10.1.1.2.8672 tidak 8
oai:CiteSeerXPSU:10.1.1.20.3633 ya 15
oai:CiteSeerXPSU:10.1.1.31.5233 ya 7
Recall Precision ContinueOai identifier Relevan Search rank
oai:CiteSeerXPSU:10.1.1.32.5049 tidak 24
oai:CiteSeerXPSU:10.1.1.34.7828 ya 4
oai:CiteSeerXPSU:10.1.1.4.677 ya 5
oai:CiteSeerXPSU:10.1.1.4.7323 ya 3
oai:CiteSeerXPSU:10.1.1.5.1181 tidak 23
oai:CiteSeerXPSU:10.1.1.5.4681 tidak 17
oai:CiteSeerXPSU:10.1.1.52.4788 tidak 28
oai:CiteSeerXPSU:10.1.1.57.3576 tidak 14
oai:CiteSeerXPSU:10.1.1.59.9118 tidak 9
Keyword : music analysisSearch result = 116 articleResult maybe relevan = 23 articleRelevan article result= 10Recall = 10/10 *100% = 100%Precision = 10/116 *100% = 8.6%
Recall Precision Continue
Oai identifier Relevan Search rank
oai:CiteSeerXPSU:10.1.1.10.2860 ya 22
oai:CiteSeerXPSU:10.1.1.10.3132 ya 2
oai:CiteSeerXPSU:10.1.1.140.3374 tidak 3
oai:CiteSeerXPSU:10.1.1.140.8940 tidak 9
oai:CiteSeerXPSU:10.1.1.145.8953 ya 5
oai:CiteSeerXPSU:10.1.1.149.6567 tidak 23
oai:CiteSeerXPSU:10.1.1.154.2529 ya 19
Recall Precision ContinueOai identifier Relevan Search rank
oai:CiteSeerXPSU:10.1.1.155.1750 ya 17
oai:CiteSeerXPSU:10.1.1.155.4454 ya 10
oai:CiteSeerXPSU:10.1.1.156.2520 ya 20
oai:CiteSeerXPSU:10.1.1.18.6229 tidak 13
oai:CiteSeerXPSU:10.1.1.2.6849 tidak 21
oai:CiteSeerXPSU:10.1.1.2.8672 ya 1
oai:CiteSeerXPSU:10.1.1.25.747 tidak 18
oai:CiteSeerXPSU:10.1.1.29.4192 tidak 11
oai:CiteSeerXPSU:10.1.1.34.7828 ya 7
oai:CiteSeerXPSU:10.1.1.4.7323 tidak 4
oai:CiteSeerXPSU:10.1.1.5.1181 tidak 16
oai:CiteSeerXPSU:10.1.1.5.4681 ya 15
oai:CiteSeerXPSU:10.1.1.155.1750 ya 17
oai:CiteSeerXPSU:10.1.1.52.4788 tidak 12
Recall Precision ContinueOai identifier Relevan Search rank
oai:CiteSeerXPSU:10.1.1.59.9118 tidak 6
oai:CiteSeerXPSU:10.1.1.6.3984 tidak 14
oai:CiteSeerXPSU:10.1.1.6.757 tidak 8
Article : 500
Indexing Time
Jumlah artikel Waktu yang diperlukan (dtk)
100 artikel 805.1392138 detik
200 artikel 1646.911684 detik
300 artikel 2509.824728 detik
400 artikel 3514.183314 detik
500 artikel 4744.517922 detik
Article : 500
Indexing Time
Jumlah artikel Waktu yang diperlukan (dtk)
100 artikel 805.1392138 detik
200 artikel 1646.911684 detik
300 artikel 2509.824728 detik
400 artikel 3514.183314 detik
500 artikel 4744.517922 detik
Article : 500Keyword : computer analysis search result: 140 artikel, Time :0.549877882004 second
Search Time
Keyword : user applicationssearch result : 92 artikel, Time : 0.547022104263 second
Search Time Continue
Keyword : work schemesearch result : 92 artikel, Time : 0.491093873978 second
Search Time Continue
Keyword : high image transformsearch result : 101 artikel, Time : 0.498678922653 second
Search Time Continue
Keyword : networksearch result : 76 artikel, Time : 0.270733833313 second
Search Time Continue
Conclusion1.System only can perform metadata harvesting process with oai_dc metadataformat.2.System only can updating automatically on the approved url.3.Time needed by system to generated keyword-related article is varied, according the number of articles produced.4.Recall on search result is very good, because it has an average of 100% while the precision is bad enough because it had an average of less than 10%. The result was good enough because of all articles that may be relevant if they are rated less than 30.
Conclusion
Suggestion1.The system can be developed in order to become data providers.2.The system can be dynamically able to harvest other metadata formats.
Suggestion
Thank You For Your Attention