presentation is r
TRANSCRIPT
-
7/31/2019 Presentation is r
1/32
Perfrmance Enhancement and Custmizatin
f Infrmatin Strage and Retrieval systemSynopsys
for the Degree of
Doctor of Philosophy
in
Department of Engineering
submitted to
MEWAR UNIVERSITY GANGRAR
CHITTORGARH(RAJASTHAN)
Research Supervisor Research Scholar
Dr. Suresh Jain Dharmendra Sharma
-
7/31/2019 Presentation is r
2/32
Agenda
Introduction
Literature survey
Objective of research work Proposed Methodology
-
7/31/2019 Presentation is r
3/32
Introduction
Information can organized in structure, semi
structure and un structured form
Information Storage and Retrieval
Google, LexisNexis, Dilog
Language dependency
Multiple spelling color / colour Word ambiguity bat-cricket bird
Context - I eat what I see and I see what I eat
-
7/31/2019 Presentation is r
4/32
Introduction
Web search analysis for AltaVista
Informational 48%
Transactional 30%
Navigational 22%
Balance of Recall and Precision
-
7/31/2019 Presentation is r
5/32
Literature survey
Historical Development of ISR system
Main problem of Information storage and
retrieval
Document or query Indexing and translation
Information storage and retrieval model
System evaluation
-
7/31/2019 Presentation is r
6/32
Document indexing
H.C. yang and C.H. proposed three different
strategy
Dictionary based
Treasure based
Corpus based
Hull and Gerferetette
Dictionary based context evaluation without
word sense disambiguate
-
7/31/2019 Presentation is r
7/32
Word Sense Disambiguate
There are two way to solve WSD
Supervised learning
Unsupervised learning
P. Bhattacharyya evaluate the context
Context of word in sentence
Context evaluate by WordNet
-
7/31/2019 Presentation is r
8/32
ISR Models
Document set converted into suitable
representation
There are three type of presentation or model
Boolean model
Probabilistic model
Algebraic model
-
7/31/2019 Presentation is r
9/32
Boolean model
Standard Boolean model
Based on classical set theory
Let riginal dcuments O={ O1,O2,O3,,On}
Set f Term T={t1,t2,t3,t4,.,tn}
Document set D=(d1,d2,d3,---di-dn),di can be powerset of T.
Let d1=(t1,t2),d2=(t2,t3),d3=(t2,t4)
Let Q=(t2,t3) then etrieved will be(d1,d2,d3) and (d2) Calulate (d1,d2,d3) Intersection(d2)
Result id d2
-
7/31/2019 Presentation is r
10/32
Standard Boolean model(Cont.)
Pros
Clear formation
Easy to implement
Cons
Exact match lead to few or large document
All term have equal weight
How to rank out
More like data retrieval than information retrieval
-
7/31/2019 Presentation is r
11/32
Boolean model
Extended Boolean model
Document represent as vector
Each ith term present ith dimension
Term weight weight calculate as tf-idf
Vdj={w1j,w2j,.wij}
K1 and k2 weight is w1,w2
Then Q(k1 or K2)={(w12+w22)/2}1/2
Q(k1 and k2)={1-{((1-w12)+(1-w22)/2)^1/2}
-
7/31/2019 Presentation is r
12/32
Boolean model
Fuzzy retrieval
Membership define by degree
Mixed Min and Max (MMM)
Paice model
Both do not provide way to evaluate the query
Query evaluated by P norms algorithm
-
7/31/2019 Presentation is r
13/32
Probabilistic model
Binary Independence model
Uncertain Inference
Language model
-
7/31/2019 Presentation is r
14/32
Probabilistic model
Binary Independence model
Introduced by Yu and Salton
Assume document as binary vector, only present
or absent of term in document is recoreded as 0
or 1
Terms are independently distributed in relevant
and irrelevant document Document represent as ordered set of Boolean
variable
-
7/31/2019 Presentation is r
15/32
Probabilistic model
Uncertain inference
Proposed by Rijsbergen
Measure of uncertainty of document d to query q
is probability of logical implication P(d->q)
System task is to infer a document if query
assertion is true.
Knowledge base of fact and rule is used
-
7/31/2019 Presentation is r
16/32
Probabilistic model
Language Model
Language model is associated with document
Document ranked on the basis of probability that
document language model would generate the
term of query
Unigram model
-
7/31/2019 Presentation is r
17/32
Algebraic model
Vector space model
Latent semantic model
-
7/31/2019 Presentation is r
18/32
Algebraic model
Vector space model
Document and query represent by terms
Weight of term assign by tf-idf scheme
Each ith term represent ith dimension of vector
The similarity calculated by correlation between
query and document vector
-
7/31/2019 Presentation is r
19/32
Algebraic model
Latent semantic indexing
Mathematical technique ( Singular value
decomposition)
Word used in same context have same meaning
Indentify pattern and relation ship between term
and concept.
-
7/31/2019 Presentation is r
20/32
System Evaluation
Precision
Recall
Fall out F measure
Average precision
Mean average precision
-
7/31/2019 Presentation is r
21/32
Impact of research on other domain
The research on Information storage and retrievalsystem draws on achievement and techniques inseveral related area. Information Access: Document indexing, retrieval,
filtering, clustering, presentation and summarizationof information, cross language information retrieval.
Machine translation: comparable and parallel textalignment, language generation.
Computational linguistics: morphological analysis,syntactic parsing, technique for disambiguation,document segmentation, corpus analysis, termrecognition and term expansion.
-
7/31/2019 Presentation is r
22/32
Objectives of proposed work
i. Investigation of various techniques useful for Informationstorage and retrieval and their comparison.
ii. Apply grid computing and domain knowledge as supervisedlearning to eliminate the word sense ambiguity from thequery.
iii. Analysis of models to store the data and improvement ofdocument representation by using following methods. Clustering techniques
Concept mapping
iv. Evaluation and customization of Information storage andretrieval models according to users need.
V. Experimental evaluation of the developed algorithms.
-
7/31/2019 Presentation is r
23/32
Proposed Methodology
Grid computing for semantic database
Domain creation by knowledge base
Clustering algorithms Concept mapping for mapping between
clusters
-
7/31/2019 Presentation is r
24/32
Methodology
Grid computing
Language limitation, color/colour , bat
I eat what I sea / I sea what I eat
Combining of different domain to achieve
common.
Difficulty in corpus analysis
Domain creation on the basis of knowledge base
Parallel processing for different semantic network
-
7/31/2019 Presentation is r
25/32
Methodology
Clustering Research gap
Grouping of data into meaning full category
Document descriptor and descriptor extraction
Following algorithm will be used Hierarchical algorithm- association and dividing the document
Ontology support clustering
Graph based clustering
Concept map Mapping between clusters
How idea create from word
Decision making system
Word and concept related to each other and whole idea
-
7/31/2019 Presentation is r
26/32
Methodology
Following criteria will be used to evaluate the
performance
Turn around time
Response time
Precision
-
7/31/2019 Presentation is r
27/32
Methodology
Recall
Fall out
F measure
-
7/31/2019 Presentation is r
28/32
Bibliography
[1]. Wissam Tawileh, chair f business infrmatics Explring web search behavir f Arab internetusers IEEE, Internatinal cnference n Innvatin in Infrmatin Technlgy, Dresden, Germeny,2011
[2]. D. B. Cleveland and A.D Cleveland.Intrductin t Indexing and Abstracting, Englewd, CO:Libraries Unlimited, Inc, (1990)
[3] K. Sparck Jnes A statistical interpretatin f term specificity and its applicatin in retrieval,Journal of Documentation 28, 11-21, (1972)
[4] G. Saltn and C. Buckley Term Weighting Appraches in Autmatic Text Retrieval, InfrmatinProcessing and Management, 24, 513-523, 1988.
[5] H.C. yang and C.H. lee Multi Lingual Infrmatin Retrieval, Internatinal cnference nIntelligent system design and application, Kaohsiung, Taiwan , 2008
[6] D.A.Hull and G. Grefenstette Query acrss a language: a dicitinary based approach tomultilingual infrmatin retrieval , Internatinal cnference n research and develpment ininfrmatin retrieval ,1996
[7] Singhal, Amit "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE ComputerSociety Technical Committee on Data Engineering (2001).
[8] Maron, Melvin E. "An Historical Note on the Origins of Probabilistic Indexing". InformationProcessing and Management 2008
[9] Pushpak Bhattacharyya, M. Sinha, M. K. Reddy, P. Pande, L. Kashyap, Hindi Wrd SenseDisambiguatin Indian Institute f Technlgy, Bmbay, India, 2011
[10] Lashkari, Mahdavi, Ghomi, A Boolean Model in Information Retrieval for Search Engines 2009
-
7/31/2019 Presentation is r
29/32
Bibliography
[11] Manning,, Christopher D.; Prabhakar Raghavan, Hinrich Schtze Introduction to Information Retrieval.Cambridge University Press. Standerd Boolean model:2008
[12] Turpin, Andrew; Scholer, Falk (2006). User performance versus precision measures for simple search tasks."Proceedings of the 29th annual international ACM SIGIR conference on Research and development in informationretrieval - SIGIR '06". Proceedings of the 29th Annual international ACM SIGIR Conference on Research andDevelopment in information Retrieval (Seattle, Washington, USA, August 06-11, 2006) (New York, NY: ACM): 1118. doi:10.1145/1148170.1148176. ISBN 1595933697.
[13] Salton, Gerard; Edward A. Fox, Harry Wu (1983), Extended Boolean information retrieval, Communications ofthe ACM, Volume 26, Issue 11,
[14] Lee, W. C.; E. A. Fox (1988), Experimental Comparison of Schemes for Interpreting Boolean Queries [15] Kang, Bo-Yeong; Dae-Won Kim, Hae-Jung Kim (2005), Fuzzy Information Retrieval Indexed by Concept
Identification, Springer Berlin / Heidelberg, Zadrozny, Sawmir; Nowacka, Katarzyna (2009), Fuzzy informationretrieval model revisited, Elsevier North-Holland, Inc., doi:10.1016/j.fss.2009.02.012
[16] Fox, E. A.; S. Sharat (1986), A Comparison of Two Methods for Soft Boolean Interpretation in InformationRetrieval, Technical Report TR-86-1, Virginia Tech, Department of Computer Science
[17] Ding, C., A Similarity-based Probability Model for Latent Semantic Indexing, Proceedings of the 22ndInternational ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 5965.
[18] Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st
Annual Meeting of the American Society for Information Science 25, 1988, pp. 3640. [19] Bartell, B., Cottrell, G., and Belew, R., Latent Semantic Indexing is an Optimal Special Case of Multidimensional
Scaling, Proceedings, ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.161167.
[20] Dumais, S., and Nielsen, J., Automating the Assignment of Submitted Manuscripts to Reviewers, Proceedingsof the Fifteenth Annual International Conference on Research and Development in Information Retrieval, 1992,pp. 233244.
-
7/31/2019 Presentation is r
30/32
Bibliography
[21] Berry, M. W., and Browne, M., Understanding Search Engines: Mathematical Modeling and Text Retrieval,Society for Industrial and Applied Mathematics, Philadelphia, (2005).
[22] Sparc, Robertson, Using Latent Semantic Analysis to Identify Similarities in Source Code to Support ProgramUnderstanding, Proceedings of 12th IEEE International Conference on Tools with Artificial Intelligence, Vancouver,British Columbia, November 1315, 2000, pp. 4653.
[23] G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Communications ofthe ACM, vol. 18, nr. 11, pages 613620. (Article in which a vector space model was presented)
[24] David Dubin , The Most Influential Paper Gerard Salton Never Wrote (Explains the history of the Vector SpaceModel and the non-existence of a frequently cited publication)2004
[25] Yarowsky, D., and Florian, R., Taking the Load off the Conference Chairs: Towards a Digital Paper-routingAssistant, Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very-Large Corpora,1999, pp. 220230.
[26] Soboroff, I., et al, Visualizing Document Authorship Using N-grams and Latent Semantic Indexing, Workshopon New Paradigms in Information Visualization and Manipulation, 1997, pp. 4348.
[27] Yuanhua Lv and ChengXiang Zhai, Positional Language Models for Information Retrieval, in Proceedings of the32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 2009.
[28] Buyya, Rajkumar . "Grid Computing: Making the Global Cyberinfrastructure for eScience a Reality" (PDF). CSICommunications (Mumbai, India: Computer Society of India (CSI)) 29 (1Francesco Lelli, Eric Frizziero, Michele
Gulmini, Gaetano Maron, Salvatore Orlando, Andrea Petrucci and Silvano Squizzato. The many faces of theintegration of instruments and the grid. International Journal of Web and Grid Services 2007 Vol. 3, No.3pp. 239 266
[29] Goodrum, Abby A. "Image Information Retrieval: An Overview of Current Research". Informing Science,2000
[30] J M Ponte and W B Croft . "A Language Modeling Approach to Information Retrieval". Research andDevelopment in Information Retrieval. pp. 275281. 1998
-
7/31/2019 Presentation is r
31/32
Bibliography
[31] Beel, Jran; Gipp, Bela; Stiller, Jan-Olaf . "Information Retrieval On Mind Maps - What Could It Be Good For?".Proceedings of the 5th International Conference on Collaborative Computing: Networking, Applications andWorksharing:2009
[32] Benedict, Shajulin; Vasudevan. "A Niched Pareto GA approach for scheduling scientific workflows in wirelessGrids". Journal of Computing and Information Technology 16: 101. 2008
[33] R. korra, P. sujatha, Sidige, Chetana, N. Kumar Performace Evaluation of Multilingual Information RetrievalSystem ver Infrmatin Retrieval System, IEEE , Internatinal cnference n Recent Trends in InfrmatinTechnology, ICRTIT 2011, MIT , Anna Univercity, Chennai, June 3-5-2011
[34] Achtert, E.; Bohm, C.; Kriegel, H. P.; Krger, P.; Zimek, A. "On Exploring Complex Relationships of Correlation
Clusters". 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007). pp.7.2007
[35] Auffarth, B. Clustering by a Genetic Algorithm with Biased Mutation Operator. WCCI CEC. IEEE, July 1823,2010.
[36] Achtert, E.; Bhm, C.; Krger, P.; Zimek, A. "Mining Hierarchies of Correlation Clusters". Proc. 18thInternational Conference on Scientific and Statistical Database Management (SSDBM): 119128. doi:2006
[37] Z. Huang. "Extensions to the k-means algorithm for clustering large data sets with categorical values". DataMining and Knowledge Discovery, 2:283304, 1998.
[38] Joseph D. Novak & Alberto J. Caas . "The Theory Underlying Concept Maps and How To Construct and Use
Them", Institute for Human and Machine Cognition. Accessed 24 Nov 2008. [39] Moon, B.M., Hoffman, R.R., Novak, J.D., & Caoas, A.J. Applied Concept Mapping: Capturing, Analyzing and
Organizing Knowledge. CRC Press: New York,2011
[40] Anderson, J. R., & Lebiere, C. (1998). The atomic components of thought. Mahwah, NJ: Erlbaum.
-
7/31/2019 Presentation is r
32/32
Thank You