presentation is r

7/31/2019 Presentation is r

1/32

Perfrmance Enhancement and Custmizatin

f Infrmatin Strage and Retrieval systemSynopsys

for the Degree of

Doctor of Philosophy

in

Department of Engineering

submitted to

MEWAR UNIVERSITY GANGRAR

CHITTORGARH(RAJASTHAN)

Research Supervisor Research Scholar

Dr. Suresh Jain Dharmendra Sharma


2/32

Agenda

Introduction

Literature survey

Objective of research work Proposed Methodology


3/32

Introduction

Information can organized in structure, semi

structure and un structured form

Information Storage and Retrieval

Google, LexisNexis, Dilog

Language dependency

Multiple spelling color / colour Word ambiguity bat-cricket bird

Context - I eat what I see and I see what I eat


4/32

Introduction

Web search analysis for AltaVista

Informational 48%

Transactional 30%

Navigational 22%

Balance of Recall and Precision


5/32

Literature survey

Historical Development of ISR system

Main problem of Information storage and

retrieval

Document or query Indexing and translation

Information storage and retrieval model

System evaluation


6/32

Document indexing

H.C. yang and C.H. proposed three different

strategy

Dictionary based

Treasure based

Corpus based

Hull and Gerferetette

Dictionary based context evaluation without

word sense disambiguate


7/32

Word Sense Disambiguate

There are two way to solve WSD

Supervised learning

Unsupervised learning

P. Bhattacharyya evaluate the context

Context of word in sentence

Context evaluate by WordNet


8/32

ISR Models

Document set converted into suitable

representation

There are three type of presentation or model

Boolean model

Probabilistic model

Algebraic model


9/32

Boolean model

Standard Boolean model

Based on classical set theory

Let riginal dcuments O={ O1,O2,O3,,On}

Set f Term T={t1,t2,t3,t4,.,tn}

Document set D=(d1,d2,d3,---di-dn),di can be powerset of T.

Let d1=(t1,t2),d2=(t2,t3),d3=(t2,t4)

Let Q=(t2,t3) then etrieved will be(d1,d2,d3) and (d2) Calulate (d1,d2,d3) Intersection(d2)

Result id d2


10/32

Standard Boolean model(Cont.)

Pros

Clear formation

Easy to implement

Cons

Exact match lead to few or large document

All term have equal weight

How to rank out

More like data retrieval than information retrieval


11/32

Boolean model

Extended Boolean model

Document represent as vector

Each ith term present ith dimension

Term weight weight calculate as tf-idf

Vdj={w1j,w2j,.wij}

K1 and k2 weight is w1,w2

Then Q(k1 or K2)={(w12+w22)/2}1/2

Q(k1 and k2)={1-{((1-w12)+(1-w22)/2)^1/2}


12/32

Boolean model

Fuzzy retrieval

Membership define by degree

Mixed Min and Max (MMM)

Paice model

Both do not provide way to evaluate the query

Query evaluated by P norms algorithm


13/32

Probabilistic model

Binary Independence model

Uncertain Inference

Language model


14/32

Probabilistic model

Binary Independence model

Introduced by Yu and Salton

Assume document as binary vector, only present

or absent of term in document is recoreded as 0

or 1

Terms are independently distributed in relevant

and irrelevant document Document represent as ordered set of Boolean

variable


15/32

Probabilistic model

Uncertain inference

Proposed by Rijsbergen

Measure of uncertainty of document d to query q

is probability of logical implication P(d->q)

System task is to infer a document if query

assertion is true.

Knowledge base of fact and rule is used


16/32

Probabilistic model

Language Model

Language model is associated with document

Document ranked on the basis of probability that

document language model would generate the

term of query

Unigram model


17/32

Algebraic model

Vector space model

Latent semantic model


18/32

Algebraic model

Vector space model

Document and query represent by terms

Weight of term assign by tf-idf scheme

Each ith term represent ith dimension of vector

The similarity calculated by correlation between

query and document vector


19/32

Algebraic model

Latent semantic indexing

Mathematical technique ( Singular value

decomposition)

Word used in same context have same meaning

Indentify pattern and relation ship between term

and concept.


20/32

System Evaluation

Precision

Recall

Fall out F measure

Average precision

Mean average precision


21/32

Impact of research on other domain

The research on Information storage and retrievalsystem draws on achievement and techniques inseveral related area. Information Access: Document indexing, retrieval,

filtering, clustering, presentation and summarizationof information, cross language information retrieval.

Machine translation: comparable and parallel textalignment, language generation.

Computational linguistics: morphological analysis,syntactic parsing, technique for disambiguation,document segmentation, corpus analysis, termrecognition and term expansion.


22/32

Objectives of proposed work

i. Investigation of various techniques useful for Informationstorage and retrieval and their comparison.

ii. Apply grid computing and domain knowledge as supervisedlearning to eliminate the word sense ambiguity from thequery.

iii. Analysis of models to store the data and improvement ofdocument representation by using following methods. Clustering techniques

Concept mapping

iv. Evaluation and customization of Information storage andretrieval models according to users need.

V. Experimental evaluation of the developed algorithms.


23/32

Proposed Methodology

Grid computing for semantic database

Domain creation by knowledge base

Clustering algorithms Concept mapping for mapping between

clusters


24/32

Methodology

Grid computing

Language limitation, color/colour , bat

I eat what I sea / I sea what I eat

Combining of different domain to achieve

common.

Difficulty in corpus analysis

Domain creation on the basis of knowledge base

Parallel processing for different semantic network


25/32

Methodology

Clustering Research gap

Grouping of data into meaning full category

Document descriptor and descriptor extraction

Following algorithm will be used Hierarchical algorithm- association and dividing the document

Ontology support clustering

Graph based clustering

Concept map Mapping between clusters

How idea create from word

Decision making system

Word and concept related to each other and whole idea


26/32

Methodology

Following criteria will be used to evaluate the

performance

Turn around time

Response time

Precision


27/32

Methodology

Recall

Fall out

F measure


28/32

Bibliography

[1]. Wissam Tawileh, chair f business infrmatics Explring web search behavir f Arab internetusers IEEE, Internatinal cnference n Innvatin in Infrmatin Technlgy, Dresden, Germeny,2011

[2]. D. B. Cleveland and A.D Cleveland.Intrductin t Indexing and Abstracting, Englewd, CO:Libraries Unlimited, Inc, (1990)

[3] K. Sparck Jnes A statistical interpretatin f term specificity and its applicatin in retrieval,Journal of Documentation 28, 11-21, (1972)

[4] G. Saltn and C. Buckley Term Weighting Appraches in Autmatic Text Retrieval, InfrmatinProcessing and Management, 24, 513-523, 1988.

[5] H.C. yang and C.H. lee Multi Lingual Infrmatin Retrieval, Internatinal cnference nIntelligent system design and application, Kaohsiung, Taiwan , 2008

[6] D.A.Hull and G. Grefenstette Query acrss a language: a dicitinary based approach tomultilingual infrmatin retrieval , Internatinal cnference n research and develpment ininfrmatin retrieval ,1996

[7] Singhal, Amit "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE ComputerSociety Technical Committee on Data Engineering (2001).

[8] Maron, Melvin E. "An Historical Note on the Origins of Probabilistic Indexing". InformationProcessing and Management 2008

[9] Pushpak Bhattacharyya, M. Sinha, M. K. Reddy, P. Pande, L. Kashyap, Hindi Wrd SenseDisambiguatin Indian Institute f Technlgy, Bmbay, India, 2011

[10] Lashkari, Mahdavi, Ghomi, A Boolean Model in Information Retrieval for Search Engines 2009


29/32

Bibliography

[11] Manning,, Christopher D.; Prabhakar Raghavan, Hinrich Schtze Introduction to Information Retrieval.Cambridge University Press. Standerd Boolean model:2008

[12] Turpin, Andrew; Scholer, Falk (2006). User performance versus precision measures for simple search tasks."Proceedings of the 29th annual international ACM SIGIR conference on Research and development in informationretrieval - SIGIR '06". Proceedings of the 29th Annual international ACM SIGIR Conference on Research andDevelopment in information Retrieval (Seattle, Washington, USA, August 06-11, 2006) (New York, NY: ACM): 1118. doi:10.1145/1148170.1148176. ISBN 1595933697.

[13] Salton, Gerard; Edward A. Fox, Harry Wu (1983), Extended Boolean information retrieval, Communications ofthe ACM, Volume 26, Issue 11,

[14] Lee, W. C.; E. A. Fox (1988), Experimental Comparison of Schemes for Interpreting Boolean Queries [15] Kang, Bo-Yeong; Dae-Won Kim, Hae-Jung Kim (2005), Fuzzy Information Retrieval Indexed by Concept

Identification, Springer Berlin / Heidelberg, Zadrozny, Sawmir; Nowacka, Katarzyna (2009), Fuzzy informationretrieval model revisited, Elsevier North-Holland, Inc., doi:10.1016/j.fss.2009.02.012

[16] Fox, E. A.; S. Sharat (1986), A Comparison of Two Methods for Soft Boolean Interpretation in InformationRetrieval, Technical Report TR-86-1, Virginia Tech, Department of Computer Science

[17] Ding, C., A Similarity-based Probability Model for Latent Semantic Indexing, Proceedings of the 22ndInternational ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 5965.

[18] Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st

Annual Meeting of the American Society for Information Science 25, 1988, pp. 3640. [19] Bartell, B., Cottrell, G., and Belew, R., Latent Semantic Indexing is an Optimal Special Case of Multidimensional

Scaling, Proceedings, ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.161167.

[20] Dumais, S., and Nielsen, J., Automating the Assignment of Submitted Manuscripts to Reviewers, Proceedingsof the Fifteenth Annual International Conference on Research and Development in Information Retrieval, 1992,pp. 233244.


30/32

Bibliography

[21] Berry, M. W., and Browne, M., Understanding Search Engines: Mathematical Modeling and Text Retrieval,Society for Industrial and Applied Mathematics, Philadelphia, (2005).

[22] Sparc, Robertson, Using Latent Semantic Analysis to Identify Similarities in Source Code to Support ProgramUnderstanding, Proceedings of 12th IEEE International Conference on Tools with Artificial Intelligence, Vancouver,British Columbia, November 1315, 2000, pp. 4653.

[23] G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Communications ofthe ACM, vol. 18, nr. 11, pages 613620. (Article in which a vector space model was presented)

[24] David Dubin , The Most Influential Paper Gerard Salton Never Wrote (Explains the history of the Vector SpaceModel and the non-existence of a frequently cited publication)2004

[25] Yarowsky, D., and Florian, R., Taking the Load off the Conference Chairs: Towards a Digital Paper-routingAssistant, Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very-Large Corpora,1999, pp. 220230.

[26] Soboroff, I., et al, Visualizing Document Authorship Using N-grams and Latent Semantic Indexing, Workshopon New Paradigms in Information Visualization and Manipulation, 1997, pp. 4348.

[27] Yuanhua Lv and ChengXiang Zhai, Positional Language Models for Information Retrieval, in Proceedings of the32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 2009.

[28] Buyya, Rajkumar . "Grid Computing: Making the Global Cyberinfrastructure for eScience a Reality" (PDF). CSICommunications (Mumbai, India: Computer Society of India (CSI)) 29 (1Francesco Lelli, Eric Frizziero, Michele

Gulmini, Gaetano Maron, Salvatore Orlando, Andrea Petrucci and Silvano Squizzato. The many faces of theintegration of instruments and the grid. International Journal of Web and Grid Services 2007 Vol. 3, No.3pp. 239 266

[29] Goodrum, Abby A. "Image Information Retrieval: An Overview of Current Research". Informing Science,2000

[30] J M Ponte and W B Croft . "A Language Modeling Approach to Information Retrieval". Research andDevelopment in Information Retrieval. pp. 275281. 1998


31/32

Bibliography

[31] Beel, Jran; Gipp, Bela; Stiller, Jan-Olaf . "Information Retrieval On Mind Maps - What Could It Be Good For?".Proceedings of the 5th International Conference on Collaborative Computing: Networking, Applications andWorksharing:2009

[32] Benedict, Shajulin; Vasudevan. "A Niched Pareto GA approach for scheduling scientific workflows in wirelessGrids". Journal of Computing and Information Technology 16: 101. 2008

[33] R. korra, P. sujatha, Sidige, Chetana, N. Kumar Performace Evaluation of Multilingual Information RetrievalSystem ver Infrmatin Retrieval System, IEEE , Internatinal cnference n Recent Trends in InfrmatinTechnology, ICRTIT 2011, MIT , Anna Univercity, Chennai, June 3-5-2011

[34] Achtert, E.; Bohm, C.; Kriegel, H. P.; Krger, P.; Zimek, A. "On Exploring Complex Relationships of Correlation

Clusters". 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007). pp.7.2007

[35] Auffarth, B. Clustering by a Genetic Algorithm with Biased Mutation Operator. WCCI CEC. IEEE, July 1823,2010.

[36] Achtert, E.; Bhm, C.; Krger, P.; Zimek, A. "Mining Hierarchies of Correlation Clusters". Proc. 18thInternational Conference on Scientific and Statistical Database Management (SSDBM): 119128. doi:2006

[37] Z. Huang. "Extensions to the k-means algorithm for clustering large data sets with categorical values". DataMining and Knowledge Discovery, 2:283304, 1998.

[38] Joseph D. Novak & Alberto J. Caas . "The Theory Underlying Concept Maps and How To Construct and Use

Them", Institute for Human and Machine Cognition. Accessed 24 Nov 2008. [39] Moon, B.M., Hoffman, R.R., Novak, J.D., & Caoas, A.J. Applied Concept Mapping: Capturing, Analyzing and

Organizing Knowledge. CRC Press: New York,2011

[40] Anderson, J. R., & Lebiere, C. (1998). The atomic components of thought. Mahwah, NJ: Erlbaum.


32/32

Thank You

presentation is r

Documents