presentation is r

Upload: riteshshah433

Post on 05-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Presentation is r

    1/32

    Perfrmance Enhancement and Custmizatin

    f Infrmatin Strage and Retrieval systemSynopsys

    for the Degree of

    Doctor of Philosophy

    in

    Department of Engineering

    submitted to

    MEWAR UNIVERSITY GANGRAR

    CHITTORGARH(RAJASTHAN)

    Research Supervisor Research Scholar

    Dr. Suresh Jain Dharmendra Sharma

  • 7/31/2019 Presentation is r

    2/32

    Agenda

    Introduction

    Literature survey

    Objective of research work Proposed Methodology

  • 7/31/2019 Presentation is r

    3/32

    Introduction

    Information can organized in structure, semi

    structure and un structured form

    Information Storage and Retrieval

    Google, LexisNexis, Dilog

    Language dependency

    Multiple spelling color / colour Word ambiguity bat-cricket bird

    Context - I eat what I see and I see what I eat

  • 7/31/2019 Presentation is r

    4/32

    Introduction

    Web search analysis for AltaVista

    Informational 48%

    Transactional 30%

    Navigational 22%

    Balance of Recall and Precision

  • 7/31/2019 Presentation is r

    5/32

    Literature survey

    Historical Development of ISR system

    Main problem of Information storage and

    retrieval

    Document or query Indexing and translation

    Information storage and retrieval model

    System evaluation

  • 7/31/2019 Presentation is r

    6/32

    Document indexing

    H.C. yang and C.H. proposed three different

    strategy

    Dictionary based

    Treasure based

    Corpus based

    Hull and Gerferetette

    Dictionary based context evaluation without

    word sense disambiguate

  • 7/31/2019 Presentation is r

    7/32

    Word Sense Disambiguate

    There are two way to solve WSD

    Supervised learning

    Unsupervised learning

    P. Bhattacharyya evaluate the context

    Context of word in sentence

    Context evaluate by WordNet

  • 7/31/2019 Presentation is r

    8/32

    ISR Models

    Document set converted into suitable

    representation

    There are three type of presentation or model

    Boolean model

    Probabilistic model

    Algebraic model

  • 7/31/2019 Presentation is r

    9/32

    Boolean model

    Standard Boolean model

    Based on classical set theory

    Let riginal dcuments O={ O1,O2,O3,,On}

    Set f Term T={t1,t2,t3,t4,.,tn}

    Document set D=(d1,d2,d3,---di-dn),di can be powerset of T.

    Let d1=(t1,t2),d2=(t2,t3),d3=(t2,t4)

    Let Q=(t2,t3) then etrieved will be(d1,d2,d3) and (d2) Calulate (d1,d2,d3) Intersection(d2)

    Result id d2

  • 7/31/2019 Presentation is r

    10/32

    Standard Boolean model(Cont.)

    Pros

    Clear formation

    Easy to implement

    Cons

    Exact match lead to few or large document

    All term have equal weight

    How to rank out

    More like data retrieval than information retrieval

  • 7/31/2019 Presentation is r

    11/32

    Boolean model

    Extended Boolean model

    Document represent as vector

    Each ith term present ith dimension

    Term weight weight calculate as tf-idf

    Vdj={w1j,w2j,.wij}

    K1 and k2 weight is w1,w2

    Then Q(k1 or K2)={(w12+w22)/2}1/2

    Q(k1 and k2)={1-{((1-w12)+(1-w22)/2)^1/2}

  • 7/31/2019 Presentation is r

    12/32

    Boolean model

    Fuzzy retrieval

    Membership define by degree

    Mixed Min and Max (MMM)

    Paice model

    Both do not provide way to evaluate the query

    Query evaluated by P norms algorithm

  • 7/31/2019 Presentation is r

    13/32

    Probabilistic model

    Binary Independence model

    Uncertain Inference

    Language model

  • 7/31/2019 Presentation is r

    14/32

    Probabilistic model

    Binary Independence model

    Introduced by Yu and Salton

    Assume document as binary vector, only present

    or absent of term in document is recoreded as 0

    or 1

    Terms are independently distributed in relevant

    and irrelevant document Document represent as ordered set of Boolean

    variable

  • 7/31/2019 Presentation is r

    15/32

    Probabilistic model

    Uncertain inference

    Proposed by Rijsbergen

    Measure of uncertainty of document d to query q

    is probability of logical implication P(d->q)

    System task is to infer a document if query

    assertion is true.

    Knowledge base of fact and rule is used

  • 7/31/2019 Presentation is r

    16/32

    Probabilistic model

    Language Model

    Language model is associated with document

    Document ranked on the basis of probability that

    document language model would generate the

    term of query

    Unigram model

  • 7/31/2019 Presentation is r

    17/32

    Algebraic model

    Vector space model

    Latent semantic model

  • 7/31/2019 Presentation is r

    18/32

    Algebraic model

    Vector space model

    Document and query represent by terms

    Weight of term assign by tf-idf scheme

    Each ith term represent ith dimension of vector

    The similarity calculated by correlation between

    query and document vector

  • 7/31/2019 Presentation is r

    19/32

    Algebraic model

    Latent semantic indexing

    Mathematical technique ( Singular value

    decomposition)

    Word used in same context have same meaning

    Indentify pattern and relation ship between term

    and concept.

  • 7/31/2019 Presentation is r

    20/32

    System Evaluation

    Precision

    Recall

    Fall out F measure

    Average precision

    Mean average precision

  • 7/31/2019 Presentation is r

    21/32

    Impact of research on other domain

    The research on Information storage and retrievalsystem draws on achievement and techniques inseveral related area. Information Access: Document indexing, retrieval,

    filtering, clustering, presentation and summarizationof information, cross language information retrieval.

    Machine translation: comparable and parallel textalignment, language generation.

    Computational linguistics: morphological analysis,syntactic parsing, technique for disambiguation,document segmentation, corpus analysis, termrecognition and term expansion.

  • 7/31/2019 Presentation is r

    22/32

    Objectives of proposed work

    i. Investigation of various techniques useful for Informationstorage and retrieval and their comparison.

    ii. Apply grid computing and domain knowledge as supervisedlearning to eliminate the word sense ambiguity from thequery.

    iii. Analysis of models to store the data and improvement ofdocument representation by using following methods. Clustering techniques

    Concept mapping

    iv. Evaluation and customization of Information storage andretrieval models according to users need.

    V. Experimental evaluation of the developed algorithms.

  • 7/31/2019 Presentation is r

    23/32

    Proposed Methodology

    Grid computing for semantic database

    Domain creation by knowledge base

    Clustering algorithms Concept mapping for mapping between

    clusters

  • 7/31/2019 Presentation is r

    24/32

    Methodology

    Grid computing

    Language limitation, color/colour , bat

    I eat what I sea / I sea what I eat

    Combining of different domain to achieve

    common.

    Difficulty in corpus analysis

    Domain creation on the basis of knowledge base

    Parallel processing for different semantic network

  • 7/31/2019 Presentation is r

    25/32

    Methodology

    Clustering Research gap

    Grouping of data into meaning full category

    Document descriptor and descriptor extraction

    Following algorithm will be used Hierarchical algorithm- association and dividing the document

    Ontology support clustering

    Graph based clustering

    Concept map Mapping between clusters

    How idea create from word

    Decision making system

    Word and concept related to each other and whole idea

  • 7/31/2019 Presentation is r

    26/32

    Methodology

    Following criteria will be used to evaluate the

    performance

    Turn around time

    Response time

    Precision

  • 7/31/2019 Presentation is r

    27/32

    Methodology

    Recall

    Fall out

    F measure

  • 7/31/2019 Presentation is r

    28/32

    Bibliography

    [1]. Wissam Tawileh, chair f business infrmatics Explring web search behavir f Arab internetusers IEEE, Internatinal cnference n Innvatin in Infrmatin Technlgy, Dresden, Germeny,2011

    [2]. D. B. Cleveland and A.D Cleveland.Intrductin t Indexing and Abstracting, Englewd, CO:Libraries Unlimited, Inc, (1990)

    [3] K. Sparck Jnes A statistical interpretatin f term specificity and its applicatin in retrieval,Journal of Documentation 28, 11-21, (1972)

    [4] G. Saltn and C. Buckley Term Weighting Appraches in Autmatic Text Retrieval, InfrmatinProcessing and Management, 24, 513-523, 1988.

    [5] H.C. yang and C.H. lee Multi Lingual Infrmatin Retrieval, Internatinal cnference nIntelligent system design and application, Kaohsiung, Taiwan , 2008

    [6] D.A.Hull and G. Grefenstette Query acrss a language: a dicitinary based approach tomultilingual infrmatin retrieval , Internatinal cnference n research and develpment ininfrmatin retrieval ,1996

    [7] Singhal, Amit "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE ComputerSociety Technical Committee on Data Engineering (2001).

    [8] Maron, Melvin E. "An Historical Note on the Origins of Probabilistic Indexing". InformationProcessing and Management 2008

    [9] Pushpak Bhattacharyya, M. Sinha, M. K. Reddy, P. Pande, L. Kashyap, Hindi Wrd SenseDisambiguatin Indian Institute f Technlgy, Bmbay, India, 2011

    [10] Lashkari, Mahdavi, Ghomi, A Boolean Model in Information Retrieval for Search Engines 2009

  • 7/31/2019 Presentation is r

    29/32

    Bibliography

    [11] Manning,, Christopher D.; Prabhakar Raghavan, Hinrich Schtze Introduction to Information Retrieval.Cambridge University Press. Standerd Boolean model:2008

    [12] Turpin, Andrew; Scholer, Falk (2006). User performance versus precision measures for simple search tasks."Proceedings of the 29th annual international ACM SIGIR conference on Research and development in informationretrieval - SIGIR '06". Proceedings of the 29th Annual international ACM SIGIR Conference on Research andDevelopment in information Retrieval (Seattle, Washington, USA, August 06-11, 2006) (New York, NY: ACM): 1118. doi:10.1145/1148170.1148176. ISBN 1595933697.

    [13] Salton, Gerard; Edward A. Fox, Harry Wu (1983), Extended Boolean information retrieval, Communications ofthe ACM, Volume 26, Issue 11,

    [14] Lee, W. C.; E. A. Fox (1988), Experimental Comparison of Schemes for Interpreting Boolean Queries [15] Kang, Bo-Yeong; Dae-Won Kim, Hae-Jung Kim (2005), Fuzzy Information Retrieval Indexed by Concept

    Identification, Springer Berlin / Heidelberg, Zadrozny, Sawmir; Nowacka, Katarzyna (2009), Fuzzy informationretrieval model revisited, Elsevier North-Holland, Inc., doi:10.1016/j.fss.2009.02.012

    [16] Fox, E. A.; S. Sharat (1986), A Comparison of Two Methods for Soft Boolean Interpretation in InformationRetrieval, Technical Report TR-86-1, Virginia Tech, Department of Computer Science

    [17] Ding, C., A Similarity-based Probability Model for Latent Semantic Indexing, Proceedings of the 22ndInternational ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 5965.

    [18] Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st

    Annual Meeting of the American Society for Information Science 25, 1988, pp. 3640. [19] Bartell, B., Cottrell, G., and Belew, R., Latent Semantic Indexing is an Optimal Special Case of Multidimensional

    Scaling, Proceedings, ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.161167.

    [20] Dumais, S., and Nielsen, J., Automating the Assignment of Submitted Manuscripts to Reviewers, Proceedingsof the Fifteenth Annual International Conference on Research and Development in Information Retrieval, 1992,pp. 233244.

  • 7/31/2019 Presentation is r

    30/32

    Bibliography

    [21] Berry, M. W., and Browne, M., Understanding Search Engines: Mathematical Modeling and Text Retrieval,Society for Industrial and Applied Mathematics, Philadelphia, (2005).

    [22] Sparc, Robertson, Using Latent Semantic Analysis to Identify Similarities in Source Code to Support ProgramUnderstanding, Proceedings of 12th IEEE International Conference on Tools with Artificial Intelligence, Vancouver,British Columbia, November 1315, 2000, pp. 4653.

    [23] G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Communications ofthe ACM, vol. 18, nr. 11, pages 613620. (Article in which a vector space model was presented)

    [24] David Dubin , The Most Influential Paper Gerard Salton Never Wrote (Explains the history of the Vector SpaceModel and the non-existence of a frequently cited publication)2004

    [25] Yarowsky, D., and Florian, R., Taking the Load off the Conference Chairs: Towards a Digital Paper-routingAssistant, Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very-Large Corpora,1999, pp. 220230.

    [26] Soboroff, I., et al, Visualizing Document Authorship Using N-grams and Latent Semantic Indexing, Workshopon New Paradigms in Information Visualization and Manipulation, 1997, pp. 4348.

    [27] Yuanhua Lv and ChengXiang Zhai, Positional Language Models for Information Retrieval, in Proceedings of the32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 2009.

    [28] Buyya, Rajkumar . "Grid Computing: Making the Global Cyberinfrastructure for eScience a Reality" (PDF). CSICommunications (Mumbai, India: Computer Society of India (CSI)) 29 (1Francesco Lelli, Eric Frizziero, Michele

    Gulmini, Gaetano Maron, Salvatore Orlando, Andrea Petrucci and Silvano Squizzato. The many faces of theintegration of instruments and the grid. International Journal of Web and Grid Services 2007 Vol. 3, No.3pp. 239 266

    [29] Goodrum, Abby A. "Image Information Retrieval: An Overview of Current Research". Informing Science,2000

    [30] J M Ponte and W B Croft . "A Language Modeling Approach to Information Retrieval". Research andDevelopment in Information Retrieval. pp. 275281. 1998

  • 7/31/2019 Presentation is r

    31/32

    Bibliography

    [31] Beel, Jran; Gipp, Bela; Stiller, Jan-Olaf . "Information Retrieval On Mind Maps - What Could It Be Good For?".Proceedings of the 5th International Conference on Collaborative Computing: Networking, Applications andWorksharing:2009

    [32] Benedict, Shajulin; Vasudevan. "A Niched Pareto GA approach for scheduling scientific workflows in wirelessGrids". Journal of Computing and Information Technology 16: 101. 2008

    [33] R. korra, P. sujatha, Sidige, Chetana, N. Kumar Performace Evaluation of Multilingual Information RetrievalSystem ver Infrmatin Retrieval System, IEEE , Internatinal cnference n Recent Trends in InfrmatinTechnology, ICRTIT 2011, MIT , Anna Univercity, Chennai, June 3-5-2011

    [34] Achtert, E.; Bohm, C.; Kriegel, H. P.; Krger, P.; Zimek, A. "On Exploring Complex Relationships of Correlation

    Clusters". 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007). pp.7.2007

    [35] Auffarth, B. Clustering by a Genetic Algorithm with Biased Mutation Operator. WCCI CEC. IEEE, July 1823,2010.

    [36] Achtert, E.; Bhm, C.; Krger, P.; Zimek, A. "Mining Hierarchies of Correlation Clusters". Proc. 18thInternational Conference on Scientific and Statistical Database Management (SSDBM): 119128. doi:2006

    [37] Z. Huang. "Extensions to the k-means algorithm for clustering large data sets with categorical values". DataMining and Knowledge Discovery, 2:283304, 1998.

    [38] Joseph D. Novak & Alberto J. Caas . "The Theory Underlying Concept Maps and How To Construct and Use

    Them", Institute for Human and Machine Cognition. Accessed 24 Nov 2008. [39] Moon, B.M., Hoffman, R.R., Novak, J.D., & Caoas, A.J. Applied Concept Mapping: Capturing, Analyzing and

    Organizing Knowledge. CRC Press: New York,2011

    [40] Anderson, J. R., & Lebiere, C. (1998). The atomic components of thought. Mahwah, NJ: Erlbaum.

  • 7/31/2019 Presentation is r

    32/32

    Thank You