discovery and reconciliation of entity type taxonomies soumen chakrabarti iit bombay...

Download Discovery and Reconciliation of Entity Type Taxonomies Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

Post on 03-Jan-2016

214 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • Discovery and Reconciliation of Entity Type TaxonomiesSoumen ChakrabartiIIT Bombaywww.cse.iitb.ac.in/~soumen

  • Searching with types and entitiesAnswer typesHow far is it from Rome to Paris?type=distance#n#1 near words={Rome, Paris}Restrictions on match conditionsHow many movies did 20th Century Fox release in 1923?by 1950, the forest had only 1923 foxes lefttype={number#n#1,hasDigit} NEAR year=1923 organization=20th Century FoxCorpus = set of docs, doc = token sequence, tokens connected to lexical networks

  • Searching personal info networksNo clean schema, data changes rapidlyLots of generic graph proximity informationTimeLastModEmailDateEmailDateXYZA. U. ThorEmailToEmailToECMLCanonical nodeWant to quickly find this file

  • Building blocksStructured seeds of type infoWordNet, Wikipedia, OpenCYC, Semi-structured sourcesList of faculty members in a departmentCatalog of products at ecommerce siteUnstructured open-domain textEmail bodies, text of papers, blog text, Web page, Discovery and extension of type attachmentsHearst patterns, list extraction, NE taggersReconciling federated type systemsSchema and data integrationQuery execution engines using type catalogs

  • Hearst patterns and enhancementsHearst, 1992; KnowItAll (Etzioni+ 2004)T such as x, x and other Ts, x or other Ts, T x, x is a T, x is the only T, C-PANKOW (Cimiano and Staab 2005)

    Suitable for unformatted natural languageGenerally high precision, low recallIf few possible Ts, use a named entity tagger

  • Set extractionEach node of a graph is a wordEdge connects words wi and wj if these words occur together in more than k docsUse apriori style searches to enumerateEdge weight depends on #docsGiven a set of words Q, set up PagerankRandom surfer on word graphW.p. d, jump to some element of QW.p. 1d, walk to a neighborPresent nodes (words) with largest Pagerank

  • Example

  • List extractionGiven a current set of candidate TsLimit to candidates having high confidenceSelect random subset of k=4 candidatesGenerate query from selected candidatesDownload response documentsLook for lists containing candidate mentionsExtract more instances from lists found

    Boosts extraction rate 25 folds

  • Wrapping, scraping, taggingHTML formatting cluesHelp extract records and fieldsExtensive work in the DB, KDD, ML communitiesPaperP167is-aGerhard PassGregor Heinrichhas-authorInvestigating word correlationhas-title

  • Reconciling type systemsWordNet: small and preciseWikipedia: much larger, less controlledCollect into a common is-a database

  • Mapping between taxonomiesEach type has a set of instancesAssoc Prof: K. Burn, R. CookSynset: lemmas from leaf instancesWikipedia concept: list of instancesYahoo topic: set of example Web pagesGoal: establish connections between typesConnections could be soft or probabilistic

  • Cross-trainingSet of labels or types B, partly related but not identical to type set AA=Dmoz topics, B=Yahoo topicsA=Personal bookmark topics, B=Yahoo topicsTraining docs come in two flavors nowFully labeled with A and B labels (rare)Half-labeled with either an A or a B labelCan B make classification for A more accurate (and vice versa)?Inductive transfer, multi-task learning(Sarawagi+ 2003)DADB

  • MotivationSymmetric taxonomy mappingEcommerce catalogs: A=distributor, B=retailerWeb directories: A = Dmoz, B = YahooIncomplete taxonomies, small training setsBookmark taxonomy vs. YahooCartesian label spacesUKUSARegionalTopSportsBaseballCricketRegionTopicLabel-pair- conditioned term distribution

  • Labels as featuresA-label known, estimate B-labelSuppose we have A+B labeled training setDiscrete valued label column

    Most text classifiers cannot balance importance of very heterogeneous featuresDo not have fully-labeled dataMust guess (use soft scores instead of 0/1)Term feature valuesAugmented feature vectorTarget label

  • SVM-CT: Cross-trained SVMS(A,1) S(B,2) S(A,2)

  • SVM-CT anecdotesDiscriminant reveals relations between A and BOne-to-one, many-to-one, related, antagonisticHowever, accuracy gains are meagerPositiveNegative

    Sheet1

    TopicDmoz.Yahoo.

    MoviesGenres.WesternTitles.Western

    Titles.Horror

    PhotographyTechniques+StylesPinhole_Photography

    3D_Photography

    Panoramic_Photography

    Organizations

    Sheet1

    TopicYahoo.Dmoz.

    PhotographyPinhole_PhotographyTechniques+Styles

    Photographers

    SoftwareOS.MS_WindowsOS.MS_Windows

    OS.UNIX

  • EM1D: Info from unlabeled docsUse training docs to induce initial classifier for taxonomy B, sayRepeat until classifier satisfactoryEstimate Pr(|d) for unlabeled doc d, BReweigh d by factor Pr(|d) and add to training set for label Retrain classifierEM1D: Expectation maximization with one label set B (Nigam et al.)Ignores labels from another taxonomy A

  • Stratified EM1DTarget labels = BB-labeled docs are labeled training instancesConsider A-labeled docs labeled These are unlabeled for taxonomy BRun EM1D for each row Test instance has knownInvoke semi-supervised model for row to classifyA topicsB-topicsDocs in DADB labeled

  • EM2D: Cartesian product EMInitialize with fully labeled docs which go to a specific (,) cellSmear training doc across label row or columnUniform smear could be badUse a nave Bayes classifier to seedParameters extended from EM1D, prior probability for label pair (,),,t multinomial term probability for (,)Labels in ALabels in BA-labeled docB-labeled doc

  • EM2D updatesE-step for an A-labeled document

    M-step

  • Applying EM2D to a test docMapping a B-labeled test doc d to an A label (e-commerce catalogs)Given , find argmax Pr(,|d)Classifying a document d with no labels to an A labelAggregationFor each compute Pr(,|d), pick best Guessing (EM2D-G)Guess the best * using a B-classifierFind argmax Pr(,*|d)EM pitfalls: damping factor, early stopping

  • ExperimentsSelected 5 Dmoz and Yahoo subtree pairsCompare EM2D againstNave Bayes, best #features and smoothingEM1D: ignore labels from other taxonomy, consider as unlabeled docsStratified EM1DMapping test doc with A-label to B-label or vice versaClassifying zero-labeled test docAccuracy = fraction with correct labels

  • Accuracy benefits in mappingEM1D and NB are close, because training set sizes for each taxonomy are not too smallEM2D > Stratified EM1D > NB2d transfer of model info seems importantImprovement over NB:30% best, 10% average

    Chart1

    4546.546.745

    6365.665.764.6

    54.24338.342

    714140.962

    8877.176.576

    867670.972.6

    55.340.940.438

    41.535.528.737

    5647.848.251.7

    5954.350.952.1

    EM2D

    NB

    EM1D

    Strat-EM

    Accuracy-->

    guessResultsL (2)

    PELA

    0.10A0.556075

    0.10B0.682243

    0.10.7A0.542056

    0.10.7B0.584112

    0.30A0.560748

    0.30B0.682243

    0.30.7A0.495327

    0.30.7B0.542056

    0.50A0.542056

    0.50B0.616822

    0.50.7A0.448598

    0.50.7B0.514019

    0.70A0.518692

    0.70B0.616822

    0.70.7A0.420561

    0.70.7B0.443925

    0.90A0.471963

    0.90B0.523364

    0.90.7A0.331776

    0.90.7B0.378505

    10.7A0.313084

    10.7B0.266355

    10A0.313084

    10B0.266355

    00A0.542056

    00B0.696262

    00.7A0.46729

    00.7B0.537383

    AB-pivot

    Average of Acc

    Data2LabelTotal

    Autos2A0.481328333348.1

    B0.61116661.1

    Movies2A0.640286564.0

    B0.84307284.3

    OutdoorsA0.836303166783.6

    B0.874995833387.5

    PhotoA0.534060833353.4

    B0.37830737.8

    Software2A0.614640561.5

    B0.655801565.6

    Sheet2

    DatasetDataEM2DNBEM1DStrat-EM

    AutosA4546.546.745

    B6365.665.764.6

    MoviesA54.24338.342

    B714140.962

    OutdoorsA8877.176.576

    B867670.972.6

    PhotoA55.340.940.438

    B41.535.528.737

    SoftwareA5647.848.251.7

    B5954.350.952.1

    DatasetDataNBSVMSVM-CTEM2DA&S T90%A&S AL:10

    AutosA46.557.157.64550.249

    B65.672.872.3636969

    MoviesA4354.756.554.258.550.8

    B4170.169.67176.974.6

    OutdoorsA77.183.282.18892.192.4

    B767173.28685.486.9

    PhotoA40.935.135.155.33438.4

    B35.540.437.241.536.633.3

    SoftwareA47.851.753.75660.459.9

    B54.362.964.95956.757.5

    select all 6 cols to get complete graph

    DatasetDataEM2DEM1DStrat-EM

    AutosA4546.745

    B6365.764.6

    MoviesA54.238.342

    B7140.962

    OutdoorsA8876.576

    B8670.972.6

    PhotoA55.340.438

    B41.528.737

    SoftwareA5648.251.7

    B5950.952.1

    Sheet2

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    EM2D

    NB

    Accuracy-->

    Sheet3

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    NB

    SVM

    SVM-CT

    Accuracy-->

    summary

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    EM2D

    EM1D

    Strat-EM

    Accuracy-->

    Sheet1

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    EM2D

    NB

    EM1D

    Strat-EM

    Accuracy-->

    guessResultsL

    BMethod

    DatasetEM2DNBSVMSVM-CTGrand Total

    Autos6365.666.8567.9365.8450.51535933330.589595

    Movies714155.863.357.7750.3098590.517241

    Photo41.535.542.642.640.550.5757580.755952

    Software5954.344.445.1750.71750.2272730.355556

    Outdoors867675.481.679.750.2892560.495868

    Grand Total64.154.4857.0160.1258.927500

    allrarsActive

    DatasetMethodA/dmozpB/yahooA/dmozMethod

    Autos-oldEM1D4127-41EM1D

    Autos-oldNB40.530.58-

Recommended

View more >