text mining and seasr
TRANSCRIPT
![Page 1: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/1.jpg)
Introduc)ontoSEASRandTextMining
UIUC/NCSAFeb4,2009
LoreBaAuvil
Na)onalCenterforSupercompu)ngApplica)onsUniversityofIllinoisatUrbanaChampaign
![Page 2: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/2.jpg)
TheSEASRPicture
![Page 3: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/3.jpg)
SEASR:Reach+Relevance+Reuse+Repeatability
SEASRemphasizesflexibility,scalability,modularity,providescommunityhubandaccesstoheterogeneousdataandcomputa)onalsystems– Seman)cdrivenenvironmentforSOAinteroperability– Encouragessharingandpar)cipa)onforbuildingcommuni)es– Modularconstruc)onallowsflowstobemodifiedandconfiguredto
encouragereusabilitywithinandacrossdomains– Enablesamashupandintegra)onoftools– Data‐intensiveflowscanbeexecutedonasimpledesktoporalarge
cluster(s)withoutmodifica)on– Computa)oncanbecreatedfordistributedexecu)ononserverswhere
thecontentlives– Useraccessibilitytocontroltrustandcompliancewithrequiredcopyright
licenseofcontent– ReliesonstandardizedResourceDescrip)onFramework(RDF)todefine
componentsandflow
![Page 4: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/4.jpg)
KnowledgeDiscoveryinData
![Page 5: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/5.jpg)
Workbench
• Web‐basedUI
• Componentsandflowsareretrievedfromserver
• Addi)onalloca)onsofcomponentsandflowscanbeaddedtoserver
• Createflowusingagraphicaldraganddropinterface
• Changepropertyvalues• Executetheflow
![Page 6: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/6.jpg)
CommunityHub
![Page 7: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/7.jpg)
SEASR@Work–Zotero
• PlugintoFirefox• Zoteromanagesthe
collec)on
• LaunchSEASRAnaly)cs– Cita)onAnalysisusestheJUNG
networkimportancealgorithmstoranktheauthorsinthecita)onnetworkthatisexportedasRDFdatafromZoterotoSEASR
– ZoteroExporttoFedorathroughSEASR
– SavesresultsfromSEASRAnaly)cstoaCollec)on
• LaunchMONKProcessing– MONKDBInges)onWorkflow
![Page 8: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/8.jpg)
WebService
Interac)veWebApplica)on
SEASR@Work–Fedora
![Page 9: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/9.jpg)
SEASR@Work–En)tyMash‐up
• En)tyExtrac)onwithOpenNLP
• Loca)onsviewedonGoogleMap
• DatesviewedonSimileTimeline
![Page 10: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/10.jpg)
SEASR@Work–AudioAnalysis• NEMA:ExecutesaSEASR
flowforeachrun
– Loadsaudiodata– Extractsfeaturesforevery
10secmovingwindowofaudio
– Loadsandappliesthemodels
– SendsresultsbacktotheWebUI
• NESTER:Annota)onofAudioviaSpectralAnalysis
![Page 11: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/11.jpg)
SEASR@Work–MONK
Executesflowsforeachanalysisrequested– Predic)vemodelingusingNaïveBayes
– Predic)vemodelingusingSupportVectorMachines(SVM)
![Page 12: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/12.jpg)
SEASR@Work–DISCUS• On‐demandusageof
analy)cswhilesurfing– Whilenaviga)ng
requestanaly)cstobeperformedonpage
– Textextrac)onandcleaning
• Summariza)onandkeyworkextrac)on
– Listtheimportanttermsonthepagebeinganalyzed
– Providerelevantshortsummaries
• Visualmaps– Provideavisual
representa)onofthekeyconcepts
– Showthegraphofrela)onsbetweenconcepts
![Page 13: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/13.jpg)
SEASRandUIMA:Emo)onTrackingGoalistohavethistypeofVisualiza)ontotrackemo)onsacrossatextdocument(Leveragingflare.prefuse.org)
![Page 14: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/14.jpg)
SEASRTextAnaly)csGoalsAddresstheScholarlytextanaly)csneedsby:
• EfficientlymanagingdistributedLiteraryandHistoricaltextualassets• Structuringextractedinforma)ontofacilitateknowledgediscovery• Extractinforma)onfromtextatalevelofseman)c/func)onal
abstrac)onthatissufficientlyrichtosupportques)on‐answering• Devisearepresenta)onfortheextractedinforma)onthatcanbe
efficientlyreasonedovertorecoverdataintheques)on‐answerprocess
• Devisealgorithmsforques)onansweringandinference• DevelopUIforeffec)vevisualknowledgediscoverywithseparate
querylogicfromapplica)onlogic• Leveragingexis)ngapproachesanddevisealgorithmsforclustering,
inference,andQ&A• DevelopinganInterac)onUIforeffec)vevisualdataexplora)on• Enablethetextanaly)csthroughSEASRcomponents
![Page 15: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/15.jpg)
TheZoteroPicture
TheWEB
ZoteroStore
![Page 16: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/16.jpg)
TheZotero+SEASRPicture
TheWEB
ZoteroStore
TheWEB
![Page 17: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/17.jpg)
YourZoteroCollec)on
![Page 18: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/18.jpg)
TheSEASRAnaly)cs
![Page 19: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/19.jpg)
TheValueAdded
![Page 20: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/20.jpg)
SomeExamples
• Authorship Analysis (JUNG network importance algorithms to rank the authors in the citation network)
• Author Centrality Analysis – Uses Betweenness Centrality, which ranks each coauthor graph derived from the
number of shortest paths that pass through them
• Author Degree Analysis – Uses AuthorDegreeDistributionAnalysis, which ranks each on the number of coauthors
• Author HITS Analysis – The *hubness* of a node is the degree to which a node links to other important
authorities. The *authoritativeness* of a node is the degree to which a node is pointed to by important hubs.
• Readability • Flesch-Kincaid readability test "
(http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test)
![Page 21: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/21.jpg)
SEASR Flow
![Page 22: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/22.jpg)
TextMiningDefini)on
Manydefini)onsintheliterature• Thenontrivialextrac)onofimplicit,previouslyunknown,andpoten)allyusefulinforma)onfrom(largeamountof)textualdata”
• Anexplora)onandanalysisoftextual(natural‐language)databyautoma)candsemiautoma)cmeanstodiscovernewknowledge
• Whatis“previouslyunknown”informa)on?– Strictdefini)on
• Informa)onthatnoteventhewriterknows– Lenientdefini)on
• Rediscovertheinforma)onthattheauthorencodedinthetext
![Page 23: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/23.jpg)
TextMiningProcess
• TextPreprocessing– Syntac)cTextAnalysis– Seman)cTextAnalysis
• FeaturesGenera)on– BagofWords– Ngrams
• FeatureSelec)on– SimpleCoun)ng– Sta)s)cs– Selec)onbasedonPOS
• Text/DataMining– Classifica)on‐Supervised
Learning– Clustering‐Unsupervised
Learning– Informa)onExtrac)on
• AnalyzingResults– VisualExplora)on,Discovery
andKnowledgeExtrac)on– Query‐based–ques)on
answering
![Page 24: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/24.jpg)
TextCharacteris)cs(1)• Largetextualdatabase
– Enormouswealthoftextualinforma)onontheWeb– Publica)onsareelectronic
• Highdimensionality– Considereachword/phraseasadimension
• Noisydata– Spellingmistakes– Abbrevia)ons– Acronyms
• Textmessagesareverydynamic– Webpagesareconstantlybeinggenerated(removed)– Webpagesaregeneratedfromdatabasequeries
• Notwellstructuredtext– Email/Chatrooms
• “ruavailable?”• “Heywhazzzzzzup”
– Speech
![Page 25: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/25.jpg)
TextCharacteris)cs(2)• Dependency
– Relevantinforma)onisacomplexconjunc)onofwords/phrases– Orderofwordsinthequery
• hotdogstandintheamusementpark• hotamusementstandinthedogpark
• Ambiguity– Wordambiguity
• Pronouns(he,she…)• Synonyms(buy,purchase)• Wordswithmul)plemeanings(bat–itisrelatedtobaseballormammal)
– Seman)cambiguity• Thekingsawtherabbitwithhisglasses.(mul)plemeanings)
• Authorityofthesource– IBMismorelikelytobeanauthorizedsourcethenmysecondfar
cousin
![Page 26: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/26.jpg)
TextPreprocessing• Syntac)canalysis
– Tokeniza)on– Lemmi)za)on– POStagging– Shallowparsing– Customliterarytagging
• Seman)canalysis– Informa)onExtrac)on
• NamedEn)tytagging– Seman)cCategory(unnameden)ty)tagging– Co‐referenceresolu)on– Ontologicalassocia)on(WordNet,VerbNet)– Seman)cRoleanalysis– Concept‐Rela)onextrac)on
![Page 27: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/27.jpg)
Syntac)cAnalysis• Tokeniza)on
– Textdocumentisrepresentedbythewordsitcontains(andtheiroccurrences)– e.g.,“Lordoftherings”→{“the”,“Lord”,“rings”,“of”}– Highlyefficient– Makeslearningfarsimplerandeasier– Orderofwordsisnotthatimportantforcertainapplica)ons
• Lemmi)za)on/Stemming– Involvesthereduc)onofcorpuswordstotheirrespec)veheadwords(i.e.lemmas)– Reducedimensionality– Iden)fiesawordbyitsroot– e.g.,flying,flew→fly
• Stopwords– Iden)fiesthemostcommonwordsthatareunlikelytohelpwithtextmining– e.g.,“the”,“a”,“an”,“you”
• Parsing/PartofSpeech(POS)tagging– Generatesaparsetree(graph)foreachsentence– Eachsentenceisastandalonegraph– FindthecorrespondingPOSforeachword– e.g.,John(noun)gave(verb)the(det)ball(noun)– ShallowParsing
• analysisofasentencewhichiden)fiesthecons)tuents(noungroups,verbs,...),butdoesnotspecifytheirinternalstructure,northeirroleinthemainsentence
– DeepParsing• moresophis)catedsyntac)c,seman)candcontextualprocessingmustbeperformedtoextractorconstructtheanswer
![Page 28: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/28.jpg)
Seman)cAnalysis:Informa)onExtrac)on
• Defini)on:Informa)onextrac)onistheiden)fica)onofspecificseman)celementswithinatext(e.g.,en))es,proper)es,rela)ons)
• Extracttherelevantinforma)onandignorenon‐relevantinforma)on(important!)
• Linkrelatedinforma)onandoutputinapredeterminedformat
![Page 29: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/29.jpg)
Informa)onExtrac)on
Informa(onType Stateoftheart(Accuracy)En((es
anobjectofinterestsuchasapersonororganiza)on.
90‐98%
A9ributes
apropertyofanen)tysuchasitsname,alias,descriptor,ortype.
80%
Facts
arela1onshipheldbetweentwoormoreen))essuchasPosi)onofa
PersoninaCompany.
60‐70%
Events
anac1vityinvolvingseveralen))essuchasaterroristact,airlinecrash,managementchange,newproduct
introduc)on.
50‐60%
“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL
![Page 30: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/30.jpg)
Informa)onExtrac)onApproaches
• Terminology(name)lists– Thisworksverywellifthelistofnamesandnameexpressionsisstableandavailable
• Tokeniza)onandmorphology– Thisworkswellforthingslikeformulasordates,whicharereadilyrecognizedbytheirinternalformat(e.g.,DD/MM/YYorchemicalformulas)
• Useofcharacteris)cpaBerns– Thisworksfairlywellfornovelen))es– Rulescanbecreatedbyhandorlearnedviamachinelearningorsta)s)calalgorithms
– RulescapturelocalpaBernsthatcharacterizeen))esfrominstancesofannotatedtrainingdata
![Page 31: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/31.jpg)
Informa)onExtrac)on
Rela)on(Event)Extrac)on• Iden)fy(andtag)therela)onamongtwoen))es:
– Apersonis_located_ataloca)on(news)– Agenecodes_foraprotein(biology)
• Rela)onsrequiremoreinforma)on– Iden)fica)onoftwoen))es&theirrela)onship– Predictedrela)onaccuracy
• Pr(E1)*Pr(E2)*Pr(R)~=(.93)*(.93)*(.93)=.80• Informa)oninrela)onsislesslocal
– Contextualinforma)onisaproblem:rightwordmaynotbeexplicitlypresentinthesentence
– Eventsinvolvemorerela)onsandareevenharder
![Page 32: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/32.jpg)
MayorRexLuthorannouncedtodaytheestablishmentofa
newresearchfacilityinAlderwood.Itwillbeknownas
BoyntonLaboratory.
NE:Person NE:Time
NE:Loca)on
NE:Organiza)on
Seman)cAnaly)cs
NamedEn)ty(NE)Tagging
![Page 33: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/33.jpg)
MayorRexLuthorannouncedtodaytheestablishmentofa
newresearchfacilityinAlderwood.Itwillbeknownas
BoyntonLaboratory.
UNE:Organiza)on
Seman)cAnalysis
Seman)cCategory(unnameden)ty,UNE)Tagging
![Page 34: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/34.jpg)
MayorRexLuthorannouncedtodaytheestablishmentofa
newresearchfacilityinAlderwood.Itwillbeknownas
BoyntonLaboratory.
UNE:Organiza)on
Seman)cAnalysis
Co‐referenceResolu)onforen))esandunnameden))es
![Page 35: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/35.jpg)
Mayor Rex Luthor announced today the establishment
known as Boynton Laboratory
of a new research facility in Alderwoon. It will be
ACTIONACTOR WHEN OBJECT
WHERE
ACTION
OBJECT
COMPL
Seman)cAnalysis
Seman)cRoleAnalysis
![Page 36: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/36.jpg)
Rex Luthor
person
announce
action
establ.
event
Boynton Lab
organiz.
today
time
Alderwood
location
location
(where)
object
(what)
time(when)
objec
t(w
hat)
actor(who)
Seman)cAnalysis
Concept‐Rela)onExtrac)on
![Page 37: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/37.jpg)
IE–TemplateExtrac)on‐Steps
</VerbGroup> …
![Page 38: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/38.jpg)
(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson
…….
The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States.
``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .'‘ …
TemplateExtrac)on<Facility>Finsbury Park Mosque</Facility>
<PersonPositionOrganization> <OFFLEN OFFSET="3576" LENGTH=“33" /> <Person>Abu Hamza al-Masri</Person> <Position>chief cleric</Position> <Organization>Finsbury Park Mosque</Organization> </PersonPositionOrganization>
<Country>England</Country>
<PersonArrest> <OFFLEN OFFSET="3814" LENGTH="61" /> <Person>Abu Hamza al-Masri</Person> <Location>London</Location> <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason> </PersonArrest>
<Country>England</Country>
<Country>France </Country>
<Country>United States</Country>
<Country>Belgium</Country>
<Person>Abu Hamza al-Masri</Person>
<City>London</City>
![Page 39: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/39.jpg)
StreamingText:KnowledgeExtrac)on
• Leveragingsomeearlierworkoninforma)onextrac)onfromtextstreams
Informa)onextrac)on• processofusing
advancedautomatedmachinelearningapproaches
• toiden)fyen))esintextdocuments
• extractthisinforma)onalongwiththerela)onshipstheseen))esmayhaveinthetextdocuments
Thevisualiza)onabovedemonstratesinforma)onextrac)onofnames,placesandorganiza)onsfromreal‐)menewsfeeds.Asnewsar)clesarrive,theinforma)onisextractedanddisplayed.Rela)onshipsaredefinedwhenen))esco‐occurwithinaspecificwindowofwords.
![Page 40: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/40.jpg)
Seman)cAnalysis• WordSenseDisambigua)on
– Contextbasedorproximitybased
– Veryaccurate
![Page 41: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/41.jpg)
OntologicalAssocia)on(WordNet)• Wordnet:Asof2006,thedatabasecontainsabout150,000words
organizedinover115,000synsetsforatotalof207,000word‐sensepairs• Searchfordog
– ndog,domes)cdog,Canisfamiliaris(amemberofthegenusCanis(probablydescendedfromthecommonwolf)thathasbeendomes)catedbymansinceprehistoric)mes;occursinmanybreeds)
– nfrump,dog(adullunaBrac)veunpleasantgirlorwoman)– ndog(informaltermforaman)– ncad,bounder,blackguard,dog,hound,heel(someonewhoismorally
reprehensible)– nfrank,frankfurter,hotdog,hotdog,dog,wiener,wienerwurst,weenie(a
smooth‐texturedsausageofmincedbeeforporkusuallysmoked;o}enservedonabreadroll)
– npawl,detent,click,dog(ahingedcatchthatfitsintoanotchofaratchettomoveawheelforwardorpreventitfrommovingbackward)
– nandiron,firedog,dog,dog‐iron(metalsupportsforlogsinafireplace)– vchase,chasea}er,trail,tail,tag,givechase,dog,goa}er,track(goa}erwith
theintenttocatch)
![Page 42: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/42.jpg)
FeatureSelec)on
• ReduceDimensionality– Learnershavedifficultyaddressingtaskswithhighdimensionality
• IrrelevantFeatures– Notallfeatureshelp!– Removefeaturesthatoccurinonlyafewdocuments
– Reducefeaturesthatoccurintoomanydocuments
![Page 43: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/43.jpg)
TextMining:GeneralApplica)onAreas
• Informa)onRetrieval– Indexingandretrievaloftextualdocuments– Findingasetof(ranked)documentsthatarerelevanttothequery
• Informa)onExtrac)on– Extrac)onofpar)alknowledgeinthetext
• WebMining– Indexingandretrievaloftextualdocumentsandextrac)onofpar)alknowledgeusingtheweb
• Classifica)on– Predictaclassforeachtextdocument
• Clustering– Genera)ngcollec)onsofsimilartextdocuments
![Page 44: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/44.jpg)
TextMining:Supervisedvs.Unsupervised
• Supervisedlearning(Classifica)on)– Data(observa)ons,measurements,etc.)areaccompaniedby
labelsindica)ngtheclassoftheobserva)ons– Splitintotrainingdataandtestdataformodelbuildingprocess– Newdataisclassifiedbasedonthemodelbuiltwiththetraining
data– Techniques
• Bayesianclassifica)on,Decisiontrees,Neuralnetworks,Instance‐BasedMethods,SupportVectorMachines
• Unsupervisedlearning(Clustering)– Classlabelsoftrainingdataisunknown– Givenasetofmeasurements,observa)ons,etc.withtheaimof
establishingtheexistenceofclassesorclustersinthedata
![Page 45: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/45.jpg)
Results:SocialNetwork(TominRed)
![Page 46: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/46.jpg)
Results:Timeline
![Page 47: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/47.jpg)
Results:Maps
![Page 48: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/48.jpg)
TextMining:T2KandThemeWeaver
![Page 49: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/49.jpg)
Images from Pacific Northwest Laboratory
TextMining:ThemescapeandThemeRiver
• VisualizingRela)onshipsBetweenDocuments
![Page 50: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/50.jpg)
Gather–Analyze–Present
![Page 51: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/51.jpg)
TextMining:Applica)ons
• Email:Spamfiltering• NewsFeeds:Discoverwhatis
interes)ng• Medical:Iden)fyrela)onshipsand
linkinforma)onfromdifferentmedicalfields
• HomelandSecurity• Marke)ng:Discoverdis)nctgroupsof
poten)albuyersandmakesugges)onsforotherproducts
• Industry:Iden)fyinggroupsofcompe)torswebpages
• JobSeeking:Iden)fyparametersinsearchingforjobs
![Page 52: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/52.jpg)
TextMining:Classifica)onDefini)on
• Given:Collec)onoflabeledrecords– Eachrecordcontainsasetoffeatures(aBributes),andthetrueclass
(label)– Createatrainingsettobuildthemodel– Createates)ngsettotestthemodel
• Find:Modelfortheclassasafunc)onofthevaluesofthefeatures• Goal:Assignaclass(asaccuratelyaspossible)topreviouslyunseen
records• Evalua)on:WhatIsGoodClassifica)on?
– Correctclassifica)on• Knownlabeloftestexampleisiden)caltothepredictedclassfromthemodel
– Accuracyra)o• Percentoftestsetexamplesthatarecorrectlyclassifiedbythemodel
– Distancemeasurebetweenclassescanbeused• e.g.,classifying“football”documentasa“basketball”documentisnotasbad
asclassifyingitas“crime”
![Page 53: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/53.jpg)
TextMining:ClusteringDefini)on• Given:Setofdocumentsandasimilaritymeasure
amongdocuments• Find:Clusterssuchthat
– Documentsinoneclusteraremoresimilartooneanother
– Documentsinseparateclustersarelesssimilartooneanother
• Goal:– Findingacorrectsetofdocuments
• SimilarityMeasures:– EuclideandistanceifaBributesarecon)nuous– Otherproblem‐specificmeasures
• e.g.,howmanywordsarecommoninthesedocuments
• Evalua)on:WhatIsGoodClustering?– Producehighqualityclusterswith
• highintra‐classsimilarity• lowinter‐classsimilarity
– QualityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaBerns
![Page 54: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/54.jpg)
SEASR
MeandreWorkbench
![Page 55: Text Mining and SEASR](https://reader035.vdocuments.site/reader035/viewer/2022062513/55503e94b4c90580748b48a5/html5/thumbnails/55.jpg)
FutureWork
• EnhancementstoSeman)cAnalysis– UseofOntologicalAssocia)on(WordNet,VerbNet)
– Improveco‐referencing
– Improvefactextrac)on
• Visualexplora)ontools