mallet tutorial
Post on 24-Oct-2014
379 Views
Preview:
TRANSCRIPT
MachineLearningwithMALLET
h1p://mallet.cs.umass.edu
DavidMimno
Informa@onExtrac@onandSynthesis
Laboratory,DepartmentofCS
UMass,Amherst
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
Who?
• AndrewMcCallum(mostofthe
work)
• CharlesSu1on,AronCulo1a,
GregDruck,KedarBellare,
GauravChandalia…
• FernandoPereira,othersat
Penn…
WhoamI?
• ChiefmaintainerofMALLET
• PrimaryauthorofMALLETtopicmodeling
package
Why?
• Mo@va@on:textclassifica@onand
informa@onextrac@on
• Commercialmachinelearning(Just
Research,WhizBang)
• Analysisandindexingofacademic
publica@ons:Cora,Rexa
What?
• Textfocus:dataisdiscreteratherthan
con@nuous,evenwhenvaluescouldbe
con@nuous:
double value = 3.0
How?
• Commandlinescripts:
– bin/mallet[command]‐‐[op@on][value]…
– TextUserInterface(“tui”)classes
• DirectJavaAPI
– h1p://mallet.cs.umass.edu/api
Most of this talk
History
• Version0.4:c2004
– Classesinedu.umass.cs.mallet.base.*
• Version2.0:c2008
– Classesincc.mallet.*
– Majorchangestofinitestatetransducerpackage
– bin/malletvs.specializedscripts
– Java1.5generics
LearningMore
• h1p://mallet.cs.umass.edu
– “QuickStart”guides,focusedoncommandline
processing
– Developers’guides,withJavaexamples
• mallet‐dev@cs.umass.edumailinglist
– Lowvolume,butcanbebursty
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
ModelsforTextData
• Genera@vemodels(Mul@nomials)
– NaïveBayes
– HiddenMarkovModels(HMMs)
– LatentDirichletTopicModels
• Discrimina@veRegressionModels
– MaxEnt/Logis@cregression
– Condi@onalRandomFields(CRFs)
Representa@ons
• Transformtext
documentsto
vectorsx1, x2,…
• Retainmeaning
ofvectorindices
• Ideallysparsely
Call meIshmael.…
Document
Representa@ons
• Transformtext
documentsto
vectorsx1, x2,…
• Retainmeaning
ofvectorindices
• Ideallysparsely
1.00.0…0.06.00.0…3.0…
Call meIshmael.…
xi
Document
Representa@ons
• Elementsofvector
arecalledfeature
values
• Example:Feature
atrow345is
numberof@mes
“dog”appearsin
document
1.00.0…0.06.00.0…3.0…
xi
DocumentstoVectors
Call me Ishmael.
Document
DocumentstoVectors
Call me Ishmael.
Document
Call me Ishmael
Tokens
DocumentstoVectors
Call me Ishmael
Tokens
call me ishmael
Tokens
DocumentstoVectors
call me ishmael
Tokens
473, 3591, 17
Features
17 ishmael…473 call…3591 me
DocumentstoVectors
17 1.0473 1.03591 1.0
Features (bag)
17 ishmael473 call3591 me
473, 3591, 17
Features (sequence)
17 ishmael…473 call…3591 me
17 ishmael…473 call…3591 me
Instances
Emailmessage,webpage,sentence,journal
abstract…
• Name
• Data
• Target/Label
• Source
What is it called?
What is the input?
What is the output?
What did it originally look like?
Instances
• Name
• Data
• Target
• Source
String
TokenSequenceArrayList<Token>
FeatureSequenceint[]
FeatureVectorint -> double map
cc.mallet.types
Alphabets
TObjectIntHashMap mapArrayList entries
int lookupIndex(Object o, boolean shouldAdd)
Object lookupObject(int index)
cc.mallet.types, gnu.trove
17 ishmael…473 call…3591 me
Alphabets
TObjectIntHashMap mapArrayList entries
int lookupIndex(Object o, boolean shouldAdd)
Object lookupObject(int index)
cc.mallet.types, gnu.trove
17 ishmael…473 call…3591 me
for
Alphabets
TObjectIntHashMap mapArrayList entries
cc.mallet.types, gnu.trove
17 ishmael…473 call…3591 me
void stopGrowth()
void startGrowth()
Do not add entries fornew Objects -- defaultis to allow growth.
Crea@ngInstances
• Instance
constructor
method
• Iterators
new Instance(data, target,name, source)
Iterator<Instance>FileIterator(File[], …)CsvIterator(FileReader, Pattern…)ArrayIterator(Object[])…
cc.mallet.pipe.iterator
Crea@ngInstances
• FileIterator
cc.mallet.pipe.iterator
/data/bad/
/data/good/
Label from dir name
Each instance inits own file
Crea@ngInstances
• CsvIterator
cc.mallet.pipe.iterator
Name, label, data from regular expression groups.“CSV” is a lousy name. LineRegexIterator?
Each instanceon its own line
1001 Melville Call me Ishmael. Some years ago…1002 Dickens It was the best of times, it was…
^([^\t]+)\t([^\t]+)\t(.*)
InstancePipelines
• Sequen@al
transforma@ons
ofinstancefields
(usuallyData)
• Passan
ArrayList<Pipe>
toSerialPipes
cc.mallet.pipe
// “data” is a StringCharSequence2TokenSequence// tokenize with regexpTokenSequenceLowercase// modify each token’s textTokenSequenceRemoveStopwords// drop some tokensTokenSequence2FeatureSequence// convert token Strings to intsFeatureSequence2FeatureVector// lose order, count duplicates
InstancePipelines
• Asmallnumber
ofpipesmodify
the“target”
field
• Therearenow
twoalphabets:
dataandlabel
cc.mallet.pipe, cc.mallet.types
// “target” is a StringTarget2Label// convert String to int// “target” is now a Label
Alphabet > LabelAlphabet
Labelobjects
• Weightsona
fixedsetof
classes
• Fortraining
data,weightfor
correctlabelis
1.0,allothers
0.0
cc.mallet.types
implements Labeling
int getBestIndex()Label getBestLabel()
You cannot create a Label,they are only produced byLabelAlphabet
InstanceLists
• AListof
Instanceobjects,
alongwitha
Pipe,data
Alphabet,and
LabelAlphabet
cc.mallet.types
InstanceList instances = new InstanceList(pipe);
instances.addThruPipe(iterator);
Purngitalltogether
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
pipeList.add(new Target2Label());
pipeList.add(new CharSequence2TokenSequence());
pipeList.add(new TokenSequence2FeatureSequence());
pipeList.add(new FeatureSequence2FeatureVector());
InstanceList instances =
new InstanceList(new SerialPipes(pipeList));
instances.addThruPipe(new FileIterator(. . .));
PersistentStorage
• MostMALLET
classesuseJava
serializa@onto
storemodels
anddata
java.io
ObjectOutputStream oos = new ObjectOutputStream(…);oos.writeObject(instances);oos.close();
Pipes, data objects, labelings, etcall need to implementSerializable.
Be sure to include custom classesin classpath, or you get aStreamCorruptedException
Review
• Whatarethefourmainfieldsinan
Instance?
Review
• Whatarethefourmainfieldsinan
Instance?
• WhataretwowaystogenerateInstances?
Review
• Whatarethefourmainfieldsinan
Instance?
• WhataretwowaystogenerateInstances?
• HowdowemodifythevalueofInstance
fields?
Review
• Whatarethefourmainfieldsinan
Instance?
• WhataretwowaystogenerateInstances?
• HowdowemodifythevalueofInstance
fields?
• Namesomeclassesthatappearinthe
“data”field.
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
Classifierobjects
• Classifiersmap
frominstances
todistribu@ons
overafixedset
ofclasses
• MaxEnt,Naïve
Bayes,Decision
Trees…
cc.mallet.classify
Given data Which classis best?
(this one!)watery
NN
JJ
PRP
VB
CC
Classifierobjects
• Classifiersmap
frominstances
todistribu@ons
overafixedset
ofclasses
• MaxEnt,Naïve
Bayes,Decision
Trees…
cc.mallet.classify
Labeling labeling = classifier.classify(instance);
Label l = labeling.getBestLabel();
System.out.print(instance + “\t”);System.out.println(l);
TrainingClassifierobjects
cc.mallet.classify
ClassifierTrainer trainer = new MaxEntTrainer();
Classifier classifier = trainer.train(instances);
• Eachtypeof
classifierhas
oneormore
ClassifierTrainer
classes
TrainingClassifierobjects
cc.mallet.optimize
log P(Labels | Data) =log f(label1, data1, w) +log f(label2, data2, w) +log f(label3, data3, w) +…
• Someclassifiers
require
numerical
op@miza@onof
anobjec@ve
func@on. Maximize w.r.t. w!
Parametersw
• Associa@on
between
feature,class
label
• Howmany
parametersfor
KclassesandN
features?
ac@on NN 0.13
ac@on VB ‐0.1
ac@on JJ ‐0.21
SUFF‐@on NN 1.3
SUFF‐@on VB ‐2.1
SUFF‐@on JJ ‐1.7
SUFF‐on NN 0.01
SUFF‐on VB ‐0.02
…
TrainingClassifierobjects
cc.mallet.optimize
interface Optimizerboolean optimize()
interface Optimizableinterface ByValueinterface ByValueGradient
Limited-memory BFGS,Conjugate gradient…
Specific objective functions
TrainingClassifierobjects
cc.mallet.classify
MaxEntOptimizableByLabelLikelihooddouble[] getParameters()void setParameters(double[] parameters)…
double getValue()void getValueGradient(double[] buffer)
Log likelihood and its first derivative
ForOptimizableinterface
Evalua@onofClassifiers
• Create
random
test/train
splits
cc.mallet.types
InstanceList[] instanceLists =instances.split(new Randoms(),
new double[] {0.9, 0.1, 0.0});
90% training
10% testing
0% validation
Evalua@onofClassifiers
• TheTrial
classstores
theresultsof
classifica@ons
onan
InstanceList
(tes@ngor
training)
cc.mallet.classify
Trial(Classifier c, InstanceList list)double getAccuracy()double getAverageRank()double getF1(int/Label/Object)double getPrecision(…)double getRecall(…)
Review
• Ihaveinventedanewclassifier:David
regression.
– WhatclassshouldIimplementtoclassify
instances?
Review
• Ihaveinventedanewclassifier:David
regression.
– WhatclassshouldIimplementtotrainaDavid
regressionclassifier?
Review
• Ihaveinventedanewclassifier:David
regression.
– IwanttotrainusingByValueGradient.What
mathema@calfunc@onsdoIneedtocodeup,
andwhatclassshouldIputthemin?
Review
• Ihaveinventedanewclassifier:Davidregression.
– HowwouldIcheckwhethermynewclassifierworksbe1erthanNaïveBayes?
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
SequenceTagging
• Dataoccursin
sequences
• Categoricallabels
foreachposi@on
• Labelsare
correlated
DETNNVBSVBG
thedoglikesrunning
SequenceTagging
• Dataoccursin
sequences
• Categoricallabels
foreachposi@on
• Labelsare
correlated
????????
thedoglikesrunning
SequenceTagging
• Classifica@on:n‐way
• SequenceTagging:nT‐way
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
orreddogsonbluetrees
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
Andrei Markov
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
This oneGiven this one
Is independent of theseAndrei Markov
DETJJNNVB
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
orreddogsonbluetrees Andrei Markov
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
reddogsonbluetrees Andrei Markov
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
dogsonbluetrees Andrei Markov
HiddenMarkovModelsand
Condi@onalRandomFields
• HiddenMarkov
Model:fully
genera@ve
• Condi@onal
RandomField:
condi@onal
P(Labels | Data) =P(Data, Labels) / P(Data)
P(Labels | Data)
HiddenMarkovModelsand
Condi@onalRandomFields
• HiddenMarkovModel:
simple(independent)
outputspace
• Condi@onalRandom
Field:arbitrarily
complicatedoutputs
“NSF-funded”
“NSF-funded”CAPITALIZEDHYPHENATEDENDS-WITH-edENDS-WITH-d…
HiddenMarkovModelsand
Condi@onalRandomFields
FeatureSequence
FeatureVectorSequence
FeatureVector[]
int[]
• HiddenMarkovModel:
simple(independent)
outputspace
• Condi@onalRandom
Field:arbitrarily
complicatedoutputs
Impor@ngData
• SimpleTagger
format:one
wordperline,
withinstances
delimitedbya
blankline
Call VBme PPNIshmael NNP. .
Some JJyears NNS…
Impor@ngData
• SimpleTagger
format:one
wordperline,
withinstances
delimitedbya
blankline
Call SUFF-ll VBme TWO_LETTERS PPNIshmael BIBLICAL_NAME NNP. PUNCTUATION .
Some CAPITALIZED JJyears TIME SUFF-s NNS…
Impor@ngData
LineGroupIterator
SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels
TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
Impor@ngData
LineGroupIterator
SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels
[Pipes that modify tokens]
TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
Impor@ngData
//IshmaelTokenTextCharSuffix(“C2=”, 2)
//Ishmael C2=elRegexMatches(“CAP”, Pattern.compile(“\\p{Lu}.*”))
//Ishmael C2=el CAPLexiconMembership(“NAME”, new File(‘names’), false)
//Ishmael C2=el CAP NAME
cc.mallet.pipe.tsf
must matchentire string
one name per line
ignore case?
Slidingwindowfeatures
areddogonabluetree
Slidingwindowfeatures
areddogonabluetree
Slidingwindowfeatures
areddogonabluetree
red@-1
Slidingwindowfeatures
areddogonabluetree
red@-1a@-2
Slidingwindowfeatures
areddogonabluetree
red@-1a@-2on@1
Slidingwindowfeatures
areddogonabluetree
red@-1a@-2on@1a@-2_&_red@-1
Impor@ngData
int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };
OffsetConjunctions(conjunctions)
// a@-2_&_red@-1 on@1
cc.mallet.pipe.tsf
previousposition
next position
previous two
Impor@ngData
int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };
TokenTextCharSuffix("C1=", 1)OffsetConjunctions(conjunctions)
// a@-2_&_red@-1 a@-2_&_C1=d@-1
cc.mallet.pipe.tsf
previousposition
next position
previous two
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DET
P(DET)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DET
the
P(the | DET)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DETNN
the
P(NN | DET)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DETNN
thedog
P(dog | NN)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DETNNVBS
thedog
P(VBS | NN)
Howmanyparameters?
• Determines
efficiencyof
training
• Toomanyleads
tooverfirng
Trick: Don’t allowcertain transitions
P(VBS | DET) = 0
Howmanyparameters?
• Determines
efficiencyof
training
• Toomanyleads
tooverfirng
DETNNVBS
thedogruns
DETNNVBS
thedogruns
DETNNVBS
thedogruns
FiniteStateTransducers
abstract class TransducerCRFHMM
abstract class TransducerTrainerCRFTrainerByLabelLikelihoodHMMTrainerByLikelihood
cc.mallet.fst
FiniteStateTransducers
cc.mallet.fst
First order: one weightfor every pair of labelsand observations.
CRF crf = new CRF(pipe, null);crf.addFullyConnectedStates(); // orcrf.addStatesForLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
FiniteStateTransducers
cc.mallet.fst
“three-quarter” order:one weight for everypair of labels andobservations.
crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
FiniteStateTransducers
cc.mallet.fst
Second order: one weightfor every triplet of labelsand observations.
crf.addStatesForBiLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
FiniteStateTransducers
cc.mallet.fst
“Half” order: equivalent toindependent classifiers,except some transitionsmay be illegal.
crf.addStatesForHalfLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
Trainingatransducer
CRF crf = new CRF(pipe, null);crf.addStatesForLabelsConnectedAsIn(trainingInstances); CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(crf);
trainer.train();
cc.mallet.fst
Evalua@ngatransducer
CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(transducer);
TransducerEvaluator evaluator = new TokenAccuracyEvaluator(testing, "testing"));
trainer.addEvaluator(evaluator);
trainer.train();
cc.mallet.fst
Applyingatransducer
Sequence output = transducer.transduce (input);
for (int index=0; index < input.size(); input++) {System.out.print(input.get(index) + “/”);System.out.print(output.get(index) + “ “);
}
cc.mallet.fst
Review
• Howdoyouaddnewfeaturesto
TokenSequences?
Review
• Howdoyouaddnewfeaturesto
TokenSequences?
• Whatarethreefactorsthataffectthe
numberofparametersinamodel?
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
Topics:“Seman@cGroups”
News Article
Topics:“Seman@cGroups”
“Sports” “Negotiation”
News Article
Topics:“Seman@cGroups”
“Sports” “Negotiation”
News Article
teamplayer
game
strike
deadlineunion
Topics:“Seman@cGroups”
News Article
teamplayer
game
strike
deadlineunion
SeriesYankeesSoxRedWorldLeaguegameBostonteam
gamesbaseballMetsGameserieswonClemensBraves
Yankeeteams
playersLeagueownersleaguebaseballunioncommissioner
BaseballAssocia@onlaborCommissionerFootballmajor
teamsSeligagreementstriketeambargaining
TrainingaTopicModel
cc.mallet.topics
ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate();
Evalua@ngaTopicModel
cc.mallet.topics
ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();
MarginalProbEstimator evaluator = lda.getProbEstimator();
double logLikelihood = evaluator.evaluateLeftToRight(testing, 10, false, null);
Inferringtopicsfornew
documents
cc.mallet.topics
ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();
TopicInferencer inferencer = lda.getInferencer();
double[] topicProbs = inferencer.getSampledDistribution(instance, 100, 10, 10);
Morethanwords…
• Textcollec@ons
mixfreetext
andstructured
data
David MimnoAndrew McCallumUAI2008…
Morethanwords…
• Textcollec@ons
mixfreetext
andstructured
data
David MimnoAndrew McCallumUAI2008
“Topic models conditionedon arbitrary features usingDirichlet-multinomialregression. …”
Dirichlet‐mul@nomialRegression
(DMR)
Thecorpusspecifiesavectorofreal‐valued
features(x)foreachdocument,oflengthF.
EachtopichasanF‐lengthvectorof
parameters.
Topicparametersforfeature
“publishedinJMLR”
user,users,userinterface,interac@ve,interface‐1.44
web,webpages,webpage,worldwideweb,websites‐1.36
retrieval,informa@onretrieval,query,queryexpansion‐1.23
strategies,strategy,adapta@on,adap@ve,driven‐1.21
agent,agents,mul@agent,autonomousagents‐1.12
nearestneighbor,boos@ng,nearestneighbors,adaboost1.37
blindsourcesepara@on,sourcesepara@on,separa@on,channel1.40
reinforcementlearning,learning,reinforcement1.41
bounds,vcdimension,bound,upperbound,lowerbounds1.74
kernel,kernels,ra@onalkernels,stringkernels,fisherkernel2.27
FeatureparametersforRLtopic
<default>‐3.76
COLING‐1.64
IEEETrans.PAMI‐1.54
CVPR‐1.47
ACL‐1.38
MachineLearningJournal2.19
ECML2.45
KenjiDoya2.56
ICML2.88
SridharMahadevan2.99
Topicparametersforfeature
“publishedinUAI”
nearestneighbor,boos@ng,nearestneighbors,adaboost‐1.50
descrip@ons,descrip@on,top,bo1om,topbo1om‐1.50
workshopreport,invitedtalk,interna@onalconference,report‐1.37
digitallibraries,digitallibrary,digital,library‐1.36
shape,deformable,shapes,contour,ac@vecontour‐1.29
reasoning,logic,defaultreasoning,nonmonotonicreasoning2.11
uncertainty,symbolic,sketch,primalsketch,uncertain,connec@onist2.25
probability,probabili@es,probabilitydistribu@ons,2.25
qualita@ve,reasoning,qualita@vereasoning,qualita@vesimula@on2.26
bayesiannetworks,bayesiannetwork,beliefnetworks2.88
FeatureparametersforBayes
netstopic
<default>‐3.36
ICRA‐2.24
NeuralNetworks‐1.50
COLING‐1.38
Probabilis@cSeman@csforNonmonotonicReasoning(Pearl,KR,
1989)
‐1.16
LoopyBeliefPropaga@onforApproximateInference(Murphy,Weiss,
andJordan,UAI,1999)
2.04
PhilippeSmets2.15
AshrafM.Abdelbar2.23
Mary‐AnneWilliams2.41
UAI2.88
Dirichlet‐mul@nomialRegression
• Arbitraryobservedfeaturesofdocuments
• TargetcontainsFeatureVector
DMRTopicModel dmr = new DMRTopicModel (numTopics);
dmr.addInstances(training);dmr.estimate();
dmr.writeParameters(new File("dmr.parameters"));
PolylingualTopicModeling
• Topicsexistinmorelanguagesthanyoucouldpossiblylearn
• Topicallycomparable documentsaremucheasiertogetthantransla@onsets
• Transla@ondic@onaries
– coverpairs,notsetsoflanguages
– misstechnicalvocabulary
– aren’tavailableforlow‐resourcelanguages
Topicsfrom
European
Parliament
Proceedings
Topicsfrom
European
Parliament
Proceedings
Topicsfrom
Wikipedia
Alignedinstancelists
dog… chien… hund…cat… chat…pig… schwein…
PolylingualTopics
InstanceList[] training = new InstanceList[] { english, german, arabic, mahican };
PolylingualTopicModel pltm = new PolylingualTopicModel(numTopics);
pltm.addInstances(training);
MALLEThands‐ontutorial
h1p://mallet.cs.umass.edu/mallet‐handson.tar.gz
top related