bibm11 b223 slides(2)

Upload: blargh-d-bloseng

Post on 06-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Bibm11 b223 Slides(2)

    1/49

    Catalogue with Probabilistic Topic Models

    1 2 1 1 3

    , , , ,

    1College of Information Science and Technology, Drexel University, Philadelphia, PA 19104, USA

    2Dept. of Computer Science at Central China Normal University, Wuhan, China3Department of Computer Science, University of Vermont, Burlington, VT, USA

    1

  • 8/3/2019 Bibm11 b223 Slides(2)

    2/49

    .

    thought of as the complete set of DNA sequences that codes for the

    hereditary material that is passed on from generation to generation.

    These DNA sequences include all of the genes (the functional andphysical unit of heredity passed from parent to offspring) and

    genetic information) included within the genome.

    us, genom cs re ers to t e sequenc ng an ana ys s o a o t esegenomic entities, including genes and transcripts, in an organism.

    2

  • 8/3/2019 Bibm11 b223 Slides(2)

    3/49

    In recent years we see growth of GenBank and NCBI with the

    3

  • 8/3/2019 Bibm11 b223 Slides(2)

    4/49

    As the growth of GenBank and NCBI, a lot of annotating algorithmsstandard reference and attach meta-information to the sequences.

    4

  • 8/3/2019 Bibm11 b223 Slides(2)

    5/49

    Back rounds: metaBack rounds: meta--informationinformation The annotated meta-information involves hierarchical data such as

    NCBI Taxonomy and Gene Ontology.

    5

  • 8/3/2019 Bibm11 b223 Slides(2)

    6/49

    Challenges:Challenges: MetagenomicsMetagenomics With the fast advancing sequencing techniques, large amounts of

    sequenced genomes and meta-genomes from uncultured microbial.

    The goal of metagenomics is to study the genome-wide gene-expression,

    human body) and understand the underlying biological processes. 6

  • 8/3/2019 Bibm11 b223 Slides(2)

    7/49

    Whats the major research questions of our study?

    We use our data mining framework to investigate

    following questions:samples, what genomes are there?

    Answering this question requires mapping the meta-genomic reads totaxonomic units usuall a homolo -based se uence ali nment and this

    task is also known as taxonomic classification or taxonomic analysis).2) What are the major functions of these genomes?

    The answers to this question involve annotating the major functional units(such as signal transduction, metabolic capacity and gene regulatory) onthe genome-level (a.k.a. functional analysis).

    Our research objective: We aim to develop a new method that is able to analyze the

    genome-level composition of DNA sequences, in order to

    same species, tell their functional roles. 7

  • 8/3/2019 Bibm11 b223 Slides(2)

    8/49

    Structural annotation and protein encoding regions

    Homology-based functional analysis op c o e s

    8

  • 8/3/2019 Bibm11 b223 Slides(2)

    9/49

    Structural annotation and rotein encodin re ions

    Annotating the regions of known open reading frames (ORFs),

    non-coding genes (rRNA, tRNA, miRNA), Promoters and UTRsin the DNA sequences

    9

  • 8/3/2019 Bibm11 b223 Slides(2)

    10/49

    Structure annotation and protein encoding regions

    (continue)

    stan ar re erence sequences ave eta e structuraannotations of both non-protein encoding regions (such as tRNA)

    and protein encoding regions (CDS) as well as the correspondinggene names (if applicable). The GenBank accession number ofeach reference sequence is available on each NCBI online query.

    10

  • 8/3/2019 Bibm11 b223 Slides(2)

    11/49

    Structural annotation and protein encoding regions

    Homology-based functional analysis op c o e s

    11

  • 8/3/2019 Bibm11 b223 Slides(2)

    12/49

    Functional anal sis - overview

    Functional analysis

    Uncover the major gene functions related to the genomic

    Re uires ex lainin the biochemical activit a.k.a. molecular

    function) of gene product, identifying the biology process towhich the gene or gene product contribute (including information

    gene).

    12

  • 8/3/2019 Bibm11 b223 Slides(2)

    13/49

    Homology-based functional analysis(Richter and

    Huson, 2009)

    omo ogy- ase approac as een recent y ntro uce to ac evefunctional annotation for metagenomic reads (Richter and Huson,

    2009).

    The framework begins with a homology based BLASTX algorithm to

    in NCBI database.

    The BLASTX hits will associate fragments with related protein IDand gene names. After that, with the help of the Gene Ontology

    GO terms, thus provides an overview of gene function and productsfor metagenomic fragments.

    13

  • 8/3/2019 Bibm11 b223 Slides(2)

    14/49

    Homology-based functional analysis(Richter and

    Huson, 2009)

    GO terms obtained from database identifier ma in Richter and Huson 2009

    14

  • 8/3/2019 Bibm11 b223 Slides(2)

    15/49

    Limitations with Homology-based Functional

    Analysis Methods

    1. omo ogy- ase approac es very muc rep y on t e resu t o ocasequence alignment (such as BLAST and BLASTX) to the known

    open reading frames (ORF). The BLAST-like local alignment may either return hundreds of hits, or return no

    hits, depending on the threshold of E-value used. In the latter case, the currentmethods are unable to provide any functional annotation. In the former case, itusua y ac s o a proper e- rea er o ur er re uce e s, w c ma es e

    functional annotation some how ambiguous (with hundreds of probableexplanation)

    . -any insight about the major functional capabilities of genomes

    (like which gene functions are more commonly shared by strainsrom e same spec es , as ere s no pr or y or e anno a eterms.

    15

  • 8/3/2019 Bibm11 b223 Slides(2)

    16/49

    Structural annotation and protein encoding regions

    Homology-based functional analysis op c o e s

    16

  • 8/3/2019 Bibm11 b223 Slides(2)

    17/49

    To ic ModelinTo ic Modelin -- IntuitiveIntuitive

    Of all the sensory impressions proceeding to

    the brain the visual ex eriences are thedominant ones. Our perception of the worldaround us is based essentially on the

    messages that reach the brain from our eyes.For a lon time it was thou ht that the retinal Assume the data weimage was transmitted point by point to visual

    centers in the brain; the cerebral cortex was amovie screen, so to speak, upon which theima e in the e e was ro ected. Throu h the

    sensory, brain,visual, perception,

    retinal, cerebral cortex,

    some parameterizedrandom rocess.

    discoveries of Hubel and Wiesel we nowknow that behind the origin of the visualperception in the brain there is a considerably

    more com licated course of events. B

    eye, cell, opticalnerve, image Learn the parameters

    that best explain thefollowing the visual impulses along their pathto the various cell layers of the optical cortex,

    Hubel and Wiesel have been able to

    demonstrate that the messa e about the

    data.

    Use the model toimage falling on the retina undergoes a step-

    wise analysis in a system of nerve cells

    stored in columns. In this system each cell

    has its s ecific function and is res onsible for

    predict (infer) newdata, based on data

    a specific detail in the pattern of the retinal

    image.

    .17

  • 8/3/2019 Bibm11 b223 Slides(2)

    18/49

    Basic unit.

    Item from a vocabulary indexed by {1, . . . ,V}.

    Document = , , , . . . , .

    Collection o a o ocumen s, eno e y = w ,w , . . . ,w .

    To ic Denoted by z, the total number is K. Each topic has its unique word distribution p(w|z)

    18

  • 8/3/2019 Bibm11 b223 Slides(2)

    19/49

    Background & Existing Techniques of Generative

    Latent Topic ModelsLikelihood of

    *

    topic z

    Word-Topic Prior Probability

    The probabilistic latent semantic indexing (PLSI) model

    Assumption:

    Each document has a mixture ofktopics.

    Fitting the model involves:

    Estimating the topic specific word

    distributions w z and document s ecific

    PLSI Model (Hoffman, 2001)topic distributionsp(z

    k

    |dj

    ) from the corpse

    via maximum likelihood estimation (MLE).19

  • 8/3/2019 Bibm11 b223 Slides(2)

    20/49

    Latent Dirichlet Allocation LDA Model Blei, 2003

    In PLSI model, the topic mixtured~Dir()

    k j documents are fixed once themodel is estimated. For newcomin document the modelneeded to be re-estimated. Thusit is not scalable.

    ( | ) ~ ( )j d p z d Multi

    The LDA model treats theprobability of latent topics foreach documentp(z|d) and the

    ( | ) ~ ( )j ji p w z Multi

    for each latent topicp(w|z) aslatent random variables which

    are sub ect to chan e when new

    ~ ( )j Dir

    document comes., ,

    .

    , ,.

    ( | , , )

    wi d

    i j i j

    wi i d

    i j i

    n np z j w

    W n T n

    + +=

    + +-i -wiw z

    20

  • 8/3/2019 Bibm11 b223 Slides(2)

    21/49

    LDA Model Estimation - Gibbs Sampling

    Monte Carlo process (Griffiths, 2004)

    ( | , , ) ( | , , ) ( | , )wi i i wip z j w p w z j p z j= = =-i -wi -i -wi -i -wiw z w z w z

    ,

    .( | , , ) ( | , , , ) ( | , )

    wi

    i j j j j

    i wi i

    np w z j p w z j p d

    += = = =-i -wi -i -wi -i -wiw z w z w z

    ,

    ,.

    ( | , ) ( | ) ( | , )

    d

    i jd d dd

    i

    n p z j p z j p d T n

    += = = =+-i -wi -i -wiw z w z

    ,i j

    ( | , , , )j jip w z j = =-i -wiw z

    j j j

    ( | , ) ( , | ) ( )d d d p p p -i -wi -i -wiw z w z

    in which

    Since

    and

    , ,-i -wi -i -wi

    ( , | ) ~ ( )j j p Multi -i -wiw z

    ( , | ) ~ ( ) p Multi -i -wi

    w z

    ( ) ~ ( )d p Dir

    and . It follows that We have( ) ~ ( )j p Dir

    ,( | , ) ~ ( ) j wi

    i j p Dir n +-i -wiw z,( | , ) ~ ( )

    d d

    i j p Dir n +-i -wiw z21

  • 8/3/2019 Bibm11 b223 Slides(2)

    22/49

    -

    Given the word-topic posterior probability, the Monte

    Carlo process becomes really straightforward, which

    each facet to appear) to determine the assignment ofto ics to each words for the next round.

    Given probability for each word:

    , , , ...wi i -i -wi

    New topic assignment for each word.

    22

  • 8/3/2019 Bibm11 b223 Slides(2)

    23/49

    23

  • 8/3/2019 Bibm11 b223 Slides(2)

    24/49

    24

  • 8/3/2019 Bibm11 b223 Slides(2)

    25/49

    Experiments

    25

  • 8/3/2019 Bibm11 b223 Slides(2)

    26/49

    Experiment: Inferring Functional Groups from

    Microbial Gene Catalogue with Topic Models

    ,non-redundant CDs catalogue, we show that the configuration of

    functional groups in meta-genome samples can be inferred by.

    The robabilistic to ic modelin is a Ba esian method that is able to

    extract useful topical information from unlabeled data. When used tostudy microbial samples the functional elements (including

    ,KEGG pathway mappings) bear an analogy with words.

    Estimating the probabilistic topic model can uncover theconfiguration of functional groups (the latent topic) in each sample.Which ma be further used to stud the enot e- henot econnection of human disease.

    26

  • 8/3/2019 Bibm11 b223 Slides(2)

    27/49

    Ex erimental Data Collection

    In our experiment, we conduct a probabilistic topic modelingex eriment to identif functional rou s from human ut microbialcommunity data is generated by [Qin, et al. 2010], which is openlyaccessible via http://gutmeta.genomics.org.cn/

    The human gut microbial samples from[Qin, et al. 2010] belong to bothhealth sub ects HS and atients with

    inflammatory bowel disease (IBD).Specifically, the IBD patients are from,

    Crohns disease (CD), and the othergroup with ulcerative colitis (UC).

    In total, there are 85 healthy samples,15 UC samples and 12 CD samples.

    27

  • 8/3/2019 Bibm11 b223 Slides(2)

    28/49

    Experimental Data Collection (continue)

    Accordin to Qin et al. 2010 the Illumina GA reads from humangut microbial samples are firstly assembled into longer contigs. Afterthat, the Glimmer program was used to predict protein-encoding

    .

    The predicted CDs sequences were then aligned to each other andform a non-redundant CDs catalog (a.k.a. minimal gut genome). Thenon-redundant CDs catalog consists of 3,299,822 non-redundantCDs se uences with an avera e len th of 704 b .

    CDs_id: MH0001

    Name: GL0006996 MH0001 [Lack 3'-end] [mRNA] locus=scaffold96 9:1:1206:-_ _ _ _ _ _ Length: 1206

    COG/KO: COG4799 K01966

    Pathway maping: map00280,map00640

    28

    axonom c eve : spec es - u ac er um e gens

  • 8/3/2019 Bibm11 b223 Slides(2)

    29/49

    Experimental Data Collection (continue)

    In our experiment, three types of functional elements are derivedfrom the non-redundant CDs catalog, i.e. the NCBI taxonomic levelindicators, indicator of gene orthologous groups and KEGG pathwayindicators.

    - ,obtained by carrying out BLASTP alignment against the NCBI NRdatabase. The taxonomical level of each non-redundant CDs

    based algorithm. The taxonomic abundance data for each samplecan be computed by counting the indicators of NCBI taxonomicaleve s.

    The assignments of gene orthologous indicator and KEGG pathway

    indicator are achieved b BLASTP ali nment of the amino-acidsequence from predicted CDs to the eggNOG database and KEGGdatabase.

    29

  • 8/3/2019 Bibm11 b223 Slides(2)

    30/49

    Ex erimental Data Collection continueGenus Clostridium

    Genus BacteroidesNCBI Taxonomic Levels

    Class Clostridia

    Genus Bacillus

    COG0642 : Signal transduction histidine kinase

    COG1132 : "ABC-type multidrug transport system, ATPase and permease

    components"

    Indicators

    COG0438 : Glycosyltransferase

    map00230 : Metabolism_Nucleotide Metabolism_Purine metabolismKEGG Pathway Indicators

    The union of unique functional elements jointly defines a fixed word

    _ _

    map00350 : Metabolism_Amino Acid Metabolism_Tyrosine metabolism

    voca u ary. n o a , ere are , axonom c eveindicators, with a vocabulary size of 748; there are a total of1,293,764 gene orthologous group indicators, with a vocabulary sizeof 4667; and there are 953,493 KEGG pathway indicators, with a

    vocabulary size of 237. 30

  • 8/3/2019 Bibm11 b223 Slides(2)

    31/49

    Groups of functional elements in microbial

    commun ty

    Given non-redundant CDs catalog, and derived functional elements,-

    31

    functional elements (a.k.a. functional groups).

  • 8/3/2019 Bibm11 b223 Slides(2)

    32/49

    Generative rocess of ro osed model

    Commonly shared functional elements across samples may suggestfunctional similarit and biolo ical relevance amon sam les. Tocover such information, a genome-wide background distribution offunctional elements need to be estimated, which leads to the

    0 .

    32

  • 8/3/2019 Bibm11 b223 Slides(2)

    33/49

    Illustration of the background topic of gene

    Background Topic - Indicator of Gene OGs

    Gene OGs Indicator Descriptions Probability

    COG0463Glycosyltransferases involved in cell

    wall biogenesis

    0.00813

    .

    COG0582 Integrase 0.00698

    COG1132ABC-type multidrug transport system,

    "0.00689

    COG0438 Glycosyltransferase 0.00664

    COG0745

    Response regulators consisting of a

    CheY-like receiver domain and a 0.00644

    winged-helix DNA-binding domain

    COG1396 Predicted transcriptional regulators 0.00595

    COG0577ABC-type antimicrobial peptide

    0.00594ranspor sys em, permease componen

    COG2207AraC-type DNA-binding domain-

    containing proteins0.00389

    COG3250 Beta- alactosidase/beta- lucuronidase 0.00344

    33

  • 8/3/2019 Bibm11 b223 Slides(2)

    34/49

    Illustration of the background topic of KEGG

    w yBackground Topic - KEGG Pathway Indicator

    w

    map00230Metabolism_Nucleotide Metabolism_Purine

    metabolism0.0333

    Metabolism_Carbohydrate Metabolism_Fructoseand mannose metabolism

    .

    map00500Metabolism_Carbohydrate Metabolism_Starch and

    sucrose metabolism0.0260

    Metabolism Nucleotide Metabolism P rimidinemap00240

    _ _

    metabolism0.0222

    map00350Metabolism_Amino Acid Metabolism_Tyrosine

    metabolism0.0221

    map00260e a o sm_ m no c e a o sm_ yc ne,

    serine and threonine metabolism"0.0220

    map00010Metabolism_Carbohydrate Metabolism_Glycolysis /

    Gluconeo enesis0.0190

    map00620Metabolism_Carbohydrate Metabolism_Pyruvate

    metabolism0.0176

    ma 00251Metabolism_Amino Acid Metabolism_Glutamate

    0.0169

    map00550Metabolism_Glycan Biosynthesis and

    Metabolism_Peptidoglycan biosynthesis 0.0168 34

    U d l i i h

  • 8/3/2019 Bibm11 b223 Slides(2)

    35/49

    Uncovered latent topics with respect to

    xIllustration of the most relevant latent to ics with

    respect to different taxa

    Topic ID MI Score Topic ID MI Score Topic ID MI Scoream y_ n er

    obacteriaceae Topic 48 0.02476 Topic 121 0.00915 Topic 31 0.00279

    genus_Clostri

    dium Topic 50 0.01628 Topic 153 0.01001 Topic 95 0.00765

    genus_ ac er

    oides Topic 156 0.03030 Topic 77 0.02018 Topic 52 0.01661phylum_Bact

    eroidetes Topic 132 0.00476 Topic 165 0.00260 Topic 67 0.00257

    p y um_ rm

    icutes Topic 0 0.01256 Topic 99 0.00550 Topic 193 0.00212

    ,information score (MI score). The MI severs as a relevance measurementbetween taxa and latent topics. It shows that phylum Firmicutes is most relevant

    . ,Topic 50, 153, 95 and genus Bacteroides is most relevant to Topic 156, 77, 52.

    35

    U d l t t t i ith t t

  • 8/3/2019 Bibm11 b223 Slides(2)

    36/49

    Uncovered latent topics with respect to

    xIllustration of top-ranked latent topics with respect

    MH0001

    p(topic|sampl

    e) O2.UC-1

    p(topic|sampl

    e) V1.CD-1

    p(topic|sampl

    e)

    o eren m cro a samp es

    Topic 0 0.475 Topic 0 0.363 Topic 0 0.286

    Topic 124 0.116 Topic 95 0.101 Topic 61 0.124

    Topic 181 0.103 Topic 143 0.062 Topic 12 0.116

    Topic 159 0.040 Topic 83 0.059 Topic 115 0.050

    Topic 86 0.027 Topic 65 0.056 Topic 52 0.048

    Topic 72 0.018 Topic 139 0.034 Topic 32 0.037

    Topic 19 0.017 Topic 59 0.033 Topic 50 0.036

    Discoveries : the probability of Topic 0 in Healthy and UC samples (0.475 inMH0001 and 0.363 in O2.UC-1) is much higher than that in CD samples (0.286 inV1.CD-1). This suggests that for CD samples, the proportion of bacteria belong tophylum Firmicutes is significantly reduced. The prevalence of Topic 95 and 52 insamples O2.UC-1 and sample V1.CD-1 may indicate the existence and possibly

    high abundance of genus Clostridium and genus Bacteroides, correspondingly.36

    U d l t t t i ith t t

  • 8/3/2019 Bibm11 b223 Slides(2)

    37/49

    Uncovered latent topics with respect to

    x

    37

  • 8/3/2019 Bibm11 b223 Slides(2)

    38/49

    Our discoveries from the results is evidenced by the recent

    scover es n eca m cro o a s u y o n amma ory owe sease(IBD) patients [Gerber, 2007], [Harry S. et. al. 2006], [Manichanh C

    et al., 2006], [Walker A. et. al. 2011].

    It has been reported that there is a significant reduction in the

    samples, which is consistent with our results.

    This can be explained by the fact mucosal microbial diversity isreduced in IBDs, particular in CD, which is associated with bacterial

    superficial; therefore, the reduction of phylum Firmicutes in UC is notsignificant.

    38

  • 8/3/2019 Bibm11 b223 Slides(2)

    39/49

    Based on the functional elements derived from the non-

    redundant CDs catalogue, we have shown that theconfiguration of functional groups encoded in the gene-

    -applying probabilistic topic modeling to functional elementsderived from the non-redundant CDs catalogue.

    The latent topics estimated from human gut microbial samples

    study, which demonstrate the effectiveness of the proposed

    method.

    39

  • 8/3/2019 Bibm11 b223 Slides(2)

    40/49

    In the proposed model, the number of functional group has to

    be specified in advance, or iteratively tuned by criteria such aslog-likelihood and perplexity.

    ,Bayesian models (such as HDP model) to handle theuncertainty in the number of functional groups, which providethe flexibility of modeling microbial sequences with unknownfunctional group numbers.

    40

  • 8/3/2019 Bibm11 b223 Slides(2)

    41/49

    uest ons

    41

  • 8/3/2019 Bibm11 b223 Slides(2)

    42/49

    Backup Slides

    42

    M l I f i

  • 8/3/2019 Bibm11 b223 Slides(2)

    43/49

    Mutual Information

    After estimating the topic model and assigning a latent topic to each,

    functional element indicators (i.e. NCBI taxonomic level indicators,

    indicator of gene orthologous groups and KEGG pathwayn ca ors can e o a ne y ca cu a ng e mu ua n orma on(MI) between functional element indicators and obtained latenttopics based on the final latent topic assignments to functionalelements.

    ( , )( , ) ( , )log

    g t

    g t g t

    p R ZMI R Z p R Z =

    in which Rg

    and Ztare binary indicator variables corresponding to

    t

    the functional element and the latent topic, respectively. Thevariable pair (Rg,Zt) indicates whether a latent topic has beenassigned to a specific functional element.

    43

    Lik lih d C i

  • 8/3/2019 Bibm11 b223 Slides(2)

    44/49

    Likelihood Comparison

    ( | ) ( | , ) ( | )t t t

    zt

    T

    t z z t z p p z p z d

    = w z w

    ( ) ( )

    0

    ( ) ( )1 0

    ( ) ( )( ) ( ). .

    ( ) ( ) ( ) ( )

    i i

    i i

    w wTT

    tw w

    W Wt t

    n nW W

    n W n W

    =

    + + =

    + +

    44

    Lik lih d C i ( ti )

  • 8/3/2019 Bibm11 b223 Slides(2)

    45/49

    Likelihood Comparison (continue)

    ( | ) ( | , ) ( | )t t t

    zt

    T

    t z z t z p p z p z d

    = w z w

    ( ) ( )

    0

    ( ) ( )1 0

    ( ) ( )( ) ( ). .

    ( ) ( ) ( ) ( )

    i i

    i i

    w wTT

    tw w

    W Wt t

    n nW W

    n W n W

    =

    + + =

    + +

    45

    P l it C i

  • 8/3/2019 Bibm11 b223 Slides(2)

    46/49

    Perplexity Comparison

    The perplexity is calculated for held-out testing data. In ourex eriment we use a 50% subset of the functional elements astraining data and the other 50% as testing data.

    ,from the same sample are equally split to both subsets. In practice, itis the inverse predicted model likelihood of data in held-out testing

    , .

    smaller perplexity value indicates better model fitting.

    1

    1

    log( ( ))( ) exp

    test

    test

    D

    j

    test D t

    jj

    p perplexity D

    N

    =

    =

    =

    jw

    46

    P l it C i ( ti )

  • 8/3/2019 Bibm11 b223 Slides(2)

    47/49

    Perplexity Comparison (continue)

    47

    Dirichlet Process (DP) as a Non-Parametric Mixture Models

  • 8/3/2019 Bibm11 b223 Slides(2)

    48/49

    c et ocess ( ) as a o a a et c tu e ode s

    G0 ~ DP(,H), in which is a concentration parameter andHis a base measure definedon a sample space . By its definition, for any finite measurable partition of: {A1,

    G G ~ Dirichlet H H .

    1kDirichlet Process can also be constructed by stick-breaking construction as follows:

    0

    1

    k k

    k=

    =1

    , ,k k i k

    i=

    Dirichlet process Dirichlet rocess constructed b stick-breakinby its definition: construction:

    - Data samplexi drawn from a base distribution with associated parameters k

    48,in which

    The weights of mixture components = {k} (k=1,,) are also refer to as ~ GEM().

    Hierarchical Dirichlet Process (HDP)

  • 8/3/2019 Bibm11 b223 Slides(2)

    49/49

    ( )~ 0 ,

    measure across the corpora and defines a set of child random probability measures Gj ~DP(0, G0) for each documentj, which leads to different document-level distribution over

    semantic mixture com onents: G A G ~ Dirichlet G G

    Each Gj can also be constructed by stick-breaking construction as:

    1

    ( ) j jk k k

    G

    =

    = in whch j={jk} (k=1,,) specifies the weights of mixture component indicatork.

    Substitute the stick-breaking construction ofG0 and Gj,

    1 1

    0 0,..., ~ ( ,..., )r r

    jk jk k k

    k K k K k K k K

    Dirichlet

    Based on the aggregation properties of Dirichlet distribution

    and its connection with Beta distribution, it shows that:1k k

    0 0

    11

    ' (1 ' ), ' ~ , 1 jk jk jl jk k l ll

    Beta ==

    =

    It then follows that j~ DP(0, ) Stick-breaking construction of

    49

    hierarchical Dirichlet process