model for estimating diversity presentation

23
Model for Estimating Population Diversity as the Prediction of Sample needed for full Coverage with Applications in Bioinformatics Torres, David A., Pericchi, Luis R. Department of Mathematics University of Puerto Rico, Rio Piedras.

Upload: david-torres

Post on 04-Jul-2015

587 views

Category:

Education


1 download

DESCRIPTION

Simple probability model for estimate the population diversity. Here a hybrid model using I. Good and Toulming estimators.

TRANSCRIPT

Page 1: Model For Estimating Diversity Presentation

Model for Estimating Population Diversity as the Prediction of Sample needed for full Coverage

with Applications in Bioinformatics

Torres, David A., Pericchi, Luis R.Department of Mathematics

University of Puerto Rico, Rio Piedras.

Page 2: Model For Estimating Diversity Presentation

Abstract

There exist several methods for estimating community diversity using coverage (Bunge and Fitzpatrick 1993). The biologist and environmental scientist challenge the statisticians in order to solve such problem. Here we present an approach for the estimation using coverage model (Good, I. G, 1953) and a population estimator (Good, I. G. and G. H. Toulmin, 1956). We apply the method to a data given from microbial diversity presented in the crop of the hoatzin by molecular analysis of cloned 16S RNA genes.

Page 3: Model For Estimating Diversity Presentation

Introduction

• Estimating the number of species in a community is a classical problem in Ecology, biogeography, and conservation biology, and parallel problems arise in many other disciplines. This research topic has been extensively discussed in the literature; see Bunge and Fitzpatrick (1993), Seber (1982, 1986, 1992) for a review of the historical and theoretical development.

• Ecologists and other biologists have long recognized that there are undiscovered species in almost every survey or species inventory. A parallel problem is tried to answer how many words did a particular author know. Efron, B., Twisted, R. (1975).

Page 4: Model For Estimating Diversity Presentation

• A random sample is taken from a Community. We will refer to this sample as the basic sample.

• Our intention is calculate an estimator for coverage

of the community using the information provided in the basic sample and then estimate the number of species in the community.

• Moreover, we pretend to describe a method that present an estimator of the number of additional data needed to get a total coverage of the community .

• An example will be presented in order to apply the theory.

Page 5: Model For Estimating Diversity Presentation

Methods

• A random sample of size N is drawn from a community and let be the numbers of distinct species represented exactly r times in the sample, then

1i

i

rn N∞

=

=∑

rn

Page 6: Model For Estimating Diversity Presentation

• We shall be concerned with, , the community frequency of an arbitrary species that is represented r times in the basic sample.

• Let, , be the expected value of . A main result used by Good (1953) is that

(2)

where .

rq

( )rqΕ

*

( )rr

qN

Ε =

r

r

nnrr 1)1(* ++=

rq

Page 7: Model For Estimating Diversity Presentation

• This can be generalized to give a higher moment of . As a matter of fact

(3)

where

and .

r

mr

m

r n

n

N

mrq +

+=Ε )(

3,2,1;3,2,1 == mr

∏+=

=t

mi

m it1

rq

Page 8: Model For Estimating Diversity Presentation

• Recursively, we can rewrite (3) as

.

• Moreover, the variance of is approximately:

• Note that, then we have that

∏−+

=

Ε≈Ε1

)()(mr

rii

mr qq

rq

2 12

( 1)( 1)( 2)( ) r rr

r r

n r nr rV q

N n Nn+ + ++ += −

( )* 1 rr nr

N

+≤

1r r

i

n rn N∞

=

≤ =∑

Page 9: Model For Estimating Diversity Presentation

( )1r ≤r( )1 rr n

N

+

1

1i

i r

inN

= +∑

1

2

( )11k

k

nkn

N N

=

Ε= −∑

As an estimator of the expected total change of all species that

are each represented times in the basic sample is

Also the expected total chance of all species that are represented

times or more in the sample is approximately

In particular note that the expected total change in the sample is

approximately(4)

Page 10: Model For Estimating Diversity Presentation

• Hence, the total coverage of the sample (i.e. the proportion of community represented in the sample, which is the sum of the population frequencies of the species represented) is approximately.

(5)1 1( )1 1

n n

N N

Ε− = −

Page 11: Model For Estimating Diversity Presentation

1x

x

d n∞

=

= ∑

1n

N

0n s d=−

The change that the next member of the community will belong to a new species is estimated as, .

Lets write the total number of distinct species in the sample as

and suppose that the total number of distinct species in the community is a known finite number s. Then the number of non-represented species in the sample is given by .

Page 12: Model For Estimating Diversity Presentation

• Then let be the

population frequencies of the species. As in

Good (1953), equation (10),(6)

( )1

!( ) (1 )

! !

sr N r

r

Nn p p

r N r µ µµ

=

Ε = − −

( 1, 2,3, )pµ µ =

Page 13: Model For Estimating Diversity Presentation

For the population, we have similarly,

assuming for all .1

2pµ ≤ µ

( )1

!( ( )) (1 )

! !

sr N r

r

Nn p p

r N rλ

µ µµ

λλλ

=

Ε = − −

( )

( ) ( )

( ) ( )( ) ( )( ) ( )

( 1)

1

1 0

( )

0 1

!(1 ) 1

! ! 1

! ( 1) !(1 ) (1 )

! ! ! ( 1) !

! ( 1) !(1 )

! ! ! ( 1) !

1 1 !

! !

Ns

r N r

sr N r i i

i

sr i N r i

i

ir

rr i

pNp p

r N r p

N Np p p p

r N r i N i

N Np p

r N r i N i

N N rn

r i N

λµ

µ µµ µ

µ µ µ µµ

µ µµ

λλλ λλ λλ λλ λ

λ λ

− −

=

∞− −

= =

∞+ − +

= =

++

= − + − −

− −= − −− − − −

− −= −− − − −

− − += Ε

∑ ∑

∑ ∑

( )

( ) ( ) ( )0

( )!1 1

! !

i

iir

r ii

r in

r iλ λ

+=

+≈ − − Ε∑

Page 14: Model For Estimating Diversity Presentation

• For the case r = 0, we not need to assume the value of s, since this assumption is not required to write

(8)

• We may be particularly interested in the coverage of the community, then using equation (5) and (7) with r=1 we have the expected coverage is approximately

(9)

( ) ( ) ( ) 01

ˆ 1 1 ( )i i

ii

d d n s nλ λ λ∞

=

− = − − = −∑

211 2 3

11 1 [ 2( 1) 3( 1) ]n

n n nN N

λ λ− ≈ − − − + − −

Page 15: Model For Estimating Diversity Presentation

• The expected number of distinct species represented is approximately

• We use the coverage to estimate the value of and straightforward the population size needed to get 100% coverage. The equation (9) is the one that is called Good-Toulmin model by the fact that is a merge between the two models proposed by them.

( ) ( ) 21 21 1d n nλ λ+ − − − +

Page 16: Model For Estimating Diversity Presentation

Application

• The hoatzin is a South American leaf-eating bird and the its uniqueness lies in its particular foregut (crop), the only known for the avian class.

• Forestomach compartmentalization allows mammal herbivores to be nourished on microbial fermentation products and microbial biomass. Bacteria are largely responsible for fermentation of dietary components, and bacterial cells are themselves subject to digestion by gastric lysozyme expressed in the abomasum of ruminants.

Page 17: Model For Estimating Diversity Presentation

• The evolutionary pressure towards foregut specialization in herbivores was presumably exerted by indigestible plant polymers (cellulose), so that production of microbial biomass at expenses of these indigestible materials has clear advantages.

• In the hoatzin, a preliminary characterization of the crop microflora was done by culture (Domínguez-Bello et al., 1993). In this study we aim to characterize the bacterial diversity in the crop of the hoatzin by a molecular analysis of cloned 16S rRNA genes.

Page 18: Model For Estimating Diversity Presentation

Results

• For the 69 O.T.U’s obtained, Good’s method left side of equation (9)) indicated a coverage of diversity of 77%

• This means that 100% diversity will correspond to 90 O.T.U. Given that, applying the Good and Toulmin’s model (figure 2), we estimate a λ=1.5 which means that we need 98 (300-202) additional clones to obtain the 31 O.T.U’s needed to cover 100% diversity.

Page 19: Model For Estimating Diversity Presentation

Conclusions (Application)

• The estimate indicates 300 clones are needed to represent 100% of sample diversity 99% of the clones and 88% of OTU analyzed are unidentified species.

• Based on 202 sequences yielding 69 O.T.U, Good and Toulmin estimator indicates a coverage of 77% of the total diversity.

Page 20: Model For Estimating Diversity Presentation
Page 21: Model For Estimating Diversity Presentation
Page 22: Model For Estimating Diversity Presentation

Future Research

• There are many models and procedure try to calculate coverage, instead of using the Good’s estimator of coverage it will be interesting try another approach. Perhaps, using Poisson process or an Multinomial approach it’s possible to get better estimators. Another approach could be the use of Bayesian inference in the assumption of a no known distribution in a Metropolis Hasting procedure.

• The importance of this type of problem is based on the experimental designs.

• Good stated once that “I don’t believe it is usually possible to estimate the number of unseen species … but only an approximate lower bound to that number.”. We will keep on the road.

Page 23: Model For Estimating Diversity Presentation

Literature cited

• Godoy Filipa1, Gao, Z. 2, Pei Z.2, Zhou M.2 ,Garcia-Amado, M.A.3,Pericchi, L.R. 4 ,Torres, D. 4 Michelangeli F.3, Blaser M.J 2 , Domínguez-Bello, M.G.1High bacterial diversity in the forestomach of the Hoatzin is revealed by molecular analysis of 16S rRNA Genes. 1Department of Biology, University of Puerto Rico, Rio Piedras, San Juan, PR 00931. 2 Departments of Medicine, Pathology and Microbiology, New York University School of Medicine, New York, NY 10016 3Venezuelan Institute of Scientific Research, CBB, Caracas, Venezuela. 4 Department of Mathematics University of Puerto Rico, Rio Piedras, San Juan, PR 00931.

• Chao,A.,Lee,S.,1992. Estimating the Number of Classes via Sample Coverage. Journal of the American Statistical Association,87: 210-217.

• Domínguez-Bello, M. G.M. Lovera, P. Suarez and F. Michelangeli, 1993, Microbial inhabitants in the crop of the hoatzin (Opisthocomus hoazin): the only foregut fermented avian. Physiol. Zool. 66: 374-383.

• Good, I. G. and G. H. Toulmin, 1956. The number of new species and the increase in population coverage when the sample is increase. Biometrika 43: 45-63.

• Good,I., 1953. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika,40: 237-264.