cleaning genotype data karl w bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · •...

16
1 C l e a n i n g g e n o t y p e d a t a K a r l W B r o m a n

Upload: others

Post on 27-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

1

Cleaning genotype data

Karl W Broman

Page 2: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

2

• Gene hunters need high quality data

• Genetic studies are expensive, so we should ensure that the data is the best possible

• Cleaning data is tedious and requires care and experience

• Data cleaning is not always performed appropriately

Why talk about cleaning data?

Page 3: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

3

• Pedigree errors

– non-paternity

– unreported adoption/twinning

– errors in data entry

– sample mix-ups

• Genotyping errors

– errors in data entry

– misinterpretation of pattern on gel

– mutations may masquerade as errors

Sources of errors

Page 4: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

4

• Verify the subjects’ sexes

• Verify relationships

– Families with many Mendelian inconsistencies

– Infer pairwise relationships using whole genome data [RelPair]

• Mendelian inheritance

– Find problem genotypes

– Identify responsible individuals [PedCheck]

• Beyond Mendel

– Unlikely multiple recomb. events

The cleaning process

Page 5: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

5

• 2141 individuals in 400 families (mostly sibships; a handful of smallish pedigrees)

• Weber screening set 9

– 366 autosomal markers

– 17 on X, 4 on Y

• 803,739 total genotypes (~ 3% dropped)

Example data

Page 6: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

6

RelPair

For each pair of individuals, calculate

Pr(genetic data | R)

for R = MZ twinsparent/offspringfull sibshalf sibsunrelated

Boehnke and Cox (1997) AJHG 61:423–429

Broman and Weber (1998) AJHG 63:1563–1564

Page 7: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

7

Family 239

log10 likelihood IBSrel. pair full sibs half sibs unrelated 0 1 21342,1343 –11 0 –27 73 230 571342,1344 –12 0 –34 61 252 451343,1344 0 –32 –110 39 174 140

1342 1343 1344

Page 8: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

8

Family 403log10 likelihood IBS

rel. pair full sibs half sibs unrelated 0 1 22190,2191 0 –26 –98 47 190 1252190,2192 0 –28 –111 27 204 1272191,2192 0 –48 –141 24 189 148

2190 2191 2192

Page 9: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

9

Pedigree errors

Non-paternity (10: no data on dad; 1: maybe dad’s brother)

17

MZ twins (including 1 set MZ triplets, 1 father/son)

12

Sample switch (mother/daughter)

1

Full sibs reported to be half sibs 1

Completely unrelated 2

Total 33

Page 10: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

10

Apparent error rate(via unreported twins)

IBSFam Individuals 0 1 225 126,131 0 1 34881 396,397 2 36 298107 537,539 0 1 346127 628,629 0 0 357149 742,744,745 0 3 348173 877,878 1 27 323212 1127,1129 0 0 361231 1258,1259 0 2 349251 1411,1412 0 0 359295 1628,1630 0 0 358330 1820,1826 0 0 351393 2139,2140 0 4 343

Overall error rate = 0.91 %Bad samples = 5.1 %Good samples = 0.15 %

Page 11: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

11

PedCheck

1. Check on nuclear family level– easy inconsistencies

2. Genotype elimination– find all inconsistencies

3. Determine critical genotypes– removing one individual’s genotype eliminates inconsistency

4. For critical genotypes, calculate odds ratio to determine the most likely error.

O’Connell and Weeks (1998) AJHG 63:259–266

Page 12: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

12

267/

236

267/

256

267/

259

267/

256

259/

236

259/

259

259/

259

267/

255

255/

255

255/

244

267/

255

259/

—25

5/24

4

259/

256

D1S

2141

Fam

ily

403

Page 13: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

13

205/

193

205/

193

205/

189

209/

193

209/

205

209/

193

209/

193

217/

209

213/

193

209/

193

GA

TA

49D

12Fa

mil

y 19

5

209/

205

209/

205

209/

193

Page 14: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

14

• Removed 2,216/801,848 genotypes (~ 0.3%)

• Mismatches in unreported twins: 51/3,869 (~ 1.3%) prior to Mendel checks 50/3,868 (~ 1.3%) after Mendel checks

The result of the Mendel checks

Page 15: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

15

ii--ii-ii-io-oo-ioiiiiooooooo-o-

15 cM

Beyond Mendel

Individual 195-1032chromosome 9

CRI-MAP chrompic output

Page 16: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing

16

• Cleaning data requires care and experience

• Fix pedigree errors prior to removing genotypes

• The process could be more automated, but now requires a human brain

• Biologists should learn to program (especially perl)

• Analysts should participate in data cleaning

• Data on diallelic markers will require new techniques

Conclusions