final viper presentation at biovis 2013

Jessie Kennedy, Martin Graham Edinburgh Napier University

Trevor Paterson, Andy Law The Roslin Institute, University of Edinburgh

Visual Cleaning of Genotype Data

• VIPER is a visualisation for spotting areas of error (impossible inheritance) in pedigree genotype datasets

Background

Many More Markers, with similar data per marker

Pedigreestructure

G | G

T | A G | G

G | G

G | AG | T

T | C

• The visualisation aggregated errors across markers and displayed them as offspring groups– Along with ancillary tables and bar charts

• For it to be a useful biological tool , it needed extended to become a data cleaning application

Background

• Data Wrangling– Fixing unreliable or useless data– General Purpose vs Specific Task

• General Purpose Tools– Wrangler / Google Refine– Tabular data

• Ours is a Specific Task– Remove the errors as they break further analyses– Fixing errors often creates new ones as our data is an

inheritance graph of related data rather than a table

Background

• Error Visualisation Topics (in order of vol of work)– Uncertainty visualisation – show bounds of reliability– Missing data visualisation – is data present

• Usually the bane of visualisation rather than the aim– Correctness visualisation – is data right

Background

• We cover missing data and correctness. For us...– Incorrect data – bad. – Missing (incomplete) data – manageable.

• Cleaning ≠ Correcting– Correction is preferable, but often impossible

• We clean by deleting erroneous data points and inferring data from ancestor individuals– We swap wrong data for missing data

Data Cleaning

• Four basic masking operations

Data Cleaning - Operations

1. Mask markers

2. Mask individuals

3. Mask single data points

4. Break relationships

• Markers are independent of each other.– Masking one marker doesn’t change the errors in any

other markers

• Thus markers with lots of errors can be quickly removed with no side-effect– Early version in VIPER hid errors (but didn’t do anything to

the underlying data)

Data Cleaning - Markers

• Wanted to adopt the same approach...

– But something odd happened.

– Removing individuals changes the error counts of other individuals

• Because individuals inherit from each other• So e.g. Removing every individual with > 5 errors

produced individuals with >5 errors.

Data Cleaning - Individuals

• Some errors turned out to simply drop from one generation to the next– Literal “chase to the bottom”, lots of lost data

• In these situations it is often necessary to break a child/parent relationship across all markers in the pedigree– Which is where the fourth masking operation originates

Data Cleaning - Individuals

www.napier.ac.uk/iidi

Masking - 1

A/G G/T

A/G C/G G/T C/AG/AG/C C/C

C/C G/G G/T G/G G/C

C/A A/C

G/C G/G

C/C G/CMask all errorsRecheck for errorsRepeat

Lose 50% of data


Masking - 2

A/G G/T


C/C G/G G/T G/G G/C

C/A A/C

G/C G/G

C/C G/CMask errors top downRecheck for errorsRepeat

Lose 25% of data


Masking - 3

A/G G/T


C/C G/G G/T G/G G/C

C/A A/C

G/C G/G

C/C G/CMask errors top down + cut linksRecheck for errorsRepeat

Lose <20% of data

• Masked and missing data are shown in a different colour to error data

Showing Missing

• Being careful not to use any other colours in the interface, we can see how cleaning is going (red vsblue)

• New masking interactions available through standard context menus (and through tables)

Representations

• With such a hypothetical / experimental method of cleaning errors, undo is a must– Part of Shneiderman’s mantra– Beyond single-step, branching history

Visual History

Final Interface

• Genotype Checker vs VIPER+ interfaces• Both run using the same underlying data checking

algorithm• Same dataset

• 11 Biologists/Geneticists/Bioinformaticians at The Roslin Institute

• Asked them to attempt a pair of representative tasks with both interfaces (split into 12 Q’s)

Experiment

Experiment - Objective

• Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11

GenotypeChecker

Viper

Experiment - Objective

• Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10 11

Genotype Checker

VIPER

Experiment - Subjective

Key: 1 = Strongly prefer Viper, 5 = Strongly prefer GC, Bold = Median

Question VP No Pref GCFinding structural information on a pedigree 7 1 2 1 0Finding descendents of an individual 8 2 0 1 0

Finding ancestors of an individual 7 3 1 0 0

Finding error information on a single individual 4 1 1 4 1

Finding error information on a single marker 3 3 2 3 0

Distinguishing between different types of error 7 2 2 0 0Tracing errors to a shared parent 8 0 2 1 0

Finding error information on a single family 7 1 2 1 0Comparing errors between related families (one shared parent) 8 1 1 1 0

Masking errors 1 2 4 3 1Overall understanding of errors 5 1 4 1 0Overall ease of use 5 2 3 0 1

• A lot of incorrect/skipped answers in both scenarios– GC 61/132 = 46%– VP 45/132 = 34%

• These users were occasional users of cleaning software but it does show that Pedigree Cleaning is hard

• Excelitis – Biologists love Excel. The first move of many was to investigate the tables of error info rather than the main pedigree visualisation

Experiment - Observations

• Thanks for listening

• Sponsored by BBSRC

• http://www.bioinformatics.roslin.ed.ac.uk/viper/

End

final viper presentation at biovis 2013

Data & Analytics

wrong data

similar data

data presentusually

lost data

data right background

erroneous data points

incorrect data bad

mask single data points4