argyle: an package for analysis of illumina genotyping arrayscsbio.unc.edu/muga/argyle.pdf · 2015....

argyle: an R package for analysis of Illuminagenotyping arraysAndrew P Morgan∗,1

∗Department of Genetics, University of North Carolina, Chapel Hill, NC

ABSTRACT Genotyping microarrays are an important and widely-used tool in genetics. I present argyle, anR package for analysis of genotyping array data tailored to Illumina arrays. The goal of the argyle package isto provide simple, expressive tools for non-expert users to perform quality checks and exploratory analyses ofgenotyping data. To these ends, the package consists of a suite of quality-control functions, normalizationprocedures, and utilities for visually and statistically summarizing such data. Format-conversion tools allowinteroperability with popular software packages for analysis of genetic data including PLINK, R/qtl and DOQTL.Detailed vignettes demonstrating common use cases are included as supporting information. argyle bridgesthe gap between the low-level tasks of quality control and high-level tasks of genetic analysis. It is freelyavailable at https://github.com/andrewparkermorgan/argyle.

KEYWORDS

SNP microarraysgenotypingsoftware

INTRODUCTION

High-throughput genotyping of tens of thousands of SNPs usingmicroarrays is common practice in both laboratory and populationgenetics. Genotypes at a dense panel of biallelic markers with a lowrate of missing data are a valuable resource for breeding, marker-assisted selection, genetic mapping and analyses of populationstructure. The Illumina Infinium system is one popular and cost-effective (∼ $100/sample) platform. Custom Illumina arrays areavailable for many organisms of research, agricultural or ecologicalinterest including mouse (Collaborative Cross Consortium 2012;Morgan and Welsh 2015), dog, cat (Willet and Haase 2014), chicken,cow, pig, horse, sheep (Kijas et al. 2009), salmon (Johnston et al.2013) and cotton (Hulse-Kemp et al. 2015).

Infinium arrays consist of many thousands of short invari-ant oligonucleotide probes conjugated to silica beads. SampleDNA is hybridized to the probes and a single-base, hybridization-dependent extension reaction is performed at the target SNP. Al-ternate alleles (herein denoted A and B) are labelled with differentfluorophores (Steemers et al. 2006). Raw fluorescence intensityfrom the two color channels is processed into a discrete genotypecall at each SNP, and both the total intensity from both channelsand the relative intensity in one channel versus the other are infor-mative for copy number.

Copyright © 2015 Andrew P Morgan et al.Manuscript compiled: Friday 16th October, 2015%1Please address correspondence to [email protected]

Many tools, both open-source and proprietary, already exist forpost-processing of raw hybridization intensity data. R packagesinclude beadarray (Dunning et al. 2007), lumi (Du et al. 2008) andcrlmm (Ritchie et al. 2009) among others. Illumina’s proprietaryBeadStudio software is widely-used by commercial laboratoriesand core facilities. BeadStudio applies a six-step “affine normaliza-tion” (Peiffer 2006) which pools data across many probes and manyarrays. Intensities from the two color channels (herein denoted xand y) are transformed to lie in the standard coordinate plane, withhomozygous genotypes near the x and y axes, heterozygous geno-types approximately on the x = y diagonal, and R = x + y ≈ 1.Biallelic genotypes are then called by clustering in this space.

Fewer tools exist for downstream quality control, exploratoryanalysis and interpretation of genotype calls jointly with underly-ing hybridization intensity data. To fill this gap I present argyle, apackage for the R statistical computing environment. The purposeof argyle is to provide simple and flexible tools for programmaticaccess to data from SNP arrays, with an emphasis on visualization.Although some functionality is tailored to Illumina arrays, manyof the features are general enough to accommodate any datasetwhich can be expressed as a matrix of genotypes at biallelic mark-ers. The main text of this paper outlines the key features of argyle;detailed code vignettes are provided as supplementary material.

DESIGN

The design of argyle is inspired by the PLINK software (https://www.cog-genomics.org/plink2/; Purcell et al. (2007)). A PLINK fileset

Volume X | October 2015 | 1

MULTIPARENTAL POPULATIONS

https://github.com/andrewparkermorgan/argyle

[email protected]

https://www.cog-genomics.org/plink2/

https://www.cog-genomics.org/plink2/

has three parts: a genotype matrix, a marker map and a “pedigree”(sample and family metadata) file. Likewise the central data struc-ture in argyle (the genotypes object) stores a matrix of genotypecalls, and hybridization-intensity data when available, in parallelwith a marker map and sample metadata. A genotypes object istherefore a self-contained and largely self-describing representa-tion of a genotyping dataset. The genotypes object is described infurther detail in File S1.

This package explicitly favors simplicity and readability of codeover raw efficiency. It is appropriate for the “medium-sized” data –tens of thousands of markers and hundreds of individuals – reg-ularly encountered in experimental contexts. Users with largerdatasets such as those routinely collected in human genetics – mil-lions of markers and thousands of individuals – which do notfit comfortably in memory should explore more sophisticated Rpackages (such as the GenABEL suite (http://www.genabel.org/)).

QUALITY CONTROL

Removal of poorly-performing markers and poor-quality samplesis an important precursor to genetic analysis. Failed arrays arecharacterized by aberrant intensity distributions, excess of miss-ing and heterozygous calls, or both. A summary plot (Figure 1)facilitates the identification of low-quality samples. Concordancebetween biological sex and sex inferred from calls on the sex chro-mosomes is also useful for identifying contaminated or swappedsamples. Failed arrays can be flagged and removed using global orsubgroup-specific thresholds. See File S2 for a worked example.

0.0

20.0

40.0

60.0

80.0

0.0

0.5

1.0

1.5

samples (sorted by median intensity)

# ca

lls (x

1000

)in

tens

ity q

uant

iles

genotypecall

A

B

H

N

quantile

20%

40%

60%

80%

Figure 1 Quality-control summary plot. Disribution of genotypecalls is shown in upper panel, and a contour plot of intensitydistributions across samples is shown in lower panel. Samplesfailing quality thresholds are marked with an open dot in theupper panel.

In addition to global summaries, argyle provides easy accessto hybridization intensity data from individual probes. Inspectionof “cluster plots” for individual probes is useful for confirming

the accuracy of genotype calls and diagnosing poorly-performingmarkers (Figure 2). A dotplot (Figure 3) permits direct inspectionof genotype calls at multiple markers over small genomic regions.

UNCHS043509 UNCJPD006519

0.0

0.5

1.0

1.5

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5x

y

call

0

1

2

Figure 2 Cluster plots for individual markers. Each point repre-sents a single sample; points are colored according to genotypecall, expressed as number of copies of the non-reference allele.The marker on the left performs as expected: the three canonicalclusters are present in the expected locations. The marker on theright may be genotyped incorrectly: there is not homozygous ref-erence cluster, and the nominally heterozygous samples fall intotwo clusters. This marker merits further inspection. For example,one nominally heterozygous cluster may correspond to homozy-gosity for the reference allele or the marker may be detectingparalogous variation at other loci.

MWN1026MWN1030MWN1106MWN1194MWN1198MWN1214MWN1279MWN1287RDS12763

3.4 3.45 3.5 3.55 3.6

position (Mbp)

call

0

1

2

Figure 3 Dotplot representation of genotypes among 9 wild-caught mice on proximal chromosome 19 (from Yang et al.(2011)). Genotype calls are coded as counts of the reference al-lele, and points are colored according to genotype call. Blankspaces indicate missing calls. Markers are plotted with constantspacing in the main panel; grey lines indicate physical positionalong the chromosome in megabases (Mbp).

ARRAY NORMALIZATION

Illumina BeadStudio uses an “affine normalization” algorithmto perform within- and between-array adjustments to x- and y-hybridization intensities before calling genotypes. However, fur-ther normalization is helpful for analyses of sample contaminationand copy number. Two standard metrics are the log2 R/R0 ratio(LRR), which captures total hybridization intensity (R) relative to areference level (R0); and B-allele frequency (BAF), which capturesthe relative signal from the A and B alleles (Peiffer 2006). For anuncomtaminated euploid sample, the expected value of LRR is 0and the expected value of BAF is 0.5 at heterozygous markers.

2 | Andrew P Morgan et al.

http://www.genabel.org/

Figure 4 Joint plot of B-allele frequency (BAF, upper panel) and log2 intensity ratio (LRR, lower panel) for an outbred male sample.The autosomes are almost entirely heterozygous, while the X chromosome is hemizygous: no points appear near BAF = 0.5 on the Xchromosome, and its LRR is decreased relative to the autosomes. Red traces are a local smoothing of underlying points.

The argyle package implements the thresholded quantile nor-malization (tQN) approach described in Staaf et al. (2008) andDidion et al. (2014). Brielfy, tQN peforms within-array quantilenormalization of the x channel against the y channel to accountfor dye biases specific to the Infinium chemistry, but places anupper bound on the difference between normalized and unnor-malized intensity values. LRR and BAF are then computed usingknown cluster positions computed from a set of reference samples.The tQN procedure may optionally be preceded by preliminarybetween-array quantile normalization using routines implementedin the preprocessCore package (Bolstad et al. 2003). A joint plot ofBAF and LRR (Figure 4) is valuable for assessing heterozygosity,ploidy, sample purity and sex-chromosome karyotype.

Copy-number inference from Illumina arrays is a well-studiedproblem for which good solutions already exist – for instance, thestandalone software PennCNV (Wang et al. 2007) or the R packagegenoCN (Sun et al. 2009). Most of these packages take BAF and LRRvalues as input and so are easily integrated downstream of argyle.

Systematic batch effects on intensity distributions are possiblewhen analyzing samples processed which were not processedconcurrently. The reliability of discrete genotype calls may beunchanged between batches, but downstream analyses that makeuse of hybridization intensities (eg. copy-number analyses, orhidden Markov models for haplotype inference in multiparentalpopulations (Fu et al. 2012; Gatti et al. 2014)) may benefit from afurther batch correction. One possibility, given k non-overlappingbatches, is quantile normalization of batches 1, . . . , k − 1 againstthe kth batch. Although between-batch normalization is not yetimplemented in argyle, it is slated for inclusion in future releases.

GENETICS TOOLS

Utilities are provided for efficient calculation of allele frequencies,heterozygosity and missingness by sample and by marker. Whengenotypes of both parents and offspring are available, pedigreerelationships can be confirmed via checks for Mendelian inconsis-tencies. Separate datasets can be concatenated or merged using

functions that ensure consistency of allele encoding and detectstrand swaps (eg. an [A/G] versus a [T/C] SNP).

To facilitate analysis of genotypes from experimental crosses,argyle provides functions for recoding alleles with respect toparental lines. A general-purpose hidden Markov model (HMM)allows for reconstruction of haplotype mosaics, given a panel ofreference samples and a genetic map – although users are cau-tioned that more sophisticated implementations are available forsome special cases (Broman et al. 2003; Fu et al. 2012; Gatti et al.2014). Mature tools for genetic mapping in the R environment al-ready exist (eg. R/qtl (Broman et al. 2003)). Genotypes processedwith argyle can be readily converted to R/qtl format to create aunified pipeline for quantitative-trait locus (QTL) mapping. One-and two-locus “mosaic plots” allow joint visualization of allele fre-quencies and phenotype at candidate QTL (Figure 5). A workedexample is provided in File S3.

Genome-wide patterns of relatedness can be explored usingbuilt-in functions for efficient kinship estimation (Figure 6) andprincipal components analysis (Figure 7). See File S4 for moredetailed demonstration of functions useful for population-geneticanalysis.

DATA EXPORT

The argyle package provides functions to convert genotypes ob-jects to other formats either within the R session (for R/qtl andDOQTL) or on disk. Currently argyle supports export to eitherPLINK binary format (*.fam/*.bim/*.bed) or Stanford HGDP for-mat. PLINK provides command-line utilities to convert its fileformat to many others, including VCF, LINKAGE (*.map/*.ped),Haploview, STRUCTURE and fastPHASE. In addition, sincegenotypes objects are regular R matrices, users can adapt them tobespoke input formats required by other tools for genetic analysis.

PERFORMANCE

argyle and its dependencies are compatible with R (≥ 2.14)on Windows or Mac OS X. The performance of argyle benefits

Volume X October 2015 | textttargyle: an R package for analysis of Illumina genotyping arrays | 3

A

0%

25%

50%

75%

100%

AA AG GG

pheno

control

case

JAX00467071 @ chr18: 81.37 Mbp

B

CC

TC

TT

AA AG GG

JAX00467071

UN

CH

S046

870

pheno mean

1

2

3

4

Figure 5 (A) One-way mosaic plot. Width of each bar is propor-tional to the frequency of the corresponding genotype; fill colorsindicate phenotype, here case or control status. (B) Two-waymosaic plot. Area of each block is proportional to two-locusgenotype frequency, and fill colors indicate phenotype mean foreach two-locus genotype.

RDS12763MWN1194MWN1287MWN1279MWN1106MWN1026MWN1030MWN1198MWN1214Yu2116mYu2113mYu2115mYu2095mYu2097mYu2099fYu2120f

RDS10105RDS13554

IN59IN34IN13IN25IN38IN40IN17IN47IN54IN60

RDS12763

MWN1194

MWN1287

MWN1279

MWN1106

MWN1026

MWN1030

MWN1198

MWN1214

Yu2116m

Yu2113m

Yu2115m

Yu2095m

Yu2097m

Yu2099f

Yu2120f

RDS10105

RDS13554

IN59

IN34

IN13

IN25

IN38

IN40

IN17

IN47

IN54

IN60

P(IBS)

0.6

0.7

0.8

0.9

1.0

Figure 6 Heatmap representation of matrix of pairwise geneticdistances between 28 wild-caught mice from three different sub-species using data from the Mouse Diversity Array (Yang et al.2011). Genetic distance is defined here as the proportion of alle-les shared identical by state between two individuals. The matrixis hierarchically clustered to that more closely-related samplesare adjacent to each other. The heatmap is useful for visualizingpopulation structure; here it reveals obvious genetic differentia-tion between mouse subspecies.

−50

−25

0

25

−50 0 50 100

PC1 (32.7%)

PC2

(13.

6%) subspecies

domesticus

musculus

castaneus

Figure 7 Projection of the same 28 samples from the previousfigure onto the top two principal components (PCs) of the geno-types matrix. The block structure of the kinship matrix corre-sponds to the three clusters revealed by PCA, which in turncorrespond to three distinct subspecies.


from optimized code in several existing R packages includingdata.table and Rcpp (Eddelbuettel 2013). Reading a dataset of re-alistic size – 96 samples × 77,808 markers (164 Mb ZIP-compressedon disk) – from Illumina BeadStudio output into an R session takesabout 30 seconds. The full dataset, including hybridization intensi-ties and sample and marker metadata, occupies 202.4 Mb; withouthybridization intensities, the size drops to 77.9 Mb. Memory usagescales approximately linearly with either the number of samplesor the number of markers (Figure 8A). The most computationally-intensitive component of argyle is the tQN procedure and is im-plemented in C++. Its running time is compared to the quantilenormalization routine from the preprocessCore package in Fig-ure 8B. These resource requirements are well within the range of atypical laptop or desktop computer.

R’s internal limit of 231 − 1 entries for any matrix or vectorplaces an upper bound on the dimensions of a genotypes object.For arrays with between 10,000 and 150,000 markers, this translatesto a limit of between 14,000 and 21,000 samples.

Tests were performed in R 3.1.2 (64-bit) on a MacBook Air witha single 1.7Ghz Intel Core i7 processor and 8 Gb RAM.

A

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

32

128

12 24 48 96

sample size

mem

ory

usag

e (M

b)

genotypes only

genotypes + intensity

# markers

●

●

●

●

●

●

●

1000

5000

10000

25000

50000

100000

143259

B

●

●

●

●

●

●

●

●

●

●

● ●●

●●

0.0

0.3

0.6

0.9

25 50 75 100

number of markers (x1000)

runn

ing

time

(s)/

sam

ple

method

●

●

●

tQN + post−norm

tQN + pre−norm

quantile−norm only

Figure 8 (A) Memory requirements for genotypes objects (esti-mated via R’s object.size()) with varying numbers of markersand samples, with and without hybridization intensities. (B)Running time per sample of the thresholded quantile normaliza-tion (tQN) procedure, with either initial quantile normalizationor post-polishing, compared to quantile normalization alone.

ACKNOWLEDGMENTS

The author thanks John Didion for advice on design and quality-control methods; John Didion, Dan Gatti, Marty Ferris and SofiaGrize for valuable feedback on early versions of this software;and Robert Corty, Will Valdar and Fernando Pardo-Manuel deVillena for helpful comments on this manuscript. This work wassupported in part by F30MH103925 (APM), U19AI100625 (FPMV),

U42OD010924 (Terry Magnuson) and the UNC Bioinformatics andComputational Biology Training Grant (T32GM067553).

SUPPORTING INFORMATION

A series of detailed vignettes demonstrate the key features ofargyle.File S0. Installation.File S1. Data import.File S2. Quality control.File S3. Analysis of an experimental cross.FIle S4. Analysis of genotypes from natural populations.

LITERATURE CITED

Bolstad, B., R. Irizarry, M. Astrand, and T. Speed, 2003 A compari-son of normalization methods for high density oligonucleotidearray data based on variance and bias. Bioinformatics 19: 185–193.

Broman, K. W., H. Wu, S. Sen, and G. A. Churchill, 2003 R/qtl: QTLmapping in experimental crosses. Bioinformatics 19: 889–890.

Collaborative Cross Consortium, 2012 The genome architectureof the collaborative cross mouse genetic reference population.Genetics 190: 389–401.

Didion, J. P., R. J. Buus, Z. Naghashfar, D. W. Threadgill, H. C.Morse, and F. P.-M. de Villena, 2014 SNP array profiling ofmouse cell lines identifies their strains of origin and reveals cross-contamination and widespread aneuploidy. BMC Genomics 15:847.

Du, P., W. A. Kibbe, and S. M. Lin, 2008 lumi: a pipeline for pro-cessing illumina microarray. Bioinformatics 24: 1547–1548.

Dunning, M. J., M. L. Smith, M. E. Ritchie, and S. Tavare, 2007beadarray: R classes and methods for illumina bead-based data.Bioinformatics 23: 2183–2184.

Eddelbuettel, D., 2013 Seamless R and C++ Integration with Rcpp.Springer New York.

Fu, C.-P., C. E. Welsh, F. P.-M. de Villena, and L. McMillan, 2012 In-ferring ancestry in admixed populations using microarray probeintensities. In Proceedings of the ACM Conference on Bioinformatics,Computational Biology and Biomedicine - BCB '12, Association forComputing Machinery (ACM).

Gatti, D. M., K. L. Svenson, A. Shabalin, L.-Y. Wu, W. Valdar,P. Simecek, N. Goodwin, R. Cheng, D. Pomp, A. Palmer, E. J.Chesler, K. W. Broman, and G. A. Churchill, 2014 Quantitativetrait locus mapping methods for diversity outbred mice. G3 4:1623–1633.

Hulse-Kemp, A. M., J. Lemm, J. Plieske, H. Ashrafi, R. Buyyarapu,D. D. Fang, J. Frelichowski, M. Giband, S. Hague, L. L. Hinze,K. J. Kochan, P. K. Riggs, J. A. Scheffler, J. A. Udall, M. Ulloa,S. S. Wang, Q.-H. Zhu, S. K. Bag, A. Bhardwaj, J. J. Burke, R. L.Byers, M. Claverie, M. A. Gore, D. B. Harker, M. S. Islam, J. N.Jenkins, D. C. Jones, J.-M. Lacape, D. J. Llewellyn, R. G. Percy,A. E. Pepper, J. A. Poland, K. M. Rai, S. V. Sawant, S. K. Singh,A. Spriggs, J. M. Taylor, F. Wang, S. M. Yourstone, X. Zheng,C. T. Lawley, M. W. Ganal, A. V. Deynze, I. W. Wilson, and D. M.Stelly, 2015 Development of a 63k SNP array for cotton and high-density mapping of intraspecific and interspecific populationsof gossypium spp. G3 5: 1187–1209.

Johnston, S. E., M. Lindqvist, E. Niemelä, P. Orell, J. Erkinaro, M. P.Kent, S. Lien, J.-P. Vähä, A. Vasemägi, and C. R. Primmer, 2013Fish scales and SNP chips: SNP genotyping and allele frequencyestimation in individual and pooled DNA from historical sam-ples of atlantic salmon (salmo salar). BMC Genomics 14: 439.

Volume X October 2015 | textttargyle: an R package for analysis of Illumina genotyping arrays | 5

Kijas, J. W., D. Townley, B. P. Dalrymple, M. P. Heaton, J. F. Mad-dox, A. McGrath, P. Wilson, R. G. Ingersoll, R. McCulloch,S. McWilliam, D. Tang, J. McEwan, N. Cockett, V. H. Oddy,F. W. Nicholas, and H. Raadsma, 2009 A genome wide surveyof SNP variation reveals the genetic structure of sheep breeds.PLoS ONE 4: e4668.

Morgan, A. P. and C. E. Welsh, 2015 Informatics resources forthe collaborative cross and related mouse populations. MammGenome advance online publication: 2 July 2015.

Peiffer, D. A., 2006 High-resolution genomic profiling of chromo-somal aberrations using infinium whole-genome genotyping.Genome Res 16: 1136–1148.

Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira,D. Bender, J. Maller, P. Sklar, P. I. de Bakker, M. J. Daly, and P. C.Sham, 2007 PLINK: A tool set for whole-genome associationand population-based linkage analyses. Am J Hum Genet 81:559–575.

Ritchie, M. E., B. S. Carvalho, K. N. Hetrick, S. Tavare, and R. A.Irizarry, 2009 R/bioconductor software for illumina's infiniumwhole-genome genotyping BeadChips. Bioinformatics 25: 2621–2623.

Staaf, J., J. Vallon-Christersson, D. Lindgren, G. Juliusson, R. Rosen-quist, M. Höglund, Å. Borg, and M. Ringnér, 2008 Normalizationof illumina infinium whole-genome SNP data improves copynumber estimates and allelic intensity ratios. BMC Bioinformat-ics 9: 409.

Steemers, F. J., W. Chang, G. Lee, D. L. Barker, R. Shen, and K. L.Gunderson, 2006 Whole-genome genotyping with the single-base extension assay. Nat Meth 3: 31–33.

Sun, W., F. A. Wright, Z. Tang, S. H. Nordgard, P. V. Loo, T. Yu,V. N. Kristensen, and C. M. Perou, 2009 Integrated study of copynumber states and genotype calls using high-density SNP arrays.Nucleic Acids Res 37: 5365–5377.

Wang, K., M. Li, D. Hadley, R. Liu, J. Glessner, S. F. Grant,H. Hakonarson, and M. Bucan, 2007 PennCNV: An integratedhidden markov model designed for high-resolution copy num-ber variation detection in whole-genome SNP genotyping data.Genome Res 17: 1665–1674.

Willet, C. E. and B. Haase, 2014 An updated felCat5 SNP manifestfor the illumina feline 63k SNP genotyping array. Anim Genet45: 614–615.

Yang, H., J. R. Wang, J. P. Didion, R. J. Buus, T. A. Bell, C. E. Welsh,F. Bonhomme, A. H.-T. Yu, M. W. Nachman, J. Pialek, P. Tucker,P. Boursot, L. McMillan, G. A. Churchill, and F. P.-M. de Villena,2011 Subspecific origin and haplotype diversity in the laboratorymouse. Nat Genet 43: 648–655.


argyle: an package for analysis of illumina genotyping arrayscsbio.unc.edu/muga/argyle.pdf · 2015....

Documents