actuarial analytics in r

47
Actuarial Science as Data Science Actuarial Modeling in R Revolution Analytics Webinar March 28, 2012 Jim Guszcza, FCAS, MAAA Deloitte Consulting LLP University of Wisconsin-Madison

Upload: revolution-analytics

Post on 27-Jan-2015

158 views

Category:

Economy & Finance


8 download

DESCRIPTION

With data analysis showing up in domains as varied as baseball, evidence-based medicine, predicting recidivism and child support lapses, judging wine quality, credit scoring, supermarket scanner data analysis, and “genius” recommendation engines, “business analytics” is part of the zeitgeist. This is a good moment for actuaries to remember that their discipline is arguably the first – and a quarter of a millennium old – example of business analytics at work. Today, the widespread availability of sophisticated open-source statistical computing and data visualization environments provides the actuarial profession with an unprecedented opportunity to deepen its expertise as well as broaden its horizons, living up to its potential as a profession of creative and flexible data scientists.This session will include an overview of the R statistical computing environment as well as a sequence of brief case studies of actuarial analyses in R. Case studies will include examples from loss distribution analysis, ratemaking, loss reserving, and predictive modeling.

TRANSCRIPT

Page 1: Actuarial Analytics in R

Actuarial Science as Data ScienceActuarial Modeling in R

Revolution Analytics Webinar

March 28, 2012

Jim Guszcza, FCAS, MAAA

Deloitte Consulting LLPUniversity of Wisconsin-Madison

Page 2: Actuarial Analytics in R

2 Deloitte Analytics Institute © 2011 Deloitte LLP

About Your Presenter

• James Guszcza, PhD, FCAS, MAAA• National Predictive Analytics Lead – Deloitte Consulting Actuarial, Risk, Analytics practice• Assistant professor of actuarial science & risk management – U. Wisconsin-Madison• PhD in Philosophy – The University of Chicago• Fellow of the Casualty Actuarial Society• Lots experience building predictive models / analyzing data in and outside of insurance

[email protected]@bus.wisc.edu

Page 3: Actuarial Analytics in R

Introduction

Actuarial Science and Data Science

R Background

Case Studies

• Fitting a complex size of loss model

• Loss Reserving

• Bayesian Hierarchical Modeling

• Revolution: Tweedie Regression on big data

Agenda

Page 4: Actuarial Analytics in R

Actuarial Science and Data Science

Page 5: Actuarial Analytics in R

5 Deloitte Analytics Institute © 2010 Deloitte LLP

Not Just Hype

“Perhaps the most important cultural trend today: The explosion of data about every aspect of our world and the rise of applied math gurus who know how to use it.”

-- Chris Anderson, editor-in-chief of Wired

• So behavioral economics is important in insurance for two classes of reasons:

• Decision-makers at insurance companies are human• People making insurance purchasing decisions are human

Page 6: Actuarial Analytics in R

6 Deloitte Analytics Institute © 2010 Deloitte LLP

Brave New World With Such Algorithms In IT

• The analysis of data affects:

• What we buy

• What we read

• What we watch

• How we network

• How we socialize

• The opinions we form

• Whom we date and marry!

Page 7: Actuarial Analytics in R

7 Deloitte Analytics Institute © 2010 Deloitte LLP

Clinical vs Actuarial Judgment – the Motion Picture

Page 8: Actuarial Analytics in R

8 Deloitte Analytics Institute © 2010 Deloitte LLP

Analytics Everywhere

• Neural net models are used to predict movie box-office returns based on features of their scripts

• Decision tree models are used to help ER doctors better triage patients complaining of chest pain.

• Predictive models are used to predict the price of different wine vintages based on variables about the growing season.

• Predictive models to help commercial insurance underwriters better select and price risks.

• Predict which non-custodial parents are at highest risk of falling into arrears on their child support.

• Predicting which job candidates will successfully make it through the interviewing / recruiting process… and which candidates will subsequently retain and perform well on the job.

• Predicting which doctors are at highest risk of being sued for malpractice.

• Predicting the ultimate severity of injury claims.(Deloitte applications in green)

Page 9: Actuarial Analytics in R

9 Deloitte Analytics Institute © 2010 Deloitte LLP

At the Center of It All: Data Science

• Today the analytics world is different largely due to exponential growth in computing power.

• The skill set underlying business analytics is increasingly called data science.

• Data science goes beyond: • Traditional statistics• Business intelligence [BI]• Information technology Image borrowed from Drew Conway’s blog

http://www.dataists.com/2010/09/the-data-science-venn-diagram

Or: “The Collision between Statistics and Computation”

Page 12: Actuarial Analytics in R

12 Deloitte Analytics Institute © 2010 Deloitte LLP

On then, on to R

Page 13: Actuarial Analytics in R

R Background

Page 14: Actuarial Analytics in R

14 Deloitte Analytics Institute © 2010 Deloitte LLP

R Overview

R is an open-source, object-oriented statistical programming language.In the past decade, it has become the global lingua franca of statistics.

• History:• R is based on the S statistical programming language developed by

John Chambers at Bell labs in the 1980’s• R is an open-source implementation of the S language• Developed by Robert Gentlemen and Ross Ihaka at U Auckland• Revolution R is a commercially supported, scalable implementation

of R, with parallel processing and big data capabilities

• Features:• R is an interactive, object-oriented programming environment• R has advanced graphical capabilities• Statisticians around the world contribute add-on packages

Page 15: Actuarial Analytics in R

15 Deloitte Analytics Institute © 2010 Deloitte LLP

On the Shoulders of Giants

• … therefore prominent people tend say things like this:

http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=all

Page 16: Actuarial Analytics in R

16 Deloitte Analytics Institute © 2010 Deloitte LLP

Facets of R

• In a recent article John Chambers discussed 6 “Facets of R”1. An interface to computational procedures of many kinds2. Interactive, hands-on in real time3. Functional in its model of programming4. Object-oriented, “everything is an object”5. Modular, built from standardized pieces6. Collaborative, a world-wide, open-source effort

• Interactive interface: Chambers was influenced by APL• In the days before spreadsheets, APL was very popular in the actuarial

community• One of the rare interactive scientific computing environments• Gives user ability to express novel computations• Heavy emphasis on matrices and arrays• But: unlike R, APL had no interface to procedures

Page 17: Actuarial Analytics in R

17 Deloitte Analytics Institute © 2010 Deloitte LLP

A Network ExteRnality

• Hal Varian’s “giant” has grown at an exponential rate.

• The open-source nature of R has encouraged top researchers from around the world to contribute new, often highly advanced, packages.

• Result: a powerful “network effect”.

• The value of a product increases as more people use it.

• R has become something like the Wikipedia of the statistics world.

Page 18: Actuarial Analytics in R

18 Deloitte Analytics Institute © 2010 Deloitte LLP

Adoption in the Actuarial World

Page 19: Actuarial Analytics in R

19 Deloitte Analytics Institute © 2010 Deloitte LLP

Free from Frees

• Jed Frees at the University of Wisconsin-Madison has made R integral to his new book on regression and time series. He maintains a nice website containing R instructions, data, and code.

http://instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/BookWebDec2010/learnR.html

Page 20: Actuarial Analytics in R

Case Studies

Page 21: Actuarial Analytics in R

21 Deloitte Analytics Institute © 2010 Deloitte LLP

Some Everyday Uses of R

• Free-form Exploratory Data Analysis• ad hoc data munging, data visualizations, fitting simple models on the fly• Loss models (“exam 4/C”)

• Unsupervised Learning• Correlation analysis, principal component / factor analysis, variable clustering,

k-means and hierarchical clustering, self-organizing maps, association rules (aka “market basket analysis”), Latent Dirichlet Analysis

• Supervised Learning• “statistics paradigm”: GLM, Multilevel/Hierarchical models, quantile

regression• “machine learning paradigm: CART, MARS, Random Forests, Neural

Networks, Support Vector Machines• Bayesian data analysis (MCMC simulation), causal analysis

• Optimization

Page 22: Actuarial Analytics in R

Case Study #1 Loss Distribution Modeling

Page 23: Actuarial Analytics in R

23 Deloitte Analytics Institute © 2010 Deloitte LLP

Modeling a Non-Trivial Loss Distribution

• A typical actuarial problem: modeling a highly skew and ambiguous loss distribution

• Traditional medium of analysis: spreadsheets.

• Why limit ourselves?

0 e+00 1 e+06 2 e+06 3 e+06 4 e+06 5 e+06

0 e

+00

2 e

-06

4 e

-06

6 e

-06

8 e

-06

loss

Page 24: Actuarial Analytics in R

Case Study #2Loss Reserving

Page 25: Actuarial Analytics in R

25 Deloitte Analytics Institute © 2010 Deloitte LLP

Three Approaches to Loss Reserving

• A garden-variety loss triangle:

• Let’s use R to forecast outstanding losses using three methods:• Replicate the above chain-ladder spreadsheet calculation – easy!• Use the Over-dispersed Poisson GLM model• Longitudinal data analysis using growth curves

Cumulative Losses in 1000'sAY premium 12 24 36 48 60 72 84 96 108 120 CL Ult CL LR CL res

1988 2,609 404 986 1,342 1,582 1,736 1,833 1,907 1,967 2,006 2,036 2,036 0.78 01989 2,694 387 964 1,336 1,580 1,726 1,823 1,903 1,949 1,987 2,017 0.75 291990 2,594 421 1,037 1,401 1,604 1,729 1,821 1,878 1,919 1,986 0.77 671991 2,609 338 753 1,029 1,195 1,326 1,395 1,446 1,535 0.59 891992 2,077 257 569 754 892 958 1,007 1,110 0.53 1031993 1,703 193 423 589 661 713 828 0.49 1151994 1,438 142 361 463 533 675 0.47 1421995 1,093 160 312 408 601 0.55 1931996 1,012 131 352 702 0.69 3501997 976 122 576 0.59 454

chain link 2.365 1.354 1.164 1.090 1.054 1.038 1.026 1.020 1.015 1.000 12,067 1,543chain ldf 4.720 1.996 1.473 1.266 1.162 1.102 1.062 1.035 1.015 1.000growth curve 21.2% 50.1% 67.9% 79.0% 86.1% 90.7% 94.2% 96.6% 98.5% 100.0%

Page 26: Actuarial Analytics in R

26 Deloitte Analytics Institute © 2010 Deloitte LLP

What Do You See?

• Let’s look at the loss triangle with fresh eyes.

• We would like to do stochastic reserving the “right” way.

• What considerations come to mind?

Cumulative Losses in 1000'sAY premium 12 24 36 48 60 72 84 96 108 120 CL Ult CL LR CL res

1988 2,609 404 986 1,342 1,582 1,736 1,833 1,907 1,967 2,006 2,036 2,036 0.78 01989 2,694 387 964 1,336 1,580 1,726 1,823 1,903 1,949 1,987 2,017 0.75 291990 2,594 421 1,037 1,401 1,604 1,729 1,821 1,878 1,919 1,986 0.77 671991 2,609 338 753 1,029 1,195 1,326 1,395 1,446 1,535 0.59 891992 2,077 257 569 754 892 958 1,007 1,110 0.53 1031993 1,703 193 423 589 661 713 828 0.49 1151994 1,438 142 361 463 533 675 0.47 1421995 1,093 160 312 408 601 0.55 1931996 1,012 131 352 702 0.69 3501997 976 122 576 0.59 454

chain link 2.365 1.354 1.164 1.090 1.054 1.038 1.026 1.020 1.015 1.000 12,067 1,543chain ldf 4.720 1.996 1.473 1.266 1.162 1.102 1.062 1.035 1.015 1.000growth curve 21.2% 50.1% 67.9% 79.0% 86.1% 90.7% 94.2% 96.6% 98.5% 100.0%

Page 27: Actuarial Analytics in R

27 Deloitte Analytics Institute © 2010 Deloitte LLP

Some Essential Features of Loss Reserving

• Repeated measures• The dataset is inherently longitudinal in nature.

• A “Bundle” of time series• Loss triangle: a collection of time series that are “related” to one another…• … no guarantee that the same development pattern is appropriate to each one

• Non-linear• Each year’s loss development pattern in inherently non-linear• Ultimate loss (ratio) is an asymptote

• Incomplete information• Few loss triangles contain all of the information needed to make forecasts• Most reserving exercises must incorporate judgment and/or background

information Loss reserving is inherently Bayesian

Cumulative Losses in 1000'sAY premium 12 24 36 48 60 72 84 96 108 120 CL Ult CL LR CL res

1988 2,609 404 986 1,342 1,582 1,736 1,833 1,907 1,967 2,006 2,036 2,036 0.78 01989 2,694 387 964 1,336 1,580 1,726 1,823 1,903 1,949 1,987 2,017 0.75 291990 2,594 421 1,037 1,401 1,604 1,729 1,821 1,878 1,919 1,986 0.77 671991 2,609 338 753 1,029 1,195 1,326 1,395 1,446 1,535 0.59 891992 2,077 257 569 754 892 958 1,007 1,110 0.53 1031993 1,703 193 423 589 661 713 828 0.49 1151994 1,438 142 361 463 533 675 0.47 1421995 1,093 160 312 408 601 0.55 1931996 1,012 131 352 702 0.69 3501997 976 122 576 0.59 454

chain link 2.365 1.354 1.164 1.090 1.054 1.038 1.026 1.020 1.015 1.000 12,067 1,543chain ldf 4.720 1.996 1.473 1.266 1.162 1.102 1.062 1.035 1.015 1.000growth curve 21.2% 50.1% 67.9% 79.0% 86.1% 90.7% 94.2% 96.6% 98.5% 100.0%

Page 28: Actuarial Analytics in R

28 Deloitte Analytics Institute © 2010 Deloitte LLP

Origin of the Approach: Dave’s Idea + Random Effects

+

=

Page 29: Actuarial Analytics in R

29 Deloitte Analytics Institute © 2010 Deloitte LLP

And Now it’s Bayesian

• Fully Bayesian model• Provides posterior credible

intervals (“range of reasonable reserves”)

• Add further hierarchical structure to simultaneously model loss development for multiple companies. (Wayne’s idea!)

Page 30: Actuarial Analytics in R

Case Study #3Hierarchical Bayes Ratemaking

Page 31: Actuarial Analytics in R

31 Deloitte Analytics Institute © 2010 Deloitte LLP

Workers Comp Ratemaking

• We have 7 years of Workers Comp data• Data from Klugman [1992 Bayes book]• 128 workers comp classes (types of business)• 7 years of summarized data• Given: total payroll, claim count by class• (payroll is a measure of “exposure” in this domain)

• Problem: use years 1-6 data to predict year 7

Page 32: Actuarial Analytics in R

32 Deloitte Analytics Institute © 2010 Deloitte LLP

Empirical Bayes “Credibility” Approach

• Naïve approach:• Calculate average year 1-6 claim frequency by class• Use these 128 averages as estimates for year 7.

• Better approach: build empirical Bayes hierarchical model.• “Bühlmann-Straub credibility model”• “Shrinks” low-credibility classes towards the grand mean• Use Douglas Bates’ lme4 package (UW-Madison again!)

( )( )2

][

,~~

λλ σµλλ

NpayrollPoiclmcnt

j

ijii

Page 33: Actuarial Analytics in R

33 Deloitte Analytics Institute © 2010 Deloitte LLP

Shrinkage Effect of Empirical Bayes Model

• Top row: estimated claim frequencies from un-pooled model.

• Separately calculate #claims/payroll by class

• Bottom row: estimated claim frequencies from Poisson hierarchical (credibility) model.

• Credibility estimates are “shrunk” towards the grand mean.

• Dotted line: shrinkage between 5=10%.

• Solid line: shrinkage > 10%Claim Frequency

hierach

no pool

grand mean0.00 0.05 0.10

Modeled Claim Frequency by CPoisson Models: No Pooling and Simple

Page 34: Actuarial Analytics in R

34 Deloitte Analytics Institute © 2010 Deloitte LLP

Now Specify a Fully Bayesian Model

• Here we specify a fully Bayesian model.• Use the rjags package• JAGS: Just Another Gibbs Sampler

• We’re standing on the shoulders of giants named David Spiegelhalter, Martyn Plummer, …

( )( )2

][

,~~

λλ σµλλ

NpayrollPoiclmcnt

j

ijii

Page 35: Actuarial Analytics in R

35 Deloitte Analytics Institute © 2010 Deloitte LLP

Now Specify a Fully Bayesian Model

• Here we specify a fully Bayesian model.• Poisson regression with an offset

( )( )2

][

,~~

λλ σµλλ

NpayrollPoiclmcnt

j

ijii

Page 36: Actuarial Analytics in R

36 Deloitte Analytics Institute © 2010 Deloitte LLP

Now Specify a Fully Bayesian Model

• Here we specify a fully Bayesian model.• Allow for overdispersion

( )( )2

][

,~~

λλ σµλλ

NpayrollPoiclmcnt

j

ijii

Page 37: Actuarial Analytics in R

37 Deloitte Analytics Institute © 2010 Deloitte LLP

Now Specify a Fully Bayesian Model

• Here we specify a fully Bayesian model.• Allow for overdispersion

( )( )2

][

,~~

λλ σµλλ

NpayrollPoiclmcnt

j

ijii

Page 38: Actuarial Analytics in R

38 Deloitte Analytics Institute © 2010 Deloitte LLP

Now Specify a Fully Bayesian Model

• Here we specify a fully Bayesian model.• “Credibility weighting” (aka shrinkage) results from giving class-level intercepts

a probability sub-model.

( )( )2

][

,~~

λλ σµλλ

NpayrollPoiclmcnt

j

ijii

Page 39: Actuarial Analytics in R

39 Deloitte Analytics Institute © 2010 Deloitte LLP

Now Specify a Fully Bayesian Model

• Here we specify a fully Bayesian model.• Put a diffuse prior on all of the hyperparameters• Fully Bayesian model• Bayes or Bust!

( )( )2

][

,~~

λλ σµλλ

NpayrollPoiclmcnt

j

ijii

Page 40: Actuarial Analytics in R

40 Deloitte Analytics Institute © 2010 Deloitte LLP

Now Specify a Fully Bayesian Model

• Here we specify a fully Bayesian model.• Replace year-7 actual values with missing values• We model the year-7 results … produce 128 posterior density estimates• Can compare actual claims with Bayesian posterior probabilities

( )( )2

][

,~~

λλ σµλλ

NpayrollPoiclmcnt

j

ijii

Page 41: Actuarial Analytics in R

41 Deloitte Analytics Institute © 2010 Deloitte LLP

A Credible Result

• Let’s rank the top 30 WC classes by the median of the posterior predictive density of year-7 claim count.

• 87% of the top 30 classes have actual year-7 claim count falling within the 90% posterior credible interval.

Page 42: Actuarial Analytics in R

Case Study #4 Big Data in Revolution R

Page 43: Actuarial Analytics in R

43 Deloitte Analytics Institute © 2011 Deloitte LLP

Big Data Headed Our Way

• Credibility concerns and a Bayesian outlook are part and parcel of actuarial science.

• But for many actuaries, working with “big data” is a much more pressing concern.

• Many millions of personal lines policy terms• Premium, loss, credit, billing transactions• Telematics data• … much more to come

• Base R handles data in memory• This is beautiful for “small data” problems like doing loss

reserving on summarized data• But breaks down for many industrial datasets

• So on to Revolution-R

Page 44: Actuarial Analytics in R

44 Deloitte Analytics Institute © 2011 Deloitte LLP

The kaggle Allstate Claim Prediction Challenge Data

Page 45: Actuarial Analytics in R

45 Deloitte Analytics Institute © 2011 Deloitte LLP

Loading the Data

• Data volume:• 13M rows• ~ 40 cols

• Took about 6-7 minutes to load

• Perform some variable transformations on the fly to minimize passes though the data.

• Data saved on disk in “xdf” file format for easy access and interactive modeling.

Page 46: Actuarial Analytics in R

46 Deloitte Analytics Institute © 2011 Deloitte LLP

Viewing the Data

• Data characteristics:• 13,184,290 rows• A few dozen predictive variables (mostly blinded)• Target variable: claim amount

• kaggle competition goal: build a model that segments well out-of-sample• Let’s use the 2005-6 data to predict the 2007 data • (Just a quick model to get a sense of Revolution R’s scalability)• Tweedie regression models fit in seconds

Page 47: Actuarial Analytics in R

47 Deloitte Analytics Institute © 2011 Deloitte LLP

Helpful Resources

• Edward (Jed) Frees – Regression modeling with actuarial and financial applications http://www.amazon.com/Regression-Actuarial-Financial-Applications-International/dp/0521135966

• Andrew Gelman / Jennifer Hill - Data Analysis using Regression and Multilevel/Hierarchical Models http://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X/ref=sr_1_1?s=books&ie=UTF8&qid=1332961819&sr=1-1

• Venables and Ripley – Modern Applied Statistics in S http://www.amazon.com/Modern-Applied-Statistics-Computing/dp/1441930086/ref=sr_1_1?s=books&ie=UTF8&qid=1332961867&sr=1-1

• Hastie, Tibshirani, Friedman – the Elements of Statistical Learning http://www.amazon.com/The-Elements-Statistical-Learning-Prediction/dp/0387848576/ref=sr_1_1?s=books&ie=UTF8&qid=1332961913&sr=1-1

• Gelman, Carlin, Stern, Ruin – Bayesian Data Analysis http://www.amazon.com/Bayesian-Analysis-Edition-Chapman-Statistical/dp/158488388X/ref=tag_dpp_lp_edpp_ttl_in

• John Kruschke – Doing Bayesian Data Analysis http://www.amazon.com/Doing-Bayesian-Data-Analysis-Tutorial/dp/0123814855/ref=sr_1_3?s=books&ie=UTF8&qid=1332961975&sr=1-3