data mining for biosecurity regulation - meetupfiles.meetup.com/14535342/data mining for...

Data Mining for Biosecurity Regulation

Andrew Robinson

CEBRAUniversity of Melbourne

August 10, 2016

Cen t r e o f Exce l l en ce f o rB i o se cu r i t y R i sk Ana l y s i s

Outline

Biosecurity

CEBRA

Data-Mining Examples

Failures & Lessons Learned

Biosecurity

Biosecurity is Important


I Tree snakes in Guam:12 native bird speciesnow extinct.

I Annual cost of invasives:$1.4 trillion; 5% GGDP.


I 2001 FMD in UK — cost 8 billion pounds;6 M sheep & cattle were slaughtered (2030 tested positive!)

I Modelled impact in Australia — $7 or $16 B; now $50 B.

Biosecurity is Expensive

Department of Agriculture and Water Resources2014–15 Annual Report.

17 900 000 Air passengers146 100 000 Mail Articles18 000 Vessel First-Port Arrivals611 000 Air Freight Consignments (< $1000)450 000 Cargo units referred from Customs (in 2014)

Biosecurity is Difficult

Now, here, you see, it takes all the running you cando, to keep in the same place.

— The Red Queen, Through the Looking Glass.

CEBRA

Centre of Excellence for Biosecurity Risk Analysis

I CEBRA established in the University of Melbourne

I Four year contract, started July 1 2013

I Jointly funded byI Department of Agriculture and Water Resources, andI New Zealand’s Ministry for Primary Industries.

I CEBRA curates proposal development inside departments.

http://www.cebra.unimelb.edu.au

http://www.cebra.unimelb.edu.au


I Border data case studies

I Geolocating dirty mail

I Text mining

I Pooling passenger data

I Hunting brokers

I Profiling international vessels

I Performance indicators for compliance monitoring

I Predicting hitch-hikers

ACERA 0806: 2001 IQI — rollback

ULD (External Inspection)

CEBRA provided a spreadsheet tool to the Department.

Table: Predicted 95% risk rate and tentative future sampling rate for2007 for a risk cutoff of 1%.

Region Insp. Cont. p (%) f (%) π nBrisbane 37743 58 0.154 0.190 1.86 701Far North 2957 33 1.116 1.470 100.00 2957NSW 207764 137 0.066 0.076 0.19 389SA 17510 59 0.337 0.415 9.31 1630VIC 91491 24 0.026 0.036 0.43 389WA 14067 0 0.000 0.014 1.36 191National 371532 311 0.084 0.092 0.15 552

The Benefits

I Monitoring ULDs — 370,000 in 2008; 14,000 in 2014.

I Monitoring reportable documents — 2.7 million in 2008;16,000 in 2014.

I Sea containers — 2 million in 2008; expanded CAL, hugereduction in non-CAL inspection; 370,000 in 2014.

NB: imperfect inspection data.

CEBRA 1301A1: Spatial Analysis of Intercepted Mail

International mail is monitored by DDU, X-ray, and manualinspection in Gateway Facilities.

I Delivery address is recorded for all articles intercepted withbiosecurity risk material (BRM).

I Addresses can be geolocated to ABS census region.

CEBRA used data-mining tools to identify patterns.

I Spatial analysis — spatial patterns in intercepted goods?

I Statistical analysis — any correlation with census-measuredcharacteristics at the ABS statistical unit level 2 or 3?

Greater Melbourne seizures — 2008

Potential Future Directions

I Postcode profiling (SGF mail counter, urban/rural)

I Case studies, e.g.,I Locale: Sydney area.I Infrastructure: Universities.I Interceptions: khat, tea, seeds, finfish.

I Other SourcesI Air CargoI Customs analysis.

I Address analysis — postboxes?

CEBRA 1401C/D: SAC Text Mining

SAC: self-assessed clearance, < $1000 declared value for a rangeof goods. C.f. FID.

Brief: to assess automated prediction of economic tariff codesfrom free-text goods descriptions in SAC.

In particular, is the desired accuracy of 80% or more feasible?

SAC comprises 1304 tariff codes.

Text Mining: Data & Analysis

Data:

I 3830 goods descriptions with tariff codes assigned byDepartment staff.

I 278 unique tariff codes.

I Dictionary of tariff codes and their descriptions.

I Highly uneven distribution — 75% tariffs have < 10 entries

Strategy:

I Random forest using the RTextTools package in R

I 5–fold cross-validation

Text Mining: Results

Overall accuracy 53.0% (95% CI: 51.4%, 54.5%).

Specific tariffs: e.g. XXXX 88.9% (95% CI: 83.7%, 92.9%).

Conclusion: could be ok for triage.

Failures

I Tried too hard.

I Too many ideas, not enough structure.

I Ideas began outside, not inside.

I Great ideas, poor fit.

Key Lessons Learned 1/2

Bromides.

1. Operational utility is not the same as statisticalsignificance.

I Sensitivity and sometimes specificity trump p-values.

2. The outcome of data-mining might (should?) not be astatistical model.

I Statistical models are half-way there.

3. Start small — solve case studies.I Individually: non-threatening low-bar concrete outcomes.I Swarm.

4. Analyze the data that you have now.I Delay doesn’t compensate short-comings, Action does.

5. Failures, done right, aren’t failures.I Critique thoroughly, including when to try again.

Key Lessons Learned 2/2

Facing the Organization

6. Visit & sustain engagement.I Be in the room.

7. Deliver useful, usable outcomes but operationalise lightly.I Statistical models are half-way there.

8. Build bridges inside and outside the organization.I Prepare for the new normal. Network.

9. Identify, cultivate, & reward champions.I How can you help them to think differently about what you

can possibly do?

10. Manage expectations carefully.I Under-promise and over-deliver.

Be patient!

Grateful Thanks

Matt ChisholmSandy ClarkeGreg HoodRichard GaoChris WoodlandNyree StenekesTarik ZamanWayne Atkinson

Outline

Biosecurity

CEBRA

Data-Mining ExamplesRisk–Return Case StudiesSpatial Analysis of Intercepted MailText Mining for Profiling


data mining for biosecurity regulation - meetupfiles.meetup.com/14535342/data mining for...

Documents