data mining for biosecurity regulation - meetupfiles.meetup.com/14535342/data mining for...
TRANSCRIPT
Data Mining for Biosecurity Regulation
Andrew Robinson
CEBRAUniversity of Melbourne
August 10, 2016
Cen t r e o f Exce l l en ce f o rB i o se cu r i t y R i sk Ana l y s i s
Outline
Biosecurity
CEBRA
Data-Mining Examples
Failures & Lessons Learned
Biosecurity
Biosecurity is Important
Biosecurity is Important
I Tree snakes in Guam:12 native bird speciesnow extinct.
I Annual cost of invasives:$1.4 trillion; 5% GGDP.
Biosecurity is Important
I 2001 FMD in UK — cost 8 billion pounds;6 M sheep & cattle were slaughtered (2030 tested positive!)
I Modelled impact in Australia — $7 or $16 B; now $50 B.
Biosecurity is Expensive
Department of Agriculture and Water Resources2014–15 Annual Report.
17 900 000 Air passengers146 100 000 Mail Articles18 000 Vessel First-Port Arrivals611 000 Air Freight Consignments (< $1000)450 000 Cargo units referred from Customs (in 2014)
Biosecurity is Difficult
Now, here, you see, it takes all the running you cando, to keep in the same place.
— The Red Queen, Through the Looking Glass.
CEBRA
CEBRA
Centre of Excellence for Biosecurity Risk Analysis
I CEBRA established in the University of Melbourne
I Four year contract, started July 1 2013
I Jointly funded byI Department of Agriculture and Water Resources, andI New Zealand’s Ministry for Primary Industries.
I CEBRA curates proposal development inside departments.
http://www.cebra.unimelb.edu.au
Data-Mining Examples
Data-Mining Examples
I Border data case studies
I Geolocating dirty mail
I Text mining
I Pooling passenger data
I Hunting brokers
I Profiling international vessels
I Performance indicators for compliance monitoring
I Predicting hitch-hikers
ACERA 0806: 2001 IQI — rollback
ULD (External Inspection)
CEBRA provided a spreadsheet tool to the Department.
Table: Predicted 95% risk rate and tentative future sampling rate for2007 for a risk cutoff of 1%.
Region Insp. Cont. p (%) f (%) π nBrisbane 37743 58 0.154 0.190 1.86 701Far North 2957 33 1.116 1.470 100.00 2957NSW 207764 137 0.066 0.076 0.19 389SA 17510 59 0.337 0.415 9.31 1630VIC 91491 24 0.026 0.036 0.43 389WA 14067 0 0.000 0.014 1.36 191National 371532 311 0.084 0.092 0.15 552
The Benefits
I Monitoring ULDs — 370,000 in 2008; 14,000 in 2014.
I Monitoring reportable documents — 2.7 million in 2008;16,000 in 2014.
I Sea containers — 2 million in 2008; expanded CAL, hugereduction in non-CAL inspection; 370,000 in 2014.
NB: imperfect inspection data.
CEBRA 1301A1: Spatial Analysis of Intercepted Mail
International mail is monitored by DDU, X-ray, and manualinspection in Gateway Facilities.
I Delivery address is recorded for all articles intercepted withbiosecurity risk material (BRM).
I Addresses can be geolocated to ABS census region.
CEBRA used data-mining tools to identify patterns.
I Spatial analysis — spatial patterns in intercepted goods?
I Statistical analysis — any correlation with census-measuredcharacteristics at the ABS statistical unit level 2 or 3?
Greater Melbourne seizures — 2008
Greater Melbourne seizures — 2008
Potential Future Directions
I Postcode profiling (SGF mail counter, urban/rural)
I Case studies, e.g.,I Locale: Sydney area.I Infrastructure: Universities.I Interceptions: khat, tea, seeds, finfish.
I Other SourcesI Air CargoI Customs analysis.
I Address analysis — postboxes?
CEBRA 1401C/D: SAC Text Mining
SAC: self-assessed clearance, < $1000 declared value for a rangeof goods. C.f. FID.
Brief: to assess automated prediction of economic tariff codesfrom free-text goods descriptions in SAC.
In particular, is the desired accuracy of 80% or more feasible?
SAC comprises 1304 tariff codes.
Text Mining: Data & Analysis
Data:
I 3830 goods descriptions with tariff codes assigned byDepartment staff.
I 278 unique tariff codes.
I Dictionary of tariff codes and their descriptions.
I Highly uneven distribution — 75% tariffs have < 10 entries
Strategy:
I Random forest using the RTextTools package in R
I 5–fold cross-validation
Text Mining: Results
Overall accuracy 53.0% (95% CI: 51.4%, 54.5%).
Specific tariffs: e.g. XXXX 88.9% (95% CI: 83.7%, 92.9%).
Conclusion: could be ok for triage.
Text Mining: Results
Overall accuracy 53.0% (95% CI: 51.4%, 54.5%).
Specific tariffs: e.g. XXXX 88.9% (95% CI: 83.7%, 92.9%).
Conclusion: could be ok for triage.
Failures & Lessons Learned
Failures
I Tried too hard.
I Too many ideas, not enough structure.
I Ideas began outside, not inside.
I Great ideas, poor fit.
Key Lessons Learned 1/2
Bromides.
1. Operational utility is not the same as statisticalsignificance.
I Sensitivity and sometimes specificity trump p-values.
2. The outcome of data-mining might (should?) not be astatistical model.
I Statistical models are half-way there.
3. Start small — solve case studies.I Individually: non-threatening low-bar concrete outcomes.I Swarm.
4. Analyze the data that you have now.I Delay doesn’t compensate short-comings, Action does.
5. Failures, done right, aren’t failures.I Critique thoroughly, including when to try again.
Key Lessons Learned 2/2
Facing the Organization
6. Visit & sustain engagement.I Be in the room.
7. Deliver useful, usable outcomes but operationalise lightly.I Statistical models are half-way there.
8. Build bridges inside and outside the organization.I Prepare for the new normal. Network.
9. Identify, cultivate, & reward champions.I How can you help them to think differently about what you
can possibly do?
10. Manage expectations carefully.I Under-promise and over-deliver.
Be patient!
Grateful Thanks
Matt ChisholmSandy ClarkeGreg HoodRichard GaoChris WoodlandNyree StenekesTarik ZamanWayne Atkinson
Outline
Biosecurity
CEBRA
Data-Mining ExamplesRisk–Return Case StudiesSpatial Analysis of Intercepted MailText Mining for Profiling
Failures & Lessons Learned