the full stack

21
Department of Geography School of Social Science & Public Policy THE FULL STACK JON READES

Upload: jon-reades

Post on 07-Jan-2017

636 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: The Full Stack

Department of GeographySchool of Social Science & Public Policy

THE FULL STACKJON READES

Page 2: The Full Stack

OBJECTIVE

To provide an overview of the tools and technologies that I have found – or seen – to enable good development practice & productive research.

Page 3: The Full Stack

MY BACKGROUND

BA in Comparative Literature in 1997.Went to work for dot.com start-up.Learned to program, on the job.Learned SQL, on the job.Learned to back up more often, on the job.Managed sites, ETL systems & analytics over many years.Re-entered academia in 2006.PhD at CASA; collaboration with SENSEable City lab.Lecturer at King’s since 2013; helped set up Geocomputation pathway.

Net result: only

Lit grad with an

Erdös number?

Page 4: The Full Stack

Not this.

MOTIVATION

Page 5: The Full Stack

HOW DOES ‘BIG DATA WORK’ WORK?

Idea

Exploration

DevelopmentRevision

Writing Up Start at random

point &

repeat many, many

times.

Page 6: The Full Stack

BIG DATA WORK ON A PRACTICAL LEVEL

Page 7: The Full Stack

MY EXPECTATIONS FOR (GOOD) TOOLS

They must be useful when I need them.They must get out of the way when I don’t.They must fail gracefully when they can’t help it.They must play well with other tools where feasible.They must make it easy for me to do the right thing.They should grow gracefully into operational systems.

Very few tools

do all of these

well.

Page 8: The Full Stack

WHERE DO WE GO FROM HERE?

In the remainder of this talk I will try to link my outputs – the pretty pictures – to the process by which they were created.If you want to know more about something you see, just stop me.

Page 9: The Full Stack

Considerations: Coherence of syntax Coherence of libraries Data-munging features Spatial analytic support Map-making & data viz Ability to get things done Availability of a good IDEBut it’s really the ‘value added’ features that matter.

PROGRAMMING LANGUAGES

Cellular Census (2007)

Page 10: The Full Stack

Considerations: Standards compliance (Spatial) Feature set (esp.

indexing) Replay/Logging Replication & distribution Access controls & user

managementA lot can be done without spatial queries. Learn about indexing, query & schema design, and partitioning.

DATA STORAGE & MANAGEMENT

The ‘Big Bubble’? (2014)

Page 11: The Full Stack

Considerations: Ease-of-use Scriptability Ability to layer InteroperabilityDistinguish between mapping to communicate results with a spatial dimension and mapping to produce actual maps?

GEODATA VISUALISATION

Global Health Partnerships (2016)

Page 12: The Full Stack

Considerations: Collaboration Scalability Ease of recovery Scale of useBest if you never learn SVN/CVS, then your brain will not be done in by Git.

VERSION CONTROL & RECOVERY

Oyster Card Work (2012)

Git: commits on

a plane!

Page 13: The Full Stack

Considerations: Getting out of the way Compatibility Collaboration Editing & comments Quality of outputWhat helps you to think? What helps you write first, but makes formatting later easy?

WRITING

Thesis & ‘Space of Flows’ (2011, 2014)

Page 14: The Full Stack

Considerations: How easy to

backup/share? How often? Where stored? How easy to recover? How selective is

recovery?Backup early & backup often. Never trust one solution or one location. Note: data protection issues.

BACKUP & REPLICATION STRATEGIES

Pint of Science (2014)

Page 15: The Full Stack

Considerations: Performance Encryption ACLs

(users/groups/systems) Password ManagersEncrypt! Encrypt! Encrypt! Encourage use of password managers.

COMPLIANCE & DATA SECURITY

Page 16: The Full Stack

Also worth watching: Travis CI: automated

testing with GitHub integration.

Docker/Vagrant: replication & virtualisation.

Full replication of someone else’s entire data analysis process is harder than you think!

REPLICABLE RESEARCH

N/S Housing Divide (2017?)

Page 17: The Full Stack

WHAT’S MISSING?

• Better ways of specifying the full analytical ‘context’ – including versions of libraries, platform, etc. – as well as the input/output ‘pipeline’ – such as data and results (rctrack seems to want to do this, but only with R, YAML more promising).

• Ways of talking about data processing pipelines & steps (UML is not the answer).

• Valuing of good (open) code & good data by institutions and research councils.

Page 18: The Full Stack

THE BIG PICTURE

Tools (ca. 2006): Eclipse Perl/Java Oracle 8i Cron jobs OLAP Tools CVS ArcMap

Tools (ca. 2016): R/Rstudio Python Postgres + PostGIS Cron jobs Knitr, etc. Git QGIS

Page 19: The Full Stack

THE BIG PICTURE

Massive shift from expensive proprietary to cheap open (both software & hardware).Underlying distinction between operational and development/research environments persists. The problem: one tends to evolve into the other.

Page 20: The Full Stack

FINAL THOUGHT

Document your code. And any sources it drew upon.You will regret not doing it.

Page 21: The Full Stack

THANK YOU

Jon Reades@jreades reades.comkingsgeocomputation.org