csv,conf 2014 - open data within organizations

37
Opening data within organisations #csvconf 2014 - Berlin - @stevenbeeckman

Upload: steven-beeckman

Post on 11-Aug-2014

238 views

Category:

Data & Analytics


2 download

DESCRIPTION

This talk describes how we are trying to open (sometimes sensitive) data within our organization.

TRANSCRIPT

Opening data within organisations

#csvconf 2014 - Berlin - @stevenbeeckman

hi

I’m @stevenbeeckman - a digital dj!mixcloud.com/gehorschade.kollektiv

Conductor for StartupBus Europe!

www.startupbus.com

Vienna

Poland

Estonia

GermanyUK

France

SpainItaly

Greece

Pre-apply now at startupbus.com

Follow @TheStartupBus

Who here knows what devops is about?

developers building apps vs operations running apps in production

There is

a bigger picture

there are a bit more than 2 silo’s

Defence 101

Units on the battleground

Units in training

Majors, Colonels and Generals in the staff

Defence 101 (bis)

An army needs a very strong HR and logistics machine

Belgian government budget cuts usually cut in its defence budget first

Need for integrated management

calculating the cost of a training exercise took

4 people 4 weeks

!to go bug

!5 application owners

!for data hidden in

relational databases Excel sheets

Business Objects reports Access databases

(not so) shared drives

some logistics guy deployed in Afghanistan

I can’t access the shared drive, I wish I had my data locally!

Stone Age

I’m tired of these Excel files and Access databases saying

something contradictory.

Gimme the damn truth!

Requirements

1. Centralize data

2. But protect sensitive data (HR, medical privacy, …)

3. Make the data available offline

4. Nodes should be able to regain current state after loss of communication for 5 days

some logistics guy deployed in Afghanistan

I can’t access the shared drive, I wish I had my data locally!

Stone Age 2009

First XML based prototypes

I’m tired of these Excel files and Access databases saying

something contradictory.

Gimme the damn truth!

XML-based prototypes

• Able to extract maximum 40 tables from the logistics application in one night

• Slow

• Problems with identical rows

some logistics guy deployed in Afghanistan

I can’t access the shared drive, I wish I had my data locally!

Stone Age 2009

First XML based prototypes

New team & new approach

I’m tired of these Excel files and Access databases saying

something contradictory.

Gimme the damn truth!

New team

Hand-over to Dept AD&M (“the pro’s”)

New approach

Systems engineering: holistic view on the problem

Take into account the protection of sensitive data

Make it more stable than the prototype

Explicitly not real-time

Check out NASA’s course: http://www.saylor.org/sse101/

Conceptually

• lots of data sources with data owners

• 1 central data “warehouse”

• lots of nodes downloading the data they have access rights to

HR app

Financial app

Logistics app

Planning app

Excel

Ops unit

data warehouse

another app

Inside the data warehouse

Extraction Engine (EE)

File Server

Access Control

Extraction Engine (EE)

Based on open-source software:

Linux

MySQL

Talend (Eclipse based ETL workflow tool)

What does the EE do every night?

• Detect the meta data (store it in XML format)

• Take a full dump of each data source in csv format

• Calculate delta (deleted rows and inserted rows, in csv format)

• Create two zip files:

• One full copy

• One delta for this day

File server

• Stores the zip files available for the nodes

• Full copy only for the current day (but we have a history for a month)

• Delta zip files for 14 days

Access control

• Data providers determine themselves whether their data is

• “public” within the organisation

• “restricted” to a set of nodes

The nodes

Custom XAMPP package for local development of reporting or JBoss for bigger nodes with validated reports

Custom loader contacting Access Control and filling the MySQL database

Custom “Local Reporting Framework” (XML + XSLT)

Current status

some logistics guy deployed in Afghanistan

I can’t access the shared drive, I wish I had my data locally!

Stone Age 2009

First XML based prototypes

New team & new approach

I’m tired of these Excel files and Access databases saying

something contradictory.

Gimme the damn truth!

2014

Growth

4090

1000

@SpaceCatPics

"A LARGE SYSTEM IS ONE WHERE YOU DO NOT KNOW THAT SOME OF ITS COMPONENTS EVEN EXIST."

Some statistics

• 400 users (nodes)

• > 1 billion rows processed each night

• ~ 75 gigabytes of data processed each night

• making the EE work requires > 2000 tables

0

5

9

14

18

FTP LDAP Microsoft SQL Server MySQL Oracle PostgreSQL Sharepoint

32 source databases

big data schema

What used to take my team 4 weeks now takes us one click on a

button!

A major responsible for military training & exercises

Questions?@stevenbeeckman #csvconf

Hackers, hipsters & hustlers should pre-apply at

www.startupbus.com

Image credits

http://www.photographersgallery.com/photo.asp?id=2411Diagonal full of silos

http://www.pragmaticdevops.com/2014/04/management/hacking-management/devops-as-a-team-or-a-responsibility/

Two silos