escience in the netherlands - dagstuhlmaterials.dagstuhl.de/files/16/16252/16252.robvan... ·  ·...

81
eScience in the Netherlands Rob van Nieuwpoort [email protected]

Upload: lamlien

Post on 24-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

eScience in the Netherlands

Rob van Nieuwpoort

[email protected]

We work demand-driven

35

Career paths

ResearchTechnicalManagerial

Scale…

13

12

11

10 eScience research engineer

eScience research engineer

eScience

coordinator

eScience

specialist

eScience

researcher

eScience

manager

eScience top

researcher

eScience top

specialist

Lessons learnt• Demand-driven: start from the science

• Collaboration, not competition (connected projects, calls)

• Good is good enough

• Generalization (10% ring-fenced)

• Communication communication communication

– Hiring people

– Internal communication

– Project kickoffs

• IP, work place, generalization, co-authorships

– coordinators

– Web sites, demo’s

• Keep challenging the RSEs– Courses, hackathons, sprints, …

– Switch disciplines

eStepThe eScience technology platform

A coherent set of technologies to tackle the grand challenges in eScience

Cross-cutting basic skills

• Code quality and best practices

• Integration of software

• Scaling of software

• Analytics and statistics

• Visualization

NLeSC eScience competences applied in research

1. Optimized data handlingData integration, data base optimization, structured & unstructured data, real time data

2. Big data analyticsStatistics, machine learning, visualization, text mining

3. Efficient computingDistributed & acceleratedcomputing, efficient algorithms

Optimized data handling

Big Data analytics

Efficient computing

Distributed computing

eStep

Accelerated computing

Low power computing

Orchestrated computing

High-performance computing

Natural Language processing

Machine learning

Information visualization

Scientific visualization

Information retrieval

Computer vision

Handling sensor data

Linked data

Information integration

Databases

Data assimilation

• Key expertises

are used in many

projects

• Projects often

use quite a

number of

different

competences and

technologies

eStep Goals

• Prevent fragmentation and duplication

• Promote the exchange and re-use of best practices

• Represent NLeSC’s expertise and knowledge base

• Improve the eScience state of the art with a fundamental eScience research line

NLeSC projectseStep

Tailor

Generalize

DevelopAdopt

eStep

project-specific software

discipline-specific software

enhanced science

overarching software

e-infrastructure

generic libraries, tools, and algorithms

NLeS

C

pro

jects

• Main criteria for integrating

technology in eStep:

– State-of-the-art / best-of-breed?

– Generic and overarching?

– Match with our expertise areas?

– Includes externally developed

software

Open platform!

Our sustainability approach• Prevent duplication, fragmentation

• Build something that is worth sustaining!– Sufficiently generic

– Modular

– High quality

– Must be taken into account from the start

• Enforce software engineering guidelines and best practices

• Educate partners with software carpentry and data carpentry

• Open source / open access, open standards, unless…

• Community coding

• Standardization for software and data formats

• eStep is an open platform

GitHub

Travis CITest and Deploy with Confidence.

Easily sync your GitHub projects with Travis CI

and you’ll be testing your code in minutes!

We run a Jenkins CI instance locally.

Used for private repositories and

repositories requiring HPC middleware.

A Common Workflow @ NLeSC

Open platform for building, shipping

and running distributed applications.

deploy

software

eScience software

• technology.esciencecenter.nl

• Non-technical, targets

general audience

• estep.esciencecenter.nl

• All eScience software

and knowledge you

need, in one place

• Technical, targets

developers, PIs

Knowledge base

• knowledge.esciencecenter.nl

• training and education

• best practices

• tutorials

• white papers

• training resources

• Software development

Checklist available

More info on eStep

technology.esciencecenter.nl

estep.esciencecenter.nl

[email protected]

Logo Bingo

Osmium

Semanticizer

xtas

EDAL

NLTK

CommonSense

AHN2 viewer

Optimized data handling

Cities poorly protected against heat

Three persons in elderly house died due to heat

End of 16 day heat wave

Heat protection plan abandoned

Elderly use heat info call desk massively

Summer in the city example: human thermal comfort

Courtesy Bert Holtslag

Summer in the city example

• Kilometer scale: elevation (AHN2) and land-use data (Kadaster), imagery

for assessing the green vegetation coverage and the soil moisture

content

• Street scale: sky view factor, the building height to street width ratio, the

reflectivity and thermal characteristics of buildings and streets, the

abundance of vegetation

• Network of weather stations and crowd sourcing: wunderground.com

• Special measuring campaigns

• Social media?

• Combine with fine-grained models

Novel hourly forecasting

system for human thermal

comfort in urban areas on

street level

Courtesy Bert Holtslag

Via Appia

Optimized Data Handling Technology

• Distributed sensor networks, multi-model and multi-scale

simulations, data assimilation, data integration, multi-scale

pattern recognition, geographic information systems,

databases, …

• Xenon, NetCDF, HDF5, ROOT, XNAT, OpenDA, Hadoop,

MapReduce, Oracle, MySQL, Postgres, MonetDB,

ElasticSearch, DataVault, JSON, Spark, …

Big Data Analytics

eEcology example

Courtesy Willem Bouten

Accelerometer and Behaviour

standingsitting floating

Sta t i c acce le ra t ion

X-flappingglidingflapping flight

Dynamic acce le ra t ion

Heave vertical

Surge forward

Sway sideward

Courtesy Willem Bouten

Machine learning / annotation interface

Routes and Geology

Courtesy Willem Bouten

Detours and Climate

Modis satellite image of dust

Courtesy Willem Bouten

Embodied Emotion Project• Mapping bodily expression of emotions

– To be downhearted

– Clenching fists

– My heart fills with joy

– My blood is boiling

• Test case: Dutch theatre texts 1600-1830

– Shift in experienced emotions

– Shift in embodiment of emotions

• Approach

– Establishing corpus & standardizing text

– Establishing emotional and bodily vocabularies

– Emotion mining

– Visualizing resultsSource: Nummenmaa et al., 2013

eMetabolomics example

• Use reaction rules to identify compounds in

Mass-spectrometry datasets

• Online at http://www.emetabolomics.org/

• Public and private version, private allows

bigger/longer calculations

• Lars Ridder, Laboratory of Biochemistry,

Wageningen University

Courtesy Lars Ridder

Courtesy Lars Ridder

forecast.ewatercycle.org

Big Data Analytics Technology

• Natural language processing, machine learning, information and

scientific visualization, information retrieval, computer vision

• Matlab, R, NumPy, SciPy, scikit learn, Pandas, Weka, Xtas,

Twiqs.nl, D3, ExtJS, Cesium, Leaflet, OpenLayers, GeoExt,

X3Dom, X3DomExt, Mapnik, CommonSense, …

Efficient Computing

Efficient Computing

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra

(BLAS3)

Particle Methods

• Smart algorithms can improve performance dramatically

• Power consumption is becoming the bottleneck

• Legacy codes are inefficient on modern architectures

– Need completely different optimizations, algorithms

Efficient Computing Example

• Radio Frequency Interference (RFI) is a huge

problem for many radio astronomy observations

• Caused by

– Lightning, Vehicles, airplanes, satellites, electrical

equipment, GSM, FM Radio, fences, reflection of

wind turbines, …

• Best removed offline

– Complete dataset available

– Good overview / statistics / model

– Can spend compute cycles

• Partner: Astron

Real-time RFI mitigation

• Some pipelines need to run in real time today

– Image-based transient detection (LOFAR/AARTFAAC)

– Pulsar searching (WSRT/Apertif)

• SKA will be entirely real-time

– Data rates simply too high to store

• Novel algorithms with linear computational complexity– Only very little loss in quality

RFI mitigation on accelerators

• Accelerator-based computing

– GPUs, Xeon Phi, …

– Astronomy, ocean modeling, digital

forensics, radar systems, high-

energy physics

• Auto-tuning & runtime compilation

– Generate many codes at run-time,

select most efficient

Pulsar B1919+21 in the Vulpecula nebula.

Pulse profile created with folding and the LOFAR software telescope.Background picture courtesy European Southern Observatory.

0

10

20

30

40

50

60

70

80

Xeon Phi NVIDIA GTXTitan GPU

AMD HD7990GPU

0

2

4

6

8

10

Xeon Phi NVIDIA GTXTitan GPU

AMD HD7990GPU

Performance compared to CPU Power usage compared to CPU

Performance profile of Ocean simulation

POP has a very large codebase

written in Fortran 90

Callgraph obtained using gprof

3 kernels GPU optimized, 20%

improvement

Henk Dijkstra, Ben van

Werkhoven, et al.

Efficient Computing: Distributed

• Water management and climate modeling

• Simulations in astrophysics (AMUSE)

• Digital forensics (NFI)

• Astronomy (LOFAR)

• Text mining (xtas)

• Computational chemistry (Noodles)

• High-energy physics (ROOT & pandas)

Novel domain-specific algorithm for work distribution

Towards 2 km

resolution.

Better than

space filling

curves for

distributed

runs (36%

improvement).

Courtesy

Jason

Maassen

Efficient Computing Technology (1)

• Smart algorithms:

Efficient Computing Technology (2)

• Distributed computing, accelerators (GPUs), low-

power, orchestrated computing, HPC

• Ibis, Aether, Xenon, Magnesium, Osmium,

SmartSockets, MPI, UDT, Galaxy, Knime, Sockets,

Cuda, OpenCL, OpenMP, Thrift, Protobuffers,

ZeroMQ, SSH, GridFTP, Globus, SLURM, PBS,

GridEngine, P2P, OpenFlow, bandwidth on demand,

Docker, Celery, …

eStep: Example Libraries

Several NLeSC applications require access to

distributed compute and storage resources.

Xenon provides a simple API for this, allowing

rapid development of such applications

Middleware independent: portable & reusable

Xenon: current users• Osmium: webservice on top Xenon

– Magma: eMetabolomics mass spectrometry analysis

– SIMCity: decision support for urban social economic complexity

• eSalsa: distributed climate simulation deployment

• Amuse: multi-model distributed astrophysics

• Via Appia: Pointclouds in archeology

• Vbrowser: distributed file management

• Biomarker: medical data upload tool

• Noodles: workflow tool in Python (Computational chemistry)

Xtas: distributed text analysis • Natural language processing and text mining

– Named entity recognition, sentiment detection, document clustering, topic modeling

• Use as web service or Python library

• Integrates with Elasticsearch for document storage and retrieval

• Developed by Intelligent System Lab Amsterdam (ISLA, UvA) and NLeSC

• Tight integration of external and internal software

• Searching Public Discourse: text analytics pipeline for historical research

• Texcavator: supporting large-scale text mining in the field of digital humanities

• Beyond the book: how international is a work of fiction?

• Semanticizer: entity linking

More info on eStep

technology.esciencecenter.nl

estep.esciencecenter.nl

[email protected]

Backup slides

Generalized software

from NLeSC portfolio

eStep

Externally developed

software

you are

here

ASDI

Generalized software

from NLeSC portfolio

eStep

Externally developed

software

you are

here

DTEC

NLeSC TrainingEssential Skills in Data-Intensive Research: Enabling your Research

25-29 January 2016, SURF Academy, Utrecht (for Life Sciences)

NLeSC, SURFsara, SURFnet and DTL

5 day workshop, with both hands-on and taught components for 1st year PhD students without programming background

Days 1-2 = Dealing with Data (Data Carpentry)Days 3-4 = Software and Programming (Software Carpentry)Day 5 = Introducing the e-Infrastructure

Future courses requested with other domain focus• Humanities & Social Sciences• Environment & Sustainability• Life Sciences• Physics & Beyond

Based on SoftwareCarpenty.org model(NLeSC & SURFsara have trained course leaders)

Dealing with DataData Stewardship and FAIR best practiceFrom Excel to databasesSQL practicalR practical

Computation & AutomationUsing the ShellIntroduction to programming (Python)Github and version control.Unit testingDebuggingDocumentation

Introduction to the e-InfrastructureSURF to introduce the national e-infrastructureReal world examples of e-infrastructure enhanced research from NLeSC

• Make researchers more productive by teaching them basic lab skills for scientific computing

• All lessons are freely available

• Workshops, teacher trainings

• Example lessons

– Version Control and Unit Testing for Scientific Software

– Shell, Git, Scientific Python

– Testing and Continuous Integration with Python

– From Excel to a Database

– Data Management in the Ocean, Weather and Climate Sciences

– Visualizing Your Data on the Web Using D3

– Working With Data on the Web

– Intermediate/Advanced R Lessons

– Programming with GAP

• Develop and teach workshops on the fundamental data skills for research in all domains

• Covering the full lifecycle of data-driven research

• Introductory computational skills for data management and analysis

• Domain-specific lessons, from life and physical sciences to social sciences

• Build on existing knowledge, enabling quick application of new skills to own research

• Examples:

– Ecology

• Data Organization in Spreadsheets, Data Cleaning with OpenRefine, Data Management with SQL, Data

Analysis and Visualization in R, Data Analysis and Visualization in Python

– Genomics

• Introduction to cloud computing for genomics, Introduction to the command line, Data wrangling and

processing, Data analysis in R, Data visualization in R

– Social sciences

• Social sciences text mining

– Biology

– Geospatial data

Coding Style• Nicholas C. Zakas: Why Coding Style Matters• http://coding.smashingmagazine.com/2012/10/25/why-coding-style-matters

• Use is mandatory

• We provide editor configuration

• http://editorconfig.org/

EditorConfig

Conventions & Guidelines• Web development

– General frontend guidelines: https://github.com/bendc/frontend-guidelines

– AngularJS: https://github.com/johnpapa/angular-styleguide

– Airbnb JavaScript Style Guide: https://github.com/airbnb/javascript

• Python

– PEP8: https://www.python.org/dev/peps/pep-0008/

• Java

– Code Conventions for the JavaTM Programming Language (Oracle)

• Google Style Guides: https://github.com/google/styleguide

• Wikipedia: https://en.wikipedia.org/wiki/Coding_conventions

Quality Improvement Tools

• SonarQube: http://www.sonarqube.org

• Code climate: https://codeclimate.com

• Codacy: https://www.codacy.com

• Scrutinizer: https://scrutinizer-ci.com

• Landscape: https://landscape.io

• Coveralls: https://coveralls.io

• See also

– https://github.com/ripienaar/free-for-dev#code-quality

– http://shields.io/

Article about good development practices:

The Joel Test: 12 Steps to Better Code.

Unit & Integration Testing• Guide: Writing Testable Code

• 'Unit Testing Best Practices' and other presentations on http://artofunittesting.com/.

• Continuous integration testing with Travis-CI and Jenkins-CI

• We require at least 70% code coverage

• Java: junit

• Javascript

– Jasmine, a behavior-driven development framework for testing JavaScript code.

– Karma, Runs tests in web browser with code coverage.

– PhantomJS, headless web browser on CI-servers.

• Python

– Unittest, nose and pytest.

• R

– testthat

• Web development

– To interact with web-browsers use Selenium.

– Sauce Labs hosts a matrix of web-browsers and Operating Systems for testing.

– AngularJS applications can be tested with Protractor.

Documentation• Document at multiple levels

– Source code comments

– API documentation

– Installation and usage documentation

• Comments at each level should take into account the different target audiences

• Use Markdown, a readable lightweight markup language that can be converted

to many formats

Version Control• Git and GitHub

• A successful and simple Git branching model:

GitHub Flow• https://guides.github.com/introduction/flow/

• Commit messages are formatted and

formulated in a readable way

• http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html

• http://who-t.blogspot.nl/2009/12/on-commit-messages.html

Releases and packaging

• Tag versions, use github releases

• Semantic versioning

• Keep changelogs

• Packaging is important

– Use packaging that is well known and appropriate for user

community: pypi, npm, maven, docker

• Make your code and data citable: get a DOI (Zenodo)

Support levels• S0: generic software or hibernating software that is currently not used in the

NLeSC project portfolio– Fortran, Python, vBrowser, TwiNL, XNAT, …

– No support, not disseminated

• S1: software where NLeSC maintains expertise on, and that is used in projects,

as well as external software that NLeSC extends and improves– Potree, OpenDA, ElasticSearch

– Support for project partners only

– Contribute improvements back to community

• S2: software developed in-house, where NLeSC is the specialist– Xenon, Magnesium, Osmium, xtas, esiBayes, Aether, …

– Full support for project partners, limited support for Dutch scientific community, best effort for

international community

IP & Software Licenses• NLeSC does not develop IP portfolio

• Ownership of research results is shared property of partners,

including NLeSC

• IP protection is possible

• Software is open source / open access unless agreed otherwise

• Default software license: Apache 2.0

• Deviations possible, discuss with NLeSC MT