dibbs brown dog - nationaldataservice.org · • matlab data . ecosystems and ... • low level...
TRANSCRIPT
DIBBs Brown Dog An Extensible and Distributed Data Transformation Service
Tabular Data Gap Filling
Climate Modeling Lidar
Flood Plain Analysis River Depth Distribution
River Maturity Stream Detection and Sinuosity
Satellite/Aerial Photos Land Cover/Usage
Water Detection (e.g. Lakes, Retaining Ponds)
Green Infrastructure
Hyperspectral
Radar
Photos
3D Reconstruction
3D Data
Human Preference Modeling
Video
People Detection/Tracking
Large Dynamic Group Behavior
Bee Detection/Tracking
Bee Colony Behavior
Underwater Photos
Color Correction
Image Stitching
Mapping
Event Detection
Species Detection/Counting Reef Changes
Food Supply
Structural Defects
Hazard Modeling
Microscopy Images
Pollen Detection/Classification
Paleoclimate
Evolution Root Tip Tracking
Phenomics
Materials Development
Cell Tracking
Tissue Classification
Renal Failure
Loss of Organ Function
Feedlot Tracking
Disease Detection
Historic Maps
River Meander
Coastline Changes
Documents
NLP
Sentiment Analysis
Regions in Conflict
Handwritten Documents Pre-Digital Datasets
Databases
Web Sites
Publications
Simulations
Ecosystems and Climate Change M. Dietze, K. McHenry, A. Desai, “Model-data Synthesis and Forecasting Across the Upper Midwest: Partitioning Uncertainty and Environmental Heterogeneity in Ecosystem Carbon,” NSF DBI-1062547, 2011-2014
M. Dietze, K. McHenry, A. Desai, “ABI Development: The PEcAn Project - A Community Platform for Ecological Forecasting,” NSF DBI-1457890, 2015-2019
• Towards regional-scale high resolution estimates of plant life and carbon storage
• Scientific workflow and data assimilation system connecting a variety of models within the Ecology community to a variety of data sources
• Grown to 52 developers over the past 3 years
• NCSA / U. Illinois, BU, Brookhaven National Lab, University of Wisconsin, University of Notre Dame, Utah State, Columbia University, Pacific Northwest National Laboratory, DuPont Pioneer, Exeter College, UK, U. Arizona, Dartmouth College
Ecosystems and Climate Change
• Models: • Ecosystem Demography (ED) • SIPNET • DALEC • …
• Data: • Biofuel Ecophysiological Trait and Yield Database (BETY) • Forest Inventory and Analysis (FIA) • North American Regional Reanalysis (NARR) • North American Carbon Program (NACP) • Food and Agriculture Organization (FAO) • …
Ecosystems and Climate Change
• Data with Unstructured Aspects: • MODIS (Multi-spectral) • Lidar • Palsar (Radar) • Aviris (Airborne Infrared Spectrometer) • Landsat (Images)
• Published results (e.g. tables, figures, plots)
• Manually done to ingest into BETY
• Settlement Vegetation data • Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS
• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data
Ecosystems and Climate Change
• Document
• Settlement Vegetation data • Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS
• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data
Ecosystems and Climate Change
• Document • Image
• Settlement Vegetation data • Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS
• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data
Ecosystems and Climate Change
• Document • Image • Spatial
• Settlement Vegetation data • Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS
• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data
Ecosystems and Climate Change
• Document • Image • Spatial • Tabular
• Settlement Vegetation data • Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS
• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data
Ecosystems and Climate Change
• Document • Image • Spatial • Tabular • Weather
• Settlement Vegetation data • Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS
• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data
Ecosystems and Climate Change
• Document • Image • Spatial • Tabular • Weather • 3D
• Settlement Vegetation data • Born Physical
• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets
• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,
XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS
• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data
Ecosystems and Climate Change
• Document • Image • Spatial • Tabular • Weather • 3D • Archive, Database,
Filesystem, …
“Big Data” • Large quantities of data • Large varieties of data
• “Long-Tail”
Number of grants
Dollars
http://www.slideshare.net/rheimann04/big-social-data-the-social-turn-in-big-data
The “Long-Tail” of “Big Data”
…
Tabular Data Gap Filling
Climate Modeling Lidar
Flood Plain Analysis River Depth Distribution
River Maturity Stream Detection and Sinuosity
Satellite/Aerial Photos Land Cover/Usage
Water Detection (e.g. Lakes, Retaining Ponds)
Green Infrastructure
Hyperspectral
Radar
Photos
3D Reconstruction
3D Data
Human Preference Modeling
Video
People Detection/Tracking
Large Dynamic Group Behavior
Bee Detection/Tracking
Bee Colony Behavior
Underwater Photos
Color Correction
Image Stitching
Mapping
Event Detection
Species Detection/Counting Reef Changes
Food Supply
Structural Defects
Hazard Modeling
Microscopy Images
Pollen Detection/Classification
Paleoclimate
Evolution Root Tip Tracking
Phenomics
Materials Development
Cell Tracking
Tissue Classification
Renal Failure
Loss of Organ Function
Feedlot Tracking
Disease Detection
Historic Maps
River Meander
Coastline Changes
Documents
NLP
Sentiment Analysis
Regions in Conflict
Handwritten Documents Pre-Digital Datasets
Databases
Web Sites
Publications
Simulations
Tabular Data Gap Filling
Climate Modeling Lidar
Flood Plain Analysis River Depth Distribution
River Maturity Stream Detection and Sinuosity
Satellite/Aerial Photos Land Cover/Usage
Water Detection (e.g. Lakes, Retaining Ponds)
Green Infrastructure
Hyperspectral
Radar
Photos
3D Reconstruction
3D Data
Human Preference Modeling
Video
People Detection/Tracking
Large Dynamic Group Behavior
Bee Detection/Tracking
Bee Colony Behavior
Underwater Photos
Color Correction
Image Stitching
Mapping
Event Detection
Species Detection/Counting Reef Changes
Food Supply
Structural Defects
Hazard Modeling
Microscopy Images
Pollen Detection/Classification
Paleoclimate
Evolution Root Tip Tracking
Phenomics
Materials Development
Cell Tracking
Tissue Classification
Renal Failure
Loss of Organ Function
Feedlot Tracking
Disease Detection
Historic Maps
River Meander
Coastline Changes
Documents
NLP
Sentiment Analysis
Regions in Conflict
Handwritten Documents Pre-Digital Datasets
Databases
Web Sites
Publications
Simulations
The Data
• Diversity of data types • Diversity of file formats
• Ad hoc formats • Obsolete formats • Proprietary formats
• Un-curated data • No metadata • No consistent/useful naming of files/directories
• Unstructured data • Non-text contents
• Potentially large and/or made up of many small files
Tabular Data Gap Filling
Climate Modeling Lidar
Flood Plain Analysis River Depth Distribution
River Maturity Stream Detection and Sinuosity
Satellite/Aerial Photos Land Cover/Usage
Water Detection (e.g. Lakes, Retaining Ponds)
Green Infrastructure
Hyperspectral
Radar
Photos
3D Reconstruction
3D Data
Human Preference Modeling
Video
People Detection/Tracking
Large Dynamic Group Behavior
Bee Detection/Tracking
Bee Colony Behavior
Underwater Photos
Color Correction
Image Stitching
Mapping
Event Detection
Species Detection/Counting Reef Changes
Food Supply
Structural Defects
Hazard Modeling
Microscopy Images
Pollen Detection/Classification
Paleoclimate
Evolution Root Tip Tracking
Phenomics
Materials Development
Cell Tracking
Tissue Classification
Renal Failure
Loss of Organ Function
Feedlot Tracking
Disease Detection
Historic Maps
River Meander
Coastline Changes
Documents
NLP
Sentiment Analysis
Regions in Conflict
Handwritten Documents Pre-Digital Datasets
Databases
Web Sites
Publications
Simulations
Tabular Data Gap Filling
Climate Modeling Lidar
Flood Plain Analysis River Depth Distribution
River Maturity Stream Detection and Sinuosity
Satellite/Aerial Photos Land Cover/Usage
Water Detection (e.g. Lakes, Retaining Ponds)
Green Infrastructure
Hyperspectral
Radar
Photos
3D Reconstruction
3D Data
Human Preference Modeling
Video
People Detection/Tracking
Large Dynamic Group Behavior
Bee Detection/Tracking
Bee Colony Behavior
Underwater Photos
Color Correction
Image Stitching
Mapping
Event Detection
Species Detection/Counting Reef Changes
Food Supply
Structural Defects
Hazard Modeling
Microscopy Images
Pollen Detection/Classification
Paleoclimate
Evolution Root Tip Tracking
Phenomics
Materials Development
Cell Tracking
Tissue Classification
Renal Failure
Loss of Organ Function
Feedlot Tracking
Disease Detection
Historic Maps
River Meander
Coastline Changes
Documents
NLP
Sentiment Analysis
Regions in Conflict
Handwritten Documents Pre-Digital Datasets
Databases
Web Sites
Publications
Simulations
Processes Over the Data
• Diversity of analyses • Many forms (e.g. scripts, libraries, whole suites, services) • Many languages • Many dependencies
• Leverage towards dealing with unstructured/un-curated data • Analyses churn through data and generate new, often higher
level, data • Metadata, data about data
The Problem
• A huge diversity in the data • Types • Formats • Analyses
• A huge diversity of software involved • Scripts • Applications • Libraries • Services
• Dealing with these issues has become part of the scientific workflow, its time consuming and redundant, its difficult, its varies across labs/fields, and makes reproducibility/reusability difficult!
The Problem
• A huge diversity in the data • Types • Formats • Analyses
• A huge diversity of software involved • Scripts • Applications • Libraries • Services
• Dealing with these issues has become part of the scientific workflow, its time consuming and redundant, its difficult, its varies across labs/fields, and makes reproducibility/reusability difficult!
A Science Driven Data Transformation Service
• Supporting Data Manipulation as a Service • File format conversions • Data set conversions • Database ingestion/dumping • Website scraping
• Supporting Data Analysis as a Service • Low level analyses • Tags and Metadata • Previews • Other derived products
• Relieve scientific community from having to address this as a first step of their workflows.
Brown Dog
• Data transformations • Conversions and Extractions
• Extensibility • Easy to add new converters/extractors • Encapsulated software & dependencies
https://en.wikipedia.org/wiki/Mongrel
• API • Clients, Scalability, Provenance, Information Loss, Data
Movement
• Data Access Proxy (DAP) • An extensible and distributed service for carrying out file
format conversions • Move towards an internet/world that is agnostic to file
formats • Aid in accessing a files contents independent of how it
is represented on disk
• Data Tilling Service (DTS)
• An extensible and distributed service for the extraction of new data or metadata from a file’s contents
• Provide means to query and/or relate collections of data without metadata
• Data Conversion: A transformation on digital data that largely preserves the entirety of the data. Largely reversible.
• Data Extraction: A transformation on digital data
which creates new, often higher level, data from the contents of the given data (e.g. tags, signatures). Not reversible.
Brown Dog
• The Data Access Proxy (DAP) • https://dap.ncsa.illinous.edu/polyglot /api/ • File in, File out
• The Data Tilling Service (DTS) • https://dts.ncsa.illinois.edu/clowder/api/ • File in, JSON out • JSON can contain metadata, tags, signatures, links to derived
data products, etc…
Brown Dog
• Services!!! • Programmable interface • Client applications build on top of these services • Back with computational and storage resources • Place to preserve/reuse software/tools
https://www.youtube.com/watch?v=MvaHQKT3BPQ
Clowder
• “Smart Drop Box” • Share, collaborate
on datasets • Publishing data • Social curation • Extensible Auto-
curation
Architecture
Load balancer (nginx)
Data/Metadata
(MongoDB)
Event Bus (RabbitMQ)
Extractor 1 (Java)
Extractor 2 (Python)
Text Search (Elastic search)
Webapp (Scala/Play)
Webapp (Scala/Play)
Webapp (Scala/Play)
Clowder
External Software
Web Browser Custom Clients
Client
Server
Multimedia Search (Versus)
Multimedia Search (Versus)
Text Search (Elastic search)
Data/Metadata
(MongoDB)
Load Balancer
API Frontend
Job Queue
Extractor
Database
1. File
2. Routing
3. File Stored
4. Job Submitted
6. Read 5. Job Picked Up
8. Write
7. Extract 7.5 Status Updates
Log Analysis
Distributed Log
Extractions
extractors.connect_message_bus(extractorName=extractorName, messageType=messageType, rabbitmqURL=rabbitmqURL, rabbitmqExchange=rabbitmqExchange, processFileFunction=process_file, checkMessageFunction=check_message)
Connecting to rabbitmq
Connect
def process_file(parameters): global extractorName inputfile=parameters['inputfile'] # call actual program result = subprocess.check_output(['wc', inputfile], stderr=subprocess.STDOUT) (lines, words, characters, filename) = result.split()
Return Metadata
Work on File
extractors.upload_file_metadata(mdata=metadata, parameters=parameters)
wordcount.py
face.py #!/usr/bin/env python import pika import sys import json import traceback import requests import tempfile import subprocess import os import itertools import numpy as np import cv2 import time import logging from config import * import pymedici.extractors as extractors def main(): global extractorName, messageType, rabbitmqExchange, rabbitmqURL #set logging logging.basicConfig(format='%(levelname)-7s : %(name)s - %(message)s', level=logging.WARN) …
Polyglot
• Wraps and automates I/O operations within arbitrary software
• Searches for conversion paths across software
• Estimates information loss
• Horizontally scalable
#Application name (Version) #File types supported (e.g. document, depth, image, …) #Comma separated list of supported input formats #Comma separated list of supported output formats
Describe
#Call external application and/or carry out conversion … Convert File
;OpenOffice ;document ;doc, odt, rtf, txt ;doc, odt, pdf, rtf, txt ;Run program Run, "C:\Program Files\OpenOffice.org 3\program\soffice.exe" -headless -norestore "-accept=socket`,host=local…" RunWait, "C:\Program Files\OpenOffice.org 3\program\python.exe" "C:\Converters\DocumentConverter.py" "%1%" "%2%"
OpenOffice_convert.ahk
A3DReviewer_open.ahk
;Adobe 3D Reviewer (v9) ;model ;3ds, 3dxml, arc, asm, bdl, catdrawing, catpart, catproduct, catshape, cgr, dae, dlv, exp, hgl, hp, hpgl, hpl, iam, ifc, igs, iges, ipt, jt, kmz, mf1, model, neu, obj, _pd, par, pdf, pkg, plt, prc, prt, prw, psm, pwd, sab, sat, sda, sdac, sdp, sdpc, sds, sdsc, sdw, sdwc, ses, session, sldasm, sldlfp, sldprt, stl, step, stp, u3d, unv, wrl, vrml, x_b, x_t, xas, xpr, xmt, xmt_txt, xv0, xv3 ;Run program if not already running IfWinNotExist, Adobe 3D Reviewer { Run, C:\Program Files\Adobe\Acrobat 9.0\Acrobat\plug_ins3d\prc\A3DReviewer.exe WinWait, Adobe 3D Reviewer } ;Activate the window WinActivate, Adobe 3D Reviewer WinWaitActive, Adobe 3D Reviewer ;Parse filename root arg1 = %1% …
PEcAn#ED_convert.R
#!/usr/bin/Rscript #PEcAn #data #pecan.zip #ed.zip .libPaths("/home/polyglot/R/library") sink(stdout(),type="message") # global variables overwrite <- TRUE verbose <- TRUE # get command line arguments args <- commandArgs(trailingOnly = TRUE) usage <- function(msg) { print(msg) print(paste0("Usage: ", args[0], " cf-nc_Input_File edOutputDir ")) print(paste0("Example1: ", args[0], " US-Dk3.pecan.nc US-Dk3.ed.zip [/tmp/watever] ")) …
API Gateway
API GATEWAY REDIS
CROWD
DTS / CLOWDER
DAP / POLYGLOT
VERSUS
DATAWOLF
Request
Response
Request+
Response
Request+
Response
API Gateway
FENCE
Get /keys/8d4/token Headers: Crowd Credentials
using Basic Auth
Get /dap/outputs Headers: Access token
Get /dts/api/extractions/extractors_n
ames Headers: Access token
REDIS Add token with ttl
POLYGLOT (DAP)
CLOWDER (DTS)
Get /outputs Headers: Polyglot Credentials
Get /api/extractions/extractors_nam
es Headers: Clowder Credentials
CROWD Check user credentials
1
1
1
2
3
2
3
Support within Data Management Plans
The data analysis/manipulation software developed here will be pushed into the NSF DIBBs: Brown Dog (ACI-1261582) project as data extractors/converters within the DTS and DAP, services providing automatic data annotations/analysis and format conversions as broadly usable internet resources. Brown Dog aims to both provide services and tools to aid in the curation, accessing, and indexing of data as well as to preserve scientific software that might be leveraged for that purpose. As Brown Dog extractors/converters, the capabilities of these tools will be preserved, will take part in an ecosystem of other extraction/conversion tools, and will be leverageable by others within the scientific community, perhaps in very different fields, as well as by the general public.
Milestones
• XSEDE Tutorial • July 18th, Miami • Walk through adding and deploying new tools (i.e. converters,
extractors) • Walk through the API and creating a toy client application
• Beta Release • End of this year
Polyglot
Versus Daffodil
http://browndog.ncsa.illinois.edu