presentation afternoon sessionleverage actively developed open source software and libraries...
TRANSCRIPT
1/25/2018
1
Lucy Rose
Department of Forest Resources
1/25/2018
2
Opportunities and Challenges for High Temporal Resolution Hydrologic Monitoring in
Northern Minnesota
Lucy Rose and Diana KarwanDepartment of Forest Resources, University of Minnesota
Legislative‐Citizen Commissionon Minnesota Resources
Study Site : West Swan River
Watersheds topographically defined, don’t always line up in this area
Total Watershed Area: 85 mi2
Contributing Area between “Above” and “Below”: ~2 mi2
Marcell
West Swan
Hibbing, MN
1/25/2018
3
West Swan River
Primary interests:
What are the changes in magnitude and timing of:
• Stream discharge• Dissolved organic carbon (DOC)• Total suspended sediment (TSS)• Turbidity
Upstream site
Downstream site
MonitoringEquipment‐Opportunities
Campbell OBS‐3Turbidity Probe
Decagon CTD‐10Conductivity, Temperature, Depth Probe
Stroud Water Research Centerenvirodiy.org
1/25/2018
4
Stroud Water Research Centerenvirodiy.org
MonitoringEquipment‐Opportunities
Programmable ISCO AutomatedWater Sampler
Sontek IQ‐Plus High‐Frequency Discharge Profile Sensor
MonitoringEquipment‐Challenges…
1/25/2018
5
Hydrologic variability during the spring snowmelt in West Swan River
March 15 – April 6, 2017
Downstream study site
1/25/2018
6
Opportunities and Challenges
• Expanding capabilities for high temporal resolution measurement of many water quality characteristics
• Supportive community of open source, DIY sensor and dataloggerenthusiasts, willing to share knowledge (envirodiy.org)
• Still working out the “bugs” in many of these DIY systems
• Even commercial sensors and samplers can require regular attention, depending on the monitoring site
Thank you!
1/25/2018
7
AmitPradhanhanga
Center for Changing Landscapes
Comprehensive social science data collection
and analysis Amit K Pradhananga
Mae A Davenport
01/19/2017
1/25/2018
8
WHAT drives conservation behavior?
Data collection
1/25/2018
9
Sociodemographicand property characteristics
• Age, gender, education, income
• Farming experience
• Property size• Tenure
• Practice adoption
• Civic engagement
• Support for conservation initiatives
• Awareness• Attitude• Values• Norms• Efficacy
Conservation behavior
Perceptions
1/25/2018
10
Completed projects Ongoing projects
Wild Rice Watershed District
Capitol Region Watershed District
Mississippi Watershed Management
Organization
Vermillion River Watershed
Ramsey‐Washington Metro Watershed District
Sand Creek Watershed
Cannon River Watershed
Middle Minnesota Watershed
Mississippi River‐ La Crescent Watershed
Mississippi River‐ Reno Watershed
Lower Minnesota Watershed
Middle Snake Tamarac Rivers
Watershed District
Develop comprehensive, social science-based framework to track drivers of conservation behavior
1/25/2018
11
Social data mapping
http://gis.joewheaton.org/topics/data
Opportunities/challenges
1/25/2018
12
Expertise in geospatial analysis and social data mapping
Partnerships and collaborations
1/25/2018
13
LeifOlmanson
Department of Forest Resources
Remote Sensing of Lake water QualityOpportunities and Challenges
UNIVERSITY OF MINNESOTA
Leif Olmanson
1/25/2018
14
HIGHLIGHTS7 statewide water clarity assessments since 1975 of >10,000 Minnesota lakes
Analysis of spatial and temporal trends and causative factors
Lake Browser: An on‐line resource for ~9,000 unique monthly visitors
Currently being updated to include 2010 and 2015 to maintain 5 year interval
Prior Accomplishments:
27
Remote sensing of lake water clarity in Minnesota
To improve water quality and fisheries management we need more comprehensive
water quality datawater.rs.umn.edu
New satellite technology enables measurements of the three factors controlling water clarity – algae, suspended solids, and dissolved organic color – allowing us to assess their individual effects on water quality
More often
Better sensors
Finer resolution
28
2016 CDOM Map
1/25/2018
15
29
Water Clarity Model Applied
GloballyCloud Based Image Processing
A planetary‐scale platform for Earth science data &
analysis
Using to explore Image processing methods validated with in situ data
30
Garbage in, Garbage Out.Your analysis is only as good as your data!
1/25/2018
16
Path 299/17/14
Path 289/12/15& 8/22/13
Challenge
Atmospheric correction for
consistent results using automated
methods
Path 288/30/16
Path 2610/3/16
Secchi disk transparency data Within 1 day
Landsat OLI test image field data
Landsat OLI Remote Sensing Reflectance (Rrs) Lake Spectra
1/25/2018
17
August 30, 2016Landsat 8 OLI image RGB (Blue, Thermal, Thermal)
masked using EROS CFMask
Cloud, Shadow and Haze Masking
Works well for clouds and most shadows but not
for haze
Challenge
Lakes in areas with haze will be mischaracterized
34
Opportunity: Near Real‐Time Water Quality Monitoring
Improved data
More often (~weekly)
FreeEROS Data Center
Normalize images
Remove land, clouds, shadows, haze…
Water clarity
CDOM
Suspended Solids
ChlorophyllAutomated satellite imagery pipeline
Recently launched satellites:Landsat 8Sentinel‐2 Sentinel‐3
Maps, data, statisticalsummaries, time‐trend plots and animations
Minnesota Supercomputing Institute (MSI)
UMN high performancecomputing systems
Prepare images using new automated methods.
Apply water quality models Provide customized
information to agencies, researchers, and citizens.
Enhanced Lake Browser
New Lake and Fisheries management and Research Opportunities
1/25/2018
18
XunTang
Spatial Computing Research Group
www.spatial.cs.umn.edu
Courses
36
• CSCI 5715 Spatial Computing
(Fall 17)
• CSCI 8715 Spatial Data Science Research (Spring18)
Related Grants• Active
• USDA: Increasing Low‐Input Turfgrass Adoption though Breeding
• NSF: Collaborative Research: Mining Climate and Ecosystem Data Driven Approach
• Finished
• NSF: IGERT: Non‐equilibrium Dynamics Across Space and Time: A Common Approach for Engineers, Earth Scientists, and Ecologists
• NSF: CRI:IAD Infrastructure for Research in Spatio‐Temporal and Context‐Aware Systems and Applications
• USDOD: Modeling and Mining Spatio‐Temporal Co‐occurrence Patterns
• USDOD: Cascade Models for Multi‐Scale Spatio‐Temporal Pattern Discovery
1/25/2018
19
Research: Spatial data mining
37
• The process of discovering interesting, useful, non-trivial patterns from large spatial datasets
• Example patterns• Hotspots, Spatial clusters• Spatial outlier, discontinuities• Co-locations, co-occurrences• Location prediction models
• Highly Inter-disciplinary• Students always collaborate with scientist from Environmental Science,
Public Health, Transportation.
GEO: Forensics: When and where do contaminants enter Shingle Creek?CISE/IIS: Scalable detection of spatio‐temporal hot‐spots & co‐occurrences
38
Ex. Oil Spill
Flow anomaly
After consecutive heavy rain events
(HydroLab sensor)
Details: J. M. Kang, S. Shekhar, C. Wennen, and P. Novak, Discovering Flow Anomalies: A SWEET Approach, IEEE Intl. Conf. on Data Mining, 2008.
Ack: NSF IGERT, CISE/IIS/III, USDOD.
Dissolved Oxygen
Rainfall
1/25/2018
20
Goals:
• Design compelling visions
• Identify gaps
• Develop a research agenda
55 Participants (Data-driven FEW & Data Sciences)
Global Temperature
Global Population
StateNexus Dashboard
Locations
Potentially Transformative Research Agenda: • National FEW Nexus Observatory & Dashboard for chokepoint monitoring, alerts, warnings (See Figure above)• Novel Physics-aware Data Science for mining nexus patterns in multi-scale spatio-temporal-network data despite non-stationarity, auto-correlation, uncertainty, etc.• Scalable tools for consensus Geo-design via participative planning with nexus observations and policy projections• An INFEWS data science community to address crucial gaps, and shape next-generation Data Science
Next: (a) Workshop report in Jan. 2016. (b) Symposium at NCSE National Conf. on Science, Policy & Env. (2pm-330pm, Th. 1/21/16, Crystal City, Washington D.C.)
NSF INFEWS Data Science (DS) Workshop (@ USDA NIFA, Oct. 5th‐6th, 2015; Shekhar, Mulla, & Schmoldt; www.spatial.cs.umn.edu/few)
Finding 1: Data & Data Science are crucial!• Understand problems, connections, impacts• Monitor FEW resources, and trends to detect risks• Support decision and policy making• Communicate with public and stakeholders
Finding 2: However, there are show-stopper gaps.1. Data Gaps: No global water & energy census, Heterogeneous data formats & collection protocols2. Data Science (DS) Gaps: Current DS methods are inadequate for spatio-temporal-network FEW data. Strong assumptions in DS need examination for better coupling with mechanistic models (e.g., Physics)
Aral Sea Shrinkage (1978-2014)Due to Cotton Farms
Alerts
Global Population
Food Energy Water DataSc.
14 10 11 20
Gov. Aca. Industry
26 24 5
Sea-Surface Temperature Anomaly
Trends
1/25/2018
21
Small Group Discussion
Regarding the interface of water and data at the University of Minnesota, identify our institutional strengths,
opportunities, weaknesses, and threats.
1/25/2018
22
DIGITAL WATER SURVEY SUMMARYJeffrey M. Peterson
January 19, 2018
THE SURVEY, BY THE NUMBERS
• Online survey with 13 questions
• Sent to 61 selected faculty and staff
• Complete data from N = 42 respondents
• Response rate = 69%
1/25/2018
23
Disciplines of respondents
Biological science
Aquatic science
Agricultural science
EngineeringEarth science
Computational or data science
EconomicsSocial science
Other
Number of responden
ts
Positions of respondents
Tenured or tenure‐track
faculty
Non‐tenure track faculty
Research associate or postdoctoral researcher
Extension Educator
Other
Number of responden
ts
1/25/2018
24
Data types used
Geospatial dataTime series data Cross‐sectional data, not
georeferenced
Field measurements
Laboratory measurements
Socioeconomic survey data
Qualitative data
Number of responden
ts
How much of your research depends on multidisciplinary collaborations?
None None but would consider
None but am planning
All or most
Number of responden
ts
1/25/2018
25
Importance of constraints in multidisciplinary research
Lack of common vocabulary to communicate ideas
Lack of a common framework to combine different types of data
Lack of tools/software to combine and analyze multidisciplinary data
Lack of ways to easily share data and analysis tools
Importance of constraints in data handling: Hardware and software
Availability of high performance data storage
Availability of high capacity data storage
Availability of many processors for computationally intensive work
Availability of tools/software for visualization
Availability of tools/software for data management and interoperability
1/25/2018
26
Importance of constraints in data handling: People
Access to people capable of implementing algorithms and data
analysis workflows
Access to people who are skilled at gathering, organizing, and
curating data
Access to people who are skilled at analyzing and visualizing data
Removing constraints would….
Improve my ability to compete for extramural funding
Increase my efficiency in completing research projects
Improve my ability to recruit students
Increase the likelihood that I will stay at the University of
Minnesota
1/25/2018
27
I or my group members would benefit from training on…
Machine learning
Database management
Data visualization
Trend analysis
Anomaly detection
Pattern recognition
Geospatial analysis
Importance to future research: Hardware and software
High performance computing resources
High capacity data storage
Mapping software or tools to analyze geospatial data
Custom software
1/25/2018
28
Importance to future research: Capabilities
A means to combine diverse datasets
A means to protect the privacy of some or all data
A way to share data with collaborators outside of UMN
Data use agreements to protect Intellectual Property (IP) data
A means to store, organize and access data
SOME KEY RESULTS
• Data interoperability is a pathway to multidisciplinary research
• Hardware is important but not currently a constraint
• Current constraints revolve around human resources, software/tools, and training
• Removing constraints is expected to have large benefits
1/25/2018
29
Jim Wilgenbusch
(Phil Pardey & Kevin Silverstein)
University of Minnesota Minnesota Supercomputing Institute
College of Food Agricultural and Natural Resource Sciences
January 19, 2018Water Resources Assembly and Research Symposium
University of Minnesota
G.E.M.S : Enabling Agricultural Innovation TM
Credit:Marcel Ritter, Jian Tao, Haihong Zhao, Louisiana State University Center for Computation and Technology
Visualizing Big Data: oil flow through water
1/25/2018
30
EG SM
Data Interoperability
Genomics Environment Management Socio‐Economics
(and Scalability)
TimeTime SpaceSpace
Siloed DataInstitutions, individuals, and, most importantly, by subject‐matter disciplines
1/25/2018
31
“Broken” DataA lot of “data” is incomplete, some is messy and even incoherent
Due Diligence
1/25/2018
32
G.E.M.S
GEMShareTM
Data sharing Metadata management
GEMSToolsTM
Data cleaning Data analytics
Governance
Partners
HPCTechnology
Human Capital
TM
IAA
Membership Governance Data use agreements/Data privacy Federated resources
Membership Governance Data use agreements/Data privacy Federated resources
Other Groups
Membership Governance Data use
agreements/Data privacy Federated resources
Relationship to Partnerships
Other Groups
(Digital Water
Initiative)
G.E.M.S Genomes To Fields (G2F)
IAA
TM
1/25/2018
33
Leverage actively developed open source software and libraries
Contribute back to open source development Build new communities of developers and users when none
exist Prepare to throw stuff away
Basic Development Principles
Postgres ‐ MIT Jupyter ‐ BSD 3.0 Django ‐ BSD 3.0 pyCSW ‐ MIT
Globus ‐ Apache 2.0Apache Spark ‐ Apache 2.0 Geotrellis ‐ Apache 2.0
Docker ‐ Apache 2.0 PostGIS extensions ‐ GPL 2.0
Puppet ‐ Apache 2.0 Conda ‐ BSD
R ‐ GPL Scala ‐ BSD CentOS
Open Source Tools Supporting G.E.M.S
G.E.M.STM – Core Features
GEMShareTM
A research‐enabling, federated data storage and sharing platform
• Security: Appropriate levels of security (data encryption at rest; authentication with home institution’s credentials; and secure infrastructure)
• Access Control: Data owners control access to their data, in recognition of the proprietary nature of much of the data
• Access Levels: Different levels of access [single organization; set of organizations; and publicly available (open) data]
• Discovery: Discoverability of data through metadata alone• Transfer: Secure data transfer over both high speed networks between reliable endpoints and high latency networks to less reliable endpoints
1/25/2018
34
G.E.M.STM – Specialized Features
GEMToolsTM
An ever‐expanding data documentation, cleaning, harmonizing and analysis toolkit
• provide access to best in class hardware and software libraries
• accommodate different programming languages
• offer a range of analysis styles (novice to sophisticated)
GEMSTools – Analysis Interface (Expert)
1/25/2018
35
Mousing over a location displays selected aggregate stats for that location.
Filters:
Output:
Data setCIMMYT maizeCIMMYT wheatG2F maize
LocationSeriesTrialMgmt ConditionPhenotypeSocioeconomic
Aggregate statsGlobal
By Country
By Series
By Trial
By Investigator
By Seed Variety
By Seed Source
By Location ID
GermplasmGenotype MatrixPhenotypeSocioeconomic
x
x
GEMSTools – Analysis Interface (Point & Click)
Data interoperability issues and GEMToolsTM solutions
1/25/2018
36
• Nomenclature inconsistencies
• Measurement unit differences
• Erroneous and missing entries
• Outlier / physically impossible data values
• Domain‐specific problems• Pedigree syntax
• Genotype / Pedigree inconsistencies
• Spatial concordance of census and mapped data
• Spatiao‐temporal boundary standardization
Typical Data Impurities
Nomenclature inconsistencies
Total Phosphorus207 lb/A46lb per acre46 pound/A22 lbs68 kg/ha54lbs/acre55.5 lbs P per Acre80lb/acreNone40 poundsnone applied192 lbs;17‐Apr‐14...
GEMS Tools—DataCleaner
Planter TypeAir planterFluted coneAlmaco TP2airair planterFluted ConeFluted coneAlmaco fluted cone planterjab planterjabjab (hand) planter...
Previous cropSoybeansoybeanssoybeancornsoy beansCorn...
1/25/2018
37
Dynamic Metadata Mashup Model—DM3
EML (Ecological Metadata Language): experiment, investigator, institution, organism specimens and taxonomy.
OBI: sequencing, library preparation, and sequence processing
ENVO & XEO/XEML: environmental features and habitats
Planteome.org (TO & CO): plant phenotypic traits (TO) across many individual crop ontologies (CO)
PATO: plant phenotypic qualities
AGRO agronomic practices and techniques
OGC standard ISO19115‐2, FGDC and Dublin Core: geospatial
Broad Vocabularies
AGROVOC (FAO): including food, nutrition, agriculture, fisheries, forestry, environment etc. Translated in 27 languages.
ICASA (AgMIP)
E
G
S
M
GEMSTools — DataCleaner
Before After
Correcting errors
Imputing missing lat/long values
Geocoding Inference Engine
1/25/2018
38
GEMSToolsTM—Machine Aided Data Cleaning
Modular code to address each cleaning issue
• Work on specific problem (e.g., maize field trial data)
• Write code to automate much of the cleaning
• Apply to new crops or new datasets
o G2F vs CIMMYT vs PepsiCo (nomenclature cleaning)
o Maize, wheat, soybean, apples (pedigree cleaning)
o Surface water mesurements
Rule‐based techniques, Natural Language Processing, and some Deep Learning methods
Converge toward real‐time feedback on cleaning
Thanks
G.E.M.S URL – Under Construction!
1/25/2018
39
Office of the Vice President for Research
Advanced Systems
Operations
- Common Services
- HPC Systems
- Storage Systems
- Hosted Services
Scientific Computing Solutions
-- Optimization
-- Benchmarking
-- HPC Research
Workflow & pipeline Development
Application Development
Solutions
- Custom App Dev
- System Programming
Research Informatics Solutions
-Informatics education
-Informatics research
-Informatics services
-Life Science Computing
User Gateway Group
-- User Support Lead
-- User Training
-- On Boarding
-Communications
-- Outreach
Minnesota Supercomputing
Institute
UMInformatics
InstituteU-Spatial
Research Computing
MSI Computing and Data Storage Assets
Batch High Performance Computing• Two Supercomputers• 25,000 CPU Cores• 230,400 GPU CUDA Cores• 100 TB Memory • Infiniband Network
Big Data Storage & Analysis• 6 PB Primary High Performance • 3 PB Second Tier • 30 PB Archive Tape Library • 1.2 PB Hadoop/Spark Cluster
Interactive & Cloud Computing• Citrix VDI for Windows • DCS Nice for Linux Desktops• OpenStack for Secure Cloud• 100 Gbps Campus Research Network• Regional & National Optical Networks
Web Portals & Databases• Galaxy for Multi‐omics • Jupyter Hub• Custom Interfaces & Applications
1/25/2018
40
Data Storage
Core Services Node Analysis Node
Database
REST API
Container Server
Web Apps
Globus Auth & Transfer
Globus C
lient
Jupyte
r/ Sp
ark (User1)
Jupyte
r/ Sp
ark (UserX)
Apache
Container Server
G.E.M.S – Platform Architecture & Components
Jupyte
rHub
Jupyte
r/ Sp
ark (User2)
Admin Workstation
Admin Workstation
Admin Workstations
UsersUsersUsers
Bastion Host
Globus Auth& Transfer
G.E.M.S – Security Infrastructure
SSL Encrypted
Secure Web Browser
Path (notebook)
Two factor authenticationfor admin access
SSHEncrypted
Logging
Monitoring
Encrypted Data Storage
User Isolated container
Environments
1/25/2018
41
Admin Workstation
Admin Workstation
Admin Workstations
UsersUsersUsers
Bastion Host
Globus Auth& Transfer
G.E.M.S – Scale out of Compute and Data
Logging
Monitoring
G.E.M.S – Use Cases and Business Models
Presentation
Application
Data
Middleware
APIs
Integration
Hardware
Facilities
Connectivity
Case 1
Use everything managed by MSI or by Federated Partner
Business Model: SaaS
Infrastructure
Platform
Apps
Case 2
Use G.E.M.S platform, apps, & data, and
another infrastructure
Business Model: Open Core
ACME Corp IT, AWS, etc.
Platform
Apps
Data Data
Case 3
Use only G.E.M.S data and other apps, platform, and infrastructure
Business Model: DaaS
Application
Presentation
Middleware
APIs
Integration
Hardware
Facilities
Connectivity
My Apps, My Laptop, ACME Corp IT Server,
AWS, Supercomputing Institute at Univ.,
etc.
Data Data
1/25/2018
42
Small Group Discussion
Identify resources and investments needed for a Digital Water Initiative at the University of Minnesota to
best support collaboration.