ag-analytics data platform - cornell university...currently building a gis based mapping platform...
TRANSCRIPT
Joshua D. WoodardAssistant Professor and
Zaitz Faculty Fellow in Agribusiness and FinanceDyson School of Applied Economics and Management
Cornell University
NY State Precision Ag Workshop
Ag-Analytics Data Platform
The Data Integration Problem Analysts typically source data from many different government and non-
government sources, different temporal and spatial resolution Relevant data spread over a wide variety of operational/transaction
based databases, datamarts, unstructured text files, etc. Sources all have different data storage and formatting protocols, API’s,
different levels of temporal and spatial aggregation etc. Existing infrastructures can not be queried jointly, nor at all Not processed to scales appropriate for most uses Typical approach is to “one off ” for every study, to do the
following: At a point in time, download “slices” of data from several different sources, Then format (often by hand or copy/paste) individual data sets and mash
together (may take days or weeks; not automated/replicable/documented) Perform one off analysis To expand analysis or update, entire process must be recreated by human
A Fairly Small Sampling…
AgDB Data Warehousing Overview
AgDB Data Warehouse
External Clients
OLTP
Data Chunks
Scheduled JobsTo Download andExtract from Source Over Web Prepocessing/
Aggregation/Interpolation/Transformation
Data
Filter/clean
Data Auditing & Validation
External Databases, Datastores, Datastreams: RMA, USGS, NRCS, AMS, ERS, PRISMS, CME, NASA,
NASS, FSA, FAS, etc.
Web Data Services, OLAP, Data Marts
Web Decision Tools
Load
Integration Services
Ag-Analytics.org
Ag-Analytics.org An open source, open-data portal for ag and enviro data and
models Open data: get and use data for free, direct from platform Open source: see exact code for how data are sourced,
processed, transformed and stored; and contribute code
Abridged/Partial Summaries of Major Datasets/Sources Currently in AgDB
Data Source and Item DescriptionIPCC Climate Change Projections Future temperature and precipitation projections across different emission scenarios and percentiles of the 16 General
Circulation Models (GCMs).National Climatic Data Center Drought Data
Monthly PDSI drought index data available at the climate district level aggregation. Data is available from 1895 to present, by NCDC District.
PRISMs Climate Group Monthly and daily historical temperature and precipitation data, as well as GDD/HDD processed data. Monthly data is available from 1895 to present. Daily weather data is available from 1981 to present. 800 meter resolution (raw) and processed by FIPS, Township, and in certain cases CLU (pre-2008) available.
Chicago Mercantile Exchange Daily historical futures and options data for agricultural commodities from the Chicago Mercantile Exchange (CME), Chicago Board of Trade (CBOT), and Kansas City Board of Trade (KCBOT). Data is available from 1959 to present, updated daily.
Risk Management Agency (RMA) Agricultural insurance price and participation data available at the county level aggregation. Data is available from 1989 topresent from Summary of Business. Other data also loaded from various unstructured text files (including historical discovery prices, GRIP yields, etc.)
US Census Bureau County-level and township level geographical coordinates, land area size, water area size, and population data.
USDA Economic Research Service (ERS)
Annual farm structural and financial data available at state-level aggregation for the 15 Agricultural Resource Management Survey (ARMS) states. Data is available from 1996 to present. Other various datasets are also sourced from the ad hoc ERS tools and API’s.
USDA Agricultural Marketing Service (AMS)
Monthly data on the volume, pricing, and utilization of raw milk received by handlers regulated under Federal milk orders from dairy farmers. All tables in the Public MMO database.
USDA National Agricultural Statistics Service (NASS)
Census and survey data available at regional, state, and county level aggregation. The broad categories of data available arecrops, animals and products, economics, demographics, and environmental. Data is available from 1926 to present. Obtained via FTP bulk download from QuickStats. CDL data processed against ready to map gSSURGO NRCS data by crop also available (raw and county processed).
USDA Foreign Agricultural Service Data on production, supply and distribution of agricultural commodities for the U.S. and key producing and consuming countries.
USDA National Resource Conservation Service (NRCS)
Soil data for the continental US from gSSURGO, raw and processed available at various levels of aggregation.
Applications & Accessing Data Applications: Virtually anything in the broader ag and
environmental domain for policy, risk, economics and finance Insurance Conservation and Climate Change Policy Analysis and oversight Farm Bill Program Analysis Product Development
Tools Data access tools and API’s (industrial, for developers, data analysts Facilitates end user tool development Automates processes for getting data Improves reliability of research Makes possible the previously not possible (or only possible for a few at
high cost)
End user and visualization tools RMA Premium calculator Yield and weather visualization tools Mapping applications Dairy margin protection tool S02 wine
CKAN Open Data Portal Software
Ongoing Efforts and Priorities Currently building a GIS based mapping platform for data
exploration Recently received a Microsoft Azure Research Grant for
use of Azure cloud platform, currently converting Additional datasets, API’s, tools (ongoing) Open Source launch Upgrading data portal interface, more extensive metadata,
flexible cataloging/access (CKAN and other) Incorporation of NoSQL platforms Identify various user needs, partners, and collaborators
Challenges, Policy Considerations, and Opportunities
Technical and training Some degree of learning curve, but frankly minimal (we teach this to
undergrads) Technical limitations (networking, processing, etc.) are eroding quickly
Inherently a public good, without intervention will be under-provisioned Marginal cost curse leads to lack of action Coordination within the community How can we work together? Goal: Open source eco-systems for data curation and systems development Government AND Universities and others must be involved Improving access to government data (incentives and bandwidth vs reasonable
delivery formats) Not only what but HOW data are made available is of utmost importance,
otherwise not usable Not a “FOIA-able” solution, must work together!
Challenges, Policy Considerations, and Opportunities
Next Horizons: Secure Data Warehouses for Integrating Agency and other data (RMA, FSA, etc.)
Some work simply can’t be done without linking these together Example: integrating soil information into insurance rates and programs,
modifications to properly treat or incentive conservation, soil health, etc.
Privacy concerns Many precedents, well within allowable law
Field is at an interesting vantage point compared to many others given mix of market, business, environmental and other natural systems data= LOTS OF OPPORTUNITY!
Thank you Questions?