introduction to ecoinformatics: past, present & future william michener lter network office,...
TRANSCRIPT
Introduction to Ecoinformatics: Past, Present & Future
William Michener
LTER Network Office, University of New Mexico
January 2007
Outline Ecoinformatics: a definition A science vision Information challenges Ecoinformatics “solutions”
Outline Ecoinformatics: a definition A science vision Information challenges Ecoinformatics “solutions”
Ecoinformatics
A broad S&T discipline A broad S&T discipline
thatthat
incorporates both incorporates both concepts and concepts and practicalpractical toolstools
for thefor the
understanding, generation, understanding, generation, processing, and propagationprocessing, and propagation of of ecological data, information and ecological data, information and
knowledge.knowledge.
Outline Ecoinformatics: a definition A science vision Information challenges Ecoinformatics “solutions”
Many studies employ a Many studies employ a restricted scale of observation --restricted scale of observation --
Commonly 1 mCommonly 1 m22
The literature is biased toward The literature is biased toward single and small scale resultssingle and small scale results
Space
Space
ParametersParameters
Tim
eTim
e
Thinking Thinking OutsideOutside the “Box” the “Box”
LTERLTER
BiocomplexityBiocomplexity
NEON, WATERS, OOI, ….NEON, WATERS, OOI, ….
Increase in breadth and depth of understanding.....Increase in breadth and depth of understanding.....
2001
2004
2004
1998
2000
2003
Grand environmental challenges
More and more of the ecological questions that confront society are national, continental and global in scope
Source: CDC
Drought
Source: Drought Monitor
LTER
26 NSF LTER Sites in the U.S. and the Antarctic: > 1,600 Scientists; 6,000+ Data Sets—different themes, methods, units, structure, ….
NEON Climate Domains
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
120
Northeast
Mid Atlantic
Southeast
Atlantic Neotropical
Great Lakes
Prairie Peninsula
Appalachians / Cumberland Plateau
Ozarks Complex
Northern Plains
Central Plains
Southern Plains
Northern Rockies
Southern Rockies / Colorado Plateau
Desert Southwest
Great Basin
Pacific Northwest
Pacific Southwest
Tundra
Taiga
Pacific Tropical11
10
9
8
7
6
5
4
3
2
1
12
16
15
14
13
17
19
18
20
19
18
16
Aquatic Arrays
BioMesonet Tower and Sensor Arrays
Soil Sensor Arrays
Micron-scalenitrate ISE
Small-Organism Tracking: Mobile animals as bio-sentinels for environmental change, forecasting biological invasions, emerging disease spread
Outline Ecoinformatics: a definition A science vision Information challenges Ecoinformatics “solutions”
Characteristics of Ecological Data
Complexity/Metadata RequirementsComplexity/Metadata Requirements
SatelliteImages
DataDataVolumeVolume(per(perdataset)dataset)
LowLow
HighHigh
HighHigh
Soil CoresSoil Cores
PrimaryPrimaryProductivityProductivity
GISGIS
Population DataPopulation Data
BiodiversityBiodiversitySurveysSurveys
Gene Sequences
Business Data
WeatherStations Most EcologicalMost Ecological
DataData
MostMost SoftwareSoftware
Info
rmat
ion
Co
nte
nt
Time
Time of publication
Specific details
General details
Accident
Retirement or career change
Death
(Michener et al. 1997)
Data Entropy
Date Site Species Area Count 10/1/1993 N654 PIRU 2 26 10/3/1994 N654 PIRU 2 29 10/1/1993 N654 BEPA 1 3
Date Site picrub betpap 31Oct1993 1 13.5 1.6 14Nov1994 1 8.4 1.8
Date Site Species Density 10/1/1993 N654 Picea
rubens 13
10/3/1994 N654 Picea rubens
14.5
10/1/1993 N654 Betula papyifera
3
10/31/1993 1 Picea rubens
13.5
10/31/1993 1 Betula papyifera
1.6
11/14/1994 1 Picea rubens
8.4
11/14/1994 1 Betula papyifera
1.8
A B
• Schema transform• Coding transform• Taxon Lookup• Semantic transform
Imagine scaling!!
C
Date Site Species Area Count 10/1/1993 N654 PIRU 2 26 10/3/1994 N654 PIRU 2 29 10/1/1993 N654 BEPA 1 3
Date Site Species Density
10/1/1993 N654 Picea rubens
13
10/3/1994 N654 Picea rubens
14.5
10/1/1993 N654 Betula papyifera
3
10/31/1993 1 Picea rubens
13.5
10/31/1993 1 Betula papyifera
1.6
11/14/1994 1 Picea rubens
8.4
11/14/1994 1 Betula papyifera
1.8
B
C
Semantics
Semantics—Linking Taxonomic Semantics to Ecological Data
Rhynchospora plumosa s.l.
Elliot 1816
Gray 1834
Kral 1998
Peet 2002?
Chapman1860
R. plumosa
R. plumosa
R. Plumosav. intermedia
R. plumosav. plumosa
R. Plumosav. interrupta
R. plumosa
R. intermedia
R. pineticola
R. plumosav. plumosa
R. plumosav. pinetcola
R. sp. 1
Taxon concepts change over time (and space)Multiple competing concepts coexistNames are re-used for multiple concepts
from R. Peet
Date Species # 1830 R.plumosa 39 1840 R.plumosa 49 1900 R.plumosa 42 1985 R.plumosa 48 1995 R.plumosa 22 2000 R.plumosa 19
A B C0
10
20
30
40
50
60
1/1/00 1/2/00 1/3/00 1/4/00 1/5/00 1/6/00
What Users Really Want…
Outline Ecoinformatics: a definition A science vision Information challenges Ecoinformatics “solutions”
Experimental DesignMethods
Data DesignData Forms
Data Entry
Field Computer Entry
ElectronicallyInterfaced Field
EquipmentElectronicallyInterfaced Lab
Equipment
Raw Data File
Quality Assurance Checks
Data Contamination
Data verified?
Data ValidatedArchive Data File
Archival Mass StorageMagnetic Tape / Optical Disk / Printouts
Access Interface
Off-site Storage
Secondary Users
Publication
Synthesis
Investigators
Summary Analyses
Quality Control
Metadata
Research ProgramInvestigators
Studies
yes
no
• Standard Operating Procedures• Policies
• Data sharing• Computer use• Archive storage
Ecoinformatics solutions
Data design Data acquisition QA/QC Data documentation (metadata) Data archival
Data design Data acquisition QA/QC Data documentation (metadata) Data archival
Ecoinformatics solutions
Data Design
Conceptualize and implement a logical structure within and among data sets that will facilitate data acquisition, entry, storage, retrieval and manipulation.
Database Types
File-system based Hierarchical Relational Object-oriented Hybrid (e.g., combination of relational and
object-oriented schema)
Porter 2000
Data Design: 7 Best Practices
Assign descriptive file names Use consistent and stable file formats Define the parameters Use consistent data organization Perform basic quality assurance Assign descriptive data set titles Provide documentation (metadata)
from Cook et al. 2000
1. Assign descriptive file names File names should be unique and reflect the file contents Bad file names
Mydata 2001_data
A better file name Sevilleta_LTER_NM_2001_NPP.asc
Sevilleta_LTER is the project name NM is the state abbreviation 2001 is the calendar year NPP represents Net Primary Productivity data asc stands for the file type--ASCII
2. Use consistent and stable file formats
Use ASCII file formats – avoid proprietary formats Be consistent in formatting
don’t change or re-arrange columns include header rows (first row should contain file name, data set
title, author, date, and companion file names) column headings should describe content of each column, including
one row for parameter names and one for parameter units within the ASCII file, delimit fields using commas, pipes (|), tabs, or
semicolons (in order of preference)
3. Define the parameters
Use commonly accepted parameter names that describe the contents e.g., precip for precipitation
Use consistent capitalization e.g., not temp, Temp, and TEMP in same file
Explicitly state units of reported parameters in the data file and the metadata SI units are recommended
Choose a format for each parameter, explain the format in the metadata, and use that format throughout the file e.g., use yyyymmdd; January 2, 1999 is 19990102
4. Use consistent data organization (one good approach)
Station Date Temp Precip
Units YYYYMMDD C mm
HOGI 19961001 12 0
HOGI 19961002 14 3
HOGI 19961003 19 -9999
Note: -9999 is a missing value code
4. Use consistent data organization (a second good approach)
Station Date Parameter Value Unit
HOGI 19961001 Temp 12 C
HOGI 19961002 Temp 14 C
HOGI 19961001 Precip 0 mm
HOGI 19961002 Precip 3 mm
5. Perform basic quality assurance Assure that data are delimited and line up in proper
columns Check that there no missing values for key parameters Scan for impossible and anomalous values Perform and review statistical summaries Map location data (lat/long) and assess errors Verify automated data transfers
e.g. check-sum techniques For manual data transfers, consider double keying data
and comparing 2 data sets
6. Assign descriptive data set titles
Data set titles should ideally describe the type of data, time period, location, and instruments used (e.g., Landsat 7).
Data set title should be similar to names of data files Good: “Shrub Net Primary Productivity at the Sevilleta LTER,
New Mexico, 2000-2001” Bad: “Productivity Data”
7. Provide documentation (metadata)
Ecoinformatics solutions Data design Data acquisition QA/QC Data documentation (metadata) Data archival
High-quality data depend on:
Proficiency of the data collector(s) Instrument precision and accuracy Consistency (e.g., standard methods and
approaches) Design and ease of data entry
Sound QA/QC Comprehensive metadata (e.g., documentation
of anomalies, etc.)
Plant Life Stage______________ _____________________________ _____________________________ _____________________________ _____________________________ _____________________________ _____________________________ _____________________________ _______________
What’s wrong with this data sheet?
Important questions
How well does the data sheet reflect the data set design?
How well does the data entry screen (if available) reflect the data sheet?
Plant Life Stageardi P/G V B FL FR M S D NParpu P/G V B FL FR M S D NPatca P/G V B FL FR M S D NPbamu P/G V B FL FR M S D NPzigr P/G V B FL FR M S D NP
P/G V B FL FR M S D NPP/G V B FL FR M S D NP
PHENOLOGY DATA SHEET Rio Salado - Transect 1
Collectors:_________________________________Date:___________________ Time:_________Notes: ________________________________________________________________________________________
P/G = perennating or germinating M = dispersingV = vegetating S = senescingB = budding D = deadFL = flowering NP = not presentFR = fruiting
PHENOLOGY DATA SHEET Rio Salado - Transect 1
Collectors Troy Maddux
Date: 16 May 1991 Time: 13:12
Notes: Cloudy day, 3 gopher burrows on transect
ardi P/G V B FL FR M S D NP
Y N Y N Y N Y NY NY NY N Y N Y N
arpu P/G V B FL FR M S D NP
Y N Y N Y N Y NY NY NY N Y N Y N
asbr P/G V B FL FR M S D NP
Y N Y N Y N Y NY NY NY N Y N Y N
deob P/G V B FL FR M S D NP
Y N Y N Y N Y NY NY NY N Y N Y N
Ecoinformatics solutions
Data design Data acquisition QA/QC Data documentation (metadata) Data archival
Experimental DesignMethods
Data DesignData Forms
Data Entry
Field Computer EntryElectronically
Interfaced FieldEquipment
ElectronicallyInterfaced Lab
Equipment
Raw Data File
Quality Assurance Checks
Data Contamination
Data verified?
Data ValidatedArchive Data File
Archival Mass StorageMagnetic Tape / Optical Disk / Printouts
Access Interface
Off-site Storage
Secondary Users
Publication
Synthesis
Investigators
Summary Analyses
Quality Control
Metadata
Research ProgramInvestigators
Studies
yes
no
Brunt 2000
Generic Data Processing
Ecoinformatics solutions
Project / experimental design Data design Data acquisition QA/QC Data documentation (metadata) – to be addressed Data archival
Ecoinformatics solutions
Project / experimental design Data design Data acquisition QA/QC Data documentation (metadata) Data archival
Planning
Problem
Analysis and
modeling
Cycles of Research“A Conventional View”
Collection
Publicati
ons Data
Cycles of Research“A New View”
PlanningProblem Definition
(Research Objectives)
Analysis and
modeling
Planning
CollectionSelection andextraction
Archive of Data
OriginalObservations
SecondaryObservations
Publicati
ons
Data Archive
A collection of data sets, usually electronic, stored in such a way that a variety of users can locate, acquire, understand and use the data.
Examples: ESA’s Ecological Archive NASA’s DAACs (Distributed Active Archive Centers)
Brunt (2000) Ch. 2 in Michener and Brunt (2000)
Porter (2000) Ch. 3 in Michener and Brunt (2000)
Edwards (2000) Ch. 4 in Michener and Brunt (2000)
Michener (2000) Ch. 7 in Michener and Brunt (2000)
Cook, R.B., R.J. Olson, P. Kanciruk, and L.A. Hook. 2000. Best practices for preparing ecological and ground-based data sets to share and archive. (online at http://www.daac.ornl.gov/cgi-bin/MDE/S2K/bestprac.html)
References