Communicating Scientific Thoughts and Data
Bryan LawrenceOxford
March 2006
Communicating Scientific Thought and DataOxford March, 2006 2
Outline
• Intro to BADC• Drivers for Change:
– Open Access– Data as Evidence
• Handling Data – what is this metadata stuff?• Improving our ability to find and utilise data:
– The NERC DataGrid– Importance of Climate Forecast Conventions– NumSim
• Are we changing our methods of communicating?– Communication Timescales, blogging and preprints– CLADDIER
Communicating Scientific Thought and DataOxford March, 2006 3
BADC Role
The BADC role is to assist UK researchers to locate, access and interpret atmospheric data and to ensure the long-term integrity of atmospheric data produced by NERC projects.– Facilitation and Curation/Preservation!
Communicating Scientific Thought and DataOxford March, 2006 4
BADC Data Holdings
• A BADC dataset is an aggregation of data files, documents and metadata sharing common administrative policies. These policies could be file validation, access control or retention schemes.
• Datasets vary from TBs in millions of files to a few MBs in a single file.
• There are presently over 100 datasets.
Communicating Scientific Thought and DataOxford March, 2006 5
User examples
• Atmospheric chemistry models.
• Pollution chemistry measurement campaigns.
Communicating Scientific Thought and DataOxford March, 2006 6
User examples
• Bird feeding habits.
Communicating Scientific Thought and DataOxford March, 2006 7
User examples
• Radio communication modelling.
• Wind power research.
• A & E influenza cases.
Communicating Scientific Thought and DataOxford March, 2006 8
User examples
• Castle mortar decay.
• Discomfort indices.
Drivers for Change
Communicating Scientific Thought and DataOxford March, 2006 10
Climate in 20010 – A graphic Illustration
Figures from Gary Strand, NCAR, ESG website
March 2006, 2.5 PB
Typically, two-thirds of this data will never see the light of the day: why?
No one can remember what it was, or, if they can remember that, where it is!
Communicating Scientific Thought and DataOxford March, 2006 11
http://www.realclimate.org/index.php?p=121
Data as Evidence
http://www.uoguelph.ca/~rmckitri/research/trcback.html
What McIntyre got right:
Communicating Scientific Thought and DataOxford March, 2006 12
RCUK Position Statement on Access to Research Outputs
http://www.rcuk.ac.uk/access/statement.pdfResearch Data
8. RCUK also notes that one of the benefits of digitisation and publication in digital formats is the ability to provide access to primary research data alongside the traditional article; and it shares the Select Committee’s and the Government’s view that the data underpinning the published results of publicly-funded research should be made available as widely and rapidly as possible. For a number of years, Research Councils including the AHRB, ESRC and NERC have funded data centres and services which are responsible for preserving, managing and providing access to research data; and these Councils have well-established policies and procedures for preservation and access. CCLRC is currently leading cross-Council consideration of how policy and practice need to be developed with regard to the curation of the data created through the research projects they support. Further work is needed to develop a common framework of policies and procedures for determining what sets of data are collected, whether in university or in Council-run repositories or elsewhere; and how and on what terms they are made accessible to the research community and others
New methods of publication
9. The development of web and associated Internet technologies providing access to a range of distributed information resources has enabled new possibilities for the delivery of research publications. This has also led to a change in expectations as to how and when research publications are accessed. E-print repositories (see paragraphs 10-15 below) and open access journals (see paragraphs 25-27 below) have both developed as part of this change in technology and expectation. Indeed, the economic model for open access journals depends on the web to provide a low-cost delivery mechanism. RCUK considers that both e-print repositories and open access journal can help improve access to the results of publicly funded research.
Communicating Scientific Thought and DataOxford March, 2006 13
Data Retention Policies
University of Cambridge Research Division: Data generated in the course of research should be kept securely in paper or electronic format, as appropriate. Back-up records should always be kept for data stored on a computer. The [AMRC] considers a minimum of ten years to be an appropriate period. However, research based on clinical samples or relating to public health may require longer storage to allow for long-term follow-up to occur.[AMRC: Association of Medical Research Charities]
University of Oxford Research Services Office: A successful laboratory notebook allows for ready verification of quality and integrity of research data and enables another investigator to reproduce the procedure which has been documented and get the same result. …. A successful laboratory notebook allows for ready verification of quality and integrity of research data and enables another investigator to reproduce the procedure which has been documented and get the same result.
Natural Environment Research Council: … Scientists will frequently process the data they have collected selectively, or with specific application packages, in order to prepare material for publication in the scientific literature. But the full value of the data collected may only be realised if the entire dataset is subjected to generic processing (eg to ensure calibration and adequate quality control) and is sufficientlydocumented to allow others to re-use it at a later date. The original collector may be the onlyperson in a position to undertake such work, and so to unlock the full potential of the data. Thoseholding data collected under NERC funding will be expected to cooperate in validating andpublishing them in their entirety - when this can be justified in terms of their scientific value -rather than merely creaming off a subset for immediate publication in the literature. …
What is this stuff called metadata?
Communicating Scientific Thought and DataOxford March, 2006 15
Preserving data is not just about backups!
One could argue that the writers of these documents did a brilliant job of preserving the bits-and-bytes of their time …
And yes they’ve both been translated … many times, it’s a shame the meanings are different …
Phaistos Disk, 1700BC
Communicating Scientific Thought and DataOxford March, 2006 16
Wider Internet
Research Group
Satellite
SuperComputer
Shared Resources
DB
Research Group
Research Group
Metadata Origins
Consider a hierarchy of data users beginning with an individual scientist, who may herself be part of a research group, itself part of a community sharing resources, lying in the wider internet …To be well integrated the metadata should have a role at each level!(The data portal client and server interface may be different at each level).At each level “extra” metadata will be required, probably produced by dedicated staff at the research group, or data centre.
Communicating Scientific Thought and DataOxford March, 2006 17
NDG Metadata Taxonomy
… not one schema, not one solution!
CSMLNCML+CF
MOLES THREDDS
DIF -> ISO19115
CLADDIER
The NERC DataGrid
Communicating Scientific Thought and DataOxford March, 2006 19
http://ndg.nerc.ac.uk
British Atmospheric Data Centre
British Oceanographic Data Centre
Complexity + Volume + Remote Access = Grid Challenge
NCAR
Communicating Scientific Thought and DataOxford March, 2006 20
Wider InternetNERC Grid
taperobot
XML data-base
XML data-base
BADC NDG Wrapper
OnlineData
OnlineData
BODC NDGWrapper
OnlineData
XML data-base
Group NDGWrapper
Software Agent
Grid User
Satellite Supercomputer
Research Group DataSources
Internet Link
Internet User
Internet LinkESG (&other)Applications
Wider Internet
NDGWeb
Portal
XML data-base
Communicating Scientific Thought and DataOxford March, 2006 21
e.g.: ERA40 re-analysis surface air temperature, 2001-04-27– deegree open-source WMS modified with netCDF connector
Overlaid with rainfall from globe.digitalearth.gov WMS server
NetCDF + WMS
NB: Now using Mapserver for Interoperability experiments
Communicating Scientific Thought and DataOxford March, 2006 22
Climate Science Modelling Language
• CSML feature types– defined on basis of geometric and topologic
structure
CSML feature type Description Examples
TrajectoryFeature Discrete path in time and space of a platform or instrument.
ship’s cruise track, aircraft’s flight path
PointFeature Single point measurement. raingauge measurement
ProfileFeature Single ‘profile’ of some parameter along a directed line in space.
wind sounding, XBT, CTD, radiosonde
GridFeature Single time-snapshot of a gridded field. gridded analysis field
PointSeriesFeature Series of single datum measurements. tidegauge, rainfall timeseries
ProfileSeriesFeature Series of profile-type measurements.vertical or scanning radar, shipborne ADCP, thermistor chain timeseries
GridSeriesFeature Timeseries of gridded parameter fields.numerical weather prediction model, ocean general circulation model
Communicating Scientific Thought and DataOxford March, 2006 23
Climate Science Modelling Language
• CSML feature types– examples...
ProfileSeriesFeature
ProfileFeature
GridFeature
Communicating Scientific Thought and DataOxford March, 2006 24
Climate Science Modelling Language
• Application schema– logical structure and semantic content of NDG
‘Dataset’– Based on Geography Markup Language 3.1
«Type»GML::AbstractGMLType
«Type»Dataset
«Type»UnitDefinitions
«Type»ReferenceSystemDefinitions
«Type»PhenomenonDefinitions
«Type»AbstractArrayDescriptor
«Type»GML::FeatureCollection
**
*
*
*
*
*
*
*
*
Communicating Scientific Thought and DataOxford March, 2006 25
Climate Science Modelling Language
• Numerical array descriptors– provides ‘wrapper’
architecture for legacy data files
– ‘Connected’ to data model numerical content through ‘xlink:href’
• Three subtypes:– InlineArray– ArrayGenerator– FileExtract (NASAAmes,
NetCDF, GRIB)
• Composite design pattern for aggregation
+arraySize[1]+uom[0..1]+numericType[0..1]+numericTransform[0..1]+regExpTransform[0..1]
«Type»AbstractArrayDescriptor
+aggType[1]+aggIndex[1]
«Type»AggregatedArray
1
+component
*
+values[*]
«Type»InlineArray
+expression[1]
«Type»ArrayGenerator
+fileName[1]
«Type»AbstractFileExtract
+variableName[1]+index[0..1]
«Type»NASAAmesExtract
+variableName[1]
«Type»NetCDFExtract
+parameterCode[1]+recordNumber[0..1]+fileOffset[0..1]
«Type»GRIBExtract
+id+metaDataProperty+description+name
«Type»GML::AbstractGMLType
Communicating Scientific Thought and DataOxford March, 2006 26
Climate Science Modelling Language
• Inline array
• Array generator
<NDGInlineArray><arraySize>5 2</arraySize><uom>udunits.xml#degreeC</uom><numericType>float</numericType><regExpTransform>s/10/9/ge</regExpTransform><numericTransform>+5</numericTransform><values>1 2 3 4 5 6 7 8 9 10</values>
</NDGInlineArray>
<NDGArrayGenerator><arraySize>10001</arraySize><uom>udunits.xml#minute</uom><numericType>float</numericType><expression>0:5:50000</expression>
</NDGArrayGenerator>
Communicating Scientific Thought and DataOxford March, 2006 27
Climate Science Modelling Language
File extract<NDGNASAAmesExtract>
<arraySize>526</arraySize><numericType>double</numericType><fileName>/data/BADC/macehead/mh960606.cf1</fileName><variableName>CFC-12</variableName>
</NDGNASAAmesExtract>
<NDGNetCDFExtract gml:id="feat04azimuth"><arraySize>10000</arraySize><fileName>radar_data.nc</fileName><variableName>az</variableName>
</NDGNetCDFExtract>
<NDGGRIBExtract><arraySize>320 160</arraySize><numericType>double</numericType><fileName>/e40/ggas1992010100rsn.grb</fileName><parameterCode>203</parameterCode><recordNumber>5</ recordNumber><fileOffset>289412</fileOffset>
</NDGGRIBExtract>
Communicating Scientific Thought and DataOxford March, 2006 28
XM
L P
arser
SeeMyDENC
Data Dictionary
S52 Portrayal Library
SENC
MarineGML
(NDG) Feature
Types
XML
XML
XML
Biological Species
Chl-a from Satellite
ModelledHydrodynamics
XSLT
XSLT
XSLT
For each XSD (for the source data) there is an
XSLT to translate the data to the Feature
Types (FT) defined by CSML. The FT’s and
XSLT are maintained in a ‘MarineXML registry’ The FTs can then
be translated to equivalent FTs for
display in the ECDIS system
XSLT
Features in the source XSD must be present in
the data dictionary.
XSD
XSD
XSD
XML
XML
The result of the translation is an encoding that contains the
marine data in weakly typed (i.e. generic) Features
XSLT
XSLT
Phenomena in the XSD must have an associated
portrayal
ECDIS acts as an example client for
the data.
Data from different parts of the marine
community conforming to a variety of schema
(XSD)
MeasuredHydrodynamics
S-57v3 GML
XML
XSD
XML
XSD
Feature described using S-57v3.1Application
Schema can be imported and are equivalent to the same features in CSML’
Slide adapted from Kieran Millard (AUKEGGS, 2005)
MarineXML Testbed
Communicating Scientific Thought and DataOxford March, 2006 29
Biological sampling station with attributes for the species sampled at each
Grid of Chl-a from the MERIS instrument on ENVISAT
Predicted and measured wave climate timeseries (height, direction and period)
Vectors of currents from instruments
MarineXML Testbed
Slide adapted from Kieran Millard (AUKEGGS, 2005)
Communicating Scientific Thought and DataOxford March, 2006 30
Re-using Features
Here structured XML is converted to plain ascii text in the form required for a numerical model
HTML warning service pages are generated ‘on the fly’XML can also be converted to SVG to display data graphically
Here the same XML is converted to the SENC format used in a proprietary tool for viewing electronic navigation charts.
All this requires agreement on standards
Slide adapted from Kieran Millard (AUKEGGS, 2005)
Communicating Scientific Thought and DataOxford March, 2006 31
Climate Science Modelling Language
• Status:– Initial feature types defined– First draft application schema complete– Trial software tooling being coded (parser, netCDF
instantiation)– Initial deployment trial across BODC, BADC datasets
• Future:– Separate out wrapper implementation (array descriptors)– Disallow ‘internal’ dictionaries– More strongly-typed features?
• Complex features• Implicit Ensemble Support• Swathes
– Follow (and pursue!) GML evolution, enhance compliance– Expand tooling
Communicating Scientific Thought and DataOxford March, 2006 32
CSML Round Tripping
Managing semantics
UGAS
GML app schema
XML
<gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList>
GML dataset
instance
Class1
Class2
-End1
1
-End2
*
«datatype»DataType1
conceptual model
Conforms to
101010
New Dataset
Application
produces
parser
Under Development
Communicating Scientific Thought and DataOxford March, 2006 33
CSML Round Tripping
Managing data - 1
parser
Under Development
<gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList>
GML dataset
scanner
Under Development
GML app schema
XML
instance
101010
CF Dataset
Application
producesCF
Communicating Scientific Thought and DataOxford March, 2006 34
Climate Forecasting Conventions
1. Data should be self-describing. No external tables are needed to interpret the file. For instance, CF encodings do not depend upon numeric codes (by contrast with GRIB).
2. The convention should be easy to use, both for data-writers and users of data.
3. The metadata and the semantic meaning encoding though the metadata should be readable by humans as well as easily utilized by programs.
4. Redundancy should be minimised as much as possible (because it reduces the chance of errors of inconsistency when writing data)
Communicating Scientific Thought and DataOxford March, 2006 35
CF
CF consists of:• Vocabulary management• Semantic concepts (axes, cells etc), • and format specific conventions (NetCDF now)
CF is at the heart of • IPCC data comparison• Academic earth system science data
exploitation (and archival).
Communicating Scientific Thought and DataOxford March, 2006 36
CF
CF• Exploits 100’s of man years of effort on
NetCDF evolution and tools• Is one of the means by which we can
take NetCDF data and make meaningful feature types.
• Helps future proof your data!
Communicating Scientific Thought and DataOxford March, 2006 37
Managing Data 2
101010
CF Dataset
<gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList>
GML dataset
scanner
XSLT
ISO19115
XMLPUBLISH
DECISIONPROCESSES
101010
CF Dataset
Define Dataset
Add Information
Communicating Scientific Thought and DataOxford March, 2006 38
http://ndg.nerc.ac.uk/discovery
Communicating Scientific Thought and DataOxford March, 2006 39
Choose to return either data or “B-”Metadata
Look at DIFs in either HTML or XML
Can order responses by Title, Data Centre or Temporal coverage (default random)
Communicating Scientific Thought and DataOxford March, 2006 40
Communicating Scientific Thought and DataOxford March, 2006 41
Communicating Scientific Thought and DataOxford March, 2006 42
Communicating Scientific Thought and DataOxford March, 2006 43
Communicating Scientific Thought and DataOxford March, 2006 44
Communicating Scientific Thought and DataOxford March, 2006 45
Communicating Scientific Thought and DataOxford March, 2006 46
Communicating Scientific Thought and DataOxford March, 2006 47
Communicating Scientific Thought and DataOxford March, 2006 48
Background activity being parallelised with GODIVA/CCLRC e-science collaboration (spectral -> gridpoint + CDMS + visualisation tools)
Download either plot or the data that went into the plot.
Communicating Scientific Thought and DataOxford March, 2006 49
ERA40:
•All driven from one CDML file, 9 TB online spherical harmonics, looking like 40 TB “virtual” gridded!
Communicating Scientific Thought and DataOxford March, 2006 50
NumSim.xsdhttp://proj.badc.rl.ac.uk/ndg/wiki/NumSim
See also: http://www.cgam.nerc.ac.uk/pmwiki/NMM/index.php/8
Changing CommunicationsBlogging, Trackback and
CLADDIER
Communicating Scientific Thought and DataOxford March, 2006 52
Blogging
Wednesday 15th of March:• Google search on “climate blogs” yields
33,900,000 hits.• www.technorati.com is following 30 million
blogs– 269,404 have climate posts– 1,953 climate posts in “environmental” blogs– 131 posts about potential vorticity (mainly in
weather/hurricane blogs)
• Very few “professional” standard blogs in our field, but gazillions in others!:– Notwithstanding: http://www.realclimate.org and
others
Communicating Scientific Thought and DataOxford March, 2006 53
Traditional Scientific Publishing
Pluses: • “Peer-Review”; the gold
standard• Copy-editing• Reliable indexing (Web of
Science etc)• Paper is nice to read.Minuses:• “Peer-Review”; “support
your mates”• Often (very) slow to “print”• Proprietary indexing (Role
on Google-Scholar)• Libraries can’t afford to buy
copies! Limited Readership.
SELF PUBLISHINGPluses: • No Peer Review: say what you think,
citation and annotation measure quality!
• Feedback: comments and trackback. Hyperlinks to publications AND data.
• Immediacy• Reliable Accessible Indexing• You can print things out to read …• You can still publish in the traditional
media (while it lasts).Minuses:• No Peer Review: plenty of garbage.• Spam.Conclusion: how can we do peer review
without traditional journals? Because the days of traditional journals (apart from as formal records) are numbered!
Communicating Scientific Thought and DataOxford March, 2006 54
What is trackback?
• Hyperlinks forward in time!• If a web resource (paper, page, data) is configured
correctly, software is able to accept trackback “pings” and update that web resource with annotations. One such annotation type is effectively a citation: – “I (<blog_name>) have cited <yourURL> with something
with this <title> found at this <url>)– Real Time citation of a resource so that it shows what
people have said *after* it has been published!– (Some blogging providers do it automatically, using search
engines to find all links and enter them appropriately)
• BNL just joined a working group to “standardise” trackback, and I’ll be working to make sure the format includes Academic Citation.
Communicating Scientific Thought and DataOxford March, 2006 55
Trackback Example
Communicating Scientific Thought and DataOxford March, 2006 56
Institutional Repositories
• E-print repositories (from the RCUK document cited earlier)– For the purpose of this document, e-print repositories8 are always
understood to open access. RCUK believes that such institutional and subject-based repositories, where researchers deposit copies of the articles they publish (ie post-print), provide an opportunity significantly to enhance access to research publications … Importantly, there is a small but growing body of evidence demonstrating the increased impact and visibility of material made available in open access through e-print repositories.
(ignoring issues for the publishers for the moment)
• RCUK further recomends:– Where research is funded by the Research Councils and undertaken by
researchers with access to an open access e-print repository (institutional or subject-based), Councils will make it a condition for all grants awarded from 1 October 2005 that a copy of all resultant published journal articles or conference proceedings (but not necessarily the underlying data) should be deposited in and/or accessible through that repository, subject to copyright or licensing arrangements … Such deposit requires relatively little effort and, for each published paper, should not take more than 15- 20 minutes of an author’s or repository manager’s time. There is no reason why this should be seen as an infringement of researchers’ freedom …
Communicating Scientific Thought and DataOxford March, 2006 57
IR Examples
Communicating Scientific Thought and DataOxford March, 2006 58
CLADDIER
Communicating Scientific Thought and DataOxford March, 2006 59
CLADDIER Use Case
Sequence:1. Joanna reads paper2. Joanna acquires data3. Joanna analyses data4. Joanna deposits data
• Data Centre generates trackbacks to cited data and papers (in the metadata)
5. Joanna creates paper6. Joanna deposits paper
• Institutional Repository generates trackbacks to cited data and papers
7. Fred reads Joanna’s new paper
8. Fred directly acquires EXACTLY the same data she used for his own project
Communicating Scientific Thought and DataOxford March, 2006 60
Summary
A bit of pot-pourri:• Data Reuse depends on metadata, and
eventual reuse depends on the originator doing it right!– Use CF, get involved in NumSim etc …– NDG will hopefully make it easier to exploit data!
• Timeliness of information is important and may become more relevant than quality (alone)!
• Boundary between “papers” and “data” is blurring! – The next but one RAE (if it happens) may reflect this!
• Automated linking of resources will proliferate– Use your IR and your data centre (BADC!)