citrination-mrs fall meeting 2015
TRANSCRIPT
Citrination: Open Infrastructure for Ingesting, Storing, & Mining
Materials Data
Bryce Meredig & Greg MulhollandCitrine Informatics
MRS Fall Meeting2 December 2015
Introduction
About Citrine
Data platform for the physical world—our software aggregates and mines materials data to aid R&D, mfg, sales
Business ModelWe sell enterprise-scale industrial deployments of our platform
We don’t charge academic or government labs for public data storage or access
Bold Assertion #1Our platform is a one-line data management plan for everyone
-Funding agencies ask you to make data accessible, but do not specify how-Anyone can store public data on our platform for free, today
Bold Assertion #2
Public data should be free and universally available
Bold Assertion #3
-Funding agencies don’t want an infrastructure mortgage
Scientists should focus on science, not IT
-Proliferation of unconnected data islands doesn’t serve the community
We’re Nice, But Not a CharityMore data make our platform smarter and more valuableUsers help us, and each other, by curating and organizing data
Statistics on Citrination
Users from >1k institutions
3.1m materials data records3.2k distinct datasets150k documents
Ingesting Data:Extraction from Documents
Platform Overview
Data extraction pipeline turns docs & files into a structured database
Structured data are far more discoverable, and also amenable to machine learning
Data Structure ContinuumNumerical DataDocuments
(most materials-related arXiv papers indexed)
Completely unstructured Highly structured
Data Extraction: Text
cell parameter a = MATERIALS PROPERTY5.82445(1) = NUMERICAL VALUEangstrom = UNITS
Extraction: Images & Tables
Extraction: Images & Tables
Image containing dataUnderlying x,y data
(actual extraction shown)
machine vision
Ingesting Data:Community Contributions
Contributors & Partners
Uploading Data
Uploading Data
Ingestion is instant if you create JSON or .csv files-see (citrination.com/contributing)
Otherwise, we figure it out!
Credit and ProvenanceWe acknowledge both the contributor (i.e., uploader) and the published work via the DOI
Incentives: Vanity MetricsWeekly pageviews of the OQMD paper’s page (does not count individual datum views)—comparable to high-impact journal metrics!
Incentives: DiscoverabilityCan scientists find your data via Google?
Why Upload?
Data management planDiscoverability & impactPersist your raw data
Ingesting Data:Case Studies
Computational Data: OQMD
J.E. Saal et al., JOM 65, 1501 (2013)
Computational Data: Mat. Proj.
A. Jain et al., APL Materials 1, 011002 (2013)
B2O3 DOS
Experimental Data: JCAP
A. Shinde et al., J Mat Res 30, 442 (2014)
Data Partnerships
https://citrination.com/api/doi/banner/10.1016/j.jallcom.2014.11.091
Implemented an API for Elsevier:
Data Partnerships
Link to Citrinationdata will appear here
Data Partnerships
Storing Data
Data StandardsMIF – Materials Information File: General JSON schema for defining materials data
Open standard and open-source tools for working with it
MIF SampleSchema available: http://citrineinformatics.github.io/mif-documentation{ "sample": {
"material": { "chemicalFormula": "LiF", "condition": [ { “scalar": [
{ "value": "Single crystalline" } ],
"name": "Crystallinity" }
]…
Working with MIFmifkit – open-source Python toolset for working with MIFCreate MIFs in your code, or import MIFs from Citrination into your code
Programmatic Data Access
# full documentation: http://citrineinformatics.github.io/api-documentation# search the entire databaseclient.search(formula='CrFeSn’)# filter on valuesclient.search(formula=‘GaN’, property=‘band gap’, max_measurement=‘3’)# search a single data setclient.search(formula='CrFeSn', data_set_id=‘100’)
API usage example:
Data Quality and CurationOur philosophy: We are not arbiters of data quality; instead, we give the community tools to assess and discuss quality
Mining Data
Machine Learning: TE Case Study
Sparks, T. D., Gaultois, M. W., Oliynyk, A., Brgoch, J., & Meredig, B. “Data mining our way to the next generation of thermoelectrics.” Scripta Materialia (2015)
ML-based web app that predicts key thermoelectric properties for any bulk poly material
Heat map of thermal conductivity predicted by ML in Ru-Dy-Ge system
TE Search: All Ternary Systems
Industrial ApplicationsIn production at several Fortune 500 corps and smaller co’s:CoatingsAlloysCrystallographyEnergy materials
Releases Every ~2 WeeksWebGL Crystal Structure
VisualizationsAutocomplete Profile Pages
Get Involved• We’ll store your data today—easy data management plan template
• Join Citrination newsletter: bit.ly/1NGNgdb
• Access our public data• Email me: [email protected]