kirk borne dept of computational & data sciences … › xldb10 › docs › xldb4_thu...• it...

19
Astroinformatics: at the intersection of Machine Learning, Automated Information Extraction, and Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University [email protected] , http://classweb.gmu.edu/kborne/

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

Astroinformatics:  at the intersection of Machine Learning, Automated 

Information Extraction, and AstronomyKirk Borne

Dept of Computational & Data SciencesGeorge Mason University

[email protected] , http://classweb.gmu.edu/kborne/

Page 2: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

Responding to the Data Flood• Big Data is a national challenge and a national

priority … see August 9, 2010 announcement from OMB and OSTP @ http://www.aip.org/fyi (#87)

• More data is not just more data… more is different !

• Several national study groups have issued reports on the urgency of establishing scientific and educational programs to face the data flood challenges.

• Each of these reports has issued a call to action in response to the data avalanche in science, engineering, and the global scholarly environment.

Page 3: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

Data Sciences: A National Imperative1. National Academies report: Bits of Power: Issues in Global Access to Scientific Data, (1997) downloaded from

http://www.nap.edu/catalog.php?record_id=55042. NSF (National Science Foundation) report: Knowledge Lost in Information: Research Directions for Digital Libraries, (2003) downloaded from

http://www.sis.pitt.edu/~dlwkshop/report.pdf3. NSF report: Cyberinfrastructure for Environmental Research and Education, (2003) downloaded from

http://www.ncar.ucar.edu/cyber/cyberreport.pdf4. NSB (National Science Board) report: Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century, (2005)

downloaded from http://www.nsf.gov/nsb/documents/2005/LLDDC_report.pdf5. NSF report with the Computing Research Association: Cyberinfrastructure for Education and Learning for the Future: A Vision and Research

Agenda, (2005) downloaded from http://www.cra.org/reports/cyberinfrastructure.pdf6. NSF Atkins Report: Revolutionizing Science & Engineering Through Cyberinfrastructure: Report of the NSF Blue-Ribbon Advisory Panel on

Cyberinfrastructure, (2005) downloaded from http://www.nsf.gov/od/oci/reports/atkins.pdf7. NSF report: The Role of Academic Libraries in the Digital Data Universe, (2006) downloaded from http://www.arl.org/bm~doc/digdatarpt.pdf8. National Research Council, National Academies Press report: Learning to Think Spatially, (2006) downloaded from

http://www.nap.edu/catalog.php?record_id=110199. NSF report: Cyberinfrastructure Vision for 21st Century Discovery, (2007) downloaded from http://www.nsf.gov/od/oci/ci_v5.pdf10. JISC/NSF Workshop report on Data-Driven Science & Repositories, (2007) downloaded from

http://www.sis.pitt.edu/~repwkshop/NSF-JISC-report.pdf11. DOE report: Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at

Extreme Scale, (2007) downloaded from http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/DOE-Visualization-Report-2007.pdf12. DOE report: Mathematics for Analysis of Petascale Data Workshop Report, (2008) downloaded from

http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/PetascaleDataWorkshopReport.pdf13. NSTC Interagency Working Group on Digital Data report: Harnessing the Power of Digital Data for Science and Society, (2009) downloaded

from http://www.nitrd.gov/about/Harnessing_Power_Web.pdf14. National Academies report: Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age, (2009) downloaded

from http://www.nap.edu/catalog.php?record_id=12615

Page 4: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

From Data-Driven to Data-Intensive• Astronomy has always been a data-driven science• It is now a data-intensive science: Astroinformatics

– Data-oriented Astronomical Research = “the 4th Paradigm”– Collaboration between Machine Learning (ML) and Astronomy

to accelerate discovery– The application of ML to Big Data = Data Mining = KDD

(Knowledge Discovery in Databases)– Data-Enabled Science: Data-driven, Evidence-based,

Objective, “Forensic”

• The goal : Scientific Knowledge !

Page 5: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

Data-Enabled Science:Scientific KDD (Knowledge Discovery from Data)

• Characterize the known (clustering, unsupervised learning)

• Assign the new (classification, supervised learning)

• Discover the unknown (outlier detection, semi-supervised learning)

• Benefits of very large datasets:• best statistical analysis of “typical” events• automated search for “rare” events

Graphic from S. G. Djorgovski

Page 6: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

3 Sample Machine Learning Problems in Big‐data Science

1. Unsupervised Learning :  Correlation Discovery

2. Supervised Learning :  Class Discovery

3. Semi‐supervised Learning :  Novelty Discovery

Page 7: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

• The dimension reduction problem:– Finding correlations and “fundamental planes” of parameters– Correlation Discovery !– Number of attributes can be

hundreds or thousands• The Curse of High

Dimensionality !– Are there combinations

(linear or non-linear functions) of observational parameters that correlate strongly with one another?

– Finding new patterns and dependencies, which reveal new physics or new scientific relations

Sample ML with Large Science Data – 1

Page 8: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

Sample ML with Large Science Data – 2• The classification problem:

– Classifying an object based upon observed attributes• Which attributes are optimal when there are hundreds to

thousands of attributes to choose from?

– Class and Subclass Discovery !– Are there new classes? Are there new subclasses?

• How do you discover them when Nitems≈1010, Ndim≈103?• Can you automate the discovery process?

Page 9: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

Sample ML with Large Science Data – 3• The outlier detection problem:

– Finding the objects and events that are outside the bounds of our expectations (outside known clusters) = the unknown unknowns

– Novelty Discovery! … Surprise Detection!– Finding new, rare, one-in-a-million(billion)(trillion) objects and

events– These may be real scientific discoveries or garbage– Outlier detection is therefore useful for:

• Novelty Discovery – is my Nobel prize waiting?• Anomaly Detection – is the detector system working?• Data Quality Assurance – is the data pipeline working?

� How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)?

� How do we measure their “interestingness”?

Page 10: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

Scientific KDD (Machine Learning) combined with Information Extraction

Page 11: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

This is the Informatics Layer

Page 12: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

This is the Informatics LayerInformatics Layer:

Provides standardized representations of the “information extracted”– for use in the KDD (data mining) layer.Standardization is not required (nor feasible) at the “data source” layer.The informatics is discipline-specific.Informatics enables KDD across large distributed heterogeneous scientific data repositories.

Page 13: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

13

Heliophysics Space Weather Example

CME = Coronal Mass EjectionSEP = Solar Energetic Particle

Page 14: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

• Report discoveries back to the science database for community reuse

• Basic astronomical objects (informatics granules) are annotated ...– with follow-up observations of any kind – with new knowledge discovered– with common knowledge– with inter-relationships between objects and their properties– with concepts – with context– with assertions (e.g., classifications, concepts, quality flags,

relationships, references, observational parameters, common knowledge, inter-connectivity with other objects)

– with experimental parameters– with observer / observatory descriptors

• Enables knowledge-sharing and reuse

Semantics!

Provenance (for data curation)

Automated Information Extraction

Page 15: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

• Report discoveries back to the science database for community reuse

• Basic astronomical objects (informatics granules) are annotated ...– with follow-up observations of any kind – with new knowledge discovered– with common knowledge– with inter-relationships between objects and their properties– with concepts – with context– with assertions (e.g., classifications, concepts, quality flags,

relationships, references, observational parameters, common knowledge, inter-connectivity with other objects)

– with experimental parameters– with observer / observatory descriptors

• Enables knowledge-sharing and reuse

Semantics!

Provenance (for data curation)

Automated Information Extraction

Astroinformatics !

Page 16: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

• Each night for 10 years LSST will obtain the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey

• My grad students will be asked to mine these data (~20 TB each night ≈ 40,000 CDs filled with data):

Sample LSST Data Challenge

Page 17: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

• Each night for 10 years LSST will obtain the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey

• My grad students will be asked to mine these data (~20 TB each night ≈ 40,000 CDs filled with data): a sea of CDs

Sample LSST Data Challenge

Page 18: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

• Each night for 10 years LSST will obtain the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey

• My grad students will be asked to mine these data (~20 TB each night ≈ 40,000 CDs filled with data): a sea of CDs

Image: The CD Seain Kilmington England(600,000 CDs)

Sample LSST Data Challenge

Page 19: Kirk Borne Dept of Computational & Data Sciences … › xldb10 › docs › xldb4_thu...• It is now a data-intensive science: Astroinformatics – Data-oriented Astronomical Research

• Each night for 10 years LSST will obtain the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey

• My grad students will be asked to mine these data (~20 TB each night ≈ 40,000 CDs filled with data):– A sea of CDs each and every day for 10 yrs– Cumulatively, a football stadium full of 200 million CDs

after 10 yrs

• Yes, more is most definitely different !

Sample LSST Data Challenge