machine-learning strategies for variable source classification · machine-learning strategies for...

30
Machine-Learning Strategies for Variable Source Classification Nina Hernitschek (Caltech/ Vanderbilt University*) collaborators: Judith G. Cohen (Caltech) Hans-Walter Rix (MPIA), Branimir Sesar (formerly MPIA) *DSI/VIDA Postdoctoral Fellow at Vanderbilt University’s Data Science Institute (DSI) and the Vanderbilt Initiative for Data-Intensive Astrophysics (VIDA) Hot-Wiring the Transient Universe – August 19 - 22, 2019

Upload: others

Post on 04-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

Machine-Learning Strategiesfor Variable Source

Classification

Nina Hernitschek (Caltech/ Vanderbilt University*)collaborators: Judith G. Cohen (Caltech)

Hans-Walter Rix (MPIA), Branimir Sesar (formerly MPIA)

*DSI/VIDA Postdoctoral Fellow at Vanderbilt University’s Data ScienceInstitute (DSI) and the Vanderbilt Initiative for Data-Intensive Astrophysics

(VIDA)

Hot-Wiring the Transient Universe – August 19 - 22, 2019

Page 2: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

1 / 19Machine-Learning Strategies for Variable Source Classification

Page 3: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

The Pan-STARRS1 3π Survey

PS1 3π in one sentence:An optical/near-IR survey of 3/4 the sky in non-simultaneousgrizy to r∼21.8 based on ∼70 visits over a 5.5-year period.

map galactic halo to ∼120 kpc

single-visit depth of r ∼ 21.8

coadded depth of r ∼ 23.2

sky coverage of ∼31,000 deg2

(3/4 of the sky)

δ > -30 deg

70 epochs over 5.5 years

grizy nonsimultaneous

2 / 19Machine-Learning Strategies for Variable Source Classification

Page 4: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

The Pan-STARRS1 3π Survey

PS1 3π in one sentence:An optical/near-IR survey of 3/4 the sky in non-simultaneousgrizy to r∼21.8 based on ∼70 visits over a 5.5-year period.

map galactic halo to ∼120 kpc

single-visit depth of r ∼ 21.8

coadded depth of r ∼ 23.2

sky coverage of ∼31,000 deg2

(3/4 of the sky)

δ > -30 deg

70 epochs over 5.5 years

grizy nonsimultaneous

2 / 19Machine-Learning Strategies for Variable Source Classification

Page 5: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

image based on NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech

∼ 10 kpc limit of SDSS studies

for kinematics & [Fe/H]

Page 6: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

∼400 kpc LSST

∼120 kpc PS1 3π

∼ 10 kpc limit of SDSS studies for kinematics & [Fe/H]

Page 7: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

RR Lyrae from PS1 3π

RR Lyrae stars:

periodical pulsators, varying on1/4 day timescales

distances from PLZ relation

old: ∼109 years

high-precision 3D mapping ofthe (old) Milky Way

⇒ easy to find and important tracers for old halo substructure

big challenge:characterize variability and identify RR Lyrae stars(and their periods) from 109 sparse, noisy light curves

5 / 19Machine-Learning Strategies for Variable Source Classification

Page 8: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

RR Lyrae from PS1 3π

RR Lyrae stars:

periodical pulsators, varying on1/4 day timescales

distances from PLZ relation

old: ∼109 years

high-precision 3D mapping ofthe (old) Milky Way

⇒ easy to find and important tracers for old halo substructure

big challenge:characterize variability and identify RR Lyrae stars(and their periods) from 109 sparse, noisy light curves

5 / 19Machine-Learning Strategies for Variable Source Classification

Page 9: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

To model a survey...

6 / 19Machine-Learning Strategies for Variable Source Classification

Page 10: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

To model a survey...

tools are needed for

describing data quality → outlier

describing light curve characteristics → “features”

classifying sources → catalogs

finding substructure → clumps, overdensities, ...

6 / 19Machine-Learning Strategies for Variable Source Classification

Page 11: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

To model a survey...

tools are needed for

describing data quality → outlier (machine learning)

describing light curve characteristics → “features”

classifying sources → catalogs (machine learning)

finding substructure → clumps, overdensities, ...

⇒ generic, general approaches needed

⇒ depending strongly (!) on computational performance

challenging, but enables new generation of population studies:huge and homogeneous (deep, all-sky) samples

6 / 19Machine-Learning Strategies for Variable Source Classification

Page 12: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

To model a survey...

tools are needed for

describing data quality → outlier (machine learning)

describing light curve characteristics → “features”

classifying sources → catalogs (machine learning)

finding substructure → clumps, overdensities, ...

⇒ generic, general approaches needed

⇒ depending strongly (!) on computational performance

challenging, but enables new generation of population studies:huge and homogeneous (deep, all-sky) samples

6 / 19Machine-Learning Strategies for Variable Source Classification

Page 13: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Machine Learning

... is the sub-field of computer science that gives computers theability to learn without being explicitly programmed (ArthurSamuel, 1959)

7 / 19Machine-Learning Strategies for Variable Source Classification

Page 14: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Supervised Machine Learning

for all following machine learning approaches, we use supervisedlearning: learning to infer a function from labeled training data,e.g. classification

training set classifier

target set’sprobabilities

target set

training set:

set of sources inside/outside category we are looking for

same data quality as found in target set

What’s happening internally?

8 / 19Machine-Learning Strategies for Variable Source Classification

Page 15: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Supervised Machine Learning

for all following machine learning approaches, we use supervisedlearning: learning to infer a function from labeled training data,e.g. classification

training set classifier

target set’sprobabilities

target set

training set:

set of sources inside/outside category we are looking for

same data quality as found in target set

What’s happening internally?

8 / 19Machine-Learning Strategies for Variable Source Classification

Page 16: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Random Forest Classifier

training set

...subsample 1 subsample N

...

tree 1 tree N

...tree N

tree 2

tree 1

x

x

x

majorityvote

random forest‘s decision

divide-and-conquer approach improves the classificationperformance

less sensitive to training set variances

robust to outliers

training and classification can be parallelized

9 / 19Machine-Learning Strategies for Variable Source Classification

Page 17: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Application: Variability Characterization

Classification of variable sources relies fundamentally on algorithmsquantifying different aspects of variability found in light curves.

features extraction:

light curvesignal processing−−−−−−−−−−→ numbers

features should be as discriminative and informative as possible

challenges:

non-simultaneous multi-band

noise & uncertainties

foreground effects

time-sampling acting as window function

10 / 19Machine-Learning Strategies for Variable Source Classification

Page 18: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Periodicity

found e.g. in light curves of eclipsing binaries, RR Lyrae andCepheidsRR Lyrae period crucial for distance determination:Period-Luminosity-Metallicity (PLZ) relation

However:

might be masked due to cadence of survey

not all variables are periodic

⇒ apply methods to detect periodicity in sparse and unevenlysampled multi-band data

11 / 19Machine-Learning Strategies for Variable Source Classification

Page 19: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Periodicity

found e.g. in light curves of eclipsing binaries, RR Lyrae andCepheidsRR Lyrae period crucial for distance determination:Period-Luminosity-Metallicity (PLZ) relation

However:

might be masked due to cadence of survey

not all variables are periodic

⇒ apply methods to detect periodicity in sparse and unevenlysampled multi-band data

11 / 19Machine-Learning Strategies for Variable Source Classification

Page 20: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Periodicity

Template Fittingcreate light-curve templates from better sampled data or mockdata (simulation) & fit target data

example: RR Lyrae period fitting, using light curve templates fromSDSS Stripe 82 (Sesar et al. 2010)

0.0 0.2 0.4 0.6 0.8 1.0Phase

19.5

20.0

20.5

21.0

Mag

nitu

de

g

r

i

z&y

12 / 19Machine-Learning Strategies for Variable Source Classification

Page 21: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Periodicity

Template Fittingcreate light-curve templates from better sampled data or mockdata (simulation) & fit target data

example: RR Lyrae period fitting, using light curve templates fromSDSS Stripe 82 (Sesar et al. 2010)

0.0 0.2 0.4 0.6 0.8 1.0Phase

19.5

20.0

20.5

21.0

Mag

nitu

de

g

r

i

z&y

⇒ computationally expensive⇒ should be 2nd step after more general pre-selection method

12 / 19Machine-Learning Strategies for Variable Source Classification

Page 22: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Variable Sources in General

not all variables are periodic: QSOs, supernovae...

not all periodic variables look periodic: sampling

some period-estimation methods are computationally veryexpensive: need for pre-selection

⇒ non-periodic features are very important

⇒ non-periodic non-simultaneous features extracted fromPan-STARRS1 3π light curvessuch as multiband structure functions

13 / 19Machine-Learning Strategies for Variable Source Classification

Page 23: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Variable Sources in General

not all variables are periodic: QSOs, supernovae...

not all periodic variables look periodic: sampling

some period-estimation methods are computationally veryexpensive: need for pre-selection

⇒ non-periodic features are very important

⇒ non-periodic non-simultaneous features extracted fromPan-STARRS1 3π light curvessuch as multiband structure functions

13 / 19Machine-Learning Strategies for Variable Source Classification

Page 24: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Characterize Light Curves

multi-band structure-function variability model L (grizy |ωr, τ):how much should you expect a source to vary within ∆t?

⇒ fit (ωλ, τ) & m̄λ

⇒ characteristic variability timescale & amplitude

RR Lyrae, ωr=0.3, τ=1.5 days QSO, ωr=0.13 , τ=560 days

14 / 19Machine-Learning Strategies for Variable Source Classification

Page 25: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Characterize Light Curves

multi-band structure-function variability model L (grizy |ωr, τ):how much should you expect a source to vary within ∆t?

⇒ fit (ωλ, τ) & m̄λ

⇒ characteristic variability timescale & amplitude

RR Lyrae, ωr=0.3, τ=1.5 days QSO, ωr=0.13 , τ=560 days

14 / 19Machine-Learning Strategies for Variable Source Classification

Page 26: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Methodology

109 light curves from PS1 3π

outlier cleaning using a machine-learning method

feature extraction: structure functions, mean magnitudes

first classification with RFC

RR Lyrae1.5× 105 RR Lyrae candidates(80% purity, 80% completeness)

feature extraction: template fitting

second classification with RFC

4.4× 104 RRab stars (90% purity, 80%

completeness), distances up to ∼140 kpc, σ = 3%

QSO3.8× 106 QSOcandidates(85% purity, 85%

completeness)

15 / 19Machine-Learning Strategies for Variable Source Classification

Page 27: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

RR Lyrae Candidates

16 / 19Machine-Learning Strategies for Variable Source Classification

Page 28: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Sagittarius stream: an example for structurefinding

45°

90°

135°

180°

225°

270°

315°

040

80120

Virgo

Sgr leading arm

Sgr trailing arm

Λ̃¯

D [kpc]

8

6

4

2

0

2

4

6

8

B̃¯

[◦]

17 / 19Machine-Learning Strategies for Variable Source Classification

Page 29: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

What to get from Pan-STARRS1 3π

identification of RR Lyrae and QSO candidates works well:

RR Lyrae: in S82, 90% purity, 80% completeness, 4.4× 104

sources

QSO: in S82, 85% purity, 85% completeness, 3.8× 106 sources

fitting of complete (!) 3D geometry of Sagittarius stream

huge Keck & Magellan spectroscopic follow-up survey for RRabstars: Caltech/Carnegie Survey of the Outer Halo of the Milky Way

⇒catalog of variable sources & paper: Hernitschek+2016, Sesar+2017

paper on the 3D geometry of Sagittarius stream: Hernitschek+2017, Sesar+2017

paper on the Milky Way’s halo profile to 130 kpc: Hernitschek+2018

paper on the Milky Way’s globular clusters and dwarf galaxies: Hernitschek+2019... and some more are in preparation!

18 / 19Machine-Learning Strategies for Variable Source Classification

Page 30: Machine-Learning Strategies for Variable Source Classification · Machine-Learning Strategies for Variable Source Classi cation 6 / 19. All-Sky Surveys RR Lyrae Machine-Learning Variability

All-Sky Surveys RR Lyrae Machine-Learning Variability

Take home message

With the right algorithms, even sparse data (light curves) canlead to surprisingly good classification

19 / 19Machine-Learning Strategies for Variable Source Classification