machine-learning strategies for variable source classification · machine-learning strategies for...

Machine-Learning Strategiesfor Variable Source

Classification

Nina Hernitschek (Caltech/ Vanderbilt University*)collaborators: Judith G. Cohen (Caltech)

Hans-Walter Rix (MPIA), Branimir Sesar (formerly MPIA)

*DSI/VIDA Postdoctoral Fellow at Vanderbilt University’s Data ScienceInstitute (DSI) and the Vanderbilt Initiative for Data-Intensive Astrophysics

(VIDA)

Hot-Wiring the Transient Universe – August 19 - 22, 2019

All-Sky Surveys RR Lyrae Machine-Learning Variability

1 / 19Machine-Learning Strategies for Variable Source Classification


The Pan-STARRS1 3π Survey

PS1 3π in one sentence:An optical/near-IR survey of 3/4 the sky in non-simultaneousgrizy to r∼21.8 based on ∼70 visits over a 5.5-year period.

map galactic halo to ∼120 kpc

single-visit depth of r ∼ 21.8

coadded depth of r ∼ 23.2

sky coverage of ∼31,000 deg2

(3/4 of the sky)

δ > -30 deg

70 epochs over 5.5 years

grizy nonsimultaneous


image based on NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech

∼ 10 kpc limit of SDSS studies

for kinematics & [Fe/H]

∼400 kpc LSST

∼120 kpc PS1 3π

∼ 10 kpc limit of SDSS studies for kinematics & [Fe/H]


RR Lyrae from PS1 3π

RR Lyrae stars:

periodical pulsators, varying on1/4 day timescales

distances from PLZ relation

old: ∼109 years

high-precision 3D mapping ofthe (old) Milky Way

⇒ easy to find and important tracers for old halo substructure

big challenge:characterize variability and identify RR Lyrae stars(and their periods) from 109 sparse, noisy light curves



To model a survey...




tools are needed for

describing data quality → outlier

describing light curve characteristics → “features”

classifying sources → catalogs

finding substructure → clumps, overdensities, ...




tools are needed for

describing data quality → outlier (machine learning)

describing light curve characteristics → “features”

classifying sources → catalogs (machine learning)

finding substructure → clumps, overdensities, ...

⇒ generic, general approaches needed

⇒ depending strongly (!) on computational performance

challenging, but enables new generation of population studies:huge and homogeneous (deep, all-sky) samples



Machine Learning

... is the sub-field of computer science that gives computers theability to learn without being explicitly programmed (ArthurSamuel, 1959)



Supervised Machine Learning

for all following machine learning approaches, we use supervisedlearning: learning to infer a function from labeled training data,e.g. classification

training set classifier

target set’sprobabilities

target set

training set:

set of sources inside/outside category we are looking for

same data quality as found in target set

What’s happening internally?



Random Forest Classifier

training set

...subsample 1 subsample N

...

tree 1 tree N

...tree N

tree 2

tree 1

x

x

x

majorityvote

random forest‘s decision

divide-and-conquer approach improves the classificationperformance

less sensitive to training set variances

robust to outliers

training and classification can be parallelized



Application: Variability Characterization

Classification of variable sources relies fundamentally on algorithmsquantifying different aspects of variability found in light curves.

features extraction:

light curvesignal processing−−−−−−−−−−→ numbers

features should be as discriminative and informative as possible

challenges:

non-simultaneous multi-band

noise & uncertainties

foreground effects

time-sampling acting as window function



Periodicity

found e.g. in light curves of eclipsing binaries, RR Lyrae andCepheidsRR Lyrae period crucial for distance determination:Period-Luminosity-Metallicity (PLZ) relation

However:

might be masked due to cadence of survey

not all variables are periodic

⇒ apply methods to detect periodicity in sparse and unevenlysampled multi-band data



Periodicity

Template Fittingcreate light-curve templates from better sampled data or mockdata (simulation) & fit target data

example: RR Lyrae period fitting, using light curve templates fromSDSS Stripe 82 (Sesar et al. 2010)

0.0 0.2 0.4 0.6 0.8 1.0Phase

19.5

20.0

20.5

21.0

Mag

nitu

de

g

r

i

z&y



Periodicity

Template Fittingcreate light-curve templates from better sampled data or mockdata (simulation) & fit target data

example: RR Lyrae period fitting, using light curve templates fromSDSS Stripe 82 (Sesar et al. 2010)

0.0 0.2 0.4 0.6 0.8 1.0Phase

19.5

20.0

20.5

21.0

Mag

nitu

de

g

r

i

z&y

⇒ computationally expensive⇒ should be 2nd step after more general pre-selection method



Variable Sources in General

not all variables are periodic: QSOs, supernovae...

not all periodic variables look periodic: sampling

some period-estimation methods are computationally veryexpensive: need for pre-selection

⇒ non-periodic features are very important

⇒ non-periodic non-simultaneous features extracted fromPan-STARRS1 3π light curvessuch as multiband structure functions



Characterize Light Curves

multi-band structure-function variability model L (grizy |ωr, τ):how much should you expect a source to vary within ∆t?

⇒ fit (ωλ, τ) & m̄λ

⇒ characteristic variability timescale & amplitude

RR Lyrae, ωr=0.3, τ=1.5 days QSO, ωr=0.13 , τ=560 days



Methodology

109 light curves from PS1 3π

outlier cleaning using a machine-learning method

feature extraction: structure functions, mean magnitudes

first classification with RFC

RR Lyrae1.5× 105 RR Lyrae candidates(80% purity, 80% completeness)

feature extraction: template fitting

second classification with RFC

4.4× 104 RRab stars (90% purity, 80%

completeness), distances up to ∼140 kpc, σ = 3%

QSO3.8× 106 QSOcandidates(85% purity, 85%

completeness)



RR Lyrae Candidates



Sagittarius stream: an example for structurefinding

0°

45°

90°

135°

180°

225°

270°

315°

040

80120

Virgo

Sgr leading arm

Sgr trailing arm

Λ̃¯

D [kpc]

8

6

4

2

0

2

4

6

8

B̃¯

[◦]



What to get from Pan-STARRS1 3π

identification of RR Lyrae and QSO candidates works well:

RR Lyrae: in S82, 90% purity, 80% completeness, 4.4× 104

sources

QSO: in S82, 85% purity, 85% completeness, 3.8× 106 sources

fitting of complete (!) 3D geometry of Sagittarius stream

huge Keck & Magellan spectroscopic follow-up survey for RRabstars: Caltech/Carnegie Survey of the Outer Halo of the Milky Way

⇒catalog of variable sources & paper: Hernitschek+2016, Sesar+2017

paper on the 3D geometry of Sagittarius stream: Hernitschek+2017, Sesar+2017

paper on the Milky Way’s halo profile to 130 kpc: Hernitschek+2018

paper on the Milky Way’s globular clusters and dwarf galaxies: Hernitschek+2019... and some more are in preparation!



Take home message

With the right algorithms, even sparse data (light curves) canlead to surprisingly good classification


machine-learning strategies for variable source classification · machine-learning strategies for...

Documents