lsst: preparing for the data avalanche through partitioning, parallelization, and provenance kirk...

LSST: Preparing for the Data Avalanche through LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and ProvenancePartitioning, Parallelization, and ProvenanceKirk Borne (Perot Systems Corporation / NASA GSFC and George Mason University / LSST)

The LSST research and development effort is funded in part by the National Science Foundation under Scientific Program Order No. 9 (AST-0551161) through Cooperative Agreement AST-0132798. The LSST research and development effort is funded in part by the National Science Foundation under Scientific Program Order No. 9 (AST-0551161) through Cooperative Agreement AST-0132798. Additional funding comes from private donations, in-kind support at Department of Energy laboratories and other LSSTC Institutional Members.Additional funding comes from private donations, in-kind support at Department of Energy laboratories and other LSSTC Institutional Members.

National Optical Astronomy ObservatoryNational Optical Astronomy Observatory

Research CorporationResearch Corporation

The University of ArizonaThe University of Arizona

University of WashingtonUniversity of Washington

Brookhaven National LaboratoryBrookhaven National Laboratory

Harvard-Smithsonian Center for AstrophysicsHarvard-Smithsonian Center for Astrophysics

Johns Hopkins UniversityJohns Hopkins University

Las Cumbres Observatory, Inc.Las Cumbres Observatory, Inc.

Lawrence Livermore National LaboratoryLawrence Livermore National Laboratory

Stanford Linear Accelerator CenterStanford Linear Accelerator Center

Stanford UniversityStanford University

The Pennsylvania State UniversityThe Pennsylvania State University

University of California, DavisUniversity of California, Davis

University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign

ABSTRACT: The Large Synoptic Survey Telescope (LSST) project will produce 30 terabytes of data daily for 10 years, resulting in a 65-petabyte final image data archive and a 70-petabyte final catalog (metadata) database. This large telescope will begin science operations in 2014 at Cerro Pachon in Chile. It will operate with the largest camera in use in astronomical research: 3 gigapixels, covering 10 square degrees, roughly 3000 times the coverage of one Hubble Space Telescope image. One co-located pair of 6-gigabyte sky images is acquired, processed, and ingested every 40 seconds. Within 60 seconds, notification alerts for all objects that are dynamically changing (in time or location) are sent to astronomers around the world. We expect roughly 100,000 such events every night. Each spot on the available sky will be re-imaged in pairs approximately every 3 days, resulting in about 2000 images per sky location after 10 years of operations (2014-2024). The processing, ingest, storage, replication, query, access, archival, and retrieval functions for this dynamic data avalanche are currently being designed and developed by the LSST Data Management (DM) team, under contract from the NSF. Key challenges to success include: the processing of this enormous data volume, real-time database updates and alert generation, the dynamic nature of every entry in the object database, the complexity of the processing and schema, the requirements for high availability and fast access, spatial-plus-temporal indexing of all database entries, and the creation and maintenance of multiple versions and data releases. To address these challenges, the LSST DM team is implementing solutions that include database partitioning, parallelization, and provenance (generation and tracking). The prototype LSST database schema currently has over 100 tables, including catalogs for sources, objects, moving objects, image metadata, calibration and configuration metadata, and provenance. Techniques for managing this database must satisfy intensive scaling and performance requirements. These techniques include data and index partitioning, query partitioning, parallel ingest, replication of hot-data, horizontal scaling, and automated fail-over. In the area of provenance, the LSST database will capture all information that is needed to reproduce any result ever published. Provenance-related data include: telescope/camera instrument configuration; software configuration (software versions, policies used); and hardware setup (configuration of nodes used to run LSST software). Provenance is very dynamic, in the sense that the metadata to be captured change frequently. The schema has to be flexible to allow that. In our current design, over 30% of the schema is dedicated to provenance. Our philosophy is this: (1) minimize the impact of reconfiguration by avoiding tight coupling between data and provenance: hardware and software configurations are correlated with science data via a single ProcessingHistory_id; and (2) minimize the volume of provenance information by grouping together objects with identical processing history.

LSST = Large

Synoptic Survey

Telescope

8.4-meter diameterprimary mirror =10 square degrees!

(design, development, construction, and operations of telescope & observatory funded by NSF)

(mirror funded by private donors)

Hello !

LSST Camera = 201 CCDs @ 4096x4096 pixels each!= 3 gigapixels = 6 GB per image, covering 10 sq.degrees= ~3000 times the area of one Hubble Telescope image

(camera funded by DOE)

Focal Plane Arrayscale model

Observing Strategy: One pair of images every 40 seconds for each spot on the sky,then continue across the sky continuously every night for 10 yea rs (2014-2024), with time domain sampling in log(time) intervals (to capture dynamic range of transients).

Data Products• Image Archive (65 Petabytes! after 10 years)

– 2000 visits for each 10 sq.degree patch of sky– 2000 patches in the viewable sky– 30 Terabytes per night (~2000 images)

• Object, Moving Object, & Source catalogs• Alert Notifications:

– VOevent message protocol– 100,000 alerts per night– Anything that has changed (moving or optically variable)

• Full project database (70 Petabytes!)• Uniformly processed data releases (annual)• Deep co-added image of the sky:

– Individual images reach 24 th magnitude– Deep stacked image reaches 27 th magnitude

Database Contents

• >100 database tables:– Source catalog– Object catalog– Moving Object catalog– Variable Object catalog– Alerts catalog– Calibration metadata– Configuration metadata– Processing metadata– Provenance metadata– etc.

Source– 260 billion rows *– 2,000 partitions *– 306 bytes/row– 1 row=data for 1 filter

Object

– 22 billion rows *– 2,000 partitions *– 1.8 KB/row– 1 row=data for 6 filters

Image Metadata– 675 million rows *– 1 row = metadata for 1 ccd -amplifier

* - as of Data Release 1 (DR1), 2014

Data ProductsSource Catalog, Object Catalog, Images, Alerts

PipelinesImage Processing, Detection, Association, Moving Object, Classification, Calibration

Application FrameworkImage, Astronomical Object, Catalog, Collection, Table, Meta -Data,

Component, Processing Stage, Processing Slice

Application Layer• Scientific Layer• Object Oriented, C++

Custom Software

Data Acces sData Access Framework,Dis tributed File Sys tem,

Database Management Sys tem

Dis tributed Proces s ingPipeline Cons truction

Pipeline ExecutionManagement and Control

Sys tem Adminis tration, Operations , SecuritySys tem Resource Management, User Management, Certificate -based Security

Middleware Layer• Portability, Standard Services• Open Source, Off-the -shelf

Software• Custom Integration

Us er InterfaceData QualityVisualization

VO Interfaces

ComputingClus ters /Servers , Operating Sys tem

CommunicationsFiber, Switches , Routers , Firewalls ,

Communications Stacks ,Network Management Software

Phys ical PlantPower, Cooling, Space

Infrastructure Layer• Distributed Platform• Off-the -shelf, Commercial

Hardware/Software

StorageDisk, Tape, Controllers ,

Storage Management Software

Data Management Sys tem Layered Architec ture :for sca lability, re liability, and evolution

Database System Design Meeting Massive Data Management Challenges:Parallelization, Partitioning, and

Providing Virtual Data through Provenance• Pipeline processing:

– CPU and Data Parallelization• Database access:

– Sky Partitioning –• Spatial clustering of Source and Object Catalogs• Temporal partitioning of DIASource catalog

• Database volume:– Provenance (>30% of current DB Schema; but <1% of DB volume)– Without provenance unique identifiers (see below), the DB volume for tracking

provenance (@1 -second time resolution for 10 years) would be at least 10x large r.• Why Provenance?

– to keep data volume "manageable" – maintaining provenance allows us to discard intermediate data products and rebuild them at will on -the -fly: VIRTUAL DATAVIRTUAL DATA !

• Implementing Provenance:– Unique Processing_ID used throughout DB. This 4 -byte Processing_ID uniquely

identifies the full hardware & software configuration (telescope , instrument, processing pipeline version, processing parameters, etc.) for each source m easurement (roughly 100 million sources per image pair, with 1000 image pairs per ni ght, every night).

– Minimize the impact of reconfiguration by avoiding tight couplin g between data and provenance: hardware and software configurations are correlated with data via a single Processing_ID , which is included in all major DB tables (Object, Source,...).

– Minimize the volume of provenance information by grouping togeth er objects with identical processing history.

Sample DB Table tracking provenance history

• The validity of a given set of parameters (hardware, software, p rocessing, telescope, or instrument) for any database entry (source, object , etc.) is determined by the “ validityvalidity ” time for that unique provenance identifier.

• Sample Table : Provenance of Pipeline Run Configuration

class Image Metadata

FPA_Exposure

Amp_Exposure

Amp_WCS Amp_PSFAmp_PhotoCal

_FPA_Exposure2Visit

Visit

prv _Filter

Calibration_FPA_ExposureScience_FPA_Exposure

CCD_Exposure

PSF

TemplateImage

_FPA_Exposure2TemplateImage

CCD_PhotoCal CCD_WCS

FPA_PhotoCal FPA_WCS

CCD_PSF

FPA_PSF

prv _Amplifier

prv _cnf_MaskAmpImage

_FPA_PSF_Row

0..1

subtractedExposureId

1

1..*

amplifierId

1

0..1

exposureId

1

0..1

ccdExposureId

1

0..1

exposureId

1

0..1

exposureId

1

0..1

ccdExposureId

1

0..1

ccdExposureId

1

1..*

templateImageId

1

1

exposureId

1

6

psfId

1

0..1

varianceExposureId

1

1..*

fi lterId1

0..1

exposureId

1

0..1

exposureId

1

1..*

visitId

1

1exposureId

1

1..*

psfId

1

0..1

ampExposureId

1

0..1

ampExposureId

1

1..*

amplifierId

1

30

ccdExposureId

1

1

ampExposureId

0..1

201

exposureId

1

Name:Package:Version:Author:

Image MetadataMain Telescope2.6.1Jacek Becla

class Calibration

Science_FPA_Exposure

Flat_FPA_Exposure

_FPA_Flat2CMExposure

FPA_Exposure

Bias_FPA_Exposure Dark_FPA_Exposure

_FPA_Bias2CMExposure _FPA_Dark2CMExposure

Flat_FPA_CMExposure Fringe_FPA_CMExposureBias_FPA_CMExposure Dark_FPA_CMExposure

_Science_FPA_Exposure_Group

_FPA_Fringe2CMExposure

1..*

biasExposureId

1

0..*

darkExposureId

1

0..*

biasExposureId

1

0..*

flatExposureId

1

1

cmDarkExposureId

1

1

cmBiasExposureId

1

1

x_cmFringeExposureId

6

1

x_cmFlatExposureId

6

1

cmDarkExposureId

1..*

1..*

darkExposureId

1

1

cdFringeExposureId

1..*

1..*

cmBiasExposureId

1

0..1

exposureId

1

0..1

exposureId

1

0..1

exposureId

1

1

cmFlatExposureId

1..*

0..*

darkExposureId

1

0..*

biasExposureId

1

1..*

flatExposureId1

0..1

exposureId

1

1

cseGroupId

1

0..1

varianceExposureId

1

0..1

subtractedExposureId

1

0..*

biasExposureId

1


CalibrationMain Telescope2.6.1Jacek Becla

class Hardware Prov enance

prv _cnf_Amplifier

prv _cnf_Raft

prv _cnf_FocalPlane

prv _cnf_CCD

prv _Amplifier

prv _CCD

prv _Raft

prv _FocalPlane

prv _Filterprv _cnf_Filter

prv _Telescopeprv _cnf_Telescope

Amp_Exposure

prv _cnf_MaskAmpImage

1..*

amplifierId

1

1..*

raftId

1

1..*

focalPlaneId

1

1..*

ccdId

1

30

ccdId

1

9

raftId

1

23

focalPlaneId1

6focalPlaneId

1

1..*

fi lterId

1

1

focalPlaneId

1

1..*

telescopeId

1

1..*

amplifierId

1

1..*

amplifierId

1


Hardware ProvenanceProvenance2.6.1Jacek Becla

class Software Prov enance

prv _Stage

prv _Run

prv _cnf_Policy

prv _Policy

prv _Node

prv _Pipeline

prv _Slice

prv _cnf_Pipeline2Run

prv _Pipeline2Run

prv _Stage2Pipeline

prv _cnf_Stage2Pipeline

prv _Stage2Slice

prv _cnf_Stage2Slice

prv _cnf_Node

prv _cnf_Slice

prv _UpdatableColumn

prv _UpdatableTable

prv _Stage2UpdatableColumn

prv _cnf_Stage2UpdatableColumn

1..*

pipelineId

1

1

columnId

1

1..*

stageId

1

1..*

tableId

1

1..*

sliceId

1

1..*

nodeId

1

1..*

nodeId

1

1..*

stage2SliceId

1

1

sliceId 1

1

c_stage2UpdatableColumnId

1..*

1..*

stage2pipelineId

1

1..*

policyId

1

1

stageId 1

1pipelineId 1

1..*

runId

1

1..*

pipeline2runId

1

1..*

policyId

1

1..*

policyId

1

1..*

policyId

1

1..*

policyId

1

1..*

stageId

1


Software ProvenanceProvenance2.6.1Jacek Becla

class Prov enance

prv _Stage

prv _ProcHistory

Object Source DIASourceMov ingObject Amp_ExposureCCD_ExposureFPA_Exposure

prv _Snapshot

Objects/sources/exposures processed using the same stages can share one instance of ProcessingHistory. Each instance of ProcessingHistory keeps track of stages that were run as part of given ProcessingHistory. It also keeps track of time window during which each stage run. This time window can then be used to locate appropriate configuration of every configurable piece of hardware and software (prv_cnf_* tables). Note that multiple configurations of the same thing may exist for a given stage. This isallowed for cases, where reconfiguration is believed to be such that it does not affect data (for example slice can fail and be restarted on a different node).

prv _Stage2ProcHistory

1

stageId

1

0..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1


ProvenanceProvenance2.6.1Jacek Becla

Sample database schema shown here:Sample database schema shown here:- ImageImage- CalibrationCalibration- ProvenanceProvenanceEach of these illustrates the vast collection of system parameters that are used to track the state of the end-to-end data system: what was state of the telescope & camera during an image exposure, or what were the values of the numerous calibration parameters, or what was the state & version of all relevant hardware and software components in the pipeline during data processing and ingest.Solution:Solution: Track all of these system parameters (provenance) with a single (unique) Processing_IDProcessing_ID.

Overview: LSST Sky Survey and Data System

Data Management System Architecture:Data Management System Architecture:• Multiple, specialized processing sites –

• Mountain/Base for real -time data reduction• Archive Center for non -real-time data reduction,

archival data storage, data re lease• Data Access Centers for external data access

Common Pipeline Components

Data Acquisition

Infrastructure

Image

ProcessingPipeline

Detection Pipeline

Association Pipeline

ImageArchive

SourceCatalog

ObjectCatalog

Alert

Archive

Deep Detect Pipeline

Deep

ObjectCatalog

VO Compliant Interface Middleware

Classification

Pipeline

Moving Object Pipeline

CalibrationPipeline

User

Tools(Query,

Data Quality

Analysis, Monitoring)

AlertProcessing

Eng/Fac DataArchive Calibration

Data

OrbitCatalog

http://www.gmu.edu/

lsst: preparing for the data avalanche through partitioning, parallelization, and provenance kirk...

Documents

lsst data management

database partitioning

terabytes of data

data releases

lsst software

provenancerelated data

dynamic data avalanche

prototype lsst database