lsst: preparing for the data avalanche through partitioning, parallelization, and provenance kirk...

1
LSST: Preparing for the Data Avalanche LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, through Partitioning, Parallelization, and Provenance and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George Mason University / LSST) The LSST research and development effort is funded in part by the National Science Foundation under Scientific Program Order No. 9 (AST-0551161) through Cooperative The LSST research and development effort is funded in part by the National Science Foundation under Scientific Program Order No. 9 (AST-0551161) through Cooperative Agreement AST-0132798. Agreement AST-0132798. Additional funding comes from private donations, in-kind support at Department of Energy laboratories and other LSSTC Institutional Members. Additional funding comes from private donations, in-kind support at Department of Energy laboratories and other LSSTC Institutional Members. National Optical Astronomy National Optical Astronomy Observatory Observatory Research Corporation Research Corporation The University of Arizona The University of Arizona University of Washington University of Washington Brookhaven National Laboratory Brookhaven National Laboratory Harvard-Smithsonian Center for Harvard-Smithsonian Center for Astrophysics Astrophysics Johns Hopkins University Johns Hopkins University Las Cumbres Observatory, Inc. Las Cumbres Observatory, Inc. Lawrence Livermore National Lawrence Livermore National Laboratory Laboratory Stanford Linear Accelerator Center Stanford Linear Accelerator Center Stanford University Stanford University The Pennsylvania State University The Pennsylvania State University University of California, Davis University of California, Davis University of Illinois at Urbana- University of Illinois at Urbana- Champaign Champaign ABSTRACT : The Large Synoptic Survey Telescope (LSST) project will produce 30 terabytes of data daily for 10 years, resulting in a 65-petabyte final image data archive and a 70-petabyte final catalog (metadata) database. This large telescope will begin science operations in 2014 at Cerro Pachon in Chile. It will operate with the largest camera in use in astronomical research: 3 gigapixels, covering 10 square degrees, roughly 3000 times the coverage of one Hubble Space Telescope image. One co-located pair of 6-gigabyte sky images is acquired, processed, and ingested every 40 seconds. Within 60 seconds, notification alerts for all objects that are dynamically changing (in time or location) are sent to astronomers around the world. We expect roughly 100,000 such events every night. Each spot on the available sky will be re-imaged in pairs approximately every 3 days, resulting in about 2000 images per sky location after 10 years of operations (2014-2024). The processing, ingest, storage, replication, query, access, archival, and retrieval functions for this dynamic data avalanche are currently being designed and developed by the LSST Data Management (DM) team, under contract from the NSF. Key challenges to success include: the processing of this enormous data volume, real-time database updates and alert generation, the dynamic nature of every entry in the object database, the complexity of the processing and schema, the requirements for high availability and fast access, spatial-plus-temporal indexing of all database entries, and the creation and maintenance of multiple versions and data releases. To address these challenges, the LSST DM team is implementing solutions that include database partitioning, parallelization, and provenance (generation and tracking). The prototype LSST database schema currently has over 100 tables, including catalogs for sources, objects, moving objects, image metadata, calibration and configuration metadata, and provenance. Techniques for managing this database must satisfy intensive scaling and performance requirements. These techniques include data and index partitioning, query partitioning, parallel ingest, replication of hot-data, horizontal scaling, and automated fail-over. In the area of provenance, the LSST database will capture all information that is needed to reproduce any result ever published. Provenance-related data include: telescope/camera instrument configuration; software configuration (software versions, policies used); and hardware setup (configuration of nodes used to run LSST software). Provenance is very dynamic, in the sense that the metadata to be captured change frequently. The schema has to be flexible to allow that. In our current design, over 30% of the schema is dedicated to provenance. Our philosophy is this: (1) minimize the impact of reconfiguration by avoiding tight coupling between data and provenance: hardware and software configurations are correlated with science data via a single ProcessingHistory_id; and (2) minimize the volume of provenance information by grouping together objects with identical processing history. LSST = Large Synoptic Survey Telescope 8.4-meterdiameter prim ary m irror= 10 square degrees! (design, developm ent, construction, and operations of telescope & observatory funded by N SF) (m irrorfunded by private donors) Hello ! LSST Cam era = 201 C C D s @ 4096x4096 pixels each ! = 3 gigapixels = 6 G B perim age,covering 10 sq.degrees = ~3000 tim es the area ofone H ubble Telescope im age (cam era funded by DO E) Focal Plane Array scale m odel Observing S trategy: One pairofim ages every 40 seconds foreach spoton the sky, then continue across the sky continuously every nightfor10 yea rs (2014-2024),with tim e dom ain sam pling in log(tim e)intervals (to capture dynam ic range oftransients). D ata Products Im age Archive (65 Peta bytes!after10 years) 2000 visits foreach 10 sq.degree patch of sky 2000 patches in the view able sky 30 Terabytes pernight (~2000 im ages) O bject,M oving O bject,& Source catalogs •AlertNotifications: VO eventm essage protocol 100,000 alerts pernight Anything that has changed (m oving oroptically variable) Full projectdatabase (70 Peta bytes!) U niform ly processed data releases (annual) D eep co-added im age ofthe sky: Individual im ages reach 24 th magnitude D eep stacked im age reaches 27 th magnitude D atabase C ontents >100 database tables: Source catalog O bject catalog M oving O bject catalog Variable O bject catalog Alerts catalog C alibration m etadata C onfiguration m etadata Processing m etadata Provenance m etadata –etc. Source 260 billion row s * 2,000 partitions * 306 bytes/row 1 row =data for1 filter Object 22 billion row s * 2,000 partitions * 1.8 KB/row 1 row =data for6 filters Im age M etadata 675 m illion row s * 1 row = m etadata for 1 ccd -am plifier * - as of Data R elease 1 (DR 1), 2014 Data Products Source Catalog, O bject C atalog, Im ages, Alerts Pipelines Im ageProcessing, D etection, Association, M oving Object, Classification, Calibration Application Fram ew ork Im age, Astronom ical O bject, Catalog, C ollection, Table, M eta -D ata, C om ponent, Processing Stage, Processing Slice Application Layer Scientific Layer O bject O riented, C++ Custom Softw are Data Access D ata AccessFram ework, D istributed File System , D atabase M anagem ent System Distributed Processing Pipeline Construction Pipeline Execution M anagem ent and Control System Adm inistration,O perations,Security System R esource M anagem ent, U ser M anagem ent, C ertificate -based Security Middlew are Layer Portability,Standard Services O pen Source, O ff-the-shelf Softw are Custom Integration UserInterface Data Q uality Visualization VO Interfaces Com puting C lusters/Servers, O perating System Com m unications Fiber, Sw itches, R outers, Firew alls, C om m unications Stacks, Netw ork M anagem ent Softw are PhysicalPlant Power, Cooling, Space Infrastructure Layer Distributed Platform O ff-the -shelf,Com m ercial Hardw are/Softw are Storage D isk, Tape, C ontrollers, Storage M anagem ent Softw are D ata M anagem entSystem Layered A rchitecture: forscalability,reliability,and evolution D atabase System D esign M eeting M assive D ata M anagem entChallenges: Parallelization, Partitioning, and Providing Virtual D ata through Provenance Pipeline processing: C PU and D ata Parallelization Database access: SkyPartitioning – Spatial clustering ofSource and O bjectCatalogs Tem poral partitioning of D IASource catalog Database volum e: Provenance (>30% of current D B Schem a; but <1% of D B volum e) W ithout provenance unique identifiers (see below ), the D B volum e fortracking provenance (@ 1 -second tim e resolution for10 years)w ould be at least 10x large r. W hy Provenance? to keep data volum e "m anageable" m aintaining provenance allowsusto discard interm ediate data products and rebuild them at w ill on -the -fly: VIR TU A L D A TA VIR TU A L D A TA ! Im plem enting Provenance: U nique Processing_ID used throughout D B. This 4 -byte Processing_ID uniquely identifies the full hardw are & softw are configuration (telescope , instrum ent, processing pipeline version, processing param eters, etc.)foreach source m easurem ent (roughly 100 m illion sourcesperim age pair, w ith 1000 im age pairs perni ght, every night). M inim ize the im pact of reconfiguration by avoiding tight couplin g betw een data and provenance: hardw are and softw are configurations are correlated w ith data via a single Processing_ID , w hich is included in all m ajorD B tables (O bject, Source,...). M inim ize the volum e of provenance inform ation by grouping togeth erobjectswith identical processing history. Sam ple D B Table tracking provenance history The validity of a given set of param eters (hardw are, softw are, p rocessing, telescope, orinstrum ent)forany database entry (source, object , etc.)is determ ined by the “ validity validity ”tim e forthat unique provenance identifier. S am ple Table : P rovenance of P ipeline R un C onfiguration class Im age M etadata FPA_Exposure Am p_Exposure Am p_W CS Am p_PSF Am p_PhotoCal _FP A _E xposure2V isit V isit prv _Filter Calibration_FP A _E xposure Science_FPA_Exposure CCD_Exposure PSF Tem plateImage _FP A _E xposure2Tem plateIm age C C D _P hotoC al CCD_W CS FPA_PhotoCal FPA_W CS CCD_PSF FPA_PSF prv _A m plifier prv_cnf_M askAm pIm age _FPA_PSF_Row 0..1 subtractedE xposureId 1 1..* am plifierId 1 0..1 exposureId 1 0..1 ccdE xposureId 1 0..1 exposureId 1 0..1 exposureId 1 0..1 ccdE xposureId 1 0..1 ccdE xposureId 1 1..* templateImageId 1 1 exposureId 1 6 psfId 1 0..1 varianceE xposureId 1 1..* filterId 1 0..1 exposureId 1 0..1 exposureId 1 1..* visitId 1 1 exposureId 1 1..* psfId 1 0..1 am pE xposureId 1 0..1 am pE xposureId 1 1..* am plifierId 1 30 ccdE xposureId 1 1 am pE xposureId 0..1 201 exposureId 1 Nam e: P ackage: V ersion: A uthor: Im age M etadata M ain T elescope 2.6.1 Jacek B ecla class C alibration Science_FPA_Exposure Flat_FP A _E xposure _FPA_Flat2CM Exposure FPA_Exposure Bias_FPA_Exposure Dark_FPA_Exposure _FPA_Bias2CM Exposure _FPA_Dark2CM Exposure Flat_FPA_CM Exposure Fringe_FPA_CM Exposure Bias_FPA_CM Exposure Dark_FPA_CM Exposure _Science_FPA_Exposure_G roup _FPA_Fringe2CM Exposure 1..* biasE xposureId 1 0..* darkE xposureId 1 0..* biasE xposureId 1 0..* flatE xposureId 1 1 cm D arkE xposureId 1 1 cm B iasE xposureId 1 1 x_cm FringeE xposureId 6 1 x_cm FlatE xposureId 6 1 cm D arkE xposureId 1..* 1..* darkE xposureId 1 1 cdFringeE xposureId 1..* 1..* cm B iasE xposureId 1 0..1 exposureId 1 0..1 exposureId 1 0..1 exposureId 1 1 cm FlatE xposureId 1..* 0..* darkE xposureId 1 0..* biasE xposureId 1 1..* flatE xposureId 1 0..1 exposureId 1 1 cseG roupId 1 0..1 varianceE xposureId 1 0..1 subtractedE xposureId 1 0..* biasE xposureId 1 Nam e: P ackage: V ersion: A uthor: C alibration M ain T elescope 2.6.1 Jacek B ecla class Hardw are Provenance prv _cnf_A m plifier prv_cnf_R aft prv _cnf_FocalP lane prv _cnf_C C D prv _A m plifier prv _C C D prv _R aft prv _FocalP lane prv _Filter prv _cnf_Filter prv _Telescope prv _cnf_Telescope Am p_Exposure prv_cnf_M askAm pIm age 1..* am plifierId 1 1..* raftId 1 1..* focalP laneId 1 1..* ccdId 1 30 ccdId 1 9 raftId 1 23 focalP laneId 1 6 focalP laneId 1 1..* filterId 1 1 focalP laneId 1 1..* telescopeId 1 1..* am plifierId 1 1..* am plifierId 1 Nam e: P ackage: V ersion: A uthor: H ardw are P rovenance P rovenance 2.6.1 Jacek B ecla class S oftw are Provenance prv _S tage prv _R un prv _cnf_P olicy prv _P olicy prv _N ode prv _P ipeline prv _S lice prv _cnf_P ipeline2R un prv _P ipeline2R un prv _S tage2P ipeline prv _cnf_S tage2P ipeline prv _S tage2S lice prv _cnf_S tage2S lice prv _cnf_N ode prv _cnf_S lice prv _U pdatableC olum n prv _U pdatableTable prv _S tage2U pdatableC olum n prv _cnf_S tage2U pdatableC olum n 1..* pipelineId 1 1 colum nId 1 1..* stageId 1 1..* tableId 1 1..* sliceId 1 1..* nodeId 1 1..* nodeId 1 1..* stage2S liceId 1 1 sliceId 1 1 c_stage2U pdatableC olum nId 1..* 1..* stage2pipelineId 1 1..* policyId 1 1 stageId 1 1 pipelineId 1 1..* runId 1 1..* pipeline2runId 1 1..* policyId 1 1..* policyId 1 1..* policyId 1 1..* policyId 1 1..* stageId 1 Nam e: P ackage: V ersion: A uthor: S oftw are P rovenance P rovenance 2.6.1 Jacek B ecla class Provenance prv _S tage prv _P rocH istory O bject S ource D IA S ource M ovingO bject Am p_Exposure CCD_Exposure FPA_Exposure prv_Snapshot O bjects/sources/exposures processed using the sam e stages can share one instance ofP rocessingH istory. E ach instance ofP rocessingH istory keeps track of stages thatw ere run as partofgiven P rocessingH istory.Italso keeps track oftim e w indow during w hich each stage run.T his tim e w indow can then be used to locate appropriate configuration of every configurable piece ofhardw are and softw are (prv_cnf_* tables).N ote thatm ultiple configurations ofthe sam e thing m ay existfora given stage.T his is allow ed forcases,w here reconfiguration is believed to be such thatitdoes notaffectdata (forexam ple slice can failand be restarted on a differentnode). prv _S tage2P rocH istory 1 stageId 1 0..* procH istoryId 1 1..* procH istoryId 1 1..* procH istoryId 1 1..* procH istoryId 1 1..* procH istoryId 1 1..* procH istoryId 1 1..* procH istoryId 1 1..* procH istoryId 1 1..* procH istoryId 1 Nam e: P ackage: V ersion: A uthor: P rovenance P rovenance 2.6.1 Jacek B ecla Sample database schema shown here: Sample database schema shown here: - Image Image - Calibration Calibration - Provenance Provenance Each of these illustrates the vast collection of system parameters that are used to track the state of the end-to-end data system: what was state of the telescope & camera during an image exposure, or what were the values of the numerous calibration parameters, or what was the state & version of all relevant hardware and software components in the pipeline during data processing and ingest. Solution: Solution: Track all of these system parameters (provenance) with a single (unique) Processing_ID Processing_ID. O verview :LSST Sky Survey and D ata System Data Management S ystem Architecture: Data Management S ystem Architecture: M ultiple, specialized processing sites M ountain/Base forreal -tim e data reduction Archive C enterfornon -real-tim e data reduction, archival data storage,data release D ata Access C enters forexternal data access Com m onPipelineCom ponents Data Acquisition Infrastructure Image Processing Pipeline Detection Pipeline Association Pipeline Image Archive Source Catalog Object Catalog Alert Archive DeepDetect Pipeline Deep Object Catalog VO Com pliant InterfaceM iddleware Classification Pipeline MovingO bject Pipeline Calibration Pipeline User Tools (Query, Data Quality Analysis, Monitoring) Alert Processing Eng/FacData Archive Calibration Data Orbit Catalog

Upload: lawrence-gallagher

Post on 13-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George

LSST: Preparing for the Data Avalanche through LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and ProvenancePartitioning, Parallelization, and ProvenanceKirk Borne (Perot Systems Corporation / NASA GSFC and George Mason University / LSST)

The LSST research and development effort is funded in part by the National Science Foundation under Scientific Program Order No. 9 (AST-0551161) through Cooperative Agreement AST-0132798. The LSST research and development effort is funded in part by the National Science Foundation under Scientific Program Order No. 9 (AST-0551161) through Cooperative Agreement AST-0132798. Additional funding comes from private donations, in-kind support at Department of Energy laboratories and other LSSTC Institutional Members.Additional funding comes from private donations, in-kind support at Department of Energy laboratories and other LSSTC Institutional Members.

National Optical Astronomy ObservatoryNational Optical Astronomy Observatory

Research CorporationResearch Corporation

The University of ArizonaThe University of Arizona

University of WashingtonUniversity of Washington

Brookhaven National LaboratoryBrookhaven National Laboratory

Harvard-Smithsonian Center for AstrophysicsHarvard-Smithsonian Center for Astrophysics

Johns Hopkins UniversityJohns Hopkins University

Las Cumbres Observatory, Inc.Las Cumbres Observatory, Inc.

Lawrence Livermore National LaboratoryLawrence Livermore National Laboratory

Stanford Linear Accelerator CenterStanford Linear Accelerator Center

Stanford UniversityStanford University

The Pennsylvania State UniversityThe Pennsylvania State University

University of California, DavisUniversity of California, Davis

University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign

ABSTRACT:  The Large Synoptic Survey Telescope (LSST) project will produce 30 terabytes of data daily for 10 years, resulting in a 65-petabyte final image data archive and a 70-petabyte final catalog (metadata) database. This large telescope will begin science operations in 2014 at Cerro Pachon in Chile. It will operate with the largest camera in use in astronomical research: 3 gigapixels, covering 10 square degrees, roughly 3000 times the coverage of one Hubble Space Telescope image. One co-located pair of 6-gigabyte sky images is acquired, processed, and ingested every 40 seconds.  Within 60 seconds, notification alerts for all objects that are dynamically changing (in time or location) are sent to astronomers around the world.  We expect roughly 100,000 such events every night.  Each spot on the available sky will be re-imaged in pairs approximately every 3 days, resulting in about 2000 images per sky location after 10 years of operations (2014-2024).  The processing, ingest, storage, replication, query, access, archival, and retrieval functions for this dynamic data avalanche are currently being designed and developed by the LSST Data Management (DM) team, under contract from the NSF. Key challenges to success include:  the processing of this enormous data volume, real-time database updates and alert generation, the dynamic nature of every entry in the object database, the complexity of the processing and schema, the requirements for high availability and fast access, spatial-plus-temporal indexing of all database entries, and the creation and maintenance of multiple versions and data releases.  To address these challenges, the LSST DM team is implementing solutions that include database partitioning, parallelization, and provenance (generation and tracking).  The prototype LSST database schema currently has over 100 tables, including catalogs for sources, objects, moving objects, image metadata, calibration and configuration metadata, and provenance. Techniques for managing this database must satisfy intensive scaling and performance requirements.  These techniques include data and index partitioning, query partitioning, parallel ingest, replication of hot-data, horizontal scaling, and automated fail-over.  In the area of provenance, the LSST database will capture all information that is needed to reproduce any result ever published.  Provenance-related data include: telescope/camera instrument configuration; software configuration (software versions, policies used); and hardware setup (configuration of nodes used to run LSST software).  Provenance is very dynamic, in the sense that the metadata to be captured change frequently.  The schema has to be flexible to allow that.  In our current design, over 30% of the schema is dedicated to provenance.  Our philosophy is this: (1) minimize the impact of reconfiguration by avoiding tight coupling between data and provenance: hardware and software configurations are correlated with science data via a single ProcessingHistory_id; and (2) minimize the volume of provenance information by grouping together objects with identical processing history.

LSST = Large

Synoptic Survey

Telescope

8.4-meter diameterprimary mirror =10 square degrees!

(design, development, construction, and operations of telescope & observatory funded by NSF)

(mirror funded by private donors)

Hello !

LSST Camera = 201 CCDs @ 4096x4096 pixels each!= 3 gigapixels = 6 GB per image, covering 10 sq.degrees= ~3000 times the area of one Hubble Telescope image

(camera funded by DOE)

Focal Plane Arrayscale model

Observing Strategy: One pair of images every 40 seconds for each spot on the sky,then continue across the sky continuously every night for 10 yea rs (2014-2024), with time domain sampling in log(time) intervals (to capture dynamic range of transients).

Data Products• Image Archive (65 Petabytes! after 10 years)

– 2000 visits for each 10 sq.degree patch of sky– 2000 patches in the viewable sky– 30 Terabytes per night (~2000 images)

• Object, Moving Object, & Source catalogs• Alert Notifications:

– VOevent message protocol– 100,000 alerts per night– Anything that has changed (moving or optically variable)

• Full project database (70 Petabytes!)• Uniformly processed data releases (annual)• Deep co-added image of the sky:

– Individual images reach 24 th magnitude– Deep stacked image reaches 27 th magnitude

Database Contents

• >100 database tables:– Source catalog– Object catalog– Moving Object catalog– Variable Object catalog– Alerts catalog– Calibration metadata– Configuration metadata– Processing metadata– Provenance metadata– etc.

Source– 260 billion rows *– 2,000 partitions *– 306 bytes/row– 1 row=data for 1 filter

Object

– 22 billion rows *– 2,000 partitions *– 1.8 KB/row– 1 row=data for 6 filters

Image Metadata– 675 million rows *– 1 row = metadata for 1 ccd -amplifier

* - as of Data Release 1 (DR1), 2014

Data ProductsSource Catalog, Object Catalog, Images, Alerts

PipelinesImage Processing, Detection, Association, Moving Object, Classification, Calibration

Application FrameworkImage, Astronomical Object, Catalog, Collection, Table, Meta -Data,

Component, Processing Stage, Processing Slice

Application Layer• Scientific Layer• Object Oriented, C++

Custom Software

Data Acces sData Access Framework,Dis tributed File Sys tem,

Database Management Sys tem

Dis tributed Proces s ingPipeline Cons truction

Pipeline ExecutionManagement and Control

Sys tem Adminis tration, Operations , SecuritySys tem Resource Management, User Management, Certificate -based Security

Middleware Layer• Portability, Standard Services• Open Source, Off-the -shelf

Software• Custom Integration

Us er InterfaceData QualityVisualization

VO Interfaces

ComputingClus ters /Servers , Operating Sys tem

CommunicationsFiber, Switches , Routers , Firewalls ,

Communications Stacks ,Network Management Software

Phys ical PlantPower, Cooling, Space

Infrastructure Layer• Distributed Platform• Off-the -shelf, Commercial

Hardware/Software

StorageDisk, Tape, Controllers ,

Storage Management Software

Data Management Sys tem Layered Architec ture :for sca lability, re liability, and evolution

Database System Design Meeting Massive Data Management Challenges:Parallelization, Partitioning, and

Providing Virtual Data through Provenance• Pipeline processing:

– CPU and Data Parallelization• Database access:

– Sky Partitioning –• Spatial clustering of Source and Object Catalogs• Temporal partitioning of DIASource catalog

• Database volume:– Provenance (>30% of current DB Schema; but <1% of DB volume)– Without provenance unique identifiers (see below), the DB volume for tracking

provenance (@1 -second time resolution for 10 years) would be at least 10x large r.• Why Provenance?

– to keep data volume "manageable" – maintaining provenance allows us to discard intermediate data products and rebuild them at will on -the -fly: VIRTUAL DATAVIRTUAL DATA !

• Implementing Provenance:– Unique Processing_ID used throughout DB. This 4 -byte Processing_ID uniquely

identifies the full hardware & software configuration (telescope , instrument, processing pipeline version, processing parameters, etc.) for each source m easurement (roughly 100 million sources per image pair, with 1000 image pairs per ni ght, every night).

– Minimize the impact of reconfiguration by avoiding tight couplin g between data and provenance: hardware and software configurations are correlated with data via a single Processing_ID , which is included in all major DB tables (Object, Source,...).

– Minimize the volume of provenance information by grouping togeth er objects with identical processing history.

Sample DB Table tracking provenance history

• The validity of a given set of parameters (hardware, software, p rocessing, telescope, or instrument) for any database entry (source, object , etc.) is determined by the “ validityvalidity ” time for that unique provenance identifier.

• Sample Table : Provenance of Pipeline Run Configuration

class Image Metadata

FPA_Exposure

Amp_Exposure

Amp_WCS Amp_PSFAmp_PhotoCal

_FPA_Exposure2Visit

Visit

prv _Filter

Calibration_FPA_ExposureScience_FPA_Exposure

CCD_Exposure

PSF

TemplateImage

_FPA_Exposure2TemplateImage

CCD_PhotoCal CCD_WCS

FPA_PhotoCal FPA_WCS

CCD_PSF

FPA_PSF

prv _Amplifier

prv _cnf_MaskAmpImage

_FPA_PSF_Row

0..1

subtractedExposureId

1

1..*

amplifierId

1

0..1

exposureId

1

0..1

ccdExposureId

1

0..1

exposureId

1

0..1

exposureId

1

0..1

ccdExposureId

1

0..1

ccdExposureId

1

1..*

templateImageId

1

1

exposureId

1

6

psfId

1

0..1

varianceExposureId

1

1..*

fi lterId1

0..1

exposureId

1

0..1

exposureId

1

1..*

visitId

1

1exposureId

1

1..*

psfId

1

0..1

ampExposureId

1

0..1

ampExposureId

1

1..*

amplifierId

1

30

ccdExposureId

1

1

ampExposureId

0..1

201

exposureId

1

Name:Package:Version:Author:

Image MetadataMain Telescope2.6.1Jacek Becla

class Calibration

Science_FPA_Exposure

Flat_FPA_Exposure

_FPA_Flat2CMExposure

FPA_Exposure

Bias_FPA_Exposure Dark_FPA_Exposure

_FPA_Bias2CMExposure _FPA_Dark2CMExposure

Flat_FPA_CMExposure Fringe_FPA_CMExposureBias_FPA_CMExposure Dark_FPA_CMExposure

_Science_FPA_Exposure_Group

_FPA_Fringe2CMExposure

1..*

biasExposureId

1

0..*

darkExposureId

1

0..*

biasExposureId

1

0..*

flatExposureId

1

1

cmDarkExposureId

1

1

cmBiasExposureId

1

1

x_cmFringeExposureId

6

1

x_cmFlatExposureId

6

1

cmDarkExposureId

1..*

1..*

darkExposureId

1

1

cdFringeExposureId

1..*

1..*

cmBiasExposureId

1

0..1

exposureId

1

0..1

exposureId

1

0..1

exposureId

1

1

cmFlatExposureId

1..*

0..*

darkExposureId

1

0..*

biasExposureId

1

1..*

flatExposureId1

0..1

exposureId

1

1

cseGroupId

1

0..1

varianceExposureId

1

0..1

subtractedExposureId

1

0..*

biasExposureId

1

Name:Package:Version:Author:

CalibrationMain Telescope2.6.1Jacek Becla

class Hardware Prov enance

prv _cnf_Amplifier

prv _cnf_Raft

prv _cnf_FocalPlane

prv _cnf_CCD

prv _Amplifier

prv _CCD

prv _Raft

prv _FocalPlane

prv _Filterprv _cnf_Filter

prv _Telescopeprv _cnf_Telescope

Amp_Exposure

prv _cnf_MaskAmpImage

1..*

amplifierId

1

1..*

raftId

1

1..*

focalPlaneId

1

1..*

ccdId

1

30

ccdId

1

9

raftId

1

23

focalPlaneId1

6focalPlaneId

1

1..*

fi lterId

1

1

focalPlaneId

1

1..*

telescopeId

1

1..*

amplifierId

1

1..*

amplifierId

1

Name:Package:Version:Author:

Hardware ProvenanceProvenance2.6.1Jacek Becla

class Software Prov enance

prv _Stage

prv _Run

prv _cnf_Policy

prv _Policy

prv _Node

prv _Pipeline

prv _Slice

prv _cnf_Pipeline2Run

prv _Pipeline2Run

prv _Stage2Pipeline

prv _cnf_Stage2Pipeline

prv _Stage2Slice

prv _cnf_Stage2Slice

prv _cnf_Node

prv _cnf_Slice

prv _UpdatableColumn

prv _UpdatableTable

prv _Stage2UpdatableColumn

prv _cnf_Stage2UpdatableColumn

1..*

pipelineId

1

1

columnId

1

1..*

stageId

1

1..*

tableId

1

1..*

sliceId

1

1..*

nodeId

1

1..*

nodeId

1

1..*

stage2SliceId

1

1

sliceId 1

1

c_stage2UpdatableColumnId

1..*

1..*

stage2pipelineId

1

1..*

policyId

1

1

stageId 1

1pipelineId 1

1..*

runId

1

1..*

pipeline2runId

1

1..*

policyId

1

1..*

policyId

1

1..*

policyId

1

1..*

policyId

1

1..*

stageId

1

Name:Package:Version:Author:

Software ProvenanceProvenance2.6.1Jacek Becla

class Prov enance

prv _Stage

prv _ProcHistory

Object Source DIASourceMov ingObject Amp_ExposureCCD_ExposureFPA_Exposure

prv _Snapshot

Objects/sources/exposures processed using the same stages can share one instance of ProcessingHistory. Each instance of ProcessingHistory keeps track of stages that were run as part of given ProcessingHistory. It also keeps track of time window during which each stage run. This time window can then be used to locate appropriate configuration of every configurable piece of hardware and software (prv_cnf_* tables). Note that multiple configurations of the same thing may exist for a given stage. This isallowed for cases, where reconfiguration is believed to be such that it does not affect data (for example slice can fail and be restarted on a different node).

prv _Stage2ProcHistory

1

stageId

1

0..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

1..*

procHistoryId

1

Name:Package:Version:Author:

ProvenanceProvenance2.6.1Jacek Becla

Sample database schema shown here:Sample database schema shown here:- ImageImage- CalibrationCalibration- ProvenanceProvenanceEach of these illustrates the vast collection of system parameters that are used to track the state of the end-to-end data system: what was state of the telescope & camera during an image exposure, or what were the values of the numerous calibration parameters, or what was the state & version of all relevant hardware and software components in the pipeline during data processing and ingest.Solution:Solution: Track all of these system parameters (provenance) with a single (unique) Processing_IDProcessing_ID.

Overview: LSST Sky Survey and Data System

Data Management System Architecture:Data Management System Architecture:• Multiple, specialized processing sites –

• Mountain/Base for real -time data reduction• Archive Center for non -real-time data reduction,

archival data storage, data re lease• Data Access Centers for external data access

Common Pipeline Components

Data Acquisition

Infrastructure

Image

ProcessingPipeline

Detection Pipeline

Association Pipeline

ImageArchive

SourceCatalog

ObjectCatalog

Alert

Archive

Deep Detect Pipeline

Deep

ObjectCatalog

VO Compliant Interface Middleware

Classification

Pipeline

Moving Object Pipeline

CalibrationPipeline

User

Tools(Query,

Data Quality

Analysis, Monitoring)

AlertProcessing

Eng/Fac DataArchive Calibration

Data

OrbitCatalog