components of a data analysis system

Components of a Data Analysis System

Scientific Drivers in the Design of an Analysis System

Data Import• Format

– Either widely used/accepted, or– Can be converted easily from something widely used– User need not know the details of the format– Well documented (e.g., which flavor of latitude).

• Fast Access– Disk I/O speeds do not follow Moore’s law– Read speed is more important than write speed– Caching– File size is only important to keep access times low

• Content must represent the details of the data• E2E - Full intent of the observer must be

embedded

Data Export• Format

– Either widely used/accepted, or– Can be converted easily into something widely used– User need not know the details of the format– Well documented (e.g., which flavor of latitude).

• You can read what you write– Import format == Export format

• Fast Access– Disk I/O speeds do not follow Moore’s law– Read speed is more important than write speed

• Content must represent the details of the data• E2E - Full intent of the observer must be embedded.• Includes user annotation/comments

Data Base System• Ability to work with more than one data set• Data base for both export and import files • Large data volumes

– Access using scan numbers is no longer sufficient – Require the ability to select subsets of data via sophisticated

data-base queries– Moderate number of columns in data base index– ‘Index’ to data kept in memory to speed data access– File summaries at various levels of detail

• Various levels of ‘granularity”• Calibrated and raw data• E2E - User can add annotation/comments • Security – Only the observer can access data

Data Archive• Write speed more important than read speed.• File size is very important• Cannot anticipate types of user queries

– Large number of columns in data base index– Very sophisticated/fast RDBMS

• Storage need not be a widely used data format– Format can be very different from that used by

analysis system.• Export format should be a widely used data

format

Interactive On-Line Data Analysis

• The ability to access data ASAP– Import file updates automatically as observations

proceed (real-time “filler”).– Index to file updates automatically– Updates happen per ‘integration’ (spectral-line) or per N

seconds (continuum) – Minimum integration time ~ few times the minimum time

of real-time “filler”– Analysis system automatically is aware of updated

index.– Read-protect online/filled data?

• User should be able to ‘see’ the data within an ‘integration’ of when it was taken (or N seconds).

User Interface

• Command line– Familiar syntax better than a good syntax– Procedural with byte-wise compiling

(performance)– History, min-match or command completion– Useful error messages– Interruptible– Error trapping and exception handling– Ability to “Undo”

User Interface

• GUI’s best for:– Interacting with data visualizations– Filling in forms

• data base queries• options for data pipelines

– Browsing for data files– Defining E2E data flow (ala labview)

Imaging Tools• Visualization

– Shouldn’t try to recreate those things already available in another package – export instead.

• Data Flagging – Pick a system that works• Graphics

– Traditional capabilities (zoom in/out, scroll, print, save, …)– Data volume requires great performance, smart libraries

(screen resolution << # data pts)– Interactive feedback (e.g., defining baseline regions).

• Publishable plots or export into something else?– Default plot style– Ability to tweak everything (label formats; char sizes; add,

remove, move annotation; tick mark size; major/minor ticks, full box; grid; multiple X and Y axes, …..)

Analysis Algorithms• Algorithms well documented• Study what exists in other packages.• Robustness very important but so is speed

– Provide less robust but faster alternatives• Developers should not force an algorithm on users• Developers should provide ‘defaults’ only• Building blocks better than a do-all algorithm.• Ability to use and modify ‘header’ information as

well as data.• E2E – do-alls are built out of the same building

blocks.

Documentation• On-line and hardcopy

– Tutorials/Quick Guides – Cookbook

• Based on observing types– Reference Manuals

• Full, gory details• Data Formats• Algorithms

– Searchable by keywords• Quick, interactive command help from within the

system.• Never release until these are in place

User Support/Feedback

• A familiar system minimizes staff support • Easily accessed, on-line “help desk” and

“Suggestion” box• Automatic generation of “bug” reports• Observers of observers

Marketing

• A familiar system already has a market• Don’t be another cereal on the supermarket shelf• Workshops are better than papers• Create a User Community• Responsive feedback from developers• Independent Beta testers• Reputation & first experiences are everything

User Community

• User Forums• Newsletters• Accept User Contributions/Additions

– Sourceforge-like system– NRAO-seal-of-approval

• NRAO Moderator

Real-Time Data Display• To guarantee data quality

– Product is not stored (except for hardcopy)– Sequential processing -- different from E2E/Data pipeline– Fast is more important than accurate– Few bells and whistles -- must avoid the RTD black hole– A simple display for all observation types more important

than sophisticated displays for a few data types• Display happens within an ‘integration’ of when data

were taken – tied to real time filler• GUI based – underlying language is unimportant• Output understandable by an operator

Real Time Data Analysis• Pointing/Focus/Tipping/… are different from RTD

– Results should be stored (Data Base)– Results are used by the control system (pointing/focus) or by

subsequent analysis (tipping)– Accuracy is as important as speed– More bells, whistles, user-options– Sequential processing (non E2E/data pipeline)– Only a few observation types are handled

• Analysis happens within an ‘integration’ of when data were taken

• GUI based – underlying language is unimportant• Output understandable by an operator

IDL Work Package

• SDFITS– Interim solution for data import/export– Class/IDL specific; soon Aips++/Aips/UniPOPS?– MD/BDFITS next generation (keywords,

incompleteness of contents, versatility, …)• IDL – Tom Bania

– Uses UniPOPS as a ‘model’ – familiar to many– Very good reproduction– Bania-centric – needs to be generalized

IDL Work Package

• Glen Langston– Assess whether IDL will meet performance,

extensibility, usability, … goals.– Generalization to other observing types.– Real-Time data access and display – Developed on top of and in parallel with Tom’s

work (so, implementations have diverged)– Works well for Glen’s own experiments

IDL Work Package• Institutionalize what Tom and Glen have done

– Code management– Code review– Combine Tom and Glen’s branch– Generalize code– Provide ways for Tom and Glen to contribute within

the same revision-control branch.• Develop ‘Institutionalized’ code

– Improve performance, usability, maintenance– Add/Replace I/O components with better CS

methods.

Calibration Work Package

• User-tunable algorithms– Options for the ‘real-time filler’ – sequential– Options for E2E pipeline – non-sequential– Options for interactive data reduction

• Default algorithms for all observing cases• Extensible as new algorithms are

developed• User-defined/tweaked algorithms• Robust and not-so-robust algorithms

Calibration Work Package

• Opacity/atmosphere model• Output units• Efficiencies

– Source size– Telescope model

• Tsys(f) estimates• Differencing schemes• Non-linearities/template fitting/….

components of a data analysis system

Documents

data pipelinesbrowsing

data flagging

data visualizationsfilling

onlinefilled data

subsets of data

data formatformat

data base indexindex

data base indexvery