components of a data analysis system
DESCRIPTION
Components of a Data Analysis System. Scientific Drivers in the Design of an Analysis System. Data Import. Format Either widely used/accepted, or Can be converted easily from something widely used User need not know the details of the format Well documented (e.g., which flavor of latitude). - PowerPoint PPT PresentationTRANSCRIPT
Components of a Data Analysis System
Scientific Drivers in the Design of an Analysis System
Data Import• Format
– Either widely used/accepted, or– Can be converted easily from something widely used– User need not know the details of the format– Well documented (e.g., which flavor of latitude).
• Fast Access– Disk I/O speeds do not follow Moore’s law– Read speed is more important than write speed– Caching– File size is only important to keep access times low
• Content must represent the details of the data• E2E - Full intent of the observer must be
embedded
Data Export• Format
– Either widely used/accepted, or– Can be converted easily into something widely used– User need not know the details of the format– Well documented (e.g., which flavor of latitude).
• You can read what you write– Import format == Export format
• Fast Access– Disk I/O speeds do not follow Moore’s law– Read speed is more important than write speed
• Content must represent the details of the data• E2E - Full intent of the observer must be embedded.• Includes user annotation/comments
Data Base System• Ability to work with more than one data set• Data base for both export and import files • Large data volumes
– Access using scan numbers is no longer sufficient – Require the ability to select subsets of data via sophisticated
data-base queries– Moderate number of columns in data base index– ‘Index’ to data kept in memory to speed data access– File summaries at various levels of detail
• Various levels of ‘granularity”• Calibrated and raw data• E2E - User can add annotation/comments • Security – Only the observer can access data
Data Archive• Write speed more important than read speed.• File size is very important• Cannot anticipate types of user queries
– Large number of columns in data base index– Very sophisticated/fast RDBMS
• Storage need not be a widely used data format– Format can be very different from that used by
analysis system.• Export format should be a widely used data
format
Interactive On-Line Data Analysis
• The ability to access data ASAP– Import file updates automatically as observations
proceed (real-time “filler”).– Index to file updates automatically– Updates happen per ‘integration’ (spectral-line) or per N
seconds (continuum) – Minimum integration time ~ few times the minimum time
of real-time “filler”– Analysis system automatically is aware of updated
index.– Read-protect online/filled data?
• User should be able to ‘see’ the data within an ‘integration’ of when it was taken (or N seconds).
User Interface
• Command line– Familiar syntax better than a good syntax– Procedural with byte-wise compiling
(performance)– History, min-match or command completion– Useful error messages– Interruptible– Error trapping and exception handling– Ability to “Undo”
User Interface
• GUI’s best for:– Interacting with data visualizations– Filling in forms
• data base queries• options for data pipelines
– Browsing for data files– Defining E2E data flow (ala labview)
Imaging Tools• Visualization
– Shouldn’t try to recreate those things already available in another package – export instead.
• Data Flagging – Pick a system that works• Graphics
– Traditional capabilities (zoom in/out, scroll, print, save, …)– Data volume requires great performance, smart libraries
(screen resolution << # data pts)– Interactive feedback (e.g., defining baseline regions).
• Publishable plots or export into something else?– Default plot style– Ability to tweak everything (label formats; char sizes; add,
remove, move annotation; tick mark size; major/minor ticks, full box; grid; multiple X and Y axes, …..)
Analysis Algorithms• Algorithms well documented• Study what exists in other packages.• Robustness very important but so is speed
– Provide less robust but faster alternatives• Developers should not force an algorithm on users• Developers should provide ‘defaults’ only• Building blocks better than a do-all algorithm.• Ability to use and modify ‘header’ information as
well as data.• E2E – do-alls are built out of the same building
blocks.
Documentation• On-line and hardcopy
– Tutorials/Quick Guides – Cookbook
• Based on observing types– Reference Manuals
• Full, gory details• Data Formats• Algorithms
– Searchable by keywords• Quick, interactive command help from within the
system.• Never release until these are in place
User Support/Feedback
• A familiar system minimizes staff support • Easily accessed, on-line “help desk” and
“Suggestion” box• Automatic generation of “bug” reports• Observers of observers
Marketing
• A familiar system already has a market• Don’t be another cereal on the supermarket shelf• Workshops are better than papers• Create a User Community• Responsive feedback from developers• Independent Beta testers• Reputation & first experiences are everything
User Community
• User Forums• Newsletters• Accept User Contributions/Additions
– Sourceforge-like system– NRAO-seal-of-approval
• NRAO Moderator
Real-Time Data Display• To guarantee data quality
– Product is not stored (except for hardcopy)– Sequential processing -- different from E2E/Data pipeline– Fast is more important than accurate– Few bells and whistles -- must avoid the RTD black hole– A simple display for all observation types more important
than sophisticated displays for a few data types• Display happens within an ‘integration’ of when data
were taken – tied to real time filler• GUI based – underlying language is unimportant• Output understandable by an operator
Real Time Data Analysis• Pointing/Focus/Tipping/… are different from RTD
– Results should be stored (Data Base)– Results are used by the control system (pointing/focus) or by
subsequent analysis (tipping)– Accuracy is as important as speed– More bells, whistles, user-options– Sequential processing (non E2E/data pipeline)– Only a few observation types are handled
• Analysis happens within an ‘integration’ of when data were taken
• GUI based – underlying language is unimportant• Output understandable by an operator
IDL Work Package
• SDFITS– Interim solution for data import/export– Class/IDL specific; soon Aips++/Aips/UniPOPS?– MD/BDFITS next generation (keywords,
incompleteness of contents, versatility, …)• IDL – Tom Bania
– Uses UniPOPS as a ‘model’ – familiar to many– Very good reproduction– Bania-centric – needs to be generalized
IDL Work Package
• Glen Langston– Assess whether IDL will meet performance,
extensibility, usability, … goals.– Generalization to other observing types.– Real-Time data access and display – Developed on top of and in parallel with Tom’s
work (so, implementations have diverged)– Works well for Glen’s own experiments
IDL Work Package• Institutionalize what Tom and Glen have done
– Code management– Code review– Combine Tom and Glen’s branch– Generalize code– Provide ways for Tom and Glen to contribute within
the same revision-control branch.• Develop ‘Institutionalized’ code
– Improve performance, usability, maintenance– Add/Replace I/O components with better CS
methods.
Calibration Work Package
• User-tunable algorithms– Options for the ‘real-time filler’ – sequential– Options for E2E pipeline – non-sequential– Options for interactive data reduction
• Default algorithms for all observing cases• Extensible as new algorithms are
developed• User-defined/tweaked algorithms• Robust and not-so-robust algorithms
Calibration Work Package
• Opacity/atmosphere model• Output units• Efficiencies
– Source size– Telescope model
• Tsys(f) estimates• Differencing schemes• Non-linearities/template fitting/….