computing in research

Computing In Research

Dr. S.N. PradhanProfessor, CSE Department

Agenda

• Introduction• Data analysis and Visualization• Interactive Data language(IDL)• Scilab & Scicos• Symbolic Computation• Mathematica/ maxima

A Data Analysis Pipeline Raw data Processed data Hypothesis or Model

Results

Cleaning Filtering

Transforming

Statistical Analysis Pattern Rec

Knowledge Disc

Validation

A CB

D

Where can visualization come in

• All stages can benefit from visualization• A: identify bad data, select subsets, help

choose transforms (exploratory)• B:help choose computational techniques, set

parameters, use vision to recognize, isolate, classify patterns (exploratory)

• C: Superimpose derived models on data (confirmatory)

• D:Present results (presentation)

What decides how to visualize• Characteristics of data

– Types, size, structure– Semantics, completeness, accuracy

• Characteristics of user– Perceptual and cognitive abilities– Knowledge of domain, data, tasks, tools

• Characteristics of graphical mappings– What are possibilities– Which convey data effectively and efficiently

• Characteristics of interactions– Which support the tasks best– Which are easy to learn, use, remember

Issues Regarding Data• Type may indicate which graphical mappings are appropriate

– Nominal vs. ordinal– Discrete vs. continuous– Ordered vs. unordered– Univariate vs. multivariate– Scalar vs. vector vs. tensor– Static vs. dynamic– Values vs. relations

• Trade-offs between size and accuracy needs• Different orders/structures can reveal different

features/patterns

User perceptions• What graphical attributes do we perceive

accurately?• What graphical attributes do we perceive quickly?• Which combinations of attributes are separable?• Coping with change blindness• How can visuals support the development of

accurate mental models of the data?• Relative vs. absolute judgements – impact on tasks

Issues regarding mappings

• Variables include shape, size, orientation, color, texture, opacity, position, motion….

• Some of these have an order, others don’t• Some use up significant screen space• Sensitivity to occlusion• Domain customs/expectations

Issues regarding Interactions

• Interaction critical component• Many categories of techniques

– Navigation, selection, filtering, reconfiguring, encoding, connecting, and combinations of above

• Many “spaces” in which interactions can be applied– Screen/pixels, data, data structures, graphical

objects, graphical attributes, visualization structures

Importance of Evaluation• Easy to design bad visualizations• Many design rules exist – many conflict, many routinely

violated• 5 E’s of evaluation: effective, efficient, engaging, error

tolerant, easy to learn• Many styles of evaluation (qualitative and quantitative):

– Use/case studies– Usability testing– User studies– Longitudinal studies– Expert evaluation– Heuristic evaluation

Different Views

Mappings• Based on data characteristics

– Numbers, text, graphs, software, ….• Logical groupings of techniques (Keim)

– Standard: bars, lines, pie charts, scatterplots– Geometrically transformed: landscapes,

parallel coordinates– Icon-based: stick figures, faces, profiles– Dense pixels: recursive segments, pixel bar

charts– Stacked: treemaps, dimensional stacking

Mappings• Based on dimension management (Ward)

–Dimension subsetting: scatterplots, pixel-oriented methods

–Dimension reconfiguring: glyphs, parallel coordinates

–Dimension reduction: PCA, MDS(Multi Dimensional Sclaing), Self Organizing Maps

–Dimension embedding: dimensional stacking, worlds within worlds

Sensor Network

SENSOR LAB AT BERKELEY

Pairwise link quality

Distance Between Nodes

LinkQuality

Glyphs

Dimensional Stacking• Break each dimension range into bins• Break the screen into a grid using the number

of bins for 2 dimensions• Repeat the process for 2 more dimensions

within the subimages formed by first grid, recurse through all dimensions

• Look for repeated patterns, outliers, trends, gaps

Dimensional Stacking

Pixel oriented technique

Methods to cope with scale

• Many modern datasets contain large number of records (millions and billions) and/or dimensions (hundreds and thousands)

• Several strategies to handle scale problems– Sampling– Filtering– Clustering/aggregation

• Techniques can be automated or user-controlled

• Visualization a powerful component of the data analysis process

• Each stage of analysis can be enhanced• Visualization can help guide computational

analysis, and vice versa• Multiple linked views and a rich assortment of

interactions key to success

Numerical Recipes in C & C++

• Numerical Recipes in C is a collection (or a library) of C functions written by Press et al.

• Library of mathematical functions• Useful while doing process or system

modeling.• Break down the model in known

mathematical functions and then one can use routines.

http://web.mit.edu/10.001/Web/Course_Notes/NRC_Notes/NRC.HTML#References

http://web.mit.edu/10.001/Web/Course_Notes/NRC_Notes/NRC.HTML#References

GNU Scientific Library

• Basic mathematical functions Complex numbers

• Polynomials• Special functions • Vectors and matrices • Permutations Combinations•

http://en.wikipedia.org/wiki/Elementary_function

http://en.wikipedia.org/wiki/Complex_number

http://en.wikipedia.org/wiki/Polynomial

http://en.wikipedia.org/wiki/Special_functions

http://en.wikipedia.org/wiki/Vector_space

http://en.wikipedia.org/wiki/Matrix_(mathematics)

http://en.wikipedia.org/wiki/Permutation

http://en.wikipedia.org/wiki/Combination

• Multisets Sorting • Linear algebra • Eigensystems• Fast Fourier transforms• Numerical integration (based on QUADPACK)• Random number generation• Quasi-random sequences • Random number distributions • Statistics • Histograms

http://en.wikipedia.org/wiki/Multiset

http://en.wikipedia.org/wiki/Sorting

http://en.wikipedia.org/wiki/Linear_algebra

http://en.wikipedia.org/wiki/Eigensystem

http://en.wikipedia.org/wiki/Fast_Fourier_transform

http://en.wikipedia.org/wiki/Numerical_integration

http://en.wikipedia.org/wiki/QUADPACK

http://en.wikipedia.org/wiki/Random_number_generator

http://en.wikipedia.org/wiki/Quasi-random_sequences

http://en.wikipedia.org/wiki/Probability_distribution

http://en.wikipedia.org/wiki/Statistics

http://en.wikipedia.org/wiki/Histogram

Interactive Data language

• Data manipulation and visualization• Commercially availble packages IDL from ITT

Visual Information System.• Consists of

– Data Analysis– Data visualization– Animation

Open Source IDL(GDL)

• Open Source equivalent of IDL and much more• GDL is used particularly in geosciences. • GDL is dynamically-typed, vectorized and has object-oriented

programming capabilities.• The library routines handle numerical calculations, data

visualisation, signal/image processing, interaction with host OS and data

• input/output. GDL supports several data formats such as netCDF,

• HDF4, HDF5, GRIB, PNG, TIFF, DICOM, etc. Graphical output is handled by X11, PostScript, SVG or z-buffer terminals

Part II

• Analysis may, therefore, be categorized as • • Descriptive analysis • Inferential analysis (often known as statistical

analysis). • Correlation analysis• Causal analysis (regression analysis)• Multivariate analysis

Descriptive analysis• Descriptive analysis” is largely study of distributions of one

variable. This study provides us with profiles of companies, work groups, persons and other subjects on any of a multiple of characteristics such as size. Composition, efficiency, preferences, etc.” This sort of analysis may be in respect of one variable (described as unidimensional analysis); or in respect of two variables (described as bivariate analysis) or in respect of more than two variables (described as multivariate analysis). In this context we work out various measures that show the size and shape of a distribution(s) along with the study of measuring relationships between two or more variables.

Correlation analysis

• Correlation analysis studies the joint variation of two or more variables for determining the amount of correlation between two or more variables. In most social and business researches interest lies in understanding and controlling relationships between variables and so correlation analysis are relatively more important.

Causal Analysis

• Causal analysis (regression analysis) is concerned with study of how one or more variables affect changes in another variable. It is a study of functional relationships existing between two or more variables. Causal analysis is considered relatively more important in experimental researches.

Multivariate analysis

• Multivariate analysis is defined as “all statistical methods which simultaneously analyze more than two variables on a sample of observations”. With the availability of computer facilities, there has been a rapid development of this kind of analysis.

• Multiple regression analysis: This analysis is adopted when the researcher has one dependent variable which is presumed to be a function of two or more independent variables. The objective of this analysis is to make a prediction about the dependent variable based on its covariance with all the concerned independent variables.

•

• Multiple discriminant analysis: This analysis is appropriate when the researcher has a single dependent variable that cannot be measured, but can be classified into two or more groups on the basis of some attribute. The object of this analysis happens to be to predict an entity’s possibility of belonging to a particular group based on several predictor variables.

•

• Multivariate analysis of variance (or multi-ANOVA): This analysis is an extension of two way ANOVA, wherein the ratio of among group variance to within group variance is worked out on a set of variables.

• Canonical analysis: This analysis can be used in case of both measurable and non-measurable variables for the purpose of simultaneously predicting a set of dependent variables from their joint covariance with a set of independent variables.

• Analysis of variance (ANOVA) is a useful technique concerning researches in the fields of economics, biology, education, psychology, sociology, business/industry and several other disciplines. This technique is used when multiple sample cases are involved. The significance of difference between the means of two samples can be judged through either z-test or the t-test, but the difficulty arises when we happen to examine the significance of the difference amongst more than two sample means at the same time. The ANOVA technique enables us to perform this simultaneous test and as such is considered to be an important tool of analysis in the hands of a researcher.

Role of Simulation

Simulation is a method and application to mimic the real system, mostly via computer. Simulation is a numerical technique for conducting experiments on a computer which involves logical and mathematical relationships that interact to describe the behavior and structure of a complex real-world system over extended period of times.

• Simulation makes it possible to study and experiment with the complex internal interactions of a given system.

• It provides better understanding of the system.

• Simulation can be used as a pedagogical device for teaching students.

• The experience of designing a computer simulation model may be more valuable than the designing of actual model.

• Simulation can be used to experiment with new situation about which we have little or no information available.

• To verify analytical solution.• Cheap - No need of costly equipments• Complex scenarios can be easily tested• Results can be quickly obtained• More ideas can be tested in smaller time limit

Pitfalls of simulation

• It cannot provide insight for all possible scenarios Eg. Mobile networks must be tested with different mobility models

• Failure to have a well-defined set of objectives at the beginning of the simulation study.

• Failure to communicate with the decision-maker (or the client) on a regular basis.

• Lack of knowledge of simulation methodology and also of probability and statistics.

• Real systems too complex to model leading to inappropriate level of model detail.

• Failure to collect good system data.• Belief that so-called ”easy-to-use” simulation packages

require a significantly lower level of technical competence.

• Blindly using simulation software without understanding its underlying assumptions.

• Replacing a probability distribution by its mean.• Failure to perform a proper output-data analysis

Simulation Checklist

• Checks before developing simulation– Is the goal properly specified?– Is detail in model appropriate for goal?– Does team include right mix (leader, modeling,

programming, background)?– Has sufficient time been planned?

• Checks during simulation development– Is random number random?– Is model reviewed regularly?– Is model documented?

Checklist cont…

• Checks after simulation is running– Is simulation length appropriate?– Are initial transients removed?– Has model been verified?– Has model been validated?– Are there any surprising results? If yes, have they

been validated?

Terminology

• State variables– Variables whose values define current state of

system– Saving can allow simulation to be stopped and

restarted later by restoring all state variables• Event

– A change in system state– Ex: Three events: arrival of job, beginning of new

execution, departure of job

• Continuous-time and discrete-time models– If state defined at all times continuous– If state defined only at instants discrete– Ex: class that meets M-F 2-3 is discrete since not defined other

times• Continuous-state and discrete-state models

– If uncountably infinite continuous• Ex: time spent by students on hw

– If countable discrete• Ex: jobs in CPU queue

– Note, continuous time does not necessarily imply continuous state and vice-versa

• All combinations possible

• Deterministic and probabilistic models– If output predicted with certainty deterministic– If output different for different repetitions

probabilistic– Ex: For proj1, dog type-1 makes simulation

deterministic but dog type-2 makes simulation probabilistic

• Static and dynamic models– Time is not a variable static– If changes with time dynamic– Ex: CPU scheduler is dynamic, while matter-to-

energy model E=mc2 is static• Linear and nonlinear models

– Output is linear combination of input linear– Otherwise nonlinear

• Open and closed models– Input is external and independent open– Closed model has no external input– Ex: if same jobs leave and re-enter queue then

closed, while if new jobs enter system then open• Stable and unstable

– Model output settles down stable– Model output always changes unstable

Selecting Simulation language• Four choices: simulation language, general-purpose

language, extension of general purpose, simulation package

• Simulation language – built in facilities for time steps, event scheduling, data collection, reporting

• General-purpose – known to developer, available on more systems, flexible

• The major difference is the cost tradeoff – simulation language requires startup time to learn, while general purpose may require more time to add simulation flexibility

• Extension of general-purpose – collection of routines and tasks commonly used. Often, base language with extra libraries that can be called

• Simulation packages – allow definition of model in interactive fashion. Get results in one day

• Tradeoff is in flexibility, where packages can only do what developer envisioned, but if that is what is needed then is quicker to do so

Types of Simulation• Variety of types, but main: emulation, Monte

Carlo, trace driven, and discrete-event• Emulation

– Simulation that runs on a computer to make it appear to be something else

– Examples: JVM, NIST Net

Monte Carlo Simulation

• A static simulation has no time parameter– Runs until some equilibrium state reached

• Used to model physical phenomena, evaluate probabilistic system, numerically estimate complex mathematical expression

• Driven with random number generator– So “Monte Carlo” (after casinos) simulation

• Example, consider numerically determining the value of

Trace Driven Simulation• Uses time-ordered record of events on real

system as input– Ex: to compare memory management, use trace

of page reference patterns as input, and can model and simulate page replacement algorithms

• Note, need trace to be independent of system– Ex: if had trace of disk events, could not be used

to study page replacement since events are dependent

Advantages of trace driven..• Credibility – easier to sell than random inputs• Easy validation – when gathering trace, often get

performance stats and can validate with those• Accurate workload – preserves correlation of events,

don’t need to simplify as for workload model• Less randomness – input is deterministic, so output may

be (or will at least have less non-determinism)• Fair comparison – allows comparison of alternatives

under the same input stream• Similarity to actual implementation – often simulated

system needs to be similar to real one so can get accurate idea of how complex

Disadvantages of Trace ….• Complexity – requires more detailed implementation• Representativeness – trace from one system may not

represent all traces• Finiteness – can be long, so often limited by space but

then that time may not represent other times• Single point of validation – need to be careful that

validation of performance gathered during a trace represents only 1 case

• Trade-off – it is difficult to change workload since cannot change trace. Changing trace would first need workload model

Discrete even simulation• Continuous events are simulations like weather or chemical

reactions, while computers usually discrete events• Typical components:• Event scheduler – linked list of events

– Schedule event X at time T– Hold event X for interval dt– Cancel previously scheduled event X– Hold event X indefinitely until scheduled by other event– Schedule an indefinitely scheduled event– Note, event scheduler executed often, so has significant impact

on performance

• Simulation clock and time advancing– Global variable with time– Scheduler advances time

• Unit time – increments time by small amount and see if any events

• Event-driven – increments time to next event and executes (typical)

• System state variables– Global variables describing state – Can be used to save and restore

• Event routines– Specific routines to handle event– Ex: job arrival, job scheduling, job departure– Often handled by call-back from event scheduler

• Input routines– Get input from user (or config file, or script)– Often get all input before simulation starts– May allow range of inputs and number or

repetitions, etc.

• Report generators– Routines executed at end of simulation, final result

and print– Can include graphical representation, too– Ex: may compute total wait time in queue or

number of processes scheduled

Verification and Validation(Analysis of Simulation Result)

• Would like model output to be close to that of real system• Made assumptions about behavior of real systems• 1st step, test if assumptions are reasonable

– Validation, or representativeness of assumptions• 2nd step, test whether model implements assumptions

– Verification, or correctness

• Good software engineering practices will result in fewer bugs• Top-down, modular design• Assertions (antibugging)

– Say, total packets = packets sent + packets received– If not, can halt or warn

• Structured walk-through• Simplified, deterministic cases

– Even if end-simulation will be complicated and non-deterministic, use simple repeatable values (maybe fixed seeds) to debug

• Tracing (via print statements or debugger)

• Continuity tests– Slight change in input should yield slight change in

output, otherwise error

• Degeneracy tests– Try extremes (lowest and highest) since may

reveal bugs

• Consistency tests – similar inputs produce similar outputs

• Seed independence – random number generator starting value should not affect final conclusion (maybe individual output, but not overall conclusion)

• Ensure assumptions used are reasonable– Want final simulated system to be like real system

• Unlike verification, techniques to validate one simulation may be different from one model to another

• Three key aspects to validate:– Assumptions– Input parameter values and distributions– Output values and conclusions

• Compare validity of each to one or more of:– Expert intuition– Real system measurements– Theoretical results

• Can be used to compare a simplified system with simulated results

• May not be useful for sole validation but can be used to complement measurements or expert intuition– Ex: measurement validates for one processor, while analytic

model validates for many processors• Note, there is no such thing as a “fully validated” model

– Would require too many resources and may be impossible– Can only show is invalid

• Instead, show validation in a few select cases, to lend confidence to the overall model results

Transient removal• Most simulations only want steady state

– Remove initial transient state• Trouble is, not possible to define exactly what

constitutes end of transient state• Use heuristics:

– Long runs– Proper initialization– Truncation– Initial data deletion– Moving average of replications– Batch means

Long runs

• Use very long runs• Effects of transient state will be amortized• But … wastes resources• And tough to choose how long is “enough”• Recommendation … don’t use long runs alone

Proper Initialization

• Start simulation in state close to expected state• Ex: CPU scheduler may start with some jobs in

the queue• Determine starting conditions by previous

simulations or simple analysis• May result in decreased run length, but still may

not provide confidence that are in stable condition

Terminating Simulation• For some simulations, transition state is of interest no

transient removals required• Sometimes upon termination you also get final conditions that

do not reflect steady state– Can apply transition removal conditions to end of simulation

• Take care when gathering at end of simulation– Ex: mean service time should include only those that finish

• Also, take care of values at event times

Stopping Criteria• Important to run long enough

– Stopping too short may give variable results– Stopping too long may waste resources

• Should get confidence intervals on mean to desired width:x ± z1-/2Var(x)

• Variance of sample mean of independent observationsVar(x) = Var(x) / n

• But only if observations independent! Most simulations not– Ex: if queuing delay for packet i is large then will likely be large

for packet i+1• So, use: independent replications, batch means, regeneration

(all next)

computing in research

Documents

data structures

bad data

data hypothesis

data characteristicsnumbers

data confirmatoryd

graphical objects

combinations of attributes

visualization structures