computing in research

71
Computing In Research Dr. S.N. Pradhan Professor, CSE Department

Upload: laddie

Post on 23-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Computing In Research. Dr. S.N. Pradhan Professor, CSE Department. Agenda. Introduction Data analysis and Visualization Interactive Data language(IDL) Scilab & Scicos Symbolic Computation Mathematica / maxima. A Data Analysis Pipeline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computing In Research

Computing In Research

Dr. S.N. PradhanProfessor, CSE Department

Page 2: Computing In Research

Agenda

• Introduction• Data analysis and Visualization• Interactive Data language(IDL)• Scilab & Scicos• Symbolic Computation• Mathematica/ maxima

Page 3: Computing In Research

A Data Analysis Pipeline Raw data Processed data Hypothesis or Model

Results

Cleaning Filtering

Transforming

Statistical Analysis Pattern Rec

Knowledge Disc

Validation

A CB

D

Page 4: Computing In Research

Where can visualization come in

• All stages can benefit from visualization• A: identify bad data, select subsets, help

choose transforms (exploratory)• B:help choose computational techniques, set

parameters, use vision to recognize, isolate, classify patterns (exploratory)

• C: Superimpose derived models on data (confirmatory)

• D:Present results (presentation)

Page 5: Computing In Research

What decides how to visualize• Characteristics of data

– Types, size, structure– Semantics, completeness, accuracy

• Characteristics of user– Perceptual and cognitive abilities– Knowledge of domain, data, tasks, tools

• Characteristics of graphical mappings– What are possibilities– Which convey data effectively and efficiently

• Characteristics of interactions– Which support the tasks best– Which are easy to learn, use, remember

Page 6: Computing In Research

Issues Regarding Data• Type may indicate which graphical mappings are appropriate

– Nominal vs. ordinal– Discrete vs. continuous– Ordered vs. unordered– Univariate vs. multivariate– Scalar vs. vector vs. tensor– Static vs. dynamic– Values vs. relations

• Trade-offs between size and accuracy needs• Different orders/structures can reveal different

features/patterns

Page 7: Computing In Research

User perceptions• What graphical attributes do we perceive

accurately?• What graphical attributes do we perceive quickly?• Which combinations of attributes are separable?• Coping with change blindness• How can visuals support the development of

accurate mental models of the data?• Relative vs. absolute judgements – impact on tasks

Page 8: Computing In Research

Issues regarding mappings

• Variables include shape, size, orientation, color, texture, opacity, position, motion….

• Some of these have an order, others don’t• Some use up significant screen space• Sensitivity to occlusion• Domain customs/expectations

Page 9: Computing In Research

Issues regarding Interactions

• Interaction critical component• Many categories of techniques

– Navigation, selection, filtering, reconfiguring, encoding, connecting, and combinations of above

• Many “spaces” in which interactions can be applied– Screen/pixels, data, data structures, graphical

objects, graphical attributes, visualization structures

Page 10: Computing In Research

Importance of Evaluation• Easy to design bad visualizations• Many design rules exist – many conflict, many routinely

violated• 5 E’s of evaluation: effective, efficient, engaging, error

tolerant, easy to learn• Many styles of evaluation (qualitative and quantitative):

– Use/case studies– Usability testing– User studies– Longitudinal studies– Expert evaluation– Heuristic evaluation

Page 11: Computing In Research

Different Views

Page 12: Computing In Research

Mappings• Based on data characteristics

– Numbers, text, graphs, software, ….• Logical groupings of techniques (Keim)

– Standard: bars, lines, pie charts, scatterplots– Geometrically transformed: landscapes,

parallel coordinates– Icon-based: stick figures, faces, profiles– Dense pixels: recursive segments, pixel bar

charts– Stacked: treemaps, dimensional stacking

Page 13: Computing In Research

Mappings• Based on dimension management (Ward)

–Dimension subsetting: scatterplots, pixel-oriented methods

–Dimension reconfiguring: glyphs, parallel coordinates

–Dimension reduction: PCA, MDS(Multi Dimensional Sclaing), Self Organizing Maps

–Dimension embedding: dimensional stacking, worlds within worlds

Page 14: Computing In Research

Sensor Network

SENSOR LAB AT BERKELEY

Page 15: Computing In Research

Pairwise link quality

Distance Between Nodes

LinkQuality

Page 16: Computing In Research

Glyphs

Page 17: Computing In Research

Dimensional Stacking• Break each dimension range into bins• Break the screen into a grid using the number

of bins for 2 dimensions• Repeat the process for 2 more dimensions

within the subimages formed by first grid, recurse through all dimensions

• Look for repeated patterns, outliers, trends, gaps

Page 18: Computing In Research

Dimensional Stacking

Page 19: Computing In Research

Pixel oriented technique

Page 20: Computing In Research

Methods to cope with scale

• Many modern datasets contain large number of records (millions and billions) and/or dimensions (hundreds and thousands)

• Several strategies to handle scale problems– Sampling– Filtering– Clustering/aggregation

• Techniques can be automated or user-controlled

Page 21: Computing In Research

• Visualization a powerful component of the data analysis process

• Each stage of analysis can be enhanced• Visualization can help guide computational

analysis, and vice versa• Multiple linked views and a rich assortment of

interactions key to success

Page 22: Computing In Research

Numerical Recipes in C & C++

• Numerical Recipes in C is a collection (or a library) of C functions written by Press et al.

• Library of mathematical functions• Useful while doing process or system

modeling.• Break down the model in known

mathematical functions and then one can use routines.

Page 25: Computing In Research

Interactive Data language

• Data manipulation and visualization• Commercially availble packages IDL from ITT

Visual Information System.• Consists of

– Data Analysis– Data visualization– Animation

Page 26: Computing In Research

Open Source IDL(GDL)

• Open Source equivalent of IDL and much more• GDL is used particularly in geosciences. • GDL is dynamically-typed, vectorized and has object-oriented

programming capabilities.• The library routines handle numerical calculations, data

visualisation, signal/image processing, interaction with host OS and data

• input/output. GDL supports several data formats such as netCDF,

• HDF4, HDF5, GRIB, PNG, TIFF, DICOM, etc. Graphical output is handled by X11, PostScript, SVG or z-buffer terminals

Page 27: Computing In Research

Part II

• Analysis may, therefore, be categorized as • • Descriptive analysis • Inferential analysis (often known as statistical

analysis). • Correlation analysis• Causal analysis (regression analysis)• Multivariate analysis

Page 28: Computing In Research

Descriptive analysis• Descriptive analysis” is largely study of distributions of one

variable. This study provides us with profiles of companies, work groups, persons and other subjects on any of a multiple of characteristics such as size. Composition, efficiency, preferences, etc.” This sort of analysis may be in respect of one variable (described as unidimensional analysis); or in respect of two variables (described as bivariate analysis) or in respect of more than two variables (described as multivariate analysis). In this context we work out various measures that show the size and shape of a distribution(s) along with the study of measuring relationships between two or more variables.

Page 29: Computing In Research

Correlation analysis

• Correlation analysis studies the joint variation of two or more variables for determining the amount of correlation between two or more variables. In most social and business researches interest lies in understanding and controlling relationships between variables and so correlation analysis are relatively more important.

Page 30: Computing In Research

Causal Analysis

• Causal analysis (regression analysis) is concerned with study of how one or more variables affect changes in another variable. It is a study of functional relationships existing between two or more variables. Causal analysis is considered relatively more important in experimental researches.

Page 31: Computing In Research

Multivariate analysis

• Multivariate analysis is defined as “all statistical methods which simultaneously analyze more than two variables on a sample of observations”. With the availability of computer facilities, there has been a rapid development of this kind of analysis.

Page 32: Computing In Research

• Multiple regression analysis: This analysis is adopted when the researcher has one dependent variable which is presumed to be a function of two or more independent variables. The objective of this analysis is to make a prediction about the dependent variable based on its covariance with all the concerned independent variables.

Page 33: Computing In Research

• Multiple discriminant analysis: This analysis is appropriate when the researcher has a single dependent variable that cannot be measured, but can be classified into two or more groups on the basis of some attribute. The object of this analysis happens to be to predict an entity’s possibility of belonging to a particular group based on several predictor variables.

Page 34: Computing In Research

• Multivariate analysis of variance (or multi-ANOVA): This analysis is an extension of two way ANOVA, wherein the ratio of among group variance to within group variance is worked out on a set of variables.

Page 35: Computing In Research

• Canonical analysis: This analysis can be used in case of both measurable and non-measurable variables for the purpose of simultaneously predicting a set of dependent variables from their joint covariance with a set of independent variables.

Page 36: Computing In Research

• Analysis of variance (ANOVA) is a useful technique concerning researches in the fields of economics, biology, education, psychology, sociology, business/industry and several other disciplines. This technique is used when multiple sample cases are involved. The significance of difference between the means of two samples can be judged through either z-test or the t-test, but the difficulty arises when we happen to examine the significance of the difference amongst more than two sample means at the same time. The ANOVA technique enables us to perform this simultaneous test and as such is considered to be an important tool of analysis in the hands of a researcher.

Page 37: Computing In Research
Page 38: Computing In Research

Role of Simulation

Simulation is a method and application to mimic the real system, mostly via computer. Simulation is a numerical technique for conducting experiments on a computer which involves logical and mathematical relationships that interact to describe the behavior and structure of a complex real-world system over extended period of times.

Page 39: Computing In Research

• Simulation makes it possible to study and experiment with the complex internal interactions of a given system.

• It provides better understanding of the system.

• Simulation can be used as a pedagogical device for teaching students.

• The experience of designing a computer simulation model may be more valuable than the designing of actual model.

Page 40: Computing In Research

• Simulation can be used to experiment with new situation about which we have little or no information available.

• To verify analytical solution.• Cheap - No need of costly equipments• Complex scenarios can be easily tested• Results can be quickly obtained• More ideas can be tested in smaller time limit

Page 41: Computing In Research

Pitfalls of simulation

• It cannot provide insight for all possible scenarios Eg. Mobile networks must be tested with different mobility models

• Failure to have a well-defined set of objectives at the beginning of the simulation study.

• Failure to communicate with the decision-maker (or the client) on a regular basis.

• Lack of knowledge of simulation methodology and also of probability and statistics.

Page 42: Computing In Research

• Real systems too complex to model leading to inappropriate level of model detail.

• Failure to collect good system data.• Belief that so-called ”easy-to-use” simulation packages

require a significantly lower level of technical competence.

• Blindly using simulation software without understanding its underlying assumptions.

• Replacing a probability distribution by its mean.• Failure to perform a proper output-data analysis

Page 43: Computing In Research

Simulation Checklist

• Checks before developing simulation– Is the goal properly specified?– Is detail in model appropriate for goal?– Does team include right mix (leader, modeling,

programming, background)?– Has sufficient time been planned?

• Checks during simulation development– Is random number random?– Is model reviewed regularly?– Is model documented?

Page 44: Computing In Research

Checklist cont…

• Checks after simulation is running– Is simulation length appropriate?– Are initial transients removed?– Has model been verified?– Has model been validated?– Are there any surprising results? If yes, have they

been validated?

Page 45: Computing In Research

Terminology

• State variables– Variables whose values define current state of

system– Saving can allow simulation to be stopped and

restarted later by restoring all state variables• Event

– A change in system state– Ex: Three events: arrival of job, beginning of new

execution, departure of job

Page 46: Computing In Research

• Continuous-time and discrete-time models– If state defined at all times continuous– If state defined only at instants discrete– Ex: class that meets M-F 2-3 is discrete since not defined other

times• Continuous-state and discrete-state models

– If uncountably infinite continuous• Ex: time spent by students on hw

– If countable discrete• Ex: jobs in CPU queue

– Note, continuous time does not necessarily imply continuous state and vice-versa

• All combinations possible

Page 47: Computing In Research

• Deterministic and probabilistic models– If output predicted with certainty deterministic– If output different for different repetitions

probabilistic– Ex: For proj1, dog type-1 makes simulation

deterministic but dog type-2 makes simulation probabilistic

Page 48: Computing In Research

• Static and dynamic models– Time is not a variable static– If changes with time dynamic– Ex: CPU scheduler is dynamic, while matter-to-

energy model E=mc2 is static• Linear and nonlinear models

– Output is linear combination of input linear– Otherwise nonlinear

Page 49: Computing In Research

• Open and closed models– Input is external and independent open– Closed model has no external input– Ex: if same jobs leave and re-enter queue then

closed, while if new jobs enter system then open• Stable and unstable

– Model output settles down stable– Model output always changes unstable

Page 50: Computing In Research

Selecting Simulation language• Four choices: simulation language, general-purpose

language, extension of general purpose, simulation package

• Simulation language – built in facilities for time steps, event scheduling, data collection, reporting

• General-purpose – known to developer, available on more systems, flexible

• The major difference is the cost tradeoff – simulation language requires startup time to learn, while general purpose may require more time to add simulation flexibility

Page 51: Computing In Research

• Extension of general-purpose – collection of routines and tasks commonly used. Often, base language with extra libraries that can be called

• Simulation packages – allow definition of model in interactive fashion. Get results in one day

• Tradeoff is in flexibility, where packages can only do what developer envisioned, but if that is what is needed then is quicker to do so

Page 52: Computing In Research

Types of Simulation• Variety of types, but main: emulation, Monte

Carlo, trace driven, and discrete-event• Emulation

– Simulation that runs on a computer to make it appear to be something else

– Examples: JVM, NIST Net

Page 53: Computing In Research

Monte Carlo Simulation

• A static simulation has no time parameter– Runs until some equilibrium state reached

• Used to model physical phenomena, evaluate probabilistic system, numerically estimate complex mathematical expression

• Driven with random number generator– So “Monte Carlo” (after casinos) simulation

• Example, consider numerically determining the value of

Page 54: Computing In Research

Trace Driven Simulation• Uses time-ordered record of events on real

system as input– Ex: to compare memory management, use trace

of page reference patterns as input, and can model and simulate page replacement algorithms

• Note, need trace to be independent of system– Ex: if had trace of disk events, could not be used

to study page replacement since events are dependent

Page 55: Computing In Research

Advantages of trace driven..• Credibility – easier to sell than random inputs• Easy validation – when gathering trace, often get

performance stats and can validate with those• Accurate workload – preserves correlation of events,

don’t need to simplify as for workload model• Less randomness – input is deterministic, so output may

be (or will at least have less non-determinism)• Fair comparison – allows comparison of alternatives

under the same input stream• Similarity to actual implementation – often simulated

system needs to be similar to real one so can get accurate idea of how complex

Page 56: Computing In Research

Disadvantages of Trace ….• Complexity – requires more detailed implementation• Representativeness – trace from one system may not

represent all traces• Finiteness – can be long, so often limited by space but

then that time may not represent other times• Single point of validation – need to be careful that

validation of performance gathered during a trace represents only 1 case

• Trade-off – it is difficult to change workload since cannot change trace. Changing trace would first need workload model

Page 57: Computing In Research

Discrete even simulation• Continuous events are simulations like weather or chemical

reactions, while computers usually discrete events• Typical components:• Event scheduler – linked list of events

– Schedule event X at time T– Hold event X for interval dt– Cancel previously scheduled event X– Hold event X indefinitely until scheduled by other event– Schedule an indefinitely scheduled event– Note, event scheduler executed often, so has significant impact

on performance

Page 58: Computing In Research

• Simulation clock and time advancing– Global variable with time– Scheduler advances time

• Unit time – increments time by small amount and see if any events

• Event-driven – increments time to next event and executes (typical)

• System state variables– Global variables describing state – Can be used to save and restore

Page 59: Computing In Research

• Event routines– Specific routines to handle event– Ex: job arrival, job scheduling, job departure– Often handled by call-back from event scheduler

• Input routines– Get input from user (or config file, or script)– Often get all input before simulation starts– May allow range of inputs and number or

repetitions, etc.

Page 60: Computing In Research

• Report generators– Routines executed at end of simulation, final result

and print– Can include graphical representation, too– Ex: may compute total wait time in queue or

number of processes scheduled

Page 61: Computing In Research

Verification and Validation(Analysis of Simulation Result)

• Would like model output to be close to that of real system• Made assumptions about behavior of real systems• 1st step, test if assumptions are reasonable

– Validation, or representativeness of assumptions• 2nd step, test whether model implements assumptions

– Verification, or correctness

Page 62: Computing In Research

• Good software engineering practices will result in fewer bugs• Top-down, modular design• Assertions (antibugging)

– Say, total packets = packets sent + packets received– If not, can halt or warn

• Structured walk-through• Simplified, deterministic cases

– Even if end-simulation will be complicated and non-deterministic, use simple repeatable values (maybe fixed seeds) to debug

• Tracing (via print statements or debugger)

Page 63: Computing In Research

• Continuity tests– Slight change in input should yield slight change in

output, otherwise error

• Degeneracy tests– Try extremes (lowest and highest) since may

reveal bugs

Page 64: Computing In Research

• Consistency tests – similar inputs produce similar outputs

• Seed independence – random number generator starting value should not affect final conclusion (maybe individual output, but not overall conclusion)

Page 65: Computing In Research

• Ensure assumptions used are reasonable– Want final simulated system to be like real system

• Unlike verification, techniques to validate one simulation may be different from one model to another

• Three key aspects to validate:– Assumptions– Input parameter values and distributions– Output values and conclusions

• Compare validity of each to one or more of:– Expert intuition– Real system measurements– Theoretical results

Page 66: Computing In Research

• Can be used to compare a simplified system with simulated results

• May not be useful for sole validation but can be used to complement measurements or expert intuition– Ex: measurement validates for one processor, while analytic

model validates for many processors• Note, there is no such thing as a “fully validated” model

– Would require too many resources and may be impossible– Can only show is invalid

• Instead, show validation in a few select cases, to lend confidence to the overall model results

Page 67: Computing In Research

Transient removal• Most simulations only want steady state

– Remove initial transient state• Trouble is, not possible to define exactly what

constitutes end of transient state• Use heuristics:

– Long runs– Proper initialization– Truncation– Initial data deletion– Moving average of replications– Batch means

Page 68: Computing In Research

Long runs

• Use very long runs• Effects of transient state will be amortized• But … wastes resources• And tough to choose how long is “enough”• Recommendation … don’t use long runs alone

Page 69: Computing In Research

Proper Initialization

• Start simulation in state close to expected state• Ex: CPU scheduler may start with some jobs in

the queue• Determine starting conditions by previous

simulations or simple analysis• May result in decreased run length, but still may

not provide confidence that are in stable condition

Page 70: Computing In Research

Terminating Simulation• For some simulations, transition state is of interest no

transient removals required• Sometimes upon termination you also get final conditions that

do not reflect steady state– Can apply transition removal conditions to end of simulation

• Take care when gathering at end of simulation– Ex: mean service time should include only those that finish

• Also, take care of values at event times

Page 71: Computing In Research

Stopping Criteria• Important to run long enough

– Stopping too short may give variable results– Stopping too long may waste resources

• Should get confidence intervals on mean to desired width:x ± z1-/2Var(x)

• Variance of sample mean of independent observationsVar(x) = Var(x) / n

• But only if observations independent! Most simulations not– Ex: if queuing delay for packet i is large then will likely be large

for packet i+1• So, use: independent replications, batch means, regeneration

(all next)