computing in research
DESCRIPTION
Computing In Research. Dr. S.N. Pradhan Professor, CSE Department. Agenda. Introduction Data analysis and Visualization Interactive Data language(IDL) Scilab & Scicos Symbolic Computation Mathematica / maxima. A Data Analysis Pipeline. - PowerPoint PPT PresentationTRANSCRIPT
Computing In Research
Dr. S.N. PradhanProfessor, CSE Department
Agenda
• Introduction• Data analysis and Visualization• Interactive Data language(IDL)• Scilab & Scicos• Symbolic Computation• Mathematica/ maxima
A Data Analysis Pipeline Raw data Processed data Hypothesis or Model
Results
Cleaning Filtering
Transforming
Statistical Analysis Pattern Rec
Knowledge Disc
Validation
A CB
D
Where can visualization come in
• All stages can benefit from visualization• A: identify bad data, select subsets, help
choose transforms (exploratory)• B:help choose computational techniques, set
parameters, use vision to recognize, isolate, classify patterns (exploratory)
• C: Superimpose derived models on data (confirmatory)
• D:Present results (presentation)
What decides how to visualize• Characteristics of data
– Types, size, structure– Semantics, completeness, accuracy
• Characteristics of user– Perceptual and cognitive abilities– Knowledge of domain, data, tasks, tools
• Characteristics of graphical mappings– What are possibilities– Which convey data effectively and efficiently
• Characteristics of interactions– Which support the tasks best– Which are easy to learn, use, remember
Issues Regarding Data• Type may indicate which graphical mappings are appropriate
– Nominal vs. ordinal– Discrete vs. continuous– Ordered vs. unordered– Univariate vs. multivariate– Scalar vs. vector vs. tensor– Static vs. dynamic– Values vs. relations
• Trade-offs between size and accuracy needs• Different orders/structures can reveal different
features/patterns
User perceptions• What graphical attributes do we perceive
accurately?• What graphical attributes do we perceive quickly?• Which combinations of attributes are separable?• Coping with change blindness• How can visuals support the development of
accurate mental models of the data?• Relative vs. absolute judgements – impact on tasks
Issues regarding mappings
• Variables include shape, size, orientation, color, texture, opacity, position, motion….
• Some of these have an order, others don’t• Some use up significant screen space• Sensitivity to occlusion• Domain customs/expectations
Issues regarding Interactions
• Interaction critical component• Many categories of techniques
– Navigation, selection, filtering, reconfiguring, encoding, connecting, and combinations of above
• Many “spaces” in which interactions can be applied– Screen/pixels, data, data structures, graphical
objects, graphical attributes, visualization structures
Importance of Evaluation• Easy to design bad visualizations• Many design rules exist – many conflict, many routinely
violated• 5 E’s of evaluation: effective, efficient, engaging, error
tolerant, easy to learn• Many styles of evaluation (qualitative and quantitative):
– Use/case studies– Usability testing– User studies– Longitudinal studies– Expert evaluation– Heuristic evaluation
Different Views
Mappings• Based on data characteristics
– Numbers, text, graphs, software, ….• Logical groupings of techniques (Keim)
– Standard: bars, lines, pie charts, scatterplots– Geometrically transformed: landscapes,
parallel coordinates– Icon-based: stick figures, faces, profiles– Dense pixels: recursive segments, pixel bar
charts– Stacked: treemaps, dimensional stacking
Mappings• Based on dimension management (Ward)
–Dimension subsetting: scatterplots, pixel-oriented methods
–Dimension reconfiguring: glyphs, parallel coordinates
–Dimension reduction: PCA, MDS(Multi Dimensional Sclaing), Self Organizing Maps
–Dimension embedding: dimensional stacking, worlds within worlds
Sensor Network
SENSOR LAB AT BERKELEY
Pairwise link quality
Distance Between Nodes
LinkQuality
Glyphs
Dimensional Stacking• Break each dimension range into bins• Break the screen into a grid using the number
of bins for 2 dimensions• Repeat the process for 2 more dimensions
within the subimages formed by first grid, recurse through all dimensions
• Look for repeated patterns, outliers, trends, gaps
Dimensional Stacking
Pixel oriented technique
Methods to cope with scale
• Many modern datasets contain large number of records (millions and billions) and/or dimensions (hundreds and thousands)
• Several strategies to handle scale problems– Sampling– Filtering– Clustering/aggregation
• Techniques can be automated or user-controlled
• Visualization a powerful component of the data analysis process
• Each stage of analysis can be enhanced• Visualization can help guide computational
analysis, and vice versa• Multiple linked views and a rich assortment of
interactions key to success
Numerical Recipes in C & C++
• Numerical Recipes in C is a collection (or a library) of C functions written by Press et al.
• Library of mathematical functions• Useful while doing process or system
modeling.• Break down the model in known
mathematical functions and then one can use routines.
GNU Scientific Library
• Basic mathematical functions Complex numbers
• Polynomials• Special functions • Vectors and matrices • Permutations Combinations•
• Multisets Sorting • Linear algebra • Eigensystems• Fast Fourier transforms• Numerical integration (based on QUADPACK)• Random number generation• Quasi-random sequences • Random number distributions • Statistics • Histograms
Interactive Data language
• Data manipulation and visualization• Commercially availble packages IDL from ITT
Visual Information System.• Consists of
– Data Analysis– Data visualization– Animation
Open Source IDL(GDL)
• Open Source equivalent of IDL and much more• GDL is used particularly in geosciences. • GDL is dynamically-typed, vectorized and has object-oriented
programming capabilities.• The library routines handle numerical calculations, data
visualisation, signal/image processing, interaction with host OS and data
• input/output. GDL supports several data formats such as netCDF,
• HDF4, HDF5, GRIB, PNG, TIFF, DICOM, etc. Graphical output is handled by X11, PostScript, SVG or z-buffer terminals
Part II
• Analysis may, therefore, be categorized as • • Descriptive analysis • Inferential analysis (often known as statistical
analysis). • Correlation analysis• Causal analysis (regression analysis)• Multivariate analysis
Descriptive analysis• Descriptive analysis” is largely study of distributions of one
variable. This study provides us with profiles of companies, work groups, persons and other subjects on any of a multiple of characteristics such as size. Composition, efficiency, preferences, etc.” This sort of analysis may be in respect of one variable (described as unidimensional analysis); or in respect of two variables (described as bivariate analysis) or in respect of more than two variables (described as multivariate analysis). In this context we work out various measures that show the size and shape of a distribution(s) along with the study of measuring relationships between two or more variables.
Correlation analysis
• Correlation analysis studies the joint variation of two or more variables for determining the amount of correlation between two or more variables. In most social and business researches interest lies in understanding and controlling relationships between variables and so correlation analysis are relatively more important.
Causal Analysis
• Causal analysis (regression analysis) is concerned with study of how one or more variables affect changes in another variable. It is a study of functional relationships existing between two or more variables. Causal analysis is considered relatively more important in experimental researches.
Multivariate analysis
• Multivariate analysis is defined as “all statistical methods which simultaneously analyze more than two variables on a sample of observations”. With the availability of computer facilities, there has been a rapid development of this kind of analysis.
• Multiple regression analysis: This analysis is adopted when the researcher has one dependent variable which is presumed to be a function of two or more independent variables. The objective of this analysis is to make a prediction about the dependent variable based on its covariance with all the concerned independent variables.
•
• Multiple discriminant analysis: This analysis is appropriate when the researcher has a single dependent variable that cannot be measured, but can be classified into two or more groups on the basis of some attribute. The object of this analysis happens to be to predict an entity’s possibility of belonging to a particular group based on several predictor variables.
•
• Multivariate analysis of variance (or multi-ANOVA): This analysis is an extension of two way ANOVA, wherein the ratio of among group variance to within group variance is worked out on a set of variables.
• Canonical analysis: This analysis can be used in case of both measurable and non-measurable variables for the purpose of simultaneously predicting a set of dependent variables from their joint covariance with a set of independent variables.
• Analysis of variance (ANOVA) is a useful technique concerning researches in the fields of economics, biology, education, psychology, sociology, business/industry and several other disciplines. This technique is used when multiple sample cases are involved. The significance of difference between the means of two samples can be judged through either z-test or the t-test, but the difficulty arises when we happen to examine the significance of the difference amongst more than two sample means at the same time. The ANOVA technique enables us to perform this simultaneous test and as such is considered to be an important tool of analysis in the hands of a researcher.
Role of Simulation
Simulation is a method and application to mimic the real system, mostly via computer. Simulation is a numerical technique for conducting experiments on a computer which involves logical and mathematical relationships that interact to describe the behavior and structure of a complex real-world system over extended period of times.
• Simulation makes it possible to study and experiment with the complex internal interactions of a given system.
• It provides better understanding of the system.
• Simulation can be used as a pedagogical device for teaching students.
• The experience of designing a computer simulation model may be more valuable than the designing of actual model.
• Simulation can be used to experiment with new situation about which we have little or no information available.
• To verify analytical solution.• Cheap - No need of costly equipments• Complex scenarios can be easily tested• Results can be quickly obtained• More ideas can be tested in smaller time limit
Pitfalls of simulation
• It cannot provide insight for all possible scenarios Eg. Mobile networks must be tested with different mobility models
• Failure to have a well-defined set of objectives at the beginning of the simulation study.
• Failure to communicate with the decision-maker (or the client) on a regular basis.
• Lack of knowledge of simulation methodology and also of probability and statistics.
• Real systems too complex to model leading to inappropriate level of model detail.
• Failure to collect good system data.• Belief that so-called ”easy-to-use” simulation packages
require a significantly lower level of technical competence.
• Blindly using simulation software without understanding its underlying assumptions.
• Replacing a probability distribution by its mean.• Failure to perform a proper output-data analysis
Simulation Checklist
• Checks before developing simulation– Is the goal properly specified?– Is detail in model appropriate for goal?– Does team include right mix (leader, modeling,
programming, background)?– Has sufficient time been planned?
• Checks during simulation development– Is random number random?– Is model reviewed regularly?– Is model documented?
Checklist cont…
• Checks after simulation is running– Is simulation length appropriate?– Are initial transients removed?– Has model been verified?– Has model been validated?– Are there any surprising results? If yes, have they
been validated?
Terminology
• State variables– Variables whose values define current state of
system– Saving can allow simulation to be stopped and
restarted later by restoring all state variables• Event
– A change in system state– Ex: Three events: arrival of job, beginning of new
execution, departure of job
• Continuous-time and discrete-time models– If state defined at all times continuous– If state defined only at instants discrete– Ex: class that meets M-F 2-3 is discrete since not defined other
times• Continuous-state and discrete-state models
– If uncountably infinite continuous• Ex: time spent by students on hw
– If countable discrete• Ex: jobs in CPU queue
– Note, continuous time does not necessarily imply continuous state and vice-versa
• All combinations possible
• Deterministic and probabilistic models– If output predicted with certainty deterministic– If output different for different repetitions
probabilistic– Ex: For proj1, dog type-1 makes simulation
deterministic but dog type-2 makes simulation probabilistic
• Static and dynamic models– Time is not a variable static– If changes with time dynamic– Ex: CPU scheduler is dynamic, while matter-to-
energy model E=mc2 is static• Linear and nonlinear models
– Output is linear combination of input linear– Otherwise nonlinear
• Open and closed models– Input is external and independent open– Closed model has no external input– Ex: if same jobs leave and re-enter queue then
closed, while if new jobs enter system then open• Stable and unstable
– Model output settles down stable– Model output always changes unstable
Selecting Simulation language• Four choices: simulation language, general-purpose
language, extension of general purpose, simulation package
• Simulation language – built in facilities for time steps, event scheduling, data collection, reporting
• General-purpose – known to developer, available on more systems, flexible
• The major difference is the cost tradeoff – simulation language requires startup time to learn, while general purpose may require more time to add simulation flexibility
• Extension of general-purpose – collection of routines and tasks commonly used. Often, base language with extra libraries that can be called
• Simulation packages – allow definition of model in interactive fashion. Get results in one day
• Tradeoff is in flexibility, where packages can only do what developer envisioned, but if that is what is needed then is quicker to do so
Types of Simulation• Variety of types, but main: emulation, Monte
Carlo, trace driven, and discrete-event• Emulation
– Simulation that runs on a computer to make it appear to be something else
– Examples: JVM, NIST Net
Monte Carlo Simulation
• A static simulation has no time parameter– Runs until some equilibrium state reached
• Used to model physical phenomena, evaluate probabilistic system, numerically estimate complex mathematical expression
• Driven with random number generator– So “Monte Carlo” (after casinos) simulation
• Example, consider numerically determining the value of
Trace Driven Simulation• Uses time-ordered record of events on real
system as input– Ex: to compare memory management, use trace
of page reference patterns as input, and can model and simulate page replacement algorithms
• Note, need trace to be independent of system– Ex: if had trace of disk events, could not be used
to study page replacement since events are dependent
Advantages of trace driven..• Credibility – easier to sell than random inputs• Easy validation – when gathering trace, often get
performance stats and can validate with those• Accurate workload – preserves correlation of events,
don’t need to simplify as for workload model• Less randomness – input is deterministic, so output may
be (or will at least have less non-determinism)• Fair comparison – allows comparison of alternatives
under the same input stream• Similarity to actual implementation – often simulated
system needs to be similar to real one so can get accurate idea of how complex
Disadvantages of Trace ….• Complexity – requires more detailed implementation• Representativeness – trace from one system may not
represent all traces• Finiteness – can be long, so often limited by space but
then that time may not represent other times• Single point of validation – need to be careful that
validation of performance gathered during a trace represents only 1 case
• Trade-off – it is difficult to change workload since cannot change trace. Changing trace would first need workload model
Discrete even simulation• Continuous events are simulations like weather or chemical
reactions, while computers usually discrete events• Typical components:• Event scheduler – linked list of events
– Schedule event X at time T– Hold event X for interval dt– Cancel previously scheduled event X– Hold event X indefinitely until scheduled by other event– Schedule an indefinitely scheduled event– Note, event scheduler executed often, so has significant impact
on performance
• Simulation clock and time advancing– Global variable with time– Scheduler advances time
• Unit time – increments time by small amount and see if any events
• Event-driven – increments time to next event and executes (typical)
• System state variables– Global variables describing state – Can be used to save and restore
• Event routines– Specific routines to handle event– Ex: job arrival, job scheduling, job departure– Often handled by call-back from event scheduler
• Input routines– Get input from user (or config file, or script)– Often get all input before simulation starts– May allow range of inputs and number or
repetitions, etc.
• Report generators– Routines executed at end of simulation, final result
and print– Can include graphical representation, too– Ex: may compute total wait time in queue or
number of processes scheduled
Verification and Validation(Analysis of Simulation Result)
• Would like model output to be close to that of real system• Made assumptions about behavior of real systems• 1st step, test if assumptions are reasonable
– Validation, or representativeness of assumptions• 2nd step, test whether model implements assumptions
– Verification, or correctness
• Good software engineering practices will result in fewer bugs• Top-down, modular design• Assertions (antibugging)
– Say, total packets = packets sent + packets received– If not, can halt or warn
• Structured walk-through• Simplified, deterministic cases
– Even if end-simulation will be complicated and non-deterministic, use simple repeatable values (maybe fixed seeds) to debug
• Tracing (via print statements or debugger)
• Continuity tests– Slight change in input should yield slight change in
output, otherwise error
• Degeneracy tests– Try extremes (lowest and highest) since may
reveal bugs
• Consistency tests – similar inputs produce similar outputs
• Seed independence – random number generator starting value should not affect final conclusion (maybe individual output, but not overall conclusion)
• Ensure assumptions used are reasonable– Want final simulated system to be like real system
• Unlike verification, techniques to validate one simulation may be different from one model to another
• Three key aspects to validate:– Assumptions– Input parameter values and distributions– Output values and conclusions
• Compare validity of each to one or more of:– Expert intuition– Real system measurements– Theoretical results
• Can be used to compare a simplified system with simulated results
• May not be useful for sole validation but can be used to complement measurements or expert intuition– Ex: measurement validates for one processor, while analytic
model validates for many processors• Note, there is no such thing as a “fully validated” model
– Would require too many resources and may be impossible– Can only show is invalid
• Instead, show validation in a few select cases, to lend confidence to the overall model results
Transient removal• Most simulations only want steady state
– Remove initial transient state• Trouble is, not possible to define exactly what
constitutes end of transient state• Use heuristics:
– Long runs– Proper initialization– Truncation– Initial data deletion– Moving average of replications– Batch means
Long runs
• Use very long runs• Effects of transient state will be amortized• But … wastes resources• And tough to choose how long is “enough”• Recommendation … don’t use long runs alone
Proper Initialization
• Start simulation in state close to expected state• Ex: CPU scheduler may start with some jobs in
the queue• Determine starting conditions by previous
simulations or simple analysis• May result in decreased run length, but still may
not provide confidence that are in stable condition
Terminating Simulation• For some simulations, transition state is of interest no
transient removals required• Sometimes upon termination you also get final conditions that
do not reflect steady state– Can apply transition removal conditions to end of simulation
• Take care when gathering at end of simulation– Ex: mean service time should include only those that finish
• Also, take care of values at event times
Stopping Criteria• Important to run long enough
– Stopping too short may give variable results– Stopping too long may waste resources
• Should get confidence intervals on mean to desired width:x ± z1-/2Var(x)
• Variance of sample mean of independent observationsVar(x) = Var(x) / n
• But only if observations independent! Most simulations not– Ex: if queuing delay for packet i is large then will likely be large
for packet i+1• So, use: independent replications, batch means, regeneration
(all next)