applied statistics for civil and envirmental engineers

737

Upload: nawal-badu

Post on 18-Jul-2015

215 views

Category:

Environment


8 download

TRANSCRIPT

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    APPLIED STATISTICS FOR CIVIL ANDENVIRONMENTAL ENGINEERS

    i

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    ii

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    APPLIED STATISTICS FORCIVIL AND ENVIRONMENTALENGINEERSSecond Edition

    Nathabandu T. KottegodaDepartment of Hydraulic, Environmental, and Surveying EngineeringPolitecnico di Milano, Italy

    Renzo RossoDepartment of Hydraulic, Environmental, and Surveying EngineeringPolitecnico di Milano, Italy

    iii

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    This edition first published 2008C 2008 by Blackwell Publishing Ltd and 1997 by The McGraw-Hill Companies, Inc.

    Blackwell Publishing was acquired by John Wiley & Sons in February 2007. Blackwells publishingprogramme has been merged with Wileys global Scientific, Technical, and Medical business to formWiley-Blackwell.

    Registered officeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

    Editorial office9600 Garsington Road, Oxford, OX4 2DQ, United Kingdom

    For details of our global editorial offices, for customer services and for information about how to apply forpermission to reuse the copyright material in this book please see our website atwww.wiley.com/wiley-blackwell.

    The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright, Designs and Patents Act 1988.

    All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted,in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except aspermitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

    Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not beavailable in electronic books.

    Designations used by companies to distinguish their products are often claimed as trademarks. All brandnames and product names used in this book are trade names, service marks, trademarks or registeredtrademarks of their respective owners. The publisher is not associated with any product or vendor mentioned inthis book. This publication is designed to provide accurate and authoritative information in regard to thesubject matter covered. It is sold on the understanding that the publisher is not engaged in renderingprofessional services. If professional advice or other expert assistance is required, the services of a competentprofessional should be sought.ISBN: 978-1-4051-7917-1

    Library of Congress Cataloging-in-Publication Data

    Kottegoda, N. T.Applied statistics for civil and environmental engineers / Nathabandu T. Kottegoda, Renzo Rosso. 2nd ed.p. cm.Prev. ed. published as: Statistics, probability, and reliability for civil and environmental engineers. New York :McGraw-Hill, c1997.Includes bibliographical references and index.ISBN-13: 978-1-4051-7917-1 (hardback : alk. paper)ISBN-10: 1-4051-7917-1 (hardback : alk. paper) 1. Civil engineeringStatistical methods. 2. EnvironmentalengineeringStatistical methods. 3. Probabilities. I. Rosso, Renzo. II. Kottegoda, N. T. Statistics, probability,and reliability for civil and environmental engineers. III. Title.TA340.K67 2008519.5024624dc22 2007047496

    A catalogue record for this book is available from the British Library.

    Set in 10/12pt Times by Aptara Inc., New Delhi, IndiaPrinted in Singapore by Utopia Press Pte Ltd

    1 2008

    iv

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    Contents

    Dedication xiii

    Preface to the First Edition xiv

    Preface to the Second Edition xvi

    Introduction 1

    1 Preliminary Data Analysis 31.1 Graphical Representation 3

    1.1.1 Line diagram or bar chart 41.1.2 Dot diagram 41.1.3 Histogram 51.1.4 Frequency polygon 81.1.5 Cumulative relative frequency diagram 91.1.6 Duration curves 101.1.7 Summary of Section 1.1 11

    1.2 Numerical Summaries of Data 111.2.1 Measures of central tendency 121.2.2 Measures of dispersion 151.2.3 Measure of asymmetry 191.2.4 Measure of peakedness 191.2.5 Summary of Section 1.2 19

    1.3 Exploratory Methods 201.3.1 Stem-and-leaf plot 201.3.2 Box plot 221.3.3 Summary of Section 1.3 23

    1.4 Data Observed in Pairs 231.4.1 Correlation and graphical plots 231.4.2 Covariance and the correlation coefficient 241.4.3 Q-Q plots 261.4.4 Summary of Section 1.4 27

    1.5 Summary for Chapter 1 27References 28Problems 29

    2 Basic Probability Concepts 382.1 Random Events 39

    2.1.1 Sample space and events 392.1.2 The null event, intersection, and union 412.1.3 Venn diagram and event space 432.1.4 Summary of Section 2.1 49

    v

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    vi Contents

    2.2 Measures of Probability 502.2.1 Interpretations of probability 502.2.2 Probability axioms 522.2.3 Addition rule 532.2.4 Further properties of probability functions 552.2.5 Conditional probability and multiplication rule 562.2.6 Stochastic independence 612.2.7 Total probability and Bayes theorems 652.2.8 Summary of Section 2.2 72

    2.3 Summary for Chapter 2 72References 73Problems 74

    3 Random Variables and Their Properties 833.1 Random Variables and Probability Distributions 83

    3.1.1 Random variables 833.1.2 Probability mass function 843.1.3 Cumulative distribution function of a discrete random

    variable 853.1.4 Probability density function 863.1.5 Cumulative distribution function of a continuous random

    variable 883.1.6 Summary of Section 3.1 90

    3.2 Descriptors of Random Variables 903.2.1 Expectation and other population measures 903.2.2 Generating functions 993.2.3 Estimation of parameters 1033.2.4 Summary of Section 3.2 112

    3.3 Multiple Random Variables 1123.3.1 Joint probability distributions of discrete variables 1133.3.2 Joint probability distributions of continuous variables 1183.3.3 Properties of multiple variables 1243.3.4 Summary of Section 3.3 132

    3.4 Associated Random Variables and Probabilities 1323.4.1 Functions of a random variable 1333.4.2 Functions of two or more variables 1353.4.3 Properties of derived variables 1433.4.4 Compound variables 1513.4.5 Summary of Section 3.4 154

    3.5 Copulas 1543.6 Summary for Chapter 3 157References 157Problems 160

    4 Probability Distributions 1654.1 Discrete Distributions 165

    4.1.1 Bernoulli distribution 1664.1.2 Binomial distribution 1674.1.3 Poisson distribution 1714.1.4 Geometric and negative binomial distributions 181

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    Contents vii

    4.1.5 Log-series distribution 1854.1.6 Multinomial distribution 1874.1.7 Hypergeometric distribution 1894.1.8 Summary of Section 4.1 192

    4.2 Continuous Distributions 1944.2.1 Uniform distribution 1944.2.2 Exponential distribution 1964.2.3 Erlang and gamma distribution 2004.2.4 Beta distribution 2034.2.5 Weibull distribution 2054.2.6 Normal distribution 2094.2.7 Lognormal distribution 2154.2.8 Summary of Section 4.2 217

    4.3 Multivariate Distributions 2174.3.1 Bivariate normal distribution 2194.3.2 Other bivariate distributions 222

    4.4 Summary for Chapter 4 222References 223Problems 224

    5 Model Estimation and Testing 2305.1 A Review of Terms Related to Random Sampling 2305.2 Properties of Estimators 231

    5.2.1 Unbiasedness 2315.2.2 Consistency 2325.2.3 Minimum variance 2325.2.4 Efficiency 2345.2.5 Sufficiency 2345.2.6 Summary of Section 5.2 235

    5.3 Estimation of Confidence Intervals 2365.3.1 Confidence interval estimation of the mean when the

    standard deviation is known 2365.3.2 Confidence interval estimation of the mean when the

    standard deviation is unknown 2395.3.3 Confidence interval for a proportion 2425.3.4 Sampling distribution of differences and sums of statistics 2425.3.5 Interval estimation for the variance: chi-squared distribution 2435.3.6 Summary of Section 5.3 247

    5.4 Hypothesis Testing 2475.4.1 Procedure for testing 2485.4.2 Probabilities of Type I and Type II errors and the

    power function 2545.4.3 Neyman-Pearson lemma 2565.4.4 Tests of hypotheses involving the variance 2575.4.5 The F distribution and its use 2585.4.6 Summary of Section 5.4 259

    5.5 Nonparametric Methods 2605.5.1 Sign test applied to the median 2615.5.2 Wilcoxon signed-rank test for association of paired

    observations 262

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    viii Contents

    5.5.3 Kruskal-Wallis test for paired observations in k samples 2645.5.4 Tests on randomness: runs test 2675.5.5 Spearmans rank correlation coefficient 2685.5.6 Summary of Section 5.5 269

    5.6 Goodness-of-Fit Tests 2705.6.1 Chi-squared goodness-of-fit test 2715.6.2 Kolmogorov-Smirnov goodness-of-fit test 2735.6.3 Kolmogorov-Smirnov two-sample test 2745.6.4 Anderson-Darling goodness-of-fit test 2775.6.5 Other methods for testing the goodness-of-fit to a

    normal distribution 2815.6.6 Summary of Section 5.6 282

    5.7 Analysis of Variance 2835.7.1 One-way analysis of variance 2845.7.2 Two-way analysis of variance 2885.7.3 Summary of Section 5.7 294

    5.8 Probability Plotting Methods and Visual Aids 2955.8.1 Probability plotting for uniform distribution 2965.8.2 Probability plotting for normal distribution 2975.8.3 Probability plotting for Gumbel or EV1 distribution 3005.8.4 Probability plotting of other distributions 3015.8.5 Visual fitting methods based on the histogram 3035.8.6 Summary of Section 5.8 305

    5.9 Identification and Accommodation of Outliers 3055.9.1 Hypothesis tests 3065.9.2 Test statistics for detection of outliers 3075.9.3 Dealing with nonnormal data 3095.9.4 Estimation of probabilities of extreme events when outliers

    are present 3115.9.5 Summary of Section 5.9 312

    5.10 Summary of Chapter 5 312References 313Problems 316

    6 Methods of Regression and Multivariate Analysis 3266.1 Simple Linear Regression 327

    6.1.1 Estimates of the parameters 3286.1.2 Properties of the estimators and errors 3326.1.3 Tests of significance and confidence intervals 3376.1.4 The bivariate normal model and correlation 3396.1.5 Summary of Section 6.1 342

    6.2 Multiple Linear Regression 3426.2.1 Formulation of the model 3436.2.2 Linear least squares solutions using the matrix method 3436.2.3 Properties of least squares estimators and error variance 3466.2.4 Model testing 3506.2.5 Model adequacy 3556.2.6 Residual plots 3566.2.7 Influential observations and outliers in regression 3586.2.8 Transformations 365

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    Contents ix

    6.2.9 Confidence intervals on mean response and prediction 3666.2.10 Ridge regression 3686.2.11 Other methods and discussion of Section 6.2 370

    6.3 Multivariate Analysis 3736.3.1 Principal components analysis 3736.3.2 Factor analysis 3796.3.3 Cluster analysis 3836.3.4 Other methods and summary of Section 6.3 385

    6.4 Spatial Correlation 3866.4.1 The estimation problem 3876.4.2 Spatial correlation and the semivariogram 3876.4.3 Some semivariogram models and physical aspects 3896.4.4 Spatial interpolations and Kriging 3916.4.5 Summary of Section 6.4 394

    6.5 Summary of Chapter 6 394References 395Problems 398

    7 Frequency Analysis of Extreme Events 4057.1 Order Statistics 406

    7.1.1 Definitions and distributions 4067.1.2 Functions of order statistics 4097.1.3 Expected value and variance of order statistics 4117.1.4 Summary of Section 7.1 415

    7.2 Extreme Value Distributions 4157.2.1 Basic concepts of extreme value theory 4157.2.2 Gumbel distribution 4227.2.3 Frechet distribution 4297.2.4 Weibull distribution as an extreme value model 4327.2.5 General extreme value distribution 4357.2.6 Contagious extreme value distributions 4397.2.7 Use of other distributions as extreme value models 4457.2.8 Summary of Section 7.2 450

    7.3 Analysis of Natural Hazards 4537.3.1 Floods, storms, and droughts 4537.3.2 Earthquakes and volcanic eruptions 4617.3.3 Winds 4657.3.4 Sea levels and highest sea waves 4707.3.5 Summary of Section 7.3 473

    7.4 Summary of Chapter 7 474References 474Problems 478

    8 Simulation Techniques for Design 4878.1 Monte Carlo Simulation 488

    8.1.1 Statistical experiments 4888.1.2 Probability integral transform 4938.1.3 Sample size and accuracy of Monte Carlo experiments 4958.1.4 Summary for Section 8.1 501

    8.2 Generation of Random Numbers 501

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    x Contents

    8.2.1 Random outcomes from standard uniform variates 5018.2.2 Random outcomes from continuous variates 5068.2.3 Random outcomes from discrete variates 5118.2.4 Random outcomes from jointly distributed variates 5138.2.5 Summary of Section 8.2 514

    8.3 Use of Simulation 5148.3.1 Distributions of derived design variates 5148.3.2 Sampling statistics 5178.3.3 Simulation of time- or space-varying systems 5198.3.4 Design alternatives and optimal design 5248.3.5 Summary of Section 8.3 530

    8.4 Sensitivity and Uncertainty Analysis 5308.5 Summary and Discussion of Chapter 8 531References 531Problems 533

    9 Risk and Reliability Analysis 5419.1 Measures of Reliability 542

    9.1.1 Factors of safety 5429.1.2 Safety margin 5479.1.3 Reliability index 5509.1.4 Performance function and limiting state 5589.1.5 Further practical solutions 5689.1.6 Summary of Section 9.1 577

    9.2 Multiple Failure Modes 5779.2.1 Independent failure modes 5789.2.2 Mutually dependent failure modes 5849.2.3 Summary of Section 9.2 592

    9.3 Uncertainty in Reliability Assessments 5929.3.1 Reliability limits 5929.3.2 Bayesian revision of reliability 5939.3.3 Summary of Section 9.3 597

    9.4 Temporal Reliability 5979.4.1 Failure process and survival time 5979.4.2 Hazard function 6029.4.3 Reliable life 6059.4.4 Summary of Section 9.4 606

    9.5 Reliability-Based Design 6069.6 Summary for Chapter 9 612References 613Problems 615

    10 Bayesian Decision Methods and Parameter Uncertainty 62310.1 Basic Decision Theory 624

    10.1.1 Bayes rules 62410.1.2 Decision trees 62710.1.3 The minimax solution 63010.1.4 Summary of Section 10.1 632

    10.2 Posterior Bayesian Decision Analysis 63210.2.1 Subjective probabilities 633

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    Contents xi

    10.2.2 Loss and utility functions 63410.2.3 The discrete case 63510.2.4 Inference with conditional binomial and prior beta 63610.2.5 Poisson hazards and gamma prior 63810.2.6 Inferences with normal distribution 63910.2.7 Likelihood ratio testing 64210.2.8 Summary of Section 10.2 643

    10.3 Markov Chain Monte Carlo Methods 64310.4 James-Stein Estimators 65010.5 Summary and Discussion of Chapter 10 653References 653Problems 656

    Appendix A: Further mathematics 659A.1 Chebyshev Inequality 659A.2 Convex Function and Jensen Inequality 659A.3 Derivation of the Poisson distribution 659A.4 Derivation of the normal distribution 660A.5 MGF of the normal distribution 661A.6 Central limit theorem 662A.7 Pdf of Students T distribution 663A.8 Pdf of the F distribution 664A.9 Wilcoxon signed-rank test: mean and variance of the test statistic 664A.10 Spearmans rank correlation coefficient 665

    Appendix B: Glossary of Symbols 667

    Appendix C: Tables of Selected Distributions 673

    Appendix D: Brief Answers to Selected Problems 684

    Appendix E: Data Lists 687

    Index 707

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    xii

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    Dedication

    To my parents. To estimate the debt I owe them requires a lifespan of nibbanic extent. ToMali, Shani, Siraj, and Natasha. N.T.K.

    A mamma Aria, a Donatella, ai due Riccardi della mia vita e al nostro indimenticabileRufus. R.R.

    xiii

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    Preface to the First Edition

    Statistics, probability, and reliability are subject areas that are not commonly easy for stu-dents of civil and environmental engineering. Such difficulties notwithstanding, a greateremphasis is currently being made on the teaching of these methods throughout institutionsof higher learning. Many professors with whom we have spoken have expressed the needfor a single textbook of sufficient breadth and clarity to cover these topics.

    One might ask why it is necessary to write a new book specifically for civil and envi-ronmental engineers. Firstly, we see a particular importance of statistical and associatedmethods in our disciplines. For example, some modes of failure, interactions, probabilitydistributions, outliers, and spatial relationships that one encounters are unique and requiredifferent approaches. Secondly, colleagues have said that existing books are either old andoutdated or omit particularly important engineering problems, emphasizing instead areasthat may not be directly relevant to the practitioner.

    We set ourselves several objectives in writing this book. First, it was necessary to updatemuch of the older material, which have rightly stood for decades, even centuries. Indeed.Second, we had to look at the engineers structures, waterways, and the like and bring inas much material as possible for the tasks at hand. We felt an urgent need to modernize,incorporate new concepts throughout, and reduce or eliminate the impact of some topics.We aimed to order the material in a logical sequence. In particular we tried to adopt awriting style and method of presentation that are lively and without overrigorous drudgery.These had to be accomplished without compromising a deep and thorough treatment offundamentals.

    The layout of the book is sraightforward, so it can be used to suit ones personal needs.We apologize to any readers who think we have strayed from the path of simplicity incertain parts, such as the associated variables and contagious distributions of Chapter 3and the order statistics of Chapter 7. One might wish to omit these sections on a firstreading. The introductions to the chapters will be helpful for this purpose.

    The explanation of the theory is accompanied by the assumptions made. Definitions areseparately highlighted. In many places we point out the limitations and pitfalls or viola-tions. There are warnings of possible misuses, misunderstandings, and misinterpretations.We provide guidance to the proper interpretation of statistical results.

    The numerous examples, for which we have for the most part used recorded observa-tions, will be helpful to beginners as well as to mature students who will consult the textas a reference. We hope these examples will lead to a better understanding of the materialand design variabilities, a prelude to the making of sound decisions.

    Each chapter concludes with extensive homework problems. In many instances, as inChapter 1, they are based on real data not used elsewhere in the text. We have not usedcards or dice or coins or black and red balls in any of the problems and examples. Answersto selected problems are summarized in Appendix D. A detailed manual of solutions isavailable.

    Computers are continuously becoming cheaper and more powerful. Newer ways ofhandling data are being devised. At the inception, we seriously considered the use ofcommercial software packages to enhance the scope of the book. However, the problemof choosing one, from the many suitable packages acted as a deterrent. Our concern was theserious limitations imposed by utilizing a source that necessitates corresponding purchase

    xiv

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    Preface to the First Edition xv

    by an adopting school or by individual engineers. Besides, the calculations illustratedin the book can be made using worksheets available as standard software for personalcomputers. As an aid, the data in Appendix E will be placed on the Internet.

    We have utilized the space saved (from jargon and notation of a particular software,output, graphs, and tables) to widen the scope, make our explanations more thorough,and insert additional illustrations and problems. Readers also have an almost all-inclusiveindex, a comprehensive glossary of notation, additional mathematical explanations, andother material in the appendixes. Furthermore, we hope that the extensive, annotated bibli-ographies at the end of each chapter, numerous citations and tables, will make this a usefulreference source.

    The book is written for use by students, practicing engineers, teachers, and researchers incivil and environmental engineering and applied statistics; female readers will find no hintof male chauvinism here. It is designed for a one- or two-semester course and is suitablefor final-year undergraduate and first-year graduate students. The text is self-contained forstudy by engineers. A background of elementary calculus and matrix algebra is assumed.

    ACKNOWLEDGMENTS

    We acknowledge with thanks the work of the staff at Publication Services, Inc., in Cham-paign, IL. Gianfausto Salvadori gave his time generously in reviewing the manuscript andproviding solutions to some homework problems. Thanks are due again to Adri Buishandfor his elaborate and painstaking reviews. Our publisher solicited other reviewers whosereports were useful. Howard Tillotson and colleagues at the University of Birmingham,England, provided data and some student problems. Discussions with Tony Lawrance atlunch in the University Staff House and the example problem he solved at Helsinki Airportare appreciated. Valuable assistance was provided by Giovanni Solari and Giulio Ballio inwind and steel engineering, respectively. In addition, Giovanni Vannuchi was consultedon geotechnical engineering. Research staff and doctoral students at the Politecnico di Mi-lano helped with the homework problems and the preparation of the index. Dora Tartagliaworked diligently on revisions to the manuscript. We thank the publishers, companies,and individuals who gave us permission to use their material, data, and tables; some of thetables were obtained through our own resources We shall be pleased to have any omissionsbrought to our notice. The support and hospitality provided at the Universita` degli Studi diPavia by Luigi Natale and others are acknowledged with thanks. Most importantly, withoutthe patience and tolerance of our families this book could not have been completed.

    N. T. KottegodaR. Rosso

    Milano, Italy1 July 1996

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    Preface to the Second Edition

    Last year a senior European professor, who uses our book, was visiting us in Milano.When told of the revisions underway he expressed some surprise. There is nothing torevise, he said. But all books need revision sooner or later, especially a multidimensionalone. The equations, examples, problems, figures, tables, references, and footnotes are allsubject to inevitable human fallibilities: typographical errors and errors of fact. Our firstobjective was to bring the text as close to the ideal state as possible. The second prioritywas to modernize.

    In Chapter 10, a new section is added on Markov chain Monte Carlo modeling; this haspopularized Bayesian methods in recent years; there is a full description and case studyon Gibbs sampling. In Chapter 8 on simulation, we include a new section on sensitivityanalysis and uncertainty analysis; a clear and detailed distinction is made between epis-temic and aleatory uncertainties; their implications in decision-making are discussed. InChapter 7 on Frequency Analysis of Extreme Events, natural hazards and flood hydrologyare updated. In Chapter 6 on regression analysis, further considerations have been made onthe diagnostics of regression; there are new discussions on general and generalized linearmodels. In Chapter 5 on Model Estimation and Testing we give special importance to theAnderson-Darling goodness-of-fit test because of its sensitivity to departures in the tailareas of a probability density function; we make applications to nonnormal distributionsusing the same data as in the estimation of parameters. In Chapter 3 a section is added onthe novel method of copulas with particular emphasis on bivariate distributions. We haverevised the problems following Basic Probability Concepts in Chapter 2. Other chaptersare also revised and modernized and the annotated references are updated.

    As before, we have kept in mind the scientific method of Claude Bernard, the Frenchmedical researcher of the nineteenth century. This had three essential parts: observation ofphenomena in nature (seen in Appendix E, and in the examples and problems), observationof experiments (as reported in each chapter), and the theoretical part (clear enough for theaudience in mind, but without over-simplification).

    Nobody trusts a model except the one who originated it; everybody trusts data exceptthose who record it. Models and data are subject to uncertainty. There is still a gapbetween models and data. We attempt to bridge this gap.

    The title of the book has been abridged from Statistics, Probability, and Reliabilityfor Civil and Environmental Engineers to Applied Statistics for Civil and EnvironmentalEngineers. The applications and problems pertain almost equally to both disciplines andall areas are included.

    Another aspect we emphasized before was that the calculations illustrated in the bookcan be made using worksheets available as standard software for personal computers.Alternatively, R which is now commonplace can be downloaded free of charge and adoptedto run some of the homework problems, if one so prefers. Our decision not to recommendthe use of particular commercial software packages, by giving details of jargon, notation,and so on, seems to be justified. We find that a specific version soon become obsolete withthe advent of a new version.

    A limited access solutions manual is available with the data from Appendix E on theWiley-Blackwell website [www.blackwellpublishing.com/kottegoda].

    xvi

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    Preface to the Second Edition xvii

    We are grateful for the encouragement given by many users of the first edition, andto the few who pointed out some discrepancies. We thank the anonymous reviewers fortheir useful comments. Gianfausto Salvadori, Carlo De Michele, Adri Buishand, and TonyLawrance assisted us again in the revisions. Julia Burden and Lucy Alexander of BlackwellPublishing supported us throughout the project. Universita` degli Studi di Pavia is thankedfor continued hospitality. The help provided by Fabrizio Borsa and Enrico Raiteri in thepreparation of some figures is acknowledged.

    N. T. KottegodaR. Rosso

    Milano, Italy14 September 2007

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 13, 2008 12:36

    xviii

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 2, 2008 15:39

    Introduction

    As a wide-ranging discipline, statistics concerns numerous procedures for deriving infor-mation from data that have been affected by chance variations. On the basis of scientificexperiments, one may record and make summaries of observations, quantify variations,or other changes of significance, and compare data sequences by means of some numbersor characteristics. The use of statistics in this way is for descriptive purposes. At a moresophisticated level of analysis and interpretation, one can, for instance, test hypothesesusing the inferential approach developed during the twentieth century. Thus it may beascertained, for instance, whether the change of an ingredient affects the properties ofa concrete or whether a particular method of surfacing produces a longer-lasting road;this approach often includes the estimation by means of observations of the parametersof a statistical model. Then inferences can be drawn from data and predictions made ordecisions taken. When faced with uncertainty, this last phase is the principal aim of a civilor environmental engineer acting as an applied statistician.

    In all activities, engineers have to cope with possible uncertainties. Observations of soilpressures, tensile strengths of concrete, yield strengths of steel, traffic densities, rainfalls,river flows, and pollution loads in streams vary from one case to the next for apparentlyunknown reasons or on account of factors that cannot be assessed to any degree of accu-racy. However, designs need to be completed and structures, highways, water supply, andsewerage schemes constructed. Sound engineering judgment, in fact, springs from physi-cal and mathematical theories, but it goes far beyond that. Randomness in nature must betaken into account. Thus the onus of dealing with the uncertainties lies with the engineer.

    The appropriate methods of tackling the uncertainty vary with different circumstances.The key is often the dispersion that is commonly evidenced in available data sets. Somephenomena may have negligible or low variability. In such a case, the mean of past observa-tions may be used as a descriptor, for example, the elastic constant of a steel. Nevertheless,the consequences of a possible change in the mean should also be considered. Frequently,the variability in observations is found to be quite substantial. In such situations, an engi-neer sometimes uses, rather conservatively, a design value such as the peak storm runoffor the compressive strength of a concrete. Alternatively, it has been the practice to expressthe ability of a component in a structure to withstand a specified loading without failureor a permissible deflection by a so-called factor of safety; this is in effect a blanket tocover all possible contingencies. However, we envisage some problems here in followinga purely deterministic approach because there are doubts concerning the consistency ofspecified strengths, flows, loads, or factors from one case to another. These cannot belightly dismissed or easily compounded when the consequences of ignoring variabilityare detrimental or, in general, if the decision is sensitive to a particular uncertainty. (Oftenthere are crucial economic considerations in these matters.) This obstacle strongly sug-gests that the way forward is by treating statistics and probability as necessary aids indecision making, thus coping with uncertainty through the engineering process.

    Note that statistical methods are in no way intended to replace the physical knowl-edge and experience of the engineer and his or her skills in experimentation. The engineershould know how the measurements are made and recorded and how errors may arise frompossible limitations in the equipment. There should be readiness to make changes and im-provements so that the data-gathering process is as reliable and representative as possible.

    1

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 2, 2008 15:39

    2 Applied Statistics for Civil and Environmental Engineers

    On this basis, statistics can be a complementary and a valuable aid to technology. In prudenthands it can lead to the best practical assessment of what is partially known or uncertain.

    The quantification of uncertainty and the assessment of its effects on design and im-plementation must include concepts and methods of probability, because statistics is builton the foundation of probability theory. In addition, decision making under risk involvesthe use of applied probability. Historically, probability theory arose as a branch of math-ematics concerned with the analysis of certain games of chance; it consequently foundapplications in the measurement and understanding of uncertainty in innumerable naturalphenomena and human activities. The fundamental interrelationship between statisticsand probability is clearly evident in practice. As seen in past decades, there has been anirreversible change in emphasis from descriptive to inferential statistics. In this respectwe must note that statistical inferences and the risk and reliability of decision makingunder uncertainty are evaluated through applied probability, using frequentist or Bayesianestimation. This applies to the most widely used methods. Alternatives that come undergeneralized information theory are now available.

    The reliability of a system, structure, or component is the complement of its probabilityof failure. Risk and reliability analysis, however, entail many activities. The survival prob-ability of a system is usually stated in terms of the reliabilities of its components. Themodeling process is an essential part of the analysis, and time can be an important factor.Also, the risk factor that one computes may be inherent, additional, or composite. Allthese points show that reliability design deserves special emphasis.

    Methods of reducing data, reviewed in Chapter 1, begin with tabulation and graphicalrepresentation, which are necessary first steps in understanding the uncertainty in data andthe inherent variability. Numerical summaries provide descriptions for further analysis.Exploratory methods are followed by relationships between data observed in pairs. Thusthe investigation begins. The route is long and diverse, because statistics is the scienceand art of experimenting, collecting, analyzing, and making inferences from data. Thisopening chapter provides a route map of what is to follow so that one can gain insightinto the numerous tools statistics offers and realizes the variety of problems that can betackled. In Chapters 2 and 3, we develop a background in probability theory for copingwith uncertainty in engineering. Using basic concepts, we then discuss the total probabilityand Bayes theorems and define statistical properties of distributions used for estimationpurposes. Chapter 4 examines various mathematical models of random processes. Thereis a wide-ranging discussion of discrete and continuous distributions; joint and derivedtypes are also given in Chapters 3 and 4; we introduce copulas that can effectively modeljoint distributions. Model estimation and testing methods, such as confidence intervals,hypothesis testing, analysis of variance, probability plotting, and identification of outliers,are treated in Chapter 5. The estimation and testing are based on the principle that allsuppositions need to be carefully examined in light of experimentation and observation.Details of regression and multivariate statistical methods are provided in Chapter 6, alongwith principal component analysis and associated methods and spatial correlation. Extremevalue analysis applied to floods, droughts, winds, earthquakes, and other natural hazards isfound in Chapter 7; some special types of models are included. Simulation is the subject ofChapter 8, which comprises the use of simulation in design and for other practical purposes;also, we discuss sensitivity analysis and uncertainty analysis of the aleatory and epistemictypes. In Chapter 9, risk and reliability analysis and reliability design are developed indetail. Chapter 10 is devoted to Bayesian and other types of economic decision making,used when the engineer faces uncertainty; we include here Markov chain Monte Carlomethods that have recently popularized the Bayesian approach.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Chapter 1Preliminary Data Analysis

    All natural processes, as well as those devised by humans, are subject to variability.Civil engineers are aware, for example, that crushing strengths of concrete, soil pressures,strengths of welds, traffic flow, floods, and pollution loads in streams have wide variations.These may arise on account of natural changes in properties, differences in interactionsbetween the ingredients of a material, environmental factors, or other causes. To copewith uncertainty, the engineer must first obtain and investigate a sample of data, such asa set of flow data or triaxial test results. The sample is used in applying statistics andprobability at the descriptive stage. For inferential purposes, however, one needs to makedecisions regarding the population from which the sample is drawn. By this we mean thetotal or aggregate, which, for most physical processes, is the virtually unlimited universeof all possible measurements. The main interest of the statistician is in the aggregation;the individual items provide the hints, clues, and evidence.

    A data set comprises a number of measurements of a phenomenon such as the failureload of a structural component. The quantities measured are termed variables, each ofwhich may take any one of a specified set of values. Because of its inherent randomnessand hence unpredictability, a phenomenon that an engineer or scientist usually encountersis referred to as a random variable, a name given to any quantity whose value dependson chance.1 Random variables are usually denoted by capital letters. These are classifiedby the form that their values can possibly take (or are assumed to take). The pattern ofvariability is called a distribution. A continuous variable can have any value on a conti-nuous scale between two limits, such as the volume of water flowing in a river per secondor the amount of daily rainfall measured in some city. A discrete variable, on the contrary,can only assume countable isolated numbers like integers, such as the number of vehiclesturning left at an intersection, or other distinct values.

    Having obtained a sample of data, the first step is its presentation. Consider, for ex-ample, the modulus of rupture data for a certain type of timber shown in Table E.1.1, inAppendix E. The initial problem facing the civil engineer is that such an array of data byitself does not give a clear idea of the underlying characteristics of the stress values inthis natural type of construction material. To extract the salient features and the particulartypes of information one needs, one must summarize the data and present them in somereadily comprehensible forms. There are several methods of presentation, organization,and reduction of data. Graphical methods constitute the first approach.

    1.1 GRAPHICAL REPRESENTATION

    If a picture is worth a thousand words, then graphical techniques provide an excellentmethod to visualize the variability and other properties of a set of data. To the powerfulinteractive system of ones brain and eyes, graphical displays provide insight into the form

    1 The term will be formally defined in Section 3.1.

    3

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    4 Applied Statistics for Civil and Environmental Engineers

    and shape of the data and lead to a preliminary concept of the generating process. Weproceed by assembling the data into graphs, scanning the details, and noting the importantcharacteristics. There are numerous types of graphs. Line and dot diagrams, histograms,relative frequency polygons, and cumulative frequency curves are given in this section.Subsequently, exploratory methods, such as stem-and-leaf plots and box diagrams andgraphs depicting a possible association between two variables, are presented in Sections1.3 and 1.4. We begin with the simple task of counting.

    1.1.1 Line diagram or bar chart

    The occurrences of a discrete variable can be classified on a line diagram or bar chart.In this type of graph, the horizontal axis gives the values of the discrete variable and theoccurrences are represented by the heights of vertical lines. The horizontal spread of theselines and their relative heights indicate the variability and other characteristics of the data.

    Example 1.1. Flood occurrences. Consider the annual number of floods of the Magra Riverat Calamazza, situated between Pisa and Genoa in northwestern Italy, over a 34-year period,as shown in Table 1.1.1.

    A flood in the river at the point of measurement means the river has risen above a specifiedlevel, beyond which the river poses a threat to lives and property. The data are plotted inFig. 1.1.1 as a line diagram.

    The data suggest a symmetrical distribution with a midlocation of four floods per year.In some other river basins, there is a nonlinear decrease in the occurrences for increasingnumbers of floods in a year commencing at zero, showing a negative exponential type ofvariation.

    1.1.2 Dot diagram

    A different type of graph is required to present continuous data. If the data are few (say,less than 25 items) a dot diagram is a useful visual aid. Consider the possibility that only

    Table 1.1.1 Number of flood occurrences peryear from 1939 to 1972 at the gauging station ofCalamazza on the Magra River, between Pisaand Genoa in northwestern Italya

    Number of floods Number ofin a year occurrences

    0 01 22 63 74 95 46 17 48 19 0

    Total 34a A flood occurrence is defined as river dischargeexceeding 300 m3/s.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Preliminary Data Analysis 5

    0123456789

    0 1 2 3 4 5 6 7 8 9

    Number of floods

    Num

    ber o

    f occ

    urre

    nces

    Fig. 1.1.1 Line diagram for flood occurrences in the Magra River at Calamazza between Genoaand Pisa in northwestern Italy.

    the first 15 items of data in Table E.1.1which shows the modulus of rupture in N/mm2for 50 mm 150 mm Swedish redwood and whitewoodare available. The abridgeddata are ranked in ascending order and are given in Table 1.1.2 and plotted in Fig. 1.1.2.

    The reader can see that the midlocation is close to 40 N/mm2 but the wide spread makesthis location difficult to discern. A larger sample should certainly be helpful.

    1.1.3 Histogram

    If there are at least, say, 25 observations, one of the most common graphical forms is ablock diagram called the histogram. For this purpose, the data are divided into groupsaccording to their magnitudes. The horizontal axis of the graph gives the magnitudes.Blocks are drawn to represent the groups, each of which has a distinct upper and lowerlimit. The area of a block is proportional to the number of occurrences in the group.The variability of the data is shown by the horizontal spread of the blocks, and the mostcommon values are found in blocks with the largest areas. Other features such as thesymmetry of the data or lack of it are also shown.

    The first step is to take into account the range r of the observations, that is, the differencebetween the largest and smallest values.

    Example 1.2. Timber strength. We go back to the timber strength data given in Table E.1.1.They are arranged in order of magnitude in Table 1.1.3.

    There are n = 165 observations with somewhat high variability, as expected, becausetimber is a naturally variable material. Here the range r = 70.22 0.00 = 70.22 N/mm2.

    To draw a histogram, one divides the range into a number of classes or cells nc. Thenumber of occurrences in each class is counted and tabulated. These are called frequencies.

    Table 1.1.2 The first 15 items of modulus of rupture data measuringtimber strengths in N/mm2, from Table E.1.1 (commencing with thetop row), ranked in increasing order29.11 29.93 32.02 32.40 33.06 34.12 35.58 39.3440.53 41.64 45.54 48.37 48.78 50.98 65.35

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    6 Applied Statistics for Civil and Environmental Engineers

    25 30 35 40 45 50 55 60 65 70

    Modulus of rupture, N/mm2

    Fig. 1.1.2 Dot diagram for a short sample of timber strengths from Table 1.1.3.

    The width of the classes is usually made equal to facilitate interpretation. For some worksuch as the fitting of a theoretical function to observed frequencies, however, unequal classwidths are used. Care should be exercised in the choice of the number of classes, nc. Toofew will cause an omission of some important features of the data; too many will not givea clear overall picture because there may be high fluctuations in the frequencies. A ruleof thumb is to make nc = n or an integer close to this, but it should be at least 5 and notgreater than 25. Thus, histograms based on fewer than 25 items may not be meaningful.Sturges (1926) suggested the approximation

    nc = 1 + 3.3 log10 n. (1.1.1)A more theoretically based alternative follows the work of Freedman and Diaconis (1981):2

    nc = r n1/3

    2 iqr. (1.1.2)

    Here iqr is the interquartile range. To clarify this term, we must define Q2, or themedian. This denotes the middle term of a set of data when the values are arranged inascending order, or the average of the two middle terms if n is an even number. The firstor lower quartile, Q1, is the median of the lower half of the data, and likewise the third

    Table 1.1.3 Ranked modulus of rupture data for timber strengths in N/mm2, inascending order a

    0.00 28.00 31.60 34.44 36.84 39.21 41.75 44.30 47.25 53.9917.98 28.13 32.02 34.49 36.85 39.33 41.78 44.36 47.42 54.0422.67 28.46 32.03 34.56 36.88 39.34 41.85 44.36 47.61 54.7122.74 28.69 32.40 34.63 36.92 39.60 42.31 44.51 47.74 55.2322.75 28.71 32.48 35.03 37.51 39.62 42.47 44.54 47.83 56.6023.14 28.76 32.68 35.17 37.65 39.77 43.07 44.59 48.37 56.8023.16 28.83 32.76 35.30 37.69 39.93 43.12 44.78 48.39 57.9923.19 28.97 33.06 35.43 37.78 39.97 43.26 44.78 48.78 58.3424.09 28.98 33.14 35.58 38.00 40.20 43.33 45.19 49.57 65.3524.25 29.11 33.18 35.67 38.05 40.27 43.33 45.54 49.59 65.6124.84 29.90 33.19 35.88 38.16 40.39 43.41 45.92 49.65 69.0725.39 29.93 33.47 35.89 38.64 40.53 43.48 45.97 50.91 70.2225.98 30.02 33.61 36.00 38.71 40.71 43.48 46.01 50.9826.63 30.05 33.71 36.38 38.81 40.85 43.64 46.33 51.3927.31 30.33 33.92 36.47 39.05 40.85 43.99 46.50 51.9027.90 30.53 34.12 36.53 39.15 41.64 44.00 46.86 53.0027.93 31.33 34.40 36.81 39.20 41.72 44.07 46.99 53.63a The original data set is given in Table E.1.1; n = 165. The median is underlined.

    2 See also Scott (1979).

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Preliminary Data Analysis 7

    Table 1.1.4 Frequency computations for the modulus of rupture data ranked in Table 1.1.3a

    Class upper limit Class center Absolute Relative Cumulative relative(N/mm2) (N/mm2) frequency frequency frequency (%)

    5 2.5 1 0.006 0.6110 7.5 0 0.000 0.6115 12.5 0 0.000 0.6120 17.5 1 0.006 1.2125 22.5 9 0.055 6.6730 27.5 18 0.109 17.5835 32.5 26 0.158 33.3340 37.5 38 0.230 56.3645 42.5 34 0.206 76.9750 47.5 20 0.121 89.0955 52.5 9 0.055 94.5560 57.5 5 0.030 97.5865 62.5 0 0.000 97.5870 67.5 3 0.018 99.3975 72.5 1 0.006 100.00a The width of each class is 5 N/mm2 in this example.

    or upper quartile, Q3, is the median of the upper half of the data. This definition will beused throughout.3 Thus,

    iqr = Q3 Q1. (1.1.3)

    Example 1.3. Timber strength. For the timber strength data of Table E.1.1, the median,that is, Q2, is 39.05 N/mm2. Also Q3 and Q1 are 44.57 and 32.91 N/mm2, respectively, andhence iqr = 11.66 N/mm2. From the simple square-root rule, the number of classes, nc =12.84. However, by using Eqs. (1.1.1) and (1.1.2), the number of classes are 8.32 and 16.52,respectively. If these are rounded to 9 and 15 and the range is extended to 72 and 75 N/mm2for graphical purposes, the equal class widths become 8 and 5 N/mm2, respectively. Let ususe these widths. It is important to specify the class boundaries without ambiguity for thecounting of frequencies; for example, in the first case, these should be from 0 to 7.99, 8.00 to15.99, and so on. As already mentioned, the vertical axis of a histogram is made to representthe frequency and the horizontal axis is used as a measurement scale on which the classboundaries are marked. For each of these class widths, 8 and 5 N/mm2, class boundaries aremade and counting of frequencies is completed using Table 1.1.3; the lowest boundary isat 0 and the highest boundaries are at 72 and 75 N/mm2, respectively. Table 1.1.4 gives theabsolute and relative frequencies for class widths of 5 N/mm2.

    Rectangles are then erected over each of the classes, proportional in area to the classfrequencies. When equal class widths are used, as shown here, the heights of the rectanglesrepresent the frequencies. Thus, Figs. 1.1.3 and 1.1.4 are obtained.

    The information conveyed by the two histograms seems to be similar. The diagrams arealmost symmetrical with a peak in the class below 40 N/mm2 and a steady decrease on eitherside. This type of diagram usually brings out any possible imperfections in the data, such as

    3 There are alternatives, such as rounding (n + 1)/4 and (n + 1) (3/4) to the nearest integers to calculate thelocations of Q1 and Q3, respectively. The rounding is upward or downward, respectively, when the numbers fallexactly between two integers.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    8 Applied Statistics for Civil and Environmental Engineers

    0.0

    0.1

    0.2

    0.3

    0.4

    07.

    99

    815

    .99

    1623

    .99

    2431

    .99

    3239

    .99

    4047

    .99

    4855

    .99

    5663

    .99

    6471

    .99

    7279

    .99

    Modulus of rupture (N/mm2)

    Rel

    ativ

    e fre

    quen

    cy

    Fig. 1.1.3 Histogram for timber strength data with class width of 8 N/mm2.

    the gaps at the ends. Further investigations are required to understand the true nature of thepopulation. More on these aspects will follow in this and subsequent chapters.

    1.1.4 Frequency polygon

    A frequency polygon is a useful diagnostic tool to determine the distribution of a variable.It can be drawn by joining the midpoints of the tops of the rectangles of a histogram afterextending the diagram by one class on both sides. We assume that equal class widths areused. If the ordinates of a histogram are divided by the total number of observations, thena relative frequency histogram is obtained. Thus, the ordinates for each class denote theprobabilities bounded by 0 and 1, by which we simply mean the chances of occurrence.The resulting diagram is called the relative frequency polygon.

    Example 1.4. Timber strength. Corresponding to the histogram of Fig. 1.1.4, the valuesof class center are computed and a relative frequency polygon is obtained; this is shown inFig. 1.1.5.

    0.00

    0.10

    0.20

    0.30

    04.

    99

    1014

    .99

    2024

    .99

    3034

    .99

    4044

    .99

    5054

    .99

    6064

    .99

    7074

    .99

    Modulus of rupture (N/mm2)

    Rel

    ativ

    e fre

    quen

    cy

    Fig. 1.1.4 Histogram for timber strength data with class width of 5 N/mm2.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Preliminary Data Analysis 9

    0.0

    0.1

    0.2

    0.3

    0 20 40 60 80

    Modulus of rupture (N/mm2)

    Rel

    ativ

    e fre

    quen

    cy

    Fig. 1.1.5 Relative frequency polygon for timber strength data with class width of 5 N/mm2.

    As the number of observations becomes large, the class widths theoretically tend to de-crease and, in the limiting case of an infinite sample, a relative frequency polygon becomesa frequency curve. This is in fact a probability curve, which represents a mathematicalprobability density function, abbreviated as pdf, of the population.4

    1.1.5 Cumulative relative frequency diagram

    If a cumulative sum is taken of the relative frequencies step by step from the smallest classto the largest, then the line joining the ordinates (cumulative relative frequencies) at theends of the class boundaries forms a cumulative relative frequency or probability diagram.On the vertical axis of the graph, this line gives the probabilities of nonexceedance of valuesshown on the horizontal axis. In practice, this plot is made by utilizing and displaying everyitem of data distinctly, without the necessity of proceeding via a histogram and the restric-tive categories that it entails. For this purpose, one may simply determine (e.g., from theranked data of Table 1.1.3) the number of observations less than or equal to each value anddivide these numbers by the total number of observations. This procedure is adopted here.5

    Thus, the probability diagram, as represented by the cumulative relative frequencydiagram, becomes an important practical tool. This diagram yields the median and otherquartiles directly. Also, one can find the 9 values that divide the total frequency into 10equal parts called deciles and the so-called percentiles, where the pth percentile is thevalue that is greater than p percent of the observations. In general, it is possible to obtainthe (n 1) values that divide the total frequency into n equal parts called the quantiles.Hence a cumulative frequency polygon is also called a quantile or Q-plot; a Q-plot thoughhas quantiles on the vertical axis unlike a cumulative frequency diagram.

    Example 1.5. Timber strength. Figure 1.1.6 is the cumulative frequency diagram obtainedfrom the ranked timber strength data of Table 1.1.3 using each item of data as just described.

    4 This function is discussed in Chapter 3. One of the first tasks in applying inferential statistics, as presented inChapters 4 and 5, will be to estimate the mathematical function from a finite sample and examine its closenessto the histogram.5 Further aspects of this subject, as related to probability plots, are described in Chapter 5.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    10 Applied Statistics for Civil and Environmental Engineers

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0 20 40 60 80

    Modulus of rupture (N/mm2)

    Cum

    ulat

    ive re

    lativ

    e fre

    quen

    cy

    Fig. 1.1.6 Cumulative relative frequency diagram for timber strength data.

    The deciles and percentiles can be abstracted. By convention a vertical probability orproportionality scale is used rather than one giving percentages (except in duration curves,discussed shortly). The 90th percentile, for instance, is 51 N/mm2 approximately and thevalue 40 N/mm2 has a probability of nonexceedance of approximately 0.56.

    If the sample size increases indefinitely, the cumulative relative frequency diagram willbecome a distribution curve in the limit. This represents the population by means of a(mathematical) distribution function, usually called a cumulative distribution function, ab-breviated to cdf, just as a relative frequency polygon leads to a probability density function.

    As a graphical method of ascertaining the distribution of the population, the quantileplot can be drawn using a modified nonlinear scale for the probabilities, which representsone of several types of theoretical distributions.6 Also, as shown in Section 1.4, twodistributions can be compared using a Q-Q plot.

    1.1.6 Duration curves

    For the assessment of water resources and for associated design and planning purposes,engineers find it useful to draw duration curves. When dealing with flows in rivers, this typeof graph is known as a flow duration curve. It is in effect a cumulative frequency diagramwith specific time scales. The vertical axis can represent, for example, the percentage ofthe time a flow is exceeded; and in addition, the number of days per year or season duringwhich the flow is exceeded (or not) may be given. The volume of flow per day is given onthe horizontal axis. For some purposes, the vertical and horizontal axes are interchangedas in a Q-plot. One example of a practical use is the scaled area enclosed by the curve,a horizontal line representing 100% of the time, and a vertical line drawn at a minimumvalue of flow, which is desirable to be maintained in the river. This area represents theestimated supplementary volume of water that should be diverted to the river on an annualbasis to meet such an objective.

    Example 1.6. Streamflow duration. Figure 1.1.7 gives the flow duration curve of the DoraRiparia River in the Alpine region of northern Italy, calculated over a period of 47 years fromthe records at Salbertrand gauging station. This figure is drawn using the same procedure

    6 This method is demonstrated in Section 5.8.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Preliminary Data Analysis 11

    0

    73

    146

    219

    292

    365

    0 10 20 30 40 50

    Daily streamflow (m3/s)

    Dur

    atio

    n, d

    ays

    per y

    ear f

    low

    is ex

    ceed

    ed

    0102030405060708090100

    Perc

    enta

    ge, d

    urat

    ion

    Fig. 1.1.7 Flow duration curve of Dora Riparia River at Salbertrand in the Alpine region of Italy.

    adopted for a cumulative relative frequency diagram, such as Fig. 1.1.6. For instance, supposeit is decided to divert a proportion of the discharges above 10 m3/s and below 20 m3/s from theriver. Then the area bounded by the curve and the vertical lines drawn at these discharges, usingthe vertical scale on the left-hand side, will give the estimated maximum amount availablefor diversion during the year in m3 after multiplication by the number of seconds in a day.This area is hatched in Fig. 1.1.7. If such a decision were to be implemented over a long-term basis, it should be essential to use a long series of data and to estimate the distributionfunction.

    1.1.7 Summary of Section 1.1

    In this section we have introduced some of the basic graphical methods. Other proceduressuch as stem-and-leaf plots and scatter diagrams are presented in Sections 1.3 and 1.4,respectively. More advanced plots are introduced in Chapters 5 and 6. In the next sectionwe discuss associated numerical methods.

    1.2 NUMERICAL SUMMARIES OF DATA

    Useful graphical procedures for presenting data and extracting knowledge on variabil-ity and other properties were shown in Section 1.1. There is a complementary methodthrough which much of the information contained in a data set can be represented eco-nomically and conveyed or transmitted with greater precision. This method utilizes a setof characteristic numbers to summarize the data and highlight their main features. Thesenumerical summaries represent several important properties of the histogram and the rel-ative frequency polygon. The most important purpose of these descriptive measures is forstatistical inference, a role that graphs cannot fulfill. Basically, there are three distinctivetypes: measures of central tendency, of dispersion, and of asymmetry, all of which canbe visualized through the histogram as discussed in Section 1.1. The additional measureof peakedness, that is, the relative height of the peak, requires a large sample for itsestimation and is mainly relevant in the case of symmetric distributions.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    12 Applied Statistics for Civil and Environmental Engineers

    1.2.1 Measures of central tendency

    Generally data from many natural systems, as well as those devised by humans, tend tocluster around some values of variables. A particular value, known as the central value,can be taken as a representative of the sample. This feature is called central tendencybecause the spread seems to take place about a center. The definition of the central value isflexible, and its magnitude is obtained through one of the measures of its location. Thereare three such well-known measures: the mean, the mode, and the median. The choicedepends on the use or application of the central value.

    The sample arithmetic mean is estimated from a sample of observations: x1, x2, . . . ,xn , as

    x = 1n

    n

    i=1xi . (1.2.1)

    If one uses a single number to represent the data, the sample mean seems ideal for thepurpose. After counting, this calculation is the next basic step in statistics. For theoreticalpurposes the mean is the most important numerical measure of location. As stated inSection 1.1, if the sample size increases indefinitely a curve is obtained from a frequencypolygon; the mean is the centroid of the area between this curve and the horizontal axisand it is thus the balance point of the frequency curve.

    The population value of the mean is denoted by . We reiterate our definition of popu-lation with reference to a phenomenon such as that represented by the timber strength dataof Table E.1.1. A population is the aggregate of observations that might result by makingan experiment in a particular manner.

    The sample mean has a disadvantage because it may sometimes be affected by un-expectedly high or low values, called outliers. Such values do not seem to conform tothe distribution of the rest of the data. There may be physical reasons for outliers. Theirpresence may be attributed to conditions that have perhaps changed from what were as-sumed, or because the data are generated by more than one process. On the other hand,they may arise on account of errors of faulty instrumentation, measurement, observation,or recording. The engineer must examine any visible outliers and ascertain whether theyare erroneous or whether their inclusion is justifiable. The occurrence of any improbablevalue requires careful scrutiny in practice, and this should be followed by rectification orelimination if there are valid reasons for doing so.

    Example 1.7. Timber strength. A case in point is the value of zero in the timber strengthdata of Table E.1.1 This value is retained here for comparative purposes. The mean of the165 items, which is 39.09 N/mm2, becomes 39.33 N/mm2 without the value of zero.

    Example 1.8. Concrete test Table E.1.2 is a list of the densities and compressive strengthsat 28 days from the results of 40 concrete cube test records conducted in Barton-on-Trent,England, during the period 8 July 1991 to 21 September 1992, and arranged in reversechronological order.

    These have sample means of 2445 kg/m3 and 60.14 N/mm2, respectively. The two numbersare measures of location representing the density and compressive strength of concrete.

    With many discordant values at the extremes, a trimmed mean, such as a 5% trimmedmean, may be calculated. For this purpose, the data are ranked and the mean is obtainedafter ignoring 5% of the observations from each of the two extremities (see Problem 1.16).

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Preliminary Data Analysis 13

    The technique of coding is sometimes used to facilitate calculations when the dataare given to several significant figures but the digits are constant except for the last few.For example, the densities in Table E.1.2 are higher than 2400 N/mm2 and less than2500 N/mm2, so that the number 2400 can be subtracted from the densities. The remainderswill retain the essential characteristics of the original set (apart from the enforced shift inthe mean), thus simplifying the arithmetic.

    In considering the entire data set, a weighted mean is obtained if the variables of asample are multiplied by numbers called weights and then divided by the sum of theweights. It is used if some variables should contribute more (or less) to the average thanothers.

    The median is the central value in an ordered set or the average of the two central valuesif the number of values, n, is even, as specified in Section 1.1.

    Example 1.9. Concrete test. The calculation of the median and other measures of locationwill be greatly facilitated if the data are arranged in order of magnitude. For example, thecompressive strengths of concrete given in Table E.1.2 are rewritten in ascending order inTable 1.2.1.

    The median of these data is 60.1 N/mm2, which is the average of 60.0 and 60.2 N/mm2.

    The median of the timber strength data of Table 1.1.3 is 39.05 N/mm2, as noted in thetable. The median has an advantage over the mean. It is relatively unaffected by outliersand is thus often referred to as a resistant measure. For instance, the exclusion of thezero value in Table 1.1.3 results only in a minor change of the median from 39.05 to39.10 N/mm2.

    One of the countless practical uses of the median is the application of a disinfectantto many samples of bacteria. Here, one seeks an association between the proportion ofbacteria destroyed and the strength of the disinfectant. The concentration that kills 50% ofthe bacteria is the median dose. This is termed LD50 (lethal dose for 50%) and providesan excellent measure.

    The mode is the value that occurs most frequently. Quite often the mode is not uniquebecause two or more sets of values have equal status. For this reason and for convenience,the mode is often taken from the histogram or frequency polygon.

    Example 1.10. Concrete test. For the ranked compressive strengths of concrete inTable 1.2.1, the mode is 60.5 N/mm2.

    Example 1.11. Timber strength. From Fig. 1.1.4, for example, the mode of the timberstrength data is 37.5 N/mm2, which corresponds to the midpoint of the class with the highestfrequency. However, there is ambiguity in the choice of the class widths as already noted.On the other hand, in Table 1.1.3 there are nine values in the range 38.6439.34 N/mm2, andthus 39 N/mm2 seems a more representative value, but this problem can only be resolvedtheoretically.

    As the sample size becomes indefinitely large, the modal value will correspond to thepeak of the relative frequency curve on a theoretical basis. The mode may often havegreater practical significance than the mean and the median. It becomes more useful as theasymmetry of the distribution increases. For instance, if an engineer were to ask a personwho sits habitually on the banks of a river fishing to indicate the mean level of the river,he or she is inclined to point out the modal level. It is the value most likely to occur and it

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    14 Applied Statistics for Civil and Environmental Engineers

    Table 1.2.1 Ordered data of density and compressive strength ofconcretea

    Compressive strengthOrder Density (kg/m3) (N/mm2)

    1 2411 49.92 2415 50.73 2425 52.54 2427 53.25 2427 53.46 2428 54.47 2429 54.68 2433 55.89 2435 56.3

    10 2435 56.711 2436 56.912 2436 57.813 2436 57.914 2436 58.815 2437 58.916 2437 59.017 2441 59.618 2441 59.819 2444 59.820 2445 60.021 2445 60.222 2446 60.523 2447 60.524 2447 60.525 2448 60.926 2448 60.927 2449 61.128 2450 61.529 2454 61.930 2454 63.331 2455 63.432 2456 64.933 2456 64.934 2457 65.735 2458 67.236 2469 67.337 2471 68.138 2472 68.339 2473 68.940 2488 69.5a The original data sets are given in Table E.1.2.

    is not affected by exceptionally high or low values. Clearly, the deletion of the zero valuefrom Table 1.1.3 does not alter the mode, as we have also seen in the case of the median.

    These positive attributes of the mode and median notwithstanding, the mean is indis-pensable for many theoretical purposes. Also in the same class as the sample arithmetic

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Preliminary Data Analysis 15

    mean, there are two other measures of location that are used in special situations. Theseare the harmonic and geometric means.

    The harmonic mean is the reciprocal of the mean of the reciprocals. Thus the harmonicmean for a sample of observations, x1, x2, . . . , xn , is defined as

    xh = 11/n[(1/x1) + (1/x2) + + 1/xn)] . (1.2.2)

    It is applied in situations where the reciprocal of a variable is averaged.

    Example 1.12. Stream flow velocity. A practical example of the harmonic mean is thedetermination of the mean velocity of a stream based on measurements of travel times over agiven reach of the stream using a floating device. For instance, if three velocities are calculatedas 0.20, 0.24, and 0.16 m/s, then the sample harmonic mean is

    xh = 1(1/3)[(1/0.20) + (1/0.24) + (1/0.16)] = 0.19 m/s.

    The geometric mean is used in averaging values that represent a rate of change. Here thevariable follows an exponential, that is, a logarithmic law. For a sample of observations,x1, x2, . . . , xn , the geometric mean is the positive nth root of the product of the n values.This is the same as the antilog of the mean of the logarithms:

    xg = (x1x2 . . . xn)1/n = exp(

    1n

    n

    i=1In xi

    )

    =(

    n

    i=1x

    1/ni

    )

    . (1.2.3)

    Example 1.13. Population growth. Consider the case of populations of towns and cities thatincrease geometrically, which means that a future increase is expected that is proportional tothe current population. Such information is invaluable for planning and designing urban watersupplies and sewerage systems. Suppose, for example, that according to a census conductedin 1970 and again in 1990 the population of a city had increased from 230,000 to 310,000.An engineer needs to verify, for purposes of design, the per capita consumption of water inthe intermediate period and hence tries to estimate the population in 1980. The central valueto use in this situation is the geometric mean of the two numbers which is

    xg = (230, 000 310, 000)1/2 = 267,021.(Note that the sample arithmetic mean x = 270,000.)

    As we see in Example 1.13, the geometric mean is less than the arithmetic mean.7

    1.2.2 Measures of dispersion

    Whereas a measure of central tendency is obtained by locating a central or representativevalue, a measure of dispersion represents the degree of scatter shown by observations orthe inherent variability in a phenomenon under observation. Dispersion also indicates theprecision of the data. One method of quantification is through an order statistic, that is,one of ranked data.8 The simplest in the category is the range, which is the differencebetween the largest and smallest values, as defined in Section 1.1.

    7 This theoretical property is demonstrated in Example 3.10.8 We shall discuss order statistics formally in Chapter 7; see also Chapter 5.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    16 Applied Statistics for Civil and Environmental Engineers

    Example 1.14. Timber strength. As noted before, the range of the timber strength data ofTable 1.1.3 is 70.22 0.00 = 70.22 N/mm2.

    Example 1.15. Concrete test. For the compressive strengths of concrete given in TableE.1.2 and ranked in Table 1.2.1, the range is r = 69.5 49.9 = 19.6 N/mm2; the range ofthe concrete densities is 2488 2411 = 77 kg/m3. These numbers provide a measure of thespread of the data in each case.

    The range, however, is a nondecreasing function of the sample size and thus charac-terizes the population poorly. Moreover, the range is unduly affected by high and lowvalues that may be somewhat incompatible with the rest of the data even though they maynot always be classified as outliers. For this reason, the interquartile range, iqr, which isrelatively a resistant measure, is preferable. As defined in Section 1.1, in a ranked set ofdata this is the difference between the median of the top half and the median of the bottomhalf.

    Example 1.16. Concrete test. For the compressive strengths of concrete, the iqr is 6.55N/mm2.

    Example 1.17. Timber strength. The timber strength data in Table 1.1.3 have an iqr of11.66 and 11.47 N/mm2, respectively, with or without the zero value. A similar and moregeneral measure is given by the interval between two symmetrical percentiles. For example,the 9010 percentile range for the timber strength data is approximately 52 28 = 24 N/mm2from Fig. 1.1.6.

    The aforementioned measures of dispersion can be easily obtained. However, theirshortcoming is that, apart from two values or numbers equivalent to them, the vast infor-mation usually found in a sample of data is ignored. This criticism is not applicable if onedetermines the average deviation about some central value, thus including all the obser-vations. For example, the mean absolute deviation, denoted by d , measures the averageabsolute deviation from the sample mean. For a sample of observations, x1, x2, . . . , xn , itis defined as

    d = |x1 x | + |x2 x | + + |xn x |n

    =n

    i=1

    |xi x |n

    . (1.2.4)

    Example 1.18. Annual rainfall. If the annual rainfalls in a city are 50, 56, 42, 53, and49 cm over a 5-year period, the absolute deviation with respect to the sample mean of 50 cmis given by

    d = 15

    (|50 50| + |56 50| + |42 50| + |53 50| + |49 50|) = 3.6 cm.

    This measure of dispersion is easily understood and practically useful. However, it is validonly if the large and small deviations are as significant as the average deviations. There arestrong theoretical reasons (as seen in Chapters 3, 4, and 5), on the other hand, for using thesample standard deviation, denoted by s, which is the root mean square deviation aboutthe mean. Indeed, this is the principal measure of dispersion (although the interquartile

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Preliminary Data Analysis 17

    range is meaningful and expedient). For a sample of observations, x1, x2, . . . , xn it isdefined by

    s =

    1n

    [(x1 x)2 + (x2 x)2 + + (xn x)2] =

    1n

    n

    i=1(xi x)2. (1.2.5)

    By expanding and summarizing the terms on the extreme right-hand side,

    s =

    1n

    (

    n

    i=1x2i 2x

    n

    i=1xi + nx2

    )

    =

    1n

    n

    i=1x2i x2. (1.2.6)

    Engineers will recognize that this measure is analogous to the radius of gyration of astructural cross section. In contrast to the mean absolute deviation, it is highly influencedby the largest and smallest values. The standard deviation of the population is denoted by . It is common practice to replace the divisor n of Eq. (1.2.5) by (n 1) and denote theleft-hand side by s. Consequently, the estimate of the standard deviation is, on average,closer to the population value because it is said to have smaller bias. Therefore, Eq. (1.2.5)will, on average, give an underestimate of except in the rare case in which is known.9The required modification to Eq. (1.2.6) is as follows:

    s =

    1n 1

    n

    i=1x2i

    n

    n 1 x2. (1.2.7)

    This reduction in n can be justified by means of the concept of degrees of freedom. It is aconsequence of the fact that the sum of the n deviations (x1 x), (x2 x), . . . , (xn x)is zero, which follows from Eq. (1.2.1) for the mean. Hence, regardless of the arrangementof the data, if any (n 1) terms are specified the remaining term is fixed or known, because

    xn x = n1

    i=1(xi x).

    It follows from this equation that one degree of freedom is lost in defining the samplestandard deviation. The concept of degrees of freedom was introduced by the Englishstatistician R. A. Fisher on the analogy of a dynamical system in which the term denotesthe number of independent coordinate values necessary to determine the system.

    Example 1.19. Annual rainfall. From the annual rainfall data in Example 1.18 (50, 56, 42,53, and 49 cm), one can estimate the standard deviation by using Eq. (1.2.5), as follows:

    s =

    15

    [(50 50)2 + (56 50)2 + (42 50)2 + (53 50)2 + (49 50)2]

    =

    15

    (02 + 62 + 82 + 32 + 12) =

    1105

    = 4.69 cm.

    An alternative estimate of (which is, on average, less biased) is obtained using Eq. (1.2.7)as follows:

    s =

    1104

    = 5.24 cm.

    9 Terms such as bias are discussed formally in Section 5.2. It is shown in Example 5.1 that s2 is unbiased;however, s is known to have bias, though less than s on average.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    18 Applied Statistics for Civil and Environmental Engineers

    Example 1.20. Timber strength. By using Eq. (1.2.7), the sample standard deviation ofthe timber strength data of Table E.1.1 is 9.92 N/mm2 (or 9.46 N/mm2 if the zero value isexcluded).

    Example 1.21. Concrete test. By using Eq. (1.2.7), the sample standard deviation for thedensity and compressive strength of concrete in Table E.1.2 are 15.99 kg/m3 and 5.02 N/mm2,respectively.

    Dividing the standard deviation by the mean gives the dimensionless measure of dis-persion called the sample coefficient of variation, v:

    v = sx

    (1.2.8)

    This is usually expressed as a percentage. The coefficient of variation is useful in comparingdifferent data sets with respect to central location and dispersion.

    Example 1.22. Comparison of timber and concrete strength data. From the values ofmean and standard deviation in Examples 1.7 and 1.20, the sample coefficient of variationof the timber strength data is 25.3% (or 24.0% without the value of zero). Similarly, fromExamples 1.8 and 1.21 the density and compressive strength of concrete data have samplecoefficients of 0.65 and 8.24%, respectively. The higher variation in the timber strength datais a reflection of the variability of the natural material, whereas the low variation in the densityof the concrete is evidence of a uniform quality in the constituents and a high standard ofworkmanship, including care taken in mixing. The variation in the compressive strengthof concrete is higher than that of its density. This can be attributed to random factors thatinfluence strength, such as some subtle changes in the effectiveness of the concrete that donot alter its density.

    From the square of the sample standard deviation one obtains the sample variance, s2,which is the mean of the squared deviations from the mean. The population variance isdenoted by 2. The variance, like the mean, is important in theoretical distributions.

    By squaring Eqs. (1.2.6) and (1.2.7), two estimators of the population variance are found.Here estimator refers to a method of estimating a constant in a parent population. As inall the foregoing equations, this term means the random variable of which the estimate isa realization. An unbiased estimator is obtained from Eq. (1.2.7) because on average (thatis by repeated sampling) the estimator tends to the population variance 2. In other words,the expectation E , which is in effect the average from an infinite number of observations,of the square of the right-hand side of Eq. (1.2.7) is equal to 2.

    There are also measures of dispersion pertaining to the mean of the deviations betweenthe observations. Ginis mean difference, for example, is a long-standing method.10 Thisis given by

    g = 2n (n 1)

    i> j

    n

    j=1[x(i) x( j)], (1.2.9)

    in which the observations x1, x2, . . . , xn are arranged in ascending order.

    10 See, for example, Stuart and Ord (1994, p. 58) for more details of this method originated by the Italianmathematician, Gini. See also Problem 1.7.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Preliminary Data Analysis 19

    1.2.3 Measure of asymmetry

    Another important property of the histogram or frequency polygon is its shape with respectto symmetry (on either side of the mode). The sample coefficient of skewness measuresthe asymmetry of a set of data about its mean. For a sample of observations, x1, x2, . . . ,xn , it is defined as

    g1 =n

    i=1 (xi x)3ns3

    . (1.2.10)Division by the cube of the sample standard deviation gives a dimensionless measure.

    A histogram is said to have positive skewness if it has a longer tail on the right, whichis toward increasing values, than on the left. In this case the number of values less than themean is greater than the number that exceeds the mean. Many natural phenomena tend tohave this property. For a positively skewed histogram,

    mode < median < mean.

    This inequality is reversed if skewness is negative. A symmetrical histogram suggests zeroskewness.

    Example 1.23. Comparison of timber and concrete strength data. The coefficient ofskewness of the timber strength data of Table E.1.1 and the compressive strength data ofTable E.1.2 are 0.15 (or 0.53 after excluding the zero value) and 0.03, respectively. Theseindicate a small skewness in the first case and a symmetrical distribution in the second case.

    The example indicates that this measure of skewness is sensitive to the tails of thedistribution.

    1.2.4 Measure of peakedness

    The extent of the relative steepness of ascent in the vicinity and on either side of themode in a histogram or frequency polygon is said to be a measure of its peakedness ortail weight. This is quantified by the dimensionless sample coefficient of kurtosis, whichis defined for a sample of observations, x1, x2, . . . , xn by

    g2 =n

    i=1 (xi x)4ns4

    . (1.2.11)

    Example 1.24. Comparison of timber and concrete strength data. The kurtosis of thetimber strength data of Table E.1.1 is 4.46 (or 3.57 without the zero value) and that ofthe compressive strengths of Table E.1.2 is 2.33. One can easily see from Eq. (1.2.11) thateven a small variation in one of the items of data may influence the kurtosis significantly.This observation warrants a large sample size, perhaps 200 or greater, for the estimation ofthe kurtosis. Small sample sizes, particularly in the second case with n = 40, preclude theattachment of any special significance to these estimates.

    1.2.5 Summary of Section 1.2

    Of the numerical summaries listed here, the mean, standard deviation, and coefficient ofskewness are the best representative measures of the histogram or frequency polygon, fromboth visual and theoretical aspects. These provide economical measures for summarizingthe information in a data set. Sample estimates for the data we have been discussing here,including the coefficients of variation and kurtosis, are given in Table 1.2.2.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    20 Applied Statistics for Civil and Environmental Engineers

    Table 1.2.2 Sample estimates of numerical summaries of the timber strength data of Table 1.1.3and the concrete strength and density data of Table 1.2.1

    Sample Standard Coefficient of Coefficient CoefficientData set size Meana deviationa variation (%) of skewness of kurtosisEstimated by equation 1.2.1 1.2.7 1.2.8 1.2.10 1.2.11Timber strengthfull

    sample165 39.09 9.92 25.3 0.15 4.46

    Timber strength without thezero value

    164 39.33 9.46 24.0 0.53 3.57

    Compressive strength ofconcrete

    40 60.14 5.02 8.35 0.03 2.33

    Density of concrete 40 2445 15.99 0.65 0.38 3.15a Units for strength are N/mm2; units for density are kg/m3.

    1.3 EXPLORATORY METHODS

    Some graphical displays are used when one does not have any specific questions in mindbefore examining a data set. These methods were appropriately called exploratory dataanalysis by Tukey (1977). Among such procedures the box plot is advantageous, and thestem-and-leaf plot is also a valuable tool.

    1.3.1 Stem-and-leaf plot

    The histogram is a highly effective graphical procedure for showing various characteristicsof data as seen in Section 1.1. However, for smaller samples, less than, say, 40 in size,it may not give a clear indication of the variability and other properties of the data.The stem-and-leaf plot, which resembles a histogram turned through a right angle, is auseful procedure in such cases. Its advantage is that the data are grouped without lossof information because the magnitudes of all the values are presented. Furthermore, itsintrinsic tabular form highlights extreme values and other characteristics that a histogrammay obscure. As in a histogram, the data are initially ranked in ascending order buta different approach is adopted in finding the number of classes. The class widths arealmost invariably equal. For the increments or class intervals (and hence class widths) oneuses 0.5, 1, or 2 multiplied by a power of 10, which means that the intervals are in unitssuch as 0.1 or 200 or 10,000, which are more tractable than, say, 0.13 or 140 or 12,000.The terminology is best explained through the following worked example.

    Example 1.25. Concrete test. For the concrete strength data of Table E.1.2, the maximumand minimum values are 69.5 and 49.9 N/mm2, respectively. As a first choice, the data canbe divided into 21 classes in intervals of 1 N/mm2 with lower boundaries at 49, 50, 51N/mm2, and so on, up to 69 N/mm2. For the ordered stem-and-leaf plot of Fig. 1.3.1, avertical line is drawn with the class boundaries marked in increasing order immediately toits left.

    The boundary values are called the leading digits and, together with the vertical line,constitute the stem. The trailing digits on the right represent the items of data in increasingorder when read jointly with the leading digits using the indicated units. They are termedleaves, and their counts are the class frequencies. Thus the digits 49 (stem) and 9 (leaf)constitute 49.9. It is useful to provide an additional column at the extreme left, as shownhere, giving the cumulative frequenciescalled depthsup to each class. This is completed

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    Preliminary Data Analysis 21

    1 49 9

    2 50 7

    2 51

    3 52 5

    5 53 2 4

    7 54 4 6

    8 55 8

    11 56 3 7 9

    13 57 8 9

    15 58 8 9

    19 59 0 6 8 8

    (7) 60 0 2 5 5 5 9 9

    14 61 1 5 9

    11 62

    11 63 3 4

    9 64 9 9

    7 65 7

    6 66

    6 67 2 3

    4 68 1 3 9

    1 69 5

    Fig. 1.3.1 Stem-and-leaf plot for compressive strengths of concrete in Table E.1.2; units forstem: 1 N/mm2; units for leaves: 0.1 N/mm2.

    firstly by starting at the top and totaling downward to the line containing the median for whichthe individual frequency is given in parentheses, and secondly by starting at the bottom andtotaling upward to the line containing the median.

    The diagram gives all the information in the data, which is its main advantage. Further-more, the range, median, symmetry, or gaps in the data, frequently occurring values, andany possible outliers can be highlighted. In this example, a symmetrical distribution isindicated. The plot may be redrawn with a smaller number of classes, perhaps for greaterclarity, using the guidelines for choosing the intervals stipulated previously. The units ofdata in a plot can be rounded to any number of significant figures as necessary. Also, thenumber of stems in a plot can be doubled by dividing each stem into two lines. When1 multiplied by a power of 10 is used as an interval, for example, the first line, whichis denoted by an asterisk (), will thus have leaves 0 to 4, and the leaves of the second,represented by a period (.), will be from 5 to 9. Likewise, one may divide a stem into fivelines. The stem-and-leaf plot is best suited for small to moderate sample sizes, say, lessthan 200.

  • P1: SFK/RPW P2: SFK/RPW QC: SFK/RPW T1: SFKBLUK154-Kottegoda April 15, 2008 7:11

    22 Applied Statistics for Civil and Environmental Engineers

    (N/mm2)

    10

    20

    30

    40

    50

    60

    70

    80

    Strengthexcluding 0 valueTimber strength,

    of concreteCompressive strength

    17.98

    33.1039.1044.57

    70.22

    56.8

    69.5

    49.9

    63.460.1

    73.23

    46.97

    61.78

    65.3565.6169.07

    15.89

    Maximum and minimum values

    Critical values for detecting outliersOther high values

    Quartiles

    Fig. 1.3.2 Box plots for timber strength and compressive strength of concrete data from Tables1.1.3 to 1.2.1.

    1.3.2 Box plot

    Another plot that is highly useful in data presentation is the box plot, which displays thethree quartiles, Q1, Q2, Q3, on a rectangular box aligned either horizontally or vertically.The box, together with the minimum and maximum values, which are shown at the ends oflines extended at either side from the box from the midpoints of its ex