business statistics may module

Upload: michel-kabonga

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Business Statistics May Module

    1/72

    Business statistics bcm 307

    INTRODUCTION

    Expected Learning Outcomes: At the end of the course a student should be able to:

    To demonstrate an understanding of statistics and its importance for business

    and management

    Demonstrate proficiency with the qualitative and quantitative measures: through

    ability to organize and present data on tables, charts, graphs, polygons

    Demonstrate an understanding of measures of central tendency and measure of

    dispersion

    Demonstrate an understanding in time series and application

    Drawing scatter diagrams, construct simple linear regression equation and

    application

    Demonstrate proficiency in contingent table, probability concepts and application

    in business

    Demonstrate some level of understanding and application of hypothesis testing

    1.Introduction

    Definition of statistics; Application of statistics in business; Terminologies used in

    statistics

    2.Collection, Organization and Presentation of Data

    Qualitative data: Summary tables, bar, pie and Pareto charts,

    Quantitative data: Summary tables, histogram, graphs, polygons, ogive and

    Lorenz curve

    Time Series:Time series graphs, application and forecasting

    3.Numerical Descriptive Measures: Discrete and Continuos Data

    Measures of Central Tendency: Mode, Median, Mean

    Measures of dispersion: Significance of the measures, Range, Interquartilesrange, Variance, Standard deviation, Coefficient of variation

    4. Probability Distributions.

    a) Discrete distribution

    b) Normal distribution(Continuous)

  • 7/28/2019 Business Statistics May Module

    2/72

    Introduction

    Standard Normal Distribution

    Z-scores

    Areas to the Left and Right of x

    Calculations of Probabilities Using the Central Limit Theorem

    5.Confidence Interval

    Introduction

    Confidence Interval

    Single Population Mean, Population Standard Deviation Known

    Confidence Interval, Single Population Mean, Standard Deviation

    6.Linear Regression Model and Scatter Diagram (Simple and Multi-

    linear)

    Drawing scatter diagram, Describing relationship, equation relationship between

    variables

    Use least square method to derive the simple regression equation

    Explain the coefficients of the variables and their significance

    7.Hypothesis Testing

    Introduction

    Definition of Hypothesis Testing

    Null and Alternate

    Hypothesis Testing for the Mean

    Hypothesis testing for the Proportion

    Support One of the Hypothesis

    Decision and ConclusionCourse Texts:

    1. Dean. S Illowsky.B; Principles of Business Statistics;

  • 7/28/2019 Business Statistics May Module

    3/72

    2. Berenson. M, et al; Basic Business Statistics: Concepts and Application. 11th

    edition (2009

    3. T.Lucey; Quantitative Techniques: 6th edition (2002)

    4. R.I. et al. Quantitative Approaches to Management. 8th edition (1992)

    INTRODUCTION

    Statistics is a branch of mathematics that transforms numbers into useful information

    for decision making. It does this by producing a set of methods for analyzing the

    numbers.

    Statistics is therefore the science of data that involves collecting, classifying,summarizing, organizing, analyzing and interpreting numerical information.

    Definition of Terms

    The application of statistics can be divided into two broad areas:

    Descriptive statistics

    Inferential statistics

    Descriptive statistics: It utilizes numerical and graphical methods to look for patterns

    in a data set, to summarize the information revealed and to present the information in

    a convenient form. This could be referred to as analysis of data.

    The data is usually presented in form of tables, charts, graphs and analyzed using

    statistics such as the mean, median, mode, variance, standard deviation, coefficient of

    variation etc.

    Inferential statistics: It uses the data collected from a small group to draw

    conclusion about a larger group. The conclusion may be decisions, predictions or other

    generalizations about a larger set of data.

    Important applications can therefore be summarized as:

    Summarizing business data

  • 7/28/2019 Business Statistics May Module

    4/72

    Drawing conclusion from the data

    Making reliable forecast about business activities

    Improving business process

    Statistics: The word statistics has two meanings:

    Numerical facts derived from analysis of sample data, for example, mean,

    standard deviation and proportions. Any numerical facts can also be referred

    to as a statistic, e. g number of people, number of countries, marks scored in a

    test etc.

    Field or discipline of study. It is a branch of mathematics that transforms

    numbers into useful information for decision making. It does this for by

    providing a set of methods for analyzing the numbers. These methods help to

    find patterns in the numbers and this enables one to determine whether

    differences in the numbers are just due to chance. In this case statistics can

    be seen as a science of data. It involves collecting, classifying summarizing,

    analyzing and interpreting numerical information.

    Population (target population): It is a set of units that we are interested in studying

    and the one we need to draw a conclusion about: the whole set of elements of focus.

    Characteristics are calledparameterse.g. population mean, , population standarddeviation, , population proportion, p

    Sample: It is the portion or subset of the population that is selected for analysis. The

    sample is randomly selected so as to consist of all the characteristics of the population.

    Characteristics are called statistics e.g. sample mean,xsample standard deviation, s,

    sample proportion,p

    Representative sample: The sample selected is representative if it exhibits typical

    characteristics that are possessed by the population of interest or the targetpopulation. The most common way to satisfy the representative sample requirements

    to use methods that allow us to select a random sample, that is, giving every element

    equal chance to be selected.

  • 7/28/2019 Business Statistics May Module

    5/72

    Random sample: A random sample is selected from the population in such a way that

    every different element and every sample size has equal chance of selection.

    Element/member: Element of a sample or population is a specific subjects or object

    about which information is collected, e.g. a firm, a country, a person, a university etc.

    Variable: It is a characteristic under study that assumes different values for different

    elements. For example scores in a test; different scores are expected for different

    students when the do a given test. In the case of a firm, the profit made at different

    time may be different; peoples tastes for a given product may be different etc.

    Observation/measurement: The value of a variable for an element. 120 cm as a

    height, a yes for an opinion, Sh. 20 000 as an income

    Data set: A collection of observations on one or more variables.

    Types of variables

    A variable may be classified as qualitative and quantitative.Qualitative Variable

    These are measurements that cannot be measured on a natural numerical scale but

    can only be classified into groups or categories.

    The data from categorical variable are measured on the scales; nominal or ordinal

    scales.

    Nominal scale: This divides distinct categories that cannot be ranked. For

    example the gender (female or male), preference of a product or a service (softdrink), a yes or no response etc. This is the weakest form of measurement

    Ordinal scale: It classifies data into distinct categories that can be ranked. E.g.

    Responses such as excellent, very good, fair, poor etc Though it is possible to

    rank the scale it is still weak in that the amount of the difference between

    categories cannot be accounted.

    Quantitative Variable

    The measurements are recorded on a naturally occurring numerical scale. The scale

    cans either b internal or ratio scale. These two scales can be ranked but also the

    difference between two variables can be calculated and interpreted.

    Internal scale:The scale cannot be used in comparing for example a student who

    scores a 100% is not twice as intelligent as one who scales 50%.

  • 7/28/2019 Business Statistics May Module

    6/72

    Ratio scale: It refers to data that can also be compared. This includes data that

    incorporates arithmetic operations (addition, subtraction, multiplication and

    division). For example sales of a company, income of families returns from an

    investment etc.

    2. COLLECTION, ORGANIZATION AND PRESENTATION OF DATA

    The first step is to identify the type of data one wants to collect; quantitative orqualitative. The second step is to device a suitable method for collecting data. There

    are various methods that are used to collect data, for example survey, designed

    experiments and observational study.

    Survey: The researcher samples a group of people and asks them questions from who

    responses are obtained. Some tools used in the data collection are questionnaires,

    mails, and telephone or in-person interviews.

    Experiment: Designed experiments normally involve strict control over the elementsin study. Two groups are designed one of which composes of experiment treatment

    group and a control group.

    Observational study:The researcher observes the elements in their natural setting

    and records the variables of interest.

    Regardless of the data collection method, it is likely that the data will be from a

    sample. Data is classified as primary or secondary. Primary data is the one collected by

    the person analyzing data while secondary is obtained by the analyzer from

    publications such as books, journals, newspapers etc.

    Organization and Presentation of Data

    After data is collected it is cleaned to remove unnecessary work in the record and it is

    ready for analysis which is the process that transforms the raw data to meaningful

  • 7/28/2019 Business Statistics May Module

    7/72

    information which analyst can use in decision making process. The analysis process

    includes with organization, presentation, description and inference of the results

    obtained from the sample to make generalization on the population.

    The technique or methods used to present or analyze data will depend on the type of

    data; quantitative or qualitative.

    The process and method of analysis depends on the type of data that was collected;

    qualitative or quantitative.

    Qualitative Data

    The data can be organized and presented in;

    Summary tables- frequency, relative frequency or parentage frequency table

    The bar charts

    The pie charts

    The Pareto charts

    Examples 1

    A sample was taken of 25 high school seniors who were planning to join college. The

    following are categories of majors he/she intended to choose: Business (BUS),

    economics (ECON), management information science systems (MIS), behavioral science(BS) and others.

    The responses of the students as they are asked their choice are listed below:

    ECON MIS ECON BUS BUSBUS BUS other other otherOther BS MIS other MISECON BUS MIS BUS otherBS MIS other other other

    Required:

    To organize and present data by constructing

    i. Frequency distribution table

  • 7/28/2019 Business Statistics May Module

    8/72

    ii. Bar chart

    iii. Pie chart

    Solution

    The above data is measured on categorical basis. The analyst collected the data by

    identifying those students who majored in any of the above categories.

    i. Frequency Distribution Table:

    This is a summary table that organizes the raw data into a frequency distribution table

    that includes three columns as demonstrated below:

    Categories Tally FrequencyBUS 6ECON 3MIS 6BS 2Other 8

    Total 25

    To make sure that all the responses from each category are included, the student goes

    through the raw data putting a slash on each response and recording it as a tally on the

    tally column.

    The tallies are recorded as slashes and any group of five are written by counting the

    number of numerical value. To ensure all items were considered the student must write

    the total frequency as shown above that must match the sample size of the data set.

    The result in the frequency distribution table gives us the number of students who took

    that particular major. From the number or frequency we can identify the categories

    with the highest number of students, or the least and generally we can describe how

    the frequency is distributed among the different categories.

    Relative frequency distribution table can also be used as summary table. In this case

    an additional column for relative frequency from each category is written as a relative

    value by dividing it by the total frequency to result with:

    Categories Tally Frequency Relative frequencyBUS 6 6/25 0.24ECON 3 3/25 0.12MISS 6 0.24BS 2 0.08Other 8 0.32Total 25 1.00

  • 7/28/2019 Business Statistics May Module

    9/72

    The total relative frequency is equal to 1.00. From this table we obtain the relative

    or proportion of the students as distributed among the categories.

    ii. Percentage frequency distribution Table

    The summary table can also be in percentage frequency where the column is added

    to represent this. The table may be like:

    Categories Tally Frequency Percentage

    FrequencyBUS 6 6/25 *100 24ECON 3 12MIS 6 24BS 2 08

    Other 8 32Total 25 100

    The column has a total of 100. Percentages are more simpler way of expressing

    proportions.

    NB: To develop either relative frequency distribution table or percentage frequency

    distribution table, one must have constructed the frequency distribution table first

    and then the rest and it would be advisable to do all of them in the same table.

    iii) Bar Chart

    The bar chart presents the data using a horizontal and vertical axis. The horizontal

    axis takes the categories which are represented by bars of equal width and

  • 7/28/2019 Business Statistics May Module

    10/72

    separated from each other by uniform space. The frequency or (relative frequency

    or percentage) is written on the vertical axis.

    The scale of the vertical axis is determined by the highest frequency of the

    categories. The scale must be easy to use construct and read.

    The Vertical axis may also use the relative frequencies or percentages; the scale must

    be well selected to include the highest relative/p percentage value. The bar chart

    becomes a good presentation of the data from which various information can be drawn.

    For example; the business and MIS bars are both equal; the number of students doing

    business and MIS majors are equal.

    The category of other is the highest and so it means other cause not distinctly

    identified are also offered. Behavioral science has the least number of students.bar

    charts are best suited for comparing different categories by checking on the height of

    the bars

    iv) Pie Chart

    The information could also be presented on a pie chart. The chart assigns the

    categories according to their proportions reflected by the size of the sector. The

    percentages (or relative frequencies) are converted to degrees.

    Categories RF DegreesBUS 0.24 (360) 86.4ECON 0.12 43.2MIS 0.24 86.4

  • 7/28/2019 Business Statistics May Module

    11/72

    BS 0.08 28.8Others 0.32 115.2

    1.0 360

    Then draw the pie chart to show the different categories in form of different sizes.

    The pie chart presentation easily shows and identifies the size of the different portions

    and makes it easy to draw conclusions.

    v) Pareto charts

    Pareto charts classify categories into vital few and trivial many.

    The Pareto principle exists when the majority of items in a set of data occur in a small

    number of categories and the few remaining items are spread out over a larger number

    of categories.

    The separation helps to identify and focus on the important categories.

    Example 2

    The hotel X Y Z samples complaints about the hotel rooms and categorizes them as in

    the table below. The sample that gave the responses was made up of 106 customers.The table summarizes the complaint categories and the number of customers that

    complained over certain issues.

  • 7/28/2019 Business Statistics May Module

    12/72

    Required

    a. Construct a Pareto chart

    b. What reasons for the complaints do you think the hotel managers should focus on

    if it wants to reduce the number of complaints. Explain

    c. Construct also pie and bar chart and compare the suitability of each chart in

    presenting this data

    Solution

    Orders the categories from the one with highest frequency to the one with the least,

    convert frequencies to percentages and show this column. Also draw columns for

    cumulative frequency and cumulative percentage frequency. The categories can be

    identified with symbols to avoid a lot of writing. Let the categories in the question be

    numbered from A to H before rearrangement.Arranging them in order from the one with highest frequency to the last would give us:

    A B E C F G H

    The table has the following information

    Reasons for complaint Number of

    customersA Dirty room 32B Not stocked 17C Not ready 12D Too noisy 10

    E Needs maintenance 17F Has too few beds 9G Doesnt have promised

    features

    7

    H No special

    accommodation

    2

  • 7/28/2019 Business Statistics May Module

    13/72

    Reasons Frequency Cumulative

    Frequency

    % Cumulative %

    A 32 32 30.2 30.2B 17 49 16.0 46.2E 17 66 16.0 62.2C 12 78 11.3 73.5D 10 88 9.4 82.9F 9 97 8.5 91.4G 7 104 6.6 98.0H 2 106 2.0 100

    Summary:

    Summary tables together with the chart are used to describe the portion of items of

    interest in each category.

  • 7/28/2019 Business Statistics May Module

    14/72

    Each chart best suits certain situations, for example;

    Bar chart is more suitable for the purposes of comparing the size of categories

    especially when they are not many in number; in our case we can have not more

    than six. If they are more than that, they become crowded.

    Pie charts are best suited for the situation where the main objective is to

    investigate the portion a category occupies in relation to the whole part. Coloring

    the portions with different colors enhance the display. It will also be best for few

    categories.

    Pareto chart sorts the frequencies in descending order and provides the

    cumulative curve on the same graph. This allows the viewer to see which

    categories account or matter most in the given situation. The chart allows

    presentation of many categories and also those with small difefrences in

    percentage because the curve enhances identification of the additional

    proportion given by any added category.

    Summary tables together with the chart are used to describe the portion of items of

    interest in each category.

    Each chart best suits certain situations, for example;

    Bar chart is more suitable for the purposes of comparing the size of categories

    Pie charts are best suited for the situation where the main objective is toinvestigate the portion a category occupies in relation to the whole part. Coloring

    the portions with different colors enhance the display.

    Pareto chart sorts the frequencies in descending order and provides the

    cumulative curve on the same graph. This allows the viewer to see which

    categories account or matter most in the given situation.

    Quantitative Data

    These are measurements that are recorded on a naturally concurring numericalscale. They are measured on an interval or ratio scale as explained earlier.

    Quantitative data can be organized and presented in a number of ways that

    include:

    Ordered array

  • 7/28/2019 Business Statistics May Module

    15/72

    Stem-and-leaf display

    Summary tables

    Histogram

    Frequency polygons

    The cumulative percentage polygon: ogiveQuantitative data; can either be discrete or continuous.

    Discrete data: It is a variable whose values are countable i.e. they assume

    whole number values e.g. number of persons, cars, companies etc.

    Continuous data: It is a variable that can assume any numerical value over

    continuum of certain interval or intervals e.g. time taken to serve a customer

    in a bank, amount of money height of individuals etc.

    Discrete data:

    It can be organized and presented in

    Ordered Array

    Stem-and-leaf display

    Bar chart

    Summary tables

    Example 1

    The following data represents the stock price of 25 companies.

    31 15 13 17 2316 22 12 23 3022 18 33 21 1813 26 16 26 2722 27 20 20 22

    Required: Construct

    i. Ordered arrayii. Stem-and-leaf display

    i) Ordered Array:

    This requires that the data is written in ascending or descending order.

  • 7/28/2019 Business Statistics May Module

    16/72

    12 13 13 15 16 16 17 18 18

    20 20 21 22 22 22 22 23 23

    26 26 27 27 30 31 33

    Ordered array is best applicable if the data is not so large.

    ii) Stem-and-leaf display:

    It creates suitable stem (main part one digit, two or three) depending on the nature of

    the data. Then assigning the remaining digits in what is referred to as leaf.

    Since the above data values are a two digit, the tens digit can form stem and the ones

    digit the leaf. Tens are represented by 1, 2, 3, ie, tens, twenties and thirties while

    the ones digit take the leaf.

    Stem leaf

    1 2 3 3 5 6 6 7 8 8

    2 0 0 1 2 2 2 2 3 3 6 7 7

    3 0 1 3

    The ones are matched after the appropriate tens, from the display twenties are the

    most and thirties the least.

    Example 2

    The following data represent the monthly rents paid by samples of 30 households

    selected from a city.

    429 732 550 1020 750

    540 956 1070 871 880

    650 950 780 900 750

    585 675 989 620 660

  • 7/28/2019 Business Statistics May Module

    17/72

    578 1030 930 765 975

    1020 840 870 800 820

    Solution:

    The digits contain either 3 digits or 4 digits we can take the stem for 1 digit for the 3

    digits number and 2 digits for the four digit number. The leaf can be taken as a two

    digit number.

    The stem-and-leaf display may not necessarily require data to be arranged in orderly

    manner but even. If it is arranged, the pattern obtained is maintained.

    Stem-and-leaf display

    4 295 85 50 40 78

    6 75 20 60 50

    7 32 50 65 80 50

    8 71 80 40 70 00 20

    9 89 56 30 75 50 00

    10 20 30 70 20

    By looking at the stem-and-leaf display we can observe how the data values are

    distributed. The stem and leaf display does not lose the information on individual

    observation or measurement.

    Example 3

    The following data give the number of computer courses taken by 30 businesses major

    who recently graduated from a university.

    2 3 2 3 1 4 2 2 3 4

    2 3 4 1 2 3 2 1 4 2

  • 7/28/2019 Business Statistics May Module

    18/72

    1 2 3 1 1 3 2 2 4 1

    Required

    a. Prepare a frequency distribution table.

    b. Compute relative frequency and percentage distributions

    c. Draw a bar graph for the frequency distributions

    d. What percentage of the graduates takes 2 or 3 computer courses?

    Solution

    Identify all the numbers presented in the data set: 1, 2, 3 and 4.

    Construct the summary table to include the columns: Number of courses, tallies,

    frequency and who relative and percentage frequency distributions can be included inthe same table.

    Number

    of

    courses

    Tally Frequency

    (f)

    Relative Frequency

    f/30

    Percentage

    frequency (*100)

    1 7 0.2333 23.332 11 0.3667 36.67

    3 7 0.2333 23.334 5 0.1667 16.6730 1.000

    Bar graph (chart)

  • 7/28/2019 Business Statistics May Module

    19/72

    Those graduates who take 2 or 3 courses are are the total of those who take 2

    and those who take 3: (36.67 + 23. 33) % = 60%

    Grouped/Continuous data

    Discrete data can be presented like categorical data in bar graph where the numbers

    take the horizontal axis. Frequency distribution table, relative frequency distribution or

    percentage distribution tables can be done as for the categorical variables where the

    discrete data value stands as a category. However for grouped data the frequencies,

    relative frequencies and percentages are assigned to an interval of numbers in the

    table.

    Stem- and- leaf display may not be very applicable and in place of bar chart grouped

    data is presented in a histogram.

    Sometimes it becomes necessary to look at values in a data set in form of class or

    groups. Each class gives the total number of values that fall within a given range. It is

    required that one identifies the class width, that is, the number of values

    accommodated in the class.

    Number of classes or groups: at least should not be so few (not less than 3 classes andnot too many (not more than 10) in the context of our class work. However in real life

    we may have data grouped into so many classes.

  • 7/28/2019 Business Statistics May Module

    20/72

    This is necessary because we are interested in presenting data in a more organized,

    easily interpretable form and in a way that makes sense.

    Example 1

    The data on the stock price of 25 companies:

    31 15 13 33 23 16 12 12 23 26 22 18 27 21 18 13 26 16 17 27 22 22 26

    20 30

    To group the data we can choose a class width of 4 or 5. If we choose 4 the

    approximate number of classes will be = 25/4 = 6.25 = 6

    If 5 then 25/5 = 5 classes. Either can be used.

    Lets use a class width of 5. Identify the smallest value =12

    This can be the lowest value in the data or we can decide to start at 10. This means wewill consider in first class 10, 11, 12, 13, 14, ie, 10-14. The next class will have 15, 16,

    17, 18, 19, i.e. 15-19 etc. we write class to include all the values. Other classes then

    become; 15-19, 20-24 etc. The lowest and highest values in each class are included in

    the interval.

    The above classes can also be written as 10 to less than 15, 15 to less than 20 etc.

    when we use this style the upper value in each class is not included. However in each

    case the class interval is five. Be careful to use each style correctly.

    The summary table: We can consider including relative frequency distribution and

    percentage distribution in the table.

    Grouped data can be presented in

    i. summary table,

    ii. histogram

    iii. frequency polygon

    iv. cumulative frequency curve (ogive)

    i) Summary table

  • 7/28/2019 Business Statistics May Module

    21/72

    ii) Histogram

    This is a graph in which classes are marked on horizontal axis. The classes are written

    to include class limits. Each class in adjusted so that the lower value in the class is

    subtracted 0.5 while the upper is added 0.5: .: 9.5 14.5, 14.5 19.5, 19.5 24.5 etc.

    The vertical axis either takes the frequency, relative frequency or percentage

    frequency. The scale must include the highest frequency: In this case 8.

    Draw bars with height corresponding to the frequency in each class making sure that

    the bars are adjacent (touch) because the data is continuous and any value can be

    included in this data. The information that can be obtained from a histogram is so much

    like that from a bar chart for discrete or qualitative data. Histogram also like stem and

    leaf can display the distribution pattern of the data.

    Data can be normally or approximately distributed or skewed and histogram can

    display this information well.

    Class Tally Frequency Relative Frequency

    f/25

    % frequency

    *100

    Cumulative %

    10-14 4 0.16 16 1615-19 6 0.24 24 4020-24 8 0.32 32 7225-29 4 0.16 16 88

    30-34 3 0.12 12 10025 1.00 100

  • 7/28/2019 Business Statistics May Module

    22/72

    iii) Frequency polygon

    It is formed by plotting the middle of each class against the frequency and joining

    the points with straight lines. The polygon can be drawn in the histogram by

    marking the middle of the bars and joining the points. It is also effectively used to

    display the pattern of the data across the classes.

    iv) Cumulative frequency graph: Ogive

    The graph is drawn by plotting the higher value of the class limits against

    cumulative frequency, relative frequency or percentage frequency, i.e. of

    companies had their stock prices between 15 and 22

    We may also want to know the number of companies whose stock prices

    were 27 and below.

    Locate 27 along prices and draw a vertical line to meet the curve. Drawa horizontal line to read the frequency =21. Therefore 21 companies

    had their stock prices at price of 27 and below. Therefore 4 companies

    have their stock prices above 27.

    The cumulative frequency curve: ogive

  • 7/28/2019 Business Statistics May Module

    23/72

    Lorenz curve: It is a special Ogive that can be used to plot either income or wealth of a

    country against the population. It will show how the distribution of wealth is in a given

    country. Many Lorenz curves will form a long S showing some level of unequal

    distribution of wealth among citizens.

    Tax policies can be used to level out the inequality by charging higher tax rates for the

    more wealth and lower rates for the little wealth population. For equal distribution the

    long S shape results in a straight line- an ideal situation but the more equitably wealth

    is distributed nearer the shape to a straight line.

    The curve is drawn with percentage as cumulative of the population on vertical axis

    and the amounts wealth or income.

    3. NUMERICAL DESCRIPTIVE MEASURES AND ANALYSIS

    The descriptive measures can be classified as:

    - Measures of central tendency mode, median and mean

    - Measures of dispersion or spread range, variance, std deviation, semi-

    interquartile range, coefficient of variation etc

    Measures of Central Tendency

    Discrete data

    These are summary measures that give averages. The measures of central tendency

    can be calculated for discrete (ungrouped) or continuos (grouped) data.

    Discrete data:

  • 7/28/2019 Business Statistics May Module

    24/72

    a. Mode- this is the most popular or common item in the data set. It is the

    value with the highest frequency. Data set can either have unimodal (one

    mode), bimodal (two modes) or multimodal. Example

    29 31 35 39 39 40 43 44 44

    The above set is a bimodal with 39 and 44

    b. Median- it is the value of the middle term in a data set that has been

    ranked in ascending or descending order. The position of the median is

    identified as:

    N+1

    2 where N is the total frequency

    The median in the above data is position 9+1

    2 =5th which is 39

    Example 123 36 210 249 257 506 385 13 50 97 210 275

    Find the median

    Solution

    Arrange the data in ascending order

    13 23 36 50 97 210 234 249 257 275 385 506

    Middle position = n+1 = 13 = 6.5

    2 2

    The position between 6 and 7th position 210 + 243 = 222

    2

    The advantage of using median as a measure of central tendency is that it is not

    influenced by outliers. It is preferred to the mean for data set that contains outliers.

    Outliers are few figures in the data that have extreme values from the rest: either very

    low or very high.

    Mean = Arithmetic mean

    It is the most frequently used measure of central tendency. It is the average of the sum

    of all values divided by the total frequency. So the mean is preferred in that it

    represents the whole data set from which it is computed.

  • 7/28/2019 Business Statistics May Module

    25/72

    = Mean = x Sample data

    n

    n = sample size

    = X = mean from a population data N = population sizes.

    N

    Example 2

    The following data gives the profits thousand dollars of a sample of five companies in a

    given year.

    4725 1884 3807 4939 and 162

    X = = = X = 16980 = 3396

    n 5

    The average profit on the 5 years is $3396000. A major shortcoming tendency is that

    mean is very sensitive to outliers.

    Example 3

    The following data give the number of years eight employee have been with their

    current employers

    11 9 13 12 8 9 24 10

    a) Identify the outlier.

    b) What would be the mean if the outlier was ii) excluded ii) included

    Solution

    a) Outlier is 24 which seem to be the extreme number of years the employee has

    been with the employer.

  • 7/28/2019 Business Statistics May Module

    26/72

    i) Mean excluding 24:

    11+9+13+12+8+9+10 = 10.286

    7

    ii) Including 24

    11+9+13+12+8+9+10+24 = 12

    8

    The one extreme value changes the mean by almost 2 values (units) i.e. from 10.256 to

    12 (1.714).

    Mean is very sensitive to outliers. For example the mean mark of BCM 307 test can

    easily be affected by few very poor performing students or very few very weeperforming students. The mean may not accurately represent the whole class.

    Example 4

    The mean of 60, 80, 90, 120

    60+80+90+120

    4

    350=

    4

    =87.5

    The arithmetic mean is very useful because it represents the values of most

    observations in the population.

    The mean therefore describes the population quite well in terms of the magnitudes

    attained by most of the members of the population

    Measures of Dispersion

    Discrete data

    These are statistics or measures that show how data is dispersed. The measures may

    include

  • 7/28/2019 Business Statistics May Module

    27/72

    Range

    Inter-quartile range

    Variance

    Standard variation

    Range:The difference between the highest and the lowest value.

    Example 1

    The example on the number of years the employees have stayed with the employer.

    11 9 13 12 8 9 24 10

    Range: 24-8 = 16

    The range is influenced by outliers as it is only based on two values. Its disadvantage is

    that it ignores the rest of values in a data set and so it is not a satisfactory measure of

    dispersion.

    Inter Quartile Range:The difference between the upper quartile Q3 and the lower

    quartile Q1. It contains the middle 50% data.

    Example 1

    Arrange data in order

    8 9 9 10 11 12 13 241st Quartile: x 8 = 2nd 2nd = 9

    Q1 = 9

    3rd Quartile (Q3) = x 8 6th

    6th

    = Q3 = 12Inter quartile Range: 12-9 =3.

    Example 2

    The following is a discrete data

    2, 5, 8, 10, 11, 14, 17, 20

  • 7/28/2019 Business Statistics May Module

    28/72

    Required:

    (i) Find the 30th percentile

    (ii) The quartiles.

    Solution

    Position = .3(n + 1) = .3(9) = 2.7

    30th percentile = 5 + .7(8 5) = 5 + 2.1 = 7.1

    Lower Quartile (25th percentile)

    Position = .25(n + 1) = .25(9) = 2.25

    Q1 = 5+.25(8 5) = 5 + .75 = 5.75

    Median (50th percentile)

    Position = .5(n + 1) = .5(9) = 4.5

    Median: Q2 = 10+.5(11 10) = 10.5

    Upper Quartile(75th percentile)

    Position = .75(n + 1) = .75(9) = 6.75

    Q3 = 14+.75(17 14) = 16.25

    Interquartiles

    IQ = Q3 Q1 = 16.25 5.75 = 9.50

    Example 2 (Grouped Data)

    The following table shows the levels of retirement benefits given to a group of workers

    in a given establishment.

    Retirement

    benefits 000

    No of

    retirees (f)

    Upper

    class

    limit

    cf

    20 29 50 29.5 5030 39 69 39.5 11940 49 70 49.5 18950 59 90 59.5 27960 69 52 69.5 33170 79 40 79.5 37180 89 11 89.5 382

  • 7/28/2019 Business Statistics May Module

    29/72

    Required

    i. Determine the semi interquartile range for the above data

    ii. Determine the minimum value for the top ten per cent.(10%)

    iii. Determine the maximum value for the lower 40% of the retirees

    Solution

    The lower quartile (Q1) lies on position

    N + 1 382 + 1=

    4 4

    = 95.75

    (95.75 - 50)the value of Q1 = 29.5 + x 1069

    = 29.5 + 6.63

    = 36.13

    The upper quartile (Q3) lies on position

    N + 1

    4

    382 + 1=

    4

    = 287.25

    The value of Q3 = 59.5 +( )287.25-279

    52 10

    = 61.08

    The semi interquartile range =Q3-Q1

    2

    61.08 - 36.13=

    2

  • 7/28/2019 Business Statistics May Module

    30/72

    = 12.475

    = 12,475

    ii. The top 10% is equivalent to the lower 90% of the retirees

    The position corresponding to the lower 90%

    90= (n + 1) = 0.9 (382 + 1)

    100

    = 0.9 x 383

    = 344.7

    The benefits (value) corresponding to the minimum value for top 10%

    = 69.5 + ( )344.7-33140

    x 10

    = 72.925

    = 72925

    iii. The lower 40% corresponds to position

    =10

    40(382 + 1)

    = 153.20

    Retirement benefits corresponding to its position

    = 39.5 +( )153.2-119

    70x 10

    = 39.5 + 4.88

    = 44.38

    = 44380

    e. The 10th 90th percentile range

  • 7/28/2019 Business Statistics May Module

    31/72

    This is a measure of dispersion which uses percentile. A percentile is a value which

    separates one division from the other when a given data is divided into 100 equal

    divisions.

    This measure of dispersion is very important when calculating the co-efficient of

    skewness

    Variance: Variance is the square of standard deviation. Formula

    = (x ) where x: the values in data

    N N: size of population

    : the mean

    Standard Deviation: It is simply the average of all the Deviations of values of a

    variable from the mean.The deviation of each value from the mean is squared and the sum of all the square of

    deviations is divided by total frequency (N) of population data and size of sample less 1

    (n-1) if sample data was used, them obtain square not.

    Formula for calculation:

    Population data

    = = (x-)

    Example 1

    Assuming the data in the number of years employees remained with the employer to

    have been collected from a sample:

    Variance S = (x )

    n 1

    Mean = X = 12 (obtained earlier)

    S = (11-12) + (9-12)) + (8-12) + (10-12) + (24-12)

    8-1

  • 7/28/2019 Business Statistics May Module

    32/72

    S = (-1) + 2 (-3) + (-4) + (-2) + (8-12) + (10-12)

    7

    S = 1 + 2 x 9 + 16 + 4 + 144 = 176 = 25.142

    7 7

    On average each value deviates from the mean on squared = 25.142.

    Standard Deviation: Square root of variance

    = 25.142 = 5.014

    On average each value deviates from the mean by 5.014.

    In general the lower the value of standard deviation for a data set from the mean. The

    values are close together but higher value of standard deviation indicates that thevalues are relatively spread or scattered.

    If the standard deviation of scores obtained by students in a BCM 307 class was

    obtained to be higher compared to score obtained in different class, it means the

    abilities of students are spread out. Some are very poor while others may be good in

    their performance.

    If data set is larger the working can be done from a frequency distribution table.

    Example 2

    A sample comprises of the following observations; 14, 18, 17, 16, 25, 31

    Determine the standard deviation of this sample.

    x ( )x x ( )2

    x x

    14 -6.1 37.2118 -2.1 4.4117 -3.1 9.61

    16 -4.1 16.8125 4.9 24.0131 10.9 118.81121 210.56

  • 7/28/2019 Business Statistics May Module

    33/72

    12120.1

    6X = =

    Standard deviation, ( )2

    210.56

    6n

    x x

    = =

    = 5.93

    Example 3

    The data represents the number of bedrooms in homes owned by 30 families

    3 5 2 3 2 3 1 2 1 3

    4 1 4 3 1 3 3 2 2 3

    3 4 3 1 2 4 2 2 5 3

    Required a) identify the mode calculate the

    i) mean

    ii) variance and standard deviation

    Solution

    Construct frequency distribution table.

    = 30 x = 80 (x- ) = 36.667

    a) mode is 3 bedrooms

    b) X = x = 80 = 2.67 30

    Variance = S = (x- ) = 36.667 = 1.264

    n-1 30-1

    Number of

    rooms (x)

    Tally Frequency FX Xi-X (Xi-X)

    1 5 5 -1.67 13.94452 8 16 0.67 3.59123 11 33 0.33 1.19794 4 16 1.33 7.07565 2 10 2.33 10.8578

    30 80 36.667

  • 7/28/2019 Business Statistics May Module

    34/72

    The variance = S = 1.264

    Standard deviation = S = 1.264 = 1.124

    The mean or average of all deviations of values from the mean is 1.124 i.e. each value

    is an average difference of 1.124 from the mean.

    Coefficient of Variation

    The variance or standard deviation of different data set is not easy to compare. The

    coefficient of variation makes it possible for different data sets to be compared based

    on measure of central tendency (normally the mean and measure of dispersion

    (normally the standard deviation).Coefficient of variation: CV = standard deviation

    Mean

    In the above example: CV = 1.124 = 0.421

    2.67

    CV can also be written as a percentage CV = 1.124 x 100 = 0.421x100

    2.67

    The lower the CV the less the spread of the values from the mean i.e. the values are

    closer together.

    Measures of Central Tendency and Measures of Dispersion for a Continuos

    Data

    Example 1

    The Table gives the frequency distribution of the daily commuting time for workers

    from home to work for all employees of a company.

  • 7/28/2019 Business Statistics May Module

    35/72

    Solution:

    Computation of the measures similar to that of discrete data whereby the value of x is

    obtained as the mid-point of each class

    X = sum of the class boundaries e.g. 0+10 = 5 is the mid-point of the 1st classs

    2 2

    Time Mid-point (x) Frequency (f) (fx) (x-x) f0 to less than 10 5 4 20 1075.8410 to less than 20 15 135 135 368.64

    20 to less than 30 25 150 150 77.7630 to less than 40 35 140 140 739.8440 to less than 50 45 90 90 1113.92

    =25 x

    =535

    (x-x) f

    =3439.36

    Time

    (minutes)

    Number of

    employees0 to less than

    10

    4

    10 to less than

    20

    9

    20 to less than

    30

    6

    30 to less than

    40

    4

    40 to less than

    50

    2

    25

  • 7/28/2019 Business Statistics May Module

    36/72

    The mean can also be assigned instead of x given the data is from a population.

    However whether the column writes (x-x) or (x-) should not make difference in the

    value.

    Mean = = x = 535

    25

    = 21.4

    Standard deviation = (x-) f = 3439.36

    N 25

    = = 137.5744 = 11.729.

    NB:For continuous data the mode is replaced by the term modal class; simply the class

    with the highest frequency. For the above example the modal class 10 to less than 20.

    Practice Questions:

    1. The following data represent the age of a sample of 10 employees of a given

    company

    39 29 43 52 39 44 40 31 44 35

    Required:

    i) identify the mode and the median

    ii) compute

    iii) mean

    iv) standard deviation

    v) coefficient of variation

    2. The data gives the frequency distribution of the number of orders received

    each day during the past 50 days at the office of a mail order company.

    Number of Number of days

  • 7/28/2019 Business Statistics May Module

    37/72

    order10 12 413 15 1216 18 2019 - 21 14

    a) Identify the modal class

    b) Calculate

    i) mean

    ii) variance and standard deviation

    iii) coefficient of variation

    3. The price of the ordinary 25p shares of Manco PLC quoted on the stock exchange, at

    the close of the business on successive Fridays is tabulated below

    126 120 122 105 129 119 131 138

    125 127 113 112 130 122 134 136128 126 117 114 120 123 127 140124 127 114 111 116 131 128 137127 122 106 121 116 135 142 130

    Required

    a) Group the above date into eight classes.

    b) Calculate cumulative frequency, the median value, quartile values and the

    Semi-quartile range

    c) Calculate the mean and standard deviation of your frequency distribution.

    d) Compute :

    i) The median and mean

    ii) The semi-interquartile range and the standard deviation

    5. The managers of an import agency are investigating the length of time that

    customers take to pay their invoices, the normal terms for which are 30 days net. They

    have checked the payment record of 100 customers chosen at random and havecompiled the following table:

    Payment in Number of

    customers5 to 9 days 4

  • 7/28/2019 Business Statistics May Module

    38/72

    10 to 14 days 1015 to 19 days 1720 to 24 days 2025 to 29 days 2230 to 34 days 1635 to 39 days 840 to 44 days 3

    Required:

    a) Calculate the arithmetic mean.

    b) Calculate the standard deviation

    c) Construct a histogram and insert the modal value.

    d) Estimate the probability that an unpaid invoice chosen at random will be between

    30 and 39 days old.

    4. PROBABILITY DISTRIBUTIONS

    Probability distribution can either be discrete or continuous. The distribution can

    also assume the uniform, normal and skewed

    For numerical data, any distribution: discrete, continuous or probability, the mean and

    standard deviations can be used to find the proportions or percentage of the total

    observations that fall within a given internal about the mean.

    The pattern of any distribution of data values throughout the entire range of all values

    given a certain shape. The shape can be identified from a bar chart for discrete data or

    histogram for continuous data. The shape of the distribution can either be

    i) Uniform

    ii) Bell-shaped shaped that is- symmetrical

    iii) skewed

    i) Uniform or rectangular

  • 7/28/2019 Business Statistics May Module

    39/72

    ii) Symmetrical- bell shaped

    For a symmetrical continuous distribution the measures of central tendency mode,

    median and mean are equal and the value is at the middle of the shape. Such a

    distribution is called normal distribution Gaussian distribution.

    a) DISCRETE DATA

    A probability distribution for a discrete random variable is a mutually exclusive listing

    of all the possible numerical out occurrence of each outcome.

  • 7/28/2019 Business Statistics May Module

    40/72

    Example 1

    The following table contains the probability distribution for the number of traffic

    accidents daily in a small city.

    Number of

    accidents

    Probability p(x)

    0 0.10.1 0.202 0.453 0.154 0.055 0.05

    Required:Compute:

    a) expected number of accidents

    b) The variance and standard deviation

    c) Coefficient of variation

    Solution

    Probability is a term that reflects uncertainty. It is used to make predictions on

    happenings by assigning the probability of the event happening.

    The mean or average from such distribution is referred to as expected value E(x), E(x)

    = X (Pxi)

    Where X: - Variable Px: - probability that event xi will occur

    The variance = = (xi E(x)) 2 Pxi

    Number of Accounts probability

    (x) P(xi) Xi Pxi Xi E(x) Pxi

  • 7/28/2019 Business Statistics May Module

    41/72

    0 0.10 0 0.401 0.20 0.20 0.202 0.45 0.90 0.003 0.15 0.45 0.154 0.05 0.20 0.205 0.05 0.25 0.45

    1.00 2.00 1.4

    Pxi = 1.00 xi Pxi = 2.00 (xi E(x)) Pxi= 1.4

    i) Expected value E (x) = 2.00

    ii) Variance = = (xi E(x) Pxi = 1.4

    iii) standard deviation = = = 1.4 = 1.1832

    iv) Coefficient of variance CV = 1.18322

    = 0.592 (59.2%)

    Example 2Given the following probability distributions A and B

    Distribution A Distribution BX p (x) X

    p(x)0 0.25 0 0.151 0.25 1 0.252 0.25 2 0.453 0.25 3 0.15

    a) Compute:

    i) The expected value for each distribution

    ii) The standard deviation for each distribution

    iii) Compare the results of distribution A and B.

    Distribution A Distribution BX PC(X) XP(x) [X-E(X)]

    Px)

    X P(X) X P(X) [X-E(X)] P(X)

    0 0.25 0.00 0.5625 0 0.15 0.00 0.3841 0.25 0.25 0.0625 1 0.25 0.25 0.0902 0.25 0.50 0.0625 2 0.45 0.90 0.0723 0.25 0.75 0.5625 3 0.15 0.45 0.294

  • 7/28/2019 Business Statistics May Module

    42/72

    = E(x) = 1.5 = 1.25 =E(x) = 1.6 = 0.84

    Distribution of A is uniform and symmetric .The distribution has one mode i.e. the

    variance 2.

    b) CONTINUOUS DISTRIBUTION: NORMAL DISTRIBUTIONFrequency distribution for continuous data can be converted to a probability

    distribution by calculating the relative frequency for each class. This column is taken

    as equivalent of probabilities for each class.

    Like total sum of relative frequency, the total probability is also equal to 1. i.e. Px

    = 1

    The distribution is the most common continuous distribution used in statistics based on

    the following main reasons.

    Numerous continuous variables common in business and other natural

    occurrences have distributions that closely resemble the normal distribution

    The normal distribution can be used to approximate various discrete probability

    distributions.

    It provides the basis for classical statistical inference.

    The normal distribution is represented by the classical bell shape with it one can

    calculate the probability density function is denoted by the symbol (x).

    The mean () is in the middle of the symmetrical distribution. The standard deviation

    () measures the distance from the mean to a point on the x (horizontal) axis. In

    order to work with a set of standard values it is necessary to convert or transform any

    normal distribution to a standard normal distribution which has a mean of o and a

    standard deviation of 1.

    The total area of the distribution is 1, and each half of the curve is 0.5. Any values of x

    in a distribution can be converted to a value called z value or z- score, by the formula:

    Z = x -

  • 7/28/2019 Business Statistics May Module

    43/72

    Where x the variable

    - mean

    Standard deviation

    Z values are obtained normal probability distribution. The Z values correspond to the

    area shaded (identified from the normal curve).

    Example 1

    The heights of adult males are normally distributed with mean 170 cm and standard

    deviation 10cm.

    Find the probability that the height of students is:

    Between 180 and 190

    Taller than 190cm

    Shorter than 180cm

    Shorter than 165cm

    Solution

    The distribution is said to be normally distribution

    =170

    X = 170 is the mean at the middle of the curve.

    Use formula to find 2 (standard deviation: )

    Find the area (probability) that

    The height of an adult is between 180cm and 190cm

  • 7/28/2019 Business Statistics May Module

    44/72

    Z = x - = 180 170 = 1 (180 is 1 standard deviation)

    10

    = 190 170 = 2 (190 Is 2 standard deviation)

    10

    Find areas in the normal tables = P(z)

    Z Area (Area under curve between Z = 0 and 2

    1 0.3413

    2 0.4772

    P (180 x (190) = p (1 z = 0.4772 0.3413 = 01359.

    -3 -2 -1 0 1 2 3

    Taller than 190cm: P (x > 190)

    Z = 190 170 = 2

    10

    -3 -2 -1 0 1 2 3

    Z =2 and P(z) or area =0.4772

  • 7/28/2019 Business Statistics May Module

    45/72

    P(z>2) = 0.5 0.4772 = 0.0228 c)

    c) Shorter than 180cm

    -3 -2 -1 0 1 2 3

    Z = 180-170 =1

    10

    P(x

  • 7/28/2019 Business Statistics May Module

    46/72

    -3 -2 -1 0 1 2 3P(-0.59).

    Solution

    (i) P(2 < X < 5) = P(0.33 < Z < 0.67)

    = .3779.

    (ii) P(X >0) = P(Z > 1) = P(Z < 1)

    = .8413.

    (iii) P(X >9) = P(Z > 2.0)

    = 0.5 0.4772 = .0228

    Example 3

    A sample of students had a mean age of 35 years with a standard deviation of 5 years.

    A student was randomly picked from a group of 200 students. Find the probability that

    the age of the student turned out to be as follows

    i. Lying between 35 and 40ii. Lying between 30 and 40

    iii. Lying between 25 and 30

    iv. Lying beyond 45 yrs

    v. Lying beyond 30 yrs

  • 7/28/2019 Business Statistics May Module

    47/72

    vi. Lying below 25 years

    Solution

    (i). the standardized value for 35 years

    Z =

    =

    5

    35-35= 0

    The standardized value for 40 years

    Z =

    =

    5

    35-40= 1

    The area between Z = 0 and Z = 1 is 0.3413 (These values are checked from the

    normal tables see appendix)

    The value from standard normal curve tables

    When z = 0, p = 0

    And when z = 1, p = 0.3413

    Now the area under this curve is the area between z = 1 and z = 0

    = 0.3413 0 = 0.3413

    The probability age lying between 35 and 40 yrs is 0.3413

    (ii). 30 and 40 years

    Z =

    =5

    3530 =5

    5

    = -1

    Z =

    =

    5

    3540= 1

    The area between Z = -1 and Z = 1 is

    = 0.3413 (lying on the positive side of zero) + 0.3413 (lying on the negative side

    of zero)

    P = 0.6826The probability age lying between 30 and 40 yrs is 0.6826

    (iii). 25 and 30 years

    Z =

    =

    5

    3525=

    5

    10= -2

  • 7/28/2019 Business Statistics May Module

    48/72

    Z =

    =

    5

    3530= -1

    The area between Z = -2 and Z = -1

    Probability area corresponding to Z = -2

    = 0.4772 (the z value to check from the tables is 2)Probability area corresponding to Z = -1

    = 0.3413 (the z value for this case is 1)

    The probability that the age lies between 25 and 30 yrs

    = 0.4772 0.3413 (The area under this curve)

    P(Z) = 0.1359

    iv). P(beyond 45 years) is determined as follow = P(x > 45)

    Z =

    =

    5

    3545=

    5

    10+= + 2

    Probability corresponding to P(Z = 2) = 0.4772 = probability of between 35 and 45

    P(Age > 45yrs) = 0.5000 0.4772

    = 0.0228

    Practice Questions

    1. Identify the following as discrete or continuous random variables.(i) The market value of a publicly listed security on a given day

    (ii) The number of printing errors observed in an article in a weekly news magazine

    (iii) The time to assemble a product (e.g. a chair)

    (iv) The number of emergency cases arriving at a city hospital

    (v) The number of sophomores in a randomly selected Math. class at a university

    (vi) The rate of interest paid by your local bank on a given day

    2. A random variableXhas the following probability distribution:X 1 2 3 4 5

    P(x) .05 .10 .15 .45 .25

    (i) Verify thatXhas a valid probability distribution.

    (ii) Find the probability thatXis greater than 3, i.e. P(X >3).

    (iii) Find the probability thatXis greater than or equal to 3, i.e. P(X 3).

  • 7/28/2019 Business Statistics May Module

    49/72

    (iv) Find the probability thatXis less than or equal to 2, i.e. P(X 2).

    (v) Find the probability thatXis an odd number.

    (vi) Graph the probability distribution forX.

    3, Calculate the area under the standard normal curve between the following values.

    (i)Z= 0 andz= 1.6 (i.e. P (0 Z 1.6))

    (ii)Z= 0 andz= 1.6 (i.e. P (1.6 Z 0))

    (iii)Z= .86 andz= 1.75 (i.e. P (.86 Z 1.75))

    (iv)Z= 1.75 andz= .86 (i.e. P (1.75 Z .86))

    (v)Z= 1.26 andz= 1.86 (i.e. P (1.26 Z 1.86))

    (vi)Z= 1.0 andz= 1.0 (i.e. P (1.0 Z 1.0))

    (vii)Z= 2.0 andz= 2.0 (i.e. P (2.0 Z 2.0))

    (viii)Z= 3.0 andz= 3.0 (i.e. P (3.0 Z 3.0))

    4. LetZbe a standard normal distribution. Findz0 such that(i) P (Z z0) = 0.05

    (ii) P (Z z0) = 0.99

    (iii) P (Z z0) = 0.0708

    (iv) P (Z z0) = 0.0708

    (v) P (z0 Z z0) = 0.68

    (vi) P (z0 Z z0) = 0.95

    5.A normally distributed random variableXpossesses a mean of = 10 and a standard

    deviation of=5.

    Find the following probabilities.

    (i)Xfalls between 10 and 12 (i.e. P (10 X 12)).

    (ii)Xfalls between 6 and 14 (i.e. P (6 X 14)).

    (iii)Xis less than 12 (i.e. P(X 12)).

    (iv)Xexceeds 10 (i.e. P(X 10)).

    6. CONFIDENCE INTERVAL

    The interval estimate or a confidence interval consists of a range (upper confidences

    limits and lower confidence limit) within which we are confident that a population

  • 7/28/2019 Business Statistics May Module

    50/72

    parameter lies and we assign a probability that this interval contains the true

    population value.

    Confidence interval is the interval between the confidence limits. The higher the

    confidence level the greater the confidence interval.

    For example

    A normal distribution has the following characteristic

    i. Sample mean 1.960 includes 95% of the population

    ii. Sample mean 2.588 includes 99% of the population

    Large Samples

    The Central Limit Theorem: The theory states that if we select a large number of

    simple random samples, say from any population and determine the mean of each

    sample, the distribution of these sample means will tend to be described by the normal

    probability distribution with a mean and variance 2

    /n. This is true even if thepopulation itself is not normal distribution. Or the sampling distribution of sample

    means approaches to a normal distribution irrespective of the distribution of population

    from where the sample is taken and approximation to the normal distribution becomes

    increasingly close with increase in sample sizes

    Large samples that contain a sample size greater than 30(i.e. n>30). Such samples can

    use levels of confidence based on the normal distribution.

    Estimation of population mean

    Here we assume that if we take a large sample from a population then the mean of the

    population is very close to the mean of the sample

    Steps to follow to estimate the population mean includes

    i. Take a random sample of n items where (n>30); n is the sample size

    ii. Compute sample mean (X ) and standard deviation (S)

    iii. Compute the standard error of the mean by using the following formula

    Sx

    =n

    s

    Where Sx = Standard error of mean

    S = standard deviation of the sample

    n = sample size

    iv. Choose a confidence level e.g. 95% or 99%

  • 7/28/2019 Business Statistics May Module

    51/72

    v. Estimate the population mean as under

    Population mean = appropriate number XSx

    Appropriate number means confidence level e.g. at 95% confidence level is

    1.96 this number is usually denoted by Z and is obtained from the norma

    tables. The value of z corresponds to the confidence obtained as the

    probability percentage

    Example 1

    The quality department of a wire manufacturing company periodically selects a sample

    of wire specimens in order to test for breaking strength. Past experience has shown

    that the breaking strengths of a certain type of wire are normally distributed with

    standard deviation of 200 kg. A random sample of 64 specimens gave a mean of 6200

    kgs. Find out the population mean of 95% level of confidence

    Solution

    Population mean = 1.96 Sx

    Note that sample size is already n > 30 whereas s and x are given thus step i), ii) and

    iv) are provided.

    Here: X = 6200 kgs

    Sx = ns

    =64

    200= 25

    Population mean = 6200 1.96(25)

    = 6200 49

    = 6151 to 6249

    At 95% level of confidence, population mean will be in between 6151 and 6249

    Estimation of population proportions

    This type of estimation applies at the times when information cannot be given as a

    mean or as a measure but only as a fraction or percentage

    The sampling theory stipulates that if repeated large random samples are taken from a

    population, the sample proportion p will be normally distributed with mean equal to

    the population proportion and standard error equal to

  • 7/28/2019 Business Statistics May Module

    52/72

    Sp =Pq

    n= Standard error for sampling of population proportions

    Where n is the sample size and q = 1 p.

    The procedure for estimating a proportion is similar to that for estimating a mean, we

    only have a different formula for calculating standard error is different.

    Example 1

    In a sample of 800 candidates, 560 were male. Estimate the population proportion at

    95% confidence level.

    Solution

    Here

    Sample proportion (P) =560

    800= 0.70

    q = 1 p = 1 0.70 = 0.30n = 800

    pq

    n= ( ) ( )

    0.70 0.30

    800

    Sp = 0.016

    Population proportion

    = P 1.96 Sp where 1.96 = Z.

    = 0.70 1.96 (0.016)

    = 0.70 0.03

    = 0.67 to 0.73

    = between 67% to 73%

    Example 2

  • 7/28/2019 Business Statistics May Module

    53/72

    A sample of 600 accounts was taken to test the accuracy of posting and balancing of

    accounts where in 45 mistakes were found. Find out the population proportion. Use

    99% level of confidence

    Solution

    Here

    n = 600; p =45

    600= 0.075

    q = 1 0.075 = 0.925

    Sp =pq

    n= ( ) ( )

    0.075 0.925

    600

    = 0.011

    Population proportion

    = P 2.58 (Sp)

    = 0.075 2.58 (0.011)

    = 0.075 0.028

    = 0.047 to 0.10

    = between 4.7% to 10%

    Small Samples

    Estimation of population mean

    If the sample size is small (n

  • 7/28/2019 Business Statistics May Module

    54/72

    x

    S =s

    n

    S = standard deviation of samples = ( )2

    1

    x x

    n

    for small samples.

    n = sample size

    v = n 1 degrees of freedom.

    The value of t is obtained from students t distribution tables for the required confidence

    level

    Example

    A random sample of 12 items is taken and is found to have a mean weight of 50 grams

    and a standard deviation of 9 grams

    What is the mean weight of population

    a) with 95% confidence

    b) with 99% confidence

    Solution

    50;X = S = 9; v = n 1 = 12 1 = 11;9

    12x

    sS

    n= =

    = x xts

    At 95% confidence level

    = 50 2.2629

    12

    = 50 5.72 grams

    Therefore we can state with 95% confidence that the population mean is between

    44.28 and 55.72 grams

    At 99% confidence level

  • 7/28/2019 Business Statistics May Module

    55/72

    = 50 3.259

    12

    = 50 8.07 grams

    Therefore we can state with 99% confidence that the population mean is between

    41.93 and 58.07 grams

    Note: To use the t distribution tables it is important to find the degrees of freedom (v =

    n 1). In the example above v = 12 1 = 11

    From the tables we find that at 95% confidence level against 11 and under 0.05, the

    value of t = 2.201

    7. SIMPLE LENEAR REGRESSION EQUATION

    A regression model is a mathematical equation that describes the relationship between

    two or more variables. A simple regression model includes only two variables;

    Independent variables: the variables used to explain the variation in the

    dependent variable i.e. they are used to make prediction on the dependent

    variable.

    The dependent variable is the one being explainedThe regression model that is linear shows the equation of a linear relationship between

    two variables X (dependent) and of (independent) as shown below:

    Y = a + bx

    The value of a: it is the y- intercept; the value of y where the line cuts the y- axis. The

    constant b; this is the slope or the gradient of the line.

    The linear relationship between x and y can be defined if the values of the constants a

    and b are determined. The values of a and b can be determined in the ways.

    Scatter plots:

  • 7/28/2019 Business Statistics May Module

    56/72

    Scatter plots are used to examine the relationship between two variables. One variable

    takes the horizontal axis (x) while the other takes the vertical axis (y). The variation

    between the variables can show a relationship that is positive or negative.

    Positive relationship that is either linear or close to linear would indicate that the

    variables more together in a linear manner. The scatter will show points lying in a

    region reflecting a and of a line. When one variable increases the other also increases,

    and when one decreases the other decreases the other also.

    Negative relationship is accompanied by a decrease in the other variable. Linear

    relationship shows points scattered in a way to lie in a line.

    Relationships between variables can also be non-linear. In such cases the points will

    concentrate in a region that reflects a curve. The relationship between two variables

    may therefore assume many possible shapes; which can be classified as linear or non-linear relationship that are complicated mathematical functions. The simplest

    relationship consists of a straight-line or linear relationship.

    Check on the scatter plots on page 606 of the main test book.

    A scatter plot from which a line that fits (line of best fit) the variables points in

    the scatter into an approximately straight line.

    This requires good and refined skill in identifying the line that best fits the nearer

    is the accuracy of the line obtained. However there will always be an error causedrandom causes random error term. This is the difference between the actual

    value of y the obtained from the survey and the estimated values of the y by

    assuming they fall along the line. For every value of x, a different value of y may

    be obtained by estimating the line. The error for each value of y can be written as

    E = y - : where y is the actual value and the estimated value.

    For all the values of y, there will be a sum of y are less than the actual while

    others are more than the actual.

    The sum of the less (-ve difference) and the sum of the greater (the difference) is

    zero.

    Example:

  • 7/28/2019 Business Statistics May Module

    57/72

    The following data represents a sample of seven households showing their incomes and

    food expenditures for a given month.

    Income (hundreds of

    dollars

    Food expenditure

    (hundreds of dollars)

    35 944 1521 739 1115 528 825 9

    Required:

    Construct a scatter diagram; with income an x-axis and food-expenditure an y-

    axis

    Draw the prediction line

    Identify the y-intercept (a) and the slope (gradient) (b).

    Write the simple linear regression equation.

    Scatter diagram

  • 7/28/2019 Business Statistics May Module

    58/72

    Depending on ones skill different lines such as L1, and L2 can be drawn using L2

    Y Intercept i.e. constant a = 1.2

    The gradient i.e. coefficient of x; b = 12-6

    46-22 = 0.25

    The linear equation generally written as

    Y = a + bx

    = 1.2 + 0.25x.

    The line is an estimate of values of y for different values of x. It can be used to predict

    values of y given x.

    However since the line is an estimate; the difference between the observed or actual

    value of y and the obtained by the prediction line, there exists an error called random

    error, also called the residual. It measures the surplus (positive or negative)

    differences. The random error obtained from a population is denoted by while that of

    a sample is denoted by e in the above example.

    E= Actual food expenditure - predicted food expenditure = y - .

    If the predicted line completely fits as the best line the sum of positive errors and the

    negative errors is equal to zero.

  • 7/28/2019 Business Statistics May Module

    59/72

    Drawing a scatter diagram may not give is the best of fit line. The other option that

    results in sum of errors equal to zero:

    e = (y-) = 0

    The use of the least squares method

    The Least squares method

    The least of squares method minimizes the random error. It helps to determine the

    constants a and b for the equation

    Y = a + bx that results in the line of best fit. The method gives the values of a

    and b for the equation (model) such that the same of squared errors (SSE) is minimum

    SSE = e = (y-) .

    The values of a and b which gives the minimum SSE are called the least squares

    estimates and the line is called the least squares line.

    For the line = a + bx

    b = SSxy and a = - bx

    SSxx

    Where SSxy = xy - (x) (y)

    n

    SSxy = x - (x)

    n

    Find the least square regression line for the data on incomes and food expenditures of

    seven households; we require to construct the table that would guide the computation

    of a and b.

    The table has the following;

    Icome Food expenditure(x) (y) xy x35 9 315 1225

  • 7/28/2019 Business Statistics May Module

    60/72

    49 15 735 290121 7 147 44139 11 429 152115 5 75 22528 8 224 78425 9 225 625x=212 y=64 xy=2150 x=7222

    x = 212 y = 64

    x = x = 212 = 30.286

    n 7

    y = y = 64 = 9.143

    n 7

    xy = 2150 x = 7222

    SSyx = xy- x y = 2150 - ( 212)(64) =211.714

    n 7

    SSxx = x - (x) = 7222- ( 212) = 801.429

    n 7

    b = SSxy = 212.714 = 0.2654

    SSxx 801.429

    a = - bx = y - b x = 64 - 0.2654(212)

    n n 7 7

    a =1.1414

    y = 1.1414 - 0.2654x

    Interpretation:

  • 7/28/2019 Business Statistics May Module

    61/72

    The line gives coefficients of a and b to four decimal points making it more accurate to

    be used for prediction. We can check the accuracy as follows:

    A household with monthly income 35 (83500) dented by x=35 would be expected to

    spend some money on food as follows:

    y = 1.1414 + 0.2642x

    x = 35

    y = 1.1414 + 0.2642 (35) 810.3884

    i.e. in hundred dollars (81038.84)

    The acted value in the data gives y = 9. The value 810.3884 could be regarded as an

    average i.e. for households having an average i.e. for households having an income of

    83500 (x=35) they spend an average 810.3884 (81038.84) on food.

    The constant a is the value of y when x=0. That is the amount of money a household

    would spend on food per month if there was no income.

    It means that food expenditure does not only depend on income but there could be

    other factors.

    For purposes of prediction using the linear regression line obtained, we can only predict

    values of y for values of x that lie within the range in our data.

    For example, the incomes lie between (81500 to 84900) i.e. x=15 and x = 49. We can

    only predict values of y with values of x between 15 and 49. We can only predict values

    of y with values of x between 15 and 49.

    Prediction outside this range may not hold true (prediction not reliable). X = 0 is a

    value not within the range and so the prediction that households with no income spend

    8114.14 per month cannot be supported by our equation.

  • 7/28/2019 Business Statistics May Module

    62/72

    The constant b in the model gives the gradient or change in y due to a charge of one

    unit in x.

    Example; when x increases by one unit of income in (hundreds) then y increase by

    0.2642 (in hundreds) of dollars spent on food.

    Example;

    If the income of a household changes from x = 30 to x = 31

    y will change as:

    y = 1.1414 + 0.2642 (30) = 9.0674

    y = 1.1414 + 0.2642 (31) = 9.3316

    9.3316 9.0674 = 0.2642

    When b is positive it means that as x increases y also increases and if x decreases y

    also decreases. There is a positive linear relationship between the variables i. e. the

    change in y and the charge in x are in the same direction; the variables move together.

    If the value of b is negative, change in y is in opposite direction to change in x i.e. there

    is a negative linear relationship between the variables.

    When is greater than zero (b > 0) the line slopes upwards from left to right. If b < 0 the

    line slopes down wards from left to right

    Assumptions of the regression model

    The mean value of error is zero. From the above example, among the households

    with the same income some spend more on food and other less. The sum of the

    differences (positive errors and negative errors) is equal to zero.

    The errors associated with different observations are independent. That is, all

    households decide independently how much to spend in food.

    For any given x, the distribution of errors is normal, i.e. with the above example

    the food expenditure for all households with the same income are normally

    distributed.

  • 7/28/2019 Business Statistics May Module

    63/72

    The distribution of population errors for each x has the same (constant) standard

    deviation. The assumption is that the spread of points around the regression line

    is similar for all x values.

    Example 2

    A random sample of eight auto divers insured with a company and having similar auto

    insurance policies was selected. The following table lists their driving experience (in

    years) and the monthly auto insurance premium (in dollars) paid by them. Find the

    linear equation using L. S M.

    Driving expenditure

    (year)

    Monthly auto insurance premium

    (dollars)

    5 642 8712 509 7115 446 5625 4216 60

    x = 90 y = 474

    x = x = 90 = 11.25

    n 8

    y = y = 474 = 59.25

    n 8-

    xy = 4737 x = 1396

    SSyx = xy- x y = 4739 - (90) = 383.5

    n 8

    SSxx = x - (x) = 1396 - (90) = 383.5

    n

  • 7/28/2019 Business Statistics May Module

    64/72

    b = SSxy = -593.5 = - 1.5476

    SSxx 383.5

    a = - bx = 59.25 (-1.5476)11.25) = 76.6605

    = 76.6605 - 1.5476x

    Practice Questions

    1.The age versus prices for printers I reported in the table below. Age is in years while

    prices are in dollars (in hundreds)

    Age (years)

    x

    Price 00

    dollars (y)5 807 576 586 555 704 887 436 605 695 63

    2 118

    Required:

    i. Find the equation of the regression line.

    ii. Describe the apparent relationship between age and price for the printers

    iii. What does the slope of the regression equation represent in terms of the price for

    printers?

    iv. Panama enterprise wants to buy 3 year old and 4 year old printers from the firm.

    How much do you predict the firm will spend in buying the two printers?

    8.HYPOTHESIS TESTING

    Definition

  • 7/28/2019 Business Statistics May Module

    65/72

    A hypothesis is a claim or an opinion about an item or issue. Therefore it has to be

    tested statistically in order to establish whether it is correct or not correct

    When testing a hypothesis, one must fully understand the 2 basic hypothesis to be

    tested namely

    i. The null hypothesis (H0)

    ii. The alternative hypothesis(H1)

    The null hypothesis

    This is the hypothesis being tested, the belief of a certain characteristic e.g. Kenya

    Bureau of Standards (KBS) may walk to a sugar making company with an intention of

    confirming that the 2kgs bags of sugar produced are actually 2kgs and not less, they

    conduct hypothesis testing with the null hypothesis being: H0 = each bag weighs 2kgs

    The testing will set out to confirm this or to refute it.

    The alternative hypothesis

    While formulating a null hypothesis we also consider the fact that the belief might be

    found to be untrue hence we will reject it. We therefore formulate an alternative

    hypothesis which is a contradiction to the null hypothesis, thus when we reject the null

    hypothesis we accept the alternative hypothesis.

    In our example the alternative hypothesis would be

    H1 = each bag does not weigh 2kg

    Acceptance and rejection regions

    All possible values which a test statistic may either assume consistency with the null

    hypothesis (acceptance region) or lead to the rejection of the null hypothesis (rejection

    region or critical region)

    The values which separate the rejection region from the acceptance region are called

    critical values

    Type I and type II errors

    While testing hypothesis (H0) and deciding to either accept or reject a null hypothesis,

    there are four possible occurrences.

    a) Acceptance of a true hypothesis (correct decision) accepting the null hypothesis

    and it happens to be the correct decision. Note that statistics does not give

  • 7/28/2019 Business Statistics May Module

    66/72

    absolute information, thus its conclusion could be wrong only that the probability of

    it being right are high.

    b) Rejection of a false hypothesis (correct decision).

    c) Rejection of a true hypothesis (incorrect decision) this is called type I error, with

    probability = .

    d) Acceptance of a false hypothesis (incorrect decision) this is called type II error,

    with probability = .

    Levels of significance

    A level of significance is a probability value which is used when conducting tests of

    hypothesis. A level of significance is basically the probability of one making an

    incorrect decision after the statistical testing has been done. Usually such probability

    used are very small e.g. 1% or 5%

    0.5000 0.4900

    1%

    provision for errors

    Hypothesis testing procedure

    Whenever a business complaint comes up there is a recommended procedure for

    conducting a statistical test. The purpose of such a test is to establish whether the null

    hypothesis or alternative hypothesis is to be accepted.

    The following are steps normally adopted

    1. Statement of the null and alternative hypothesis

    2. Statement of the level of significance to be used.

    3. Statement about the test statistic i.e. what is to be tested e.g. the sample mean

    sample proportion, difference between sample means or sample proportions

  • 7/28/2019 Business Statistics May Module

    67/72

    4. Type of test whether two tailed or one tailed.

    5. Statement on critical values using the appropriate level of significance

    6. Standardizing the test statistic

    7. Conclusion showing whether to accept or reject the null hypothesis

    Hypothesis testing for the mean

    Example 1

    A certain NGO carried out a survey in a certain community in order to establish the

    average at which the girls are married. The results of the survey indicated that the

    marriage age for the girls is 19 years

    In order to establish the validity of the mean marital age, a sample of 50 women was

    interviewed and the average age indicated that they got married at the age of 16

    years. However the different ages at which they were married differed with thestandard deviation of 2.1years

    The sample data indicates that the marital age is less 19 years. Is this conclusion true

    or not ?

    Required

    Conduct a statistical test to either support the above conclusion drawn from the sample

    statistics i.e. the marriage age is less than 19 years, use a level of significance of 5%

    Solution

    1. Null hypothesis

    H0: (mean marital age) = 19 years

    Alternative hypothesis H1: (mean marital age) < 19 years

    2. The level of significance is 5%

    3. The test statistics is the sample mean age, X= 16 years

    4. The critical value of the one tailed test (one tailed because the alternative

    hypothesis is an inequality) at 5% level of significance is 1.65

    Solution

    Z =X -

    Sx, where xS =

    S

    n

    Where, X = Sample mean

    = Population mean

  • 7/28/2019 Business Statistics May Module

    68/72

    S = sample standard deviation

    n = sample size

    z = standard value (as per computation)

    5. The standard value Z must fall within the acceptance region for us to accept the

    null hypothesis. Thus it must be > - 1.65 otherwise we accept the alternative

    hypothesis.

    Z = 2.150

    16 19= - 10.1

    Rejection region acceptable region

    3 -2 -1 0 1 2 3

    6. Since 10.1 < -1.65, we reject the null hypothesis but accept the alternative

    hypothesis at 5% level of significance i.e. the marriage age in this community is

    significantly lower than 19 years

    Example 2

    Test the hypothesis that weight loss in a new diet program exceeds 20 pounds during

    the first month.

    Sample data: n = 36, x = 21, s2 = 25, 0 = 20, = 0.05

    H0: = 20 ( is not larger than 20)Ha: > 20 ( is larger than 20)

    Z = X - 0 = 21 20 = 1.2

    s/ n 5/36

    Z =1.645

  • 7/28/2019 Business Statistics May Module

    69/72

    Acceptable region rejection region

    -3 -2 -1 0 1 2 3

    At 5%; with Critical value:z= 1.645

    RR: Reject H0 ifz > 1.645

    Decision: Do not reject H0 because the critical value is outside the reject region

    Conclusion: At 5% significance level there is insufficient statistical evidence to conclude

    that weight loss in a new diet program exceeds 20 pounds per first month.

    Exercise:Test the claim that weight loss is not equal to 19.5.

    Example 3

    A machine is set to cut out bars to an average length of 150mm. an operator wants to

    check whether the setting is accurate. She samples 50 bars and finds a mean of

    148mm. the standard deviation is known to be 5mm. is the machine still reliable? Test

    this at 1% significance level.

    Solution

    H0: = 150 (machine may be reliable)Ha: 150 (2- tailed test, machine not reliable; may produce lengths that are

    too long or too. Short. We cannot get a direction from the wording of the question).

    Alpha: a = 0.01, Critical value:z/2 = z0.005 = 2.575

  • 7/28/2019 Business Statistics May Module

    70/72

    Z = X - 0 = 148 -150 = -2.83

    s/ n 5/50

    0.005 0.005

    3 -2 -1 0 1 2 3

    Hypothesis testing for proportion

    A member of parliament (MP) claims that in his constituency only 50% of the total

    youth population lacks university education. A local media company wanted to

    acertain that claim thus they conducted a survey taking a sample of 400 youths, of

    these 54% lacked university education.

    Required:

    At 5% level of significance confirm if the MPs claim is wrong.

    Solution.

    Note: This is a two tailed tests since we wish to test the hypothesis that the hypothesis

    is different () and not against a specific alternative hypothesis e.g. < less than

    or > more than.

    H0 : = 50% of all youth in the constituency lack university education.

    H1 : 50% of all youth in the constituency lack university education.

    Sp =pq

    n=

    0.5 0.5

    400

    x= 0.025

    Z =0.54 0.50

    0.025

    = 1.6

  • 7/28/2019 Business Statistics May Module

    71/72

    at 5% level of significance for a two-tailored test the critical value is 1.96 since

    calculated Z value

    Z (sample) =-2.83

    This falls in the region of Ha.

    Conclusion: reject Ho. There is enough evidence to support that the machine is no

    longer reliable.

    Practice Questions

    Kenya Commercial bank Ltd. commissioned a research whose results indicated thatautomatic teller machine (ATM) reduces the cost of routine banking transactions.

    Following this information, the bank installed an ATM facility at the premises of JoyProcessing Company Ltd., which for the last several months has exclusively been, used

    by JoyS 605 employees. Survey on the usage of the ATM facility by 100 of theemployees in a month indicated the following:

    Number of

    times ATM

    used

    Frequenc

    y

    0 20

    1 32

    2 20

    3 13

    4 10

    5 5

    Required:

    a) An estimate of the proportion of Joys employees who do not use the ATM facility in

    a monthb)i) Determine the 95% confidence interval for the estimate in (a) aboveii) Can the bank be certain that at least 40% of Joys employees will use the ATM

    facility?c).The number of ATM transactions on average an employee of Joy makes per month

  • 7/28/2019 Business Statistics May Module

    72/72

    d).Determine the 95% confidence interval of the mean number of transactions madeby an employee in a month.e).Is it possible that the population mean number of transactions is four?

    Explain.