Day 30: Statistics for Business Applications

Last Time: More on linear regression
Today:
- Collect any optional assignments (chapter 13 review problems). You may hand those in as late as noon Friday as a homework replacement.
- Re: your test grade: instead of counting the tests equally, I'll count your
  - Best
  - Second best
  - Average of best and worst.
- Questions on linear regression (especially interpreting Minitab output)?
  - Interpreting the scatterplot
  - Interpreting the coefficients of the linear equation (slope and y-intercept)
  - Interpreting fit (R²)
  - Hypothesis testing for the coefficients (observed test statistic and p-value)
  - Prediction and confidence intervals
  - Extrapolation is risky! Interpolation is safer....
  - Example:
    - Analysis and Minitab output for 18.08
- Review:
  - Generalities/Logistics
    - The final: 5/13, 10:10-12:10. Contact me with questions up to the final:
      - email: longa@nku.edu.
      - Home: 859-781-3916 (til 11:00 p.m.)
      - Work: 859-572-5794
    - For the final you may use both sides of one-page of an 8.5x11 inch sheet of paper for formulas and graphs, which you will turn in with your test. I will provide a Z-table and t-table.
    - Use complete sentences, and write in complete thoughts when writing up a solution. Talk about the substantial issues of the problem.
    - Remember that your linear regression homework will count as part of your final grade. You will attach it to the test when you hand it in.
  - Statistical basics
    - Statistics is a means of teasing information out of data. Data is messy, noisy, full of errors. We'll make mistakes in our conclusions, but seek to control them, and predict the frequency of occurrence.
    - Random variables (e.g. NKU heights) are our focus, especially those which are quantititative, which are distributed in various ways: e.g.
      - normally (e.g. heights)
      - uniformly (e.g. values on a fair die, or coin tosses)
      We often seek to characterize their distributions (e.g. find the parameters which lurk behind the distributions).
    - In the terms of any example be able to identify the population and parameter, versus the sample and statistic which we hope sheds light on the population and parameter.
    - Be able to identify examples of descriptive statistics (which simply characterize samples of a random variables) and inferential statistics (which seek to characterize the underlying distribution via its parameters).
    - Is the sample random and unbiased?
    Remember that associated with inferential statistics are confidence (and significance) levels. The price of more confidence is a better (usually larger) sample.
  - Histograms - give us insight into the true distribution of a random variable - e.g. NKU heights
    - symmetry
    - unimodal, bimodal, or multimodeal?
    - What are typical (central) values?
    - What are the largest and smallest values?
    - Are there unusual or atypical values?
  - Measures of central tendency of a random variable
    - answer the question "Where is the data centered?"
    - mean - average value (center of mass of histogram) - sensitive to outliers
    - median - middle value - insensitive to outliers
    - mode - most common value (or class midpoint)
  - Measures of spread of a random variable
    - answer the question "How does the data 'deviate' from the center?"
    - range, variance, and standard deviation (square root of the variance)
    - Empirical rule (for bell-shaped distributions): 68/95/99
    - Chebysheff's Theorem: the (conservative) 0/75/89 rule
  - The normal distribution and probabilities
    - Think of the normal distribution as the limiting case of a histogram with an huge number of samples
    - Know how to use the Z table of normal probabilities (p. 235, or appendix B-8)
    - Writing areas as probabilities (e.g. P(0 < Z < 2.17) is the same as the value in the table on p. 235 for a Z value of 2.17 - .4850)
    - Using the Z table backwards - given an area, find the value (or values) of Z associated.
  - Sampling distributions (obtained using the central limit theorem):
    - of a mean (n>30? histogram relatively normal-looking means we can get away with fewer)
    - of a proportion (np and n(1-p)>5?)
  - Estimators
    - point estimators
      - unbiasedness
      - consistency
      - relatively efficient estimator
    - interval estimators - generally the larger the sample size, the smaller the interval - e.g. CI for the mean
    - Compute the sample size necessary to obtain a fixed-sized interval estimator
  - Hypothesis testing
    - null and alternative hypotheses
    - test statistic
    - distribution of test statistic
    - one- and two-tailed tests
    - significance and confidence levels
    - p-values - probability of values more extreme than the calculated statistic
    - rejection regions - region of extreme statistics
    - Type I (reject true null) and Type II errors (fail to reject false null)
  - t-distribution
    - Major issue: sample standard deviation (s), rather than population value (sigma).
    - Major twist: "degrees of freedom" is now required.
    - Looks like a normal, more so as the degrees of freedom head for infinity
    - Assure that the use of t is appropriate
    The tests remain essentially the same, only the distribution changes. We have to use a t-table rather than a z-table.
  - Test statistic and CI for a proportion p
    - use normal statistics
    - variance is related to the point estimate - we need to use the point estimate phat in the standard error formula for CI, but not for test.
  - Difference in two means
    - Independent samples
      - equal variances (NKU heights, men vs women)
      - unequal variances (two different processes for creating a chemical reaction) - fewer degrees of freedom, less power to reject the null
      - ugly formulas!
    - Paired samples - reduces to a standard t-test of the mean of the difference variable.
  - New material:
    - Difference in two proportions (section 13.6 - e.g. the M&M experiments we conducted in class)
    - Linear regression (interpreting Minitab output, right along the lines of the homework exercises):
      - Interpreting the scatterplot
      - Interpreting the coefficients of the linear equation (slope and y-intercept)
      - Interpreting R2
      - hypothesis testing for the coefficients (observed test statistic and p-value)
      - prediction and confidence intervals (avoiding extrapolation)
- Evaluations
Next time: See you at the final!

Website maintained by Andy Long. Comments appreciated.