Chapter 4 - Empirical models
Empirical models are the anti-thesis of theoretical models. If you have
information about the process being modeled, then a carefully constructed
theoretical model should be in close agreement with the "empirical data". That
is, if we just collect data, then we'll find the results in good agreement with
the theoretical predictions of the model. Theoretical models are generally
considered "better" than empirical (the latter are somewhat "dirty").
We'll start with data and create a model. Since nature involves some
stochasticity, we expect our data to be noisy - nature always seems to be
trying to conceal the deterministic aspects of relationships between variables.
True to the modeling process, we're going to start simple, and only complexify
as needed: KISS - Keep It Simple Stupid! For that reason, we'll begin today
with the linear model.
Sections:
- Covariance and correlation
- Interested in the dependence of y on x.
- Is there something systematic about the variation of x and
y? Do they vary together?
- Covariance of a variable with itself is variance.
- Correlation standardizes covariance.
- Correlation does not imply causation! (And we're usually more
interested in causation.)
- Fitting a line - using calculus and least squares criterion
- Derivation: (multi-variate) calculus solution
- Relationship between slope and covariance.
- R2: a measure of fit
- See figure 4.5, p. 158
- R2 represents the amount of variation explained by
model against a constant model (using the mean as a base
model)
- "Some books call R2 the coefficient of
determination" (p. 159)
- Exercise #2, p. 226
- Finding R2 in the non-linear case
- same as R2 in the linear case!
- Example - the X-files (using results from Mathematica)
- Typical use of regression: predict year two's results from year
one's results
- Discussion of the "failures" of the process
- R2 lowered with more data?! (i.e.,
incorporating season 2 we'd expect to do better....)
- season two would have done a woeful job of predicting year
one....
Note:
In the end, the result is to be a model. The question then becomes
one of interpretability: can the model that results explain? Or is it
simply a way of predicting values that are close to the actual, without any
sort of knowledge being gained? We certainly hope that knowledge increases,
based on the model that results.
Some Questions:
- What is the range for which the model might remain valid? How far can you
trust this model (how far can you throw this model)?
- When is complexifying worthwhile? How do I decide whether R2 is
okay?