Skip to main content\(\newenvironment{mat}{\left[\begin{array}}{\end{array}\right]}
\newcommand{\colvec}[1]{\left[\begin{matrix}#1 \end{matrix}\right]}
\newcommand{\rowvec}[1]{[\begin{matrix} #1 \end{matrix}]}
\newcommand{\definiteintegral}[4]{\int_{#1}^{#2}\,#3\,d#4}
\newcommand{\indefiniteintegral}[2]{\int#1\,d#2}
\def\u{\underline}
\def\summ{\sum\limits}
\newcommand{\lt}{ < }
\newcommand{\gt}{ > }
\newcommand{\amp}{ & }
\)
One of the most important techniques for constructing empirical models in
math modeling is regression, most frequently linear regression (but we'll
want to consider non-linear regression as well).
So, usually when we think of linear things, we think of straight
line functions:
\(y(x)=a+bx\)
But if you're willing to go into higher dimensions, then planes
are linear functions (and so on into higher and higher dimensions):
\(z(x,y)=a+bx+cy\)
They're linear in \(x\) and linear in \(y\). But it's also
important to realize that these two kinds of functions are linear in
their parameters (\(a\), \(b\), \(c\)) too. And for that reason,
we can use the process of linear regression to find appropriate models
non-linear in \(x\). So we can fit quadratics, say, and even more
complicated functions (such as sums of sines and cosines).
We start by converting our linear equation, given in slope-intercept
form, into vector notation: we write \(y(x)\) as a so-called "inner
product" of a "row vector" and a "column vector",
\begin{equation*}
y(x)=a+bx=1*a+x*b=
\rowvec{ 1 \amp x}
\colvec{ a \\ b}
% \indefiniteintegral{f(x)}{x}
\end{equation*}
Now we'd like this to be a line so wonderful that every point fits it,
so that
\begin{equation*}
\begin{array}{cc}
y_1=a+bx_1= \amp \rowvec{ 1 \amp x_1}\colvec{ a \\ b}\\
y_2=a+bx_2= \amp \rowvec{ 1 \amp x_2}\colvec{ a \\ b} \\
\vdots \amp \vdots \\
y_n=a+bx_n= \amp \rowvec{ 1 \amp x_n}\colvec{ a \\ b}
\end{array}
\end{equation*}
Let's write that stack of equations in "matrix" and "vector" format:
the matrix product on the right just acts like a bunch of inner
products of vectors, as in the first example above:
\begin{equation*}
\left[
\begin{array}{c}
y_1\\
y_2\\
\vdots\\
y_n
\end{array}
\right]_{n \times 1}
=
\left[
\begin{array}{cc}
1 \amp x_1\\
1 \amp x_2\\
\vdots \amp \vdots \\
1 \amp x_n
\end{array}
\right]_{n \times 2}
\colvec{ a \\ b}_{2 \times 1}
\end{equation*}
- On the left, we have a column vector (call it \(\u{y}\) --
notation: an underline means that it's a vector, and we are going
to assume that vectors are column vectors -- tall and thin, not
short and wide);
-
on the
right we have a matrix, call it \(X\) (note the notation: capital
letters are usually matrices), and
- on the far right a column vector of parameters (call
it \(\u{\beta}\)):
\begin{equation*}
\u{\beta}=\colvec{ a \\ b}
\end{equation*}
It's \(\u{\beta}\) that we'd like to solve for -- we know everything
else!
Observe that we have written the dimensions of each object at the
bottom right. Vectors are really just narrow matrices.
Written as an equation, we have
\begin{equation*}\u{y}=X\u{\beta}\end{equation*}
which looks linear! Our job is to find \(\u{\beta}\), the
"slope
vector".
I hope that your impulse is to "divide by X" - but we're not sure what
that means for matrices. In fact, it turns out that it's meaningless
unless \(X\) is square (that is, \(n=2\) in this case, so that
there are two rows and two columns).
If \(X\) IS square, \(2 \times 2\), then we should be able to
find a unique line, provided the two points are distinct (i.e., they're
not the same point!). This is because there's a unique line passing
through two distinct points.
Otherwise, if there are more than two points, then we can find a unique
line only if the points are colinear - that is, they all fall exactly
on the same line. And we don't expect this to happen, except by dumb
luck. So generally our line won't fit all the points. We want to find
some way to choose the "best line", of course. So how do we proceed?
Suppose \(X\) is not square - that there are more than two
points. Then how do we proceed? We use a trick of sorts to create a
matrix that IS square: we multiply the matrix \(X\) by
its transpose, which is to say the matrix \(X\) with its
columns and rows interchanged (so the first column becomes the first
row, and the first row becomes the first column, etc.).
It looks like this:
\begin{equation*}
X^T= \left[
\begin{array}{cccc}
1 \amp 1 \amp \cdots \amp 1 \\
x_1 \amp x_2 \amp \cdots \amp x_n
\end{array}
\right]_{2 \times n}
\end{equation*}
Note the interchange in dimensions (so this matrix is \(2 \times
n\)).
Now, we multiply both sides of the equation
\begin{equation*}\u{y}=X\u{\beta}\end{equation*}
by \(X^T\), i.e.
\begin{equation*}X^T\u{y}=X^TX\u{\beta}\end{equation*}\begin{equation*}
\displaystyle
\left[
\begin{array}{cccc}
1 \amp 1 \amp \cdots \amp 1 \\
x_1 \amp x_2 \amp \cdots \amp x_n
\end{array}
\right]
\left[
\begin{array}{c}
y_1\\
y_2\\
\vdots\\
y_n
\end{array}
\right]_{n \times 1}
=
\left[
\begin{array}{cccc}
1 \amp 1 \amp \cdots \amp 1 \\
x_1 \amp x_2 \amp \cdots \amp x_n
\end{array}
\right]
\left[
\begin{array}{cc}
1 \amp x_1\\
1 \amp x_2\\
\vdots \amp \vdots \\
1 \amp x_n
\end{array}
\right]_{n \times 2}
\colvec{ a \\ b}_{2 \times 1}
\end{equation*}
using the same approach as for a row vector times a column vector, we
multiply the rows of \(X^T\) times the column vector \(\u{y}\) or
the columns of the \(X^TX\) matrix to obtain
\begin{equation*}
\displaystyle
\colvec{\summ_{i=1}^ny_i \\ \summ_{i=1}^nx_iy_i}
=
\left[
\begin{array}{cc}
n \amp \summ_{i=1}^nx_i \\
\summ_{i=1}^nx_i \amp \summ_{i=1}^nx_i^2
\end{array}
\right]_{2 \times 2}
\u{\beta}
\end{equation*}
Those sums are kind of ugly, with the dummy index, so let's just
understand that sums do their summing over the \(n\) elements, and
drop those moving forward:
\begin{equation*}
\displaystyle
\colvec{\sum y_i \\ \sum x_iy_i}
=
\left[
\begin{array}{cc}
n \amp \sum x_i \\
\sum x_i \amp \sum x_i^2
\end{array}
\right]
\u{\beta}
\end{equation*}
Okay, now you're ready to "divide" by the matrix
multiplying \(\u{\beta}\), since it's square. Well, we don't so
much divide as we multiply by the multiplicative inverse,
and it turns out that the multiplicative inverse of a \(2 \times 2\)
matrix is easy to compute:
If \(M\) is a \(2 \times 2\) matrix, with entries
\begin{equation*}
M=
\left[
\begin{array}{cc}
p \amp q \\
r \amp s
\end{array}
\right]
\end{equation*}
then the multiplicative inverse \(M^{-1}\) is given by
\begin{equation*}
M^{-1}=\frac{1}{ps-rq}
\left[
\begin{array}{cc}
s \amp -q \\
-r \amp p
\end{array}
\right]
\end{equation*}
The product \(M^{-1}M=MM^{-1}=\left[
\begin{array}{cc}
1 \amp 0 \\
0 \amp 1
\end{array}
\right]\), called the "identity matrix" \(I\),
because \(I\u{\beta}=\u{\beta}\) for all \(2 \times 1\) vectors.
So now we can solve for \(\u{\beta}\). Let
\begin{equation*}
M=
\left[
\begin{array}{cc}
p \amp q \\
r \amp s
\end{array}
\right]
=
\left[
\begin{array}{cc}
n \amp \sum x_i \\
\sum x_i \amp \sum x_i^2
\end{array}
\right]
\end{equation*}
Then
\begin{equation*}
M^{-1}=
\frac{1}{n\sum x_i^2-\left(\sum x_i\right)^2}
\left[
\begin{array}{cc}
\sum x_i^2 \amp -\sum x_i \\
-\sum x_i \amp n
\end{array}
\right]
\end{equation*}\begin{equation*}
M^{-1}
\displaystyle
\colvec{\sum y_i \\ \sum x_iy_i}
=
M^{-1}
M
\u{\beta}
=
\u{\beta}
\end{equation*}
So
\begin{equation*}
\colvec{ a \\ b}
= \left\{
\begin{array}{c}
M^{-1}
\displaystyle
\colvec{\sum y_i \\ \sum x_iy_i} \cr
\frac{1}{n\sum x_i^2-\left(\sum x_i\right)^2}
\left[
\begin{array}{cc}
\sum x_i^2 \amp -\sum x_i \\
-\sum x_i \amp n
\end{array}
\right]
\colvec{\sum y_i \\ \sum x_iy_i} \cr
\frac{1}{n\sum x_i^2-\left(\sum x_i\right)^2}
\colvec{
\sum x_i^2\sum y_i - \sum x_i\sum x_iy_i \\
n\sum x_i y_i - \sum x_i\sum y_i
} \cr
\colvec{
\frac{\sum x_i^2\sum y_i - \sum x_i\sum x_iy_i}
{n\sum x_i^2-\left(\sum x_i\right)^2}\\
\frac{n\sum x_i y_i - \sum x_i\sum y_i}
{n\sum x_i^2-\left(\sum x_i\right)^2}
} \cr
\colvec{
\frac{\overline{y}\sum x_i^2 - \overline{x}\sum x_iy_i}
{\sum x_i^2-n\overline{x}^2}\\
\frac{\sum x_i y_i - n\overline{x}\overline{y}}
{\sum x_i^2-n\overline{x}^2}
}
\end{array}
\right.
\end{equation*}
So
\begin{equation*}
a =
\frac{\overline{y}\sum x_i^2 - \overline{x}\sum x_iy_i}
{\sum x_i^2-n\overline{x}^2}
\end{equation*}\begin{equation*}
b = \frac{\sum x_i y_i - n\overline{x}\ \overline{y}}
{\sum x_i^2-n\overline{x}^2}
= \frac{\overline{xy} - \overline{x}\ \overline{y}}
{\overline{x^2}-\overline{x}^2}
\end{equation*}
You can check that \(a=\overline{y}-b\overline{x}\).
Now how do we get the regression diagnostics that are commonly used
when performing linear regression?
The first order of business is to check that the residuals are
well-behaved. We look for certain aspects of the residuals:
- Is there a pattern to the residuals? If so, back to the
modeling step, to capture that pattern.
- Are the residuals independent? (related to pattern)
- Are the residuals normally distributed (or at least
identically distributed)? The assumptions of the regression
procedure include the assumption that the errors are independent
and identically distributed....
Having established that our residuals are well-behaved, we can move
on to ask how well our model is capturing the variation in the data.
There are certain quantities that are related to variances that we
begin by computing. The \(SS_{Tot}\) is the TOTAL sum of squared errors of
the null model, which is \(y(t)=\overline{y}\), the mean
value of the \(y_i\) values. So
\begin{equation*}
SS_{Tot}=\sum_{i=1}^n(y_i-\overline{y})^2
\end{equation*}
The SSE is the the minimal sum of squared errors, the minimum value
obtained by using our optimal regression
parameters. Defining \(\hat{y}_i=a+bx_i\) as the estimate provided
by the model,
\begin{equation*}
SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2
\end{equation*}
Now we can also define a sum of squared errors of the regression
model, via
\begin{equation*}
SS_{Reg}=\sum_{i=1}^n(\hat{y}_i-\overline{y})^2
\end{equation*}
It turns out (perhaps surprisingly) that
\begin{equation*}
SS_{Tot}=SSE + SS_{Reg}
\end{equation*}
The variance explained by the model is the \(R^2\) value,
\begin{equation*}
R^2=\frac{SS_{Tot}-SSE}{SS_{Tot}}=\frac{SS_{Reg}}{SS_{Tot}}
\end{equation*}
which gives us the "big picture" fit to the data in the
scatterplot. \(R^2\) is between 0 and 1, with 0 meaning no better
than horizontal null model, and 1 meaning a perfect fit to the line.
\(R^2\) provides a measure of how much of the total variability is
explained by the model.
Our next objective is to obtain measures of fit for the parameters:
we would like to know their standard errors, which indicate how much
variation we expect from the "best values" obtained by using the normal
equations.
The MSE (mean squared error) is found from the SSE as
\begin{equation*}
MSE=\frac{SSE}{n-p}
\end{equation*}
where \(p\) is the number of parameters in the model (including
the intercept term).
In addition to the
MSE we need the inverse of the matrix \(X^TX\) (which was computed
to derive the normal equations) to get the standard errors of the
parameters. The standard error of the \(i^{th}\) parameter is obtained
as the square root of the \(i^{th}\) diagonal entry of the matrix
inverse \((X^TX)^{-1}\), multiplied by MSE:
\begin{equation*}
SE_{i}=\sqrt{MSE*(X^TX)^{-1}_{ii}}
\end{equation*}
The t-statistics are the parameters divided by these standard
errors SE: \(t_i=\frac{\beta_i}{SE_i}\). Then, in order to
calculate the (two-sided) probabilities of
values of \(t_i\) that large or larger (assuming a mean of zero),
we compute
\begin{equation*}
p_i=2(1-CDF(t_i,\nu=n-p))
\end{equation*}
where CDF is the cumulative distribution function of the
t-distribution with \(\nu\) degrees of freedom.
Equivalently, we could construct \(\rho\)-confidence intervals, with form
\begin{equation*}
[\beta_i - t^{crit}_{\rho,\nu=n-p} SE_i,\beta_i +
t^{crit}_{\rho,\nu=n-p} SE_i]
\end{equation*}
and where we find \(t^{crit}\) by asking what width of the
standardized t-distribution provides us with the given confidence,
and then we use that width to multiply the appropriate standard
error \(SE_i\).