Skip to main content\(\newenvironment{mat}{\left[\begin{array}}{\end{array}\right]}
\newcommand{\colvec}[1]{\left[\begin{matrix}#1 \end{matrix}\right]}
\newcommand{\rowvec}[1]{[\begin{matrix} #1 \end{matrix}]}
\newcommand{\definiteintegral}[4]{\int_{#1}^{#2}\,#3\,d#4}
\newcommand{\indefiniteintegral}[2]{\int#1\,d#2}
\def\u{\underline}
\def\summ{\sum\limits}
\newcommand{\lt}{ < }
\newcommand{\gt}{ > }
\newcommand{\amp}{ & }
\)
The downside of linear regression is that it only works when the
parameters appear linearly in the model, and many models don't meet that
standard. What if we have a model like
\begin{equation*}
y(x)=ae^{bx}
\end{equation*}
We can make a transformation of the data, and so seek a model of the form
\begin{equation*}
\ln(y(x))=\ln a + bx
\end{equation*}
but this has an impact when it comes to the error structure. The linear
regression model assumes independent identically distributed (iid)
errors, so
\begin{equation*}
\ln(y_i)=\ln a + bx_i + \epsilon_i
\end{equation*}
When we transform back, to talk about \(y\), or rather \(y(x)\), we
have data with an error structure of the form
\begin{equation*}
y_i=ae^{bx_i + \epsilon_i}=ae^{bx_i}e^{\epsilon_i}
\end{equation*}
So the errors become multiplicative. Now that might actually be
appropriate: it depends on exactly what the nature of those errors is.
If the error is all due to reading a thermometer with an error of .05
degree
(i.e. it only reads to the nearest tenth), then the error is going to be
additive, rather than multiplicative - it's about .05 degree across the
spectrum. If the error is in counting things, and you tend to miss about
10 percent no matter the size of the crowd, then it's multiplicative. So
you would like to have some understanding of the type of error structure
you're expecting, and capture it properly.
If we truly want to work with an error structure of
\begin{equation*}
y_i=ae^{bx_i} + \epsilon_i
\end{equation*}
then non-linear regression permits us an approach. A downside of
non-linear regression is that the diagnostics are not quite as good as
they are for linear regression. But we can get the essentials (especially
the residuals, the \(R^2\), and the standard errors). So here we go.
Subsection11.1The Motivation: Newton's Method¶ permalink
I want to start by talking about Newton's method. You might have seen
this in calculus class, but perhaps not. It's one of those topics which
seem to get passed over, sadly, because we're in such a damned hurry to
race through the standard calculus curriculum. It's one of the coolest
things, is tremendously valuable, and has Newton's name on it - so it's a
bit of a surprise that it gets the shaft, but so it goes.
At any rate, I'm not bitter....
So let's take a look at that. It starts with a non-linear function for
which we want a root:
Now let's suppose that we make a guess, and it's not too bad: we
think maybe \(x=.5\) (call this starting guess \(x_0\)). We
draw in the tangent line to the function at the
point \((x_0,f(x_0))\), and construct the tangent line function:
Note: the tangent line function is actually the first-degree Taylor
series polynomial approximation to \(f\) at \((x_0,f(x_0))\). This
is important, for the generally extension to non-linear regression.
To solve for the root of the original function is a non-linear problem,
and "hard". We can solve for the root of the tangent line function (the
linear function) very easily, however: so we trade our hard non-linear
problem for a problem that we know how to do. Are we clever, or lazy?
Interesting mathematical strategy: if you don't like the problem you are
given, do an easier one!:) We cast the problem into the linear world,
solve, it and bring back the solution, and see how we're doing.
Well, in this case, we're closer. By the way, the root we're after can be
approximated using Sage, as
Notice that this is just \(\frac{\pi}{3}\)! So you could have solved it
yourself, just by thinking about it: it's a special triangle. ( Did you
think about it?)
At any rate, our method is to use two points and the known slope to find
the next approximation, which comes from
\begin{equation*}
m=\frac{0-f(x_0)}{x_1-x_0}=f'(x_0)
\end{equation*}
or
\begin{equation*}
x_1=x_0-\frac{f(x_0)}{f'(x_0)}
\end{equation*}
In this case, that gives
So by solving the linear problem, we're closer to the solution of
the non-linear problem! So just do it again, do it again, do it
again... until satisfied, whatever that means. Here's the next picture:
You can see that we've gotten a lot closer now, and that, if we just do
it about 10 times, we're going to be golden:
Now isn't that cool? How much cooler than that can you get? We do a whole
bunch of linear problems, and suddenly we have solved the non-linear
problem.
And that's the big idea behind non-linear regression, too.
Let's assume that the scalar value \(y\) is some non-linear function
of several variables, which we will denote by the (column)
vector \(\u{x}\). (Remember, we are the people of the columns,
not the people of the rows.)
So, for example, we might have test score \(y\) as a function of
several variables, making a vector of predictors
\begin{equation*}
\u{x}=\colvec{ACT \\ GPA \\ Age \\ Rank}
\end{equation*}
and those variables might be associated with several parameters, which we
might denote as
\begin{equation*}
\u{\theta}=\colvec{a \\ b \\ c \\ d \\ e}
\end{equation*}
and the non-linear model could be
\begin{equation*}
y = f(\u{x};\u{\theta}) = a ACT^b GPA^c Age^d Rank^e
\end{equation*}
Now let's suppose that we have \(n\) data values \(y_i\), and their
associated predictors \(\u{x}_i\). Let's denote the unknown
parameters by the vector \(\u\theta\). We're going to
assume that
\begin{equation*}
y_i = f({\u{x}}_i;{\u{\theta}}) + \epsilon_i,
\end{equation*}
where the residuals \(\epsilon_i\) are iid (independent and
identically distributed).
We're going to need a pretty good starting guess for the parameters (call
this \(\u{\theta}^*\)). That's one of the big differences
between linear and non-linear regression: you frequently need a good
starting guess. Of course the closer you are to
a "solution" the better (and we can't assume a unique solution in the
non-linear case, like we can in the linear case - another important
difference).
Our strategy is going to rely on the Taylor series expansion, which you
should recall from calculus (but may not -
time for review in that case!).
So we have \(n\) equations of the form
\begin{equation*}
y_i = f({\u{x}}_i;{\u{\theta}})
\end{equation*}
which is generally an over-determined system for the
parameters \(\u{\theta}\). So now we use the multivariate Taylor
series, to expand out the right-hand side:
\begin{equation*}
y_i = f({\u{x}}_i;{\u{\theta}})
= f({\u{x}}_i;{\u{\theta}^*})
+
f_{\theta_1}({\u{x}}_i;{\u{\theta}^*})(\theta_1-\theta_1^*)
+
f_{\theta_2}({\u{x}}_i;{\u{\theta}^*})(\theta_2-\theta_2^*)
+
\ldots
+
f_{\theta_p}({\u{x}}_i;{\u{\theta}^*})(\theta_p-\theta_p^*)
\end{equation*}
or
\begin{equation*}
y_i - f({\u{x}}_i;{\u{\theta}})
=
f_{\theta_1}({\u{x}}_i;{\u{\theta}^*})(\theta_1-\theta_1^*)
+
f_{\theta_2}({\u{x}}_i;{\u{\theta}^*})(\theta_2-\theta_2^*)
+
\ldots
+
f_{\theta_p}({\u{x}}_i;{\u{\theta}^*})(\theta_p-\theta_p^*)
\end{equation*}
where the \(f_{\theta_i}\) denote partial derivatives with respect
to the \(i^{th}\) parameter \(\theta_i\).
Now we can write this in vector form as
\begin{equation*}
y_i - f({\u{x}}_i;{\u{\theta}})
=
[f_{\theta_1} f_{\theta_2} \ldots f_{\theta_p}]\bigg\rvert_{{\u{x}}_i;{\u{\theta}^*}}(\u{\theta}-\u{\theta}^*)
\end{equation*}
where the vertical bar means to evaluate that vector of partials at the
known vector value \({\u{x}}_i\) and parameter
estimate \({\u{\theta}^*}\).
The only thing we don't know in this equation is \(\u{\theta}\), and
that's what we want to know!
Define now for each data value the difference between the data and the best
model so far,
\begin{equation*}
w_i \equiv y_i - f({\u{x}}_i;{\u{\theta}^*})
\end{equation*}
Then we can write this as a big matrix equation of the form
\begin{equation*}
\u{w}=X(\u{\theta}-\u{\theta}^*)
\end{equation*}
where the rows of \(X\) are the partial derivatives evaluated at the
appropriate \({\u{x}}_i\) and \({\u{\theta}^*}\).
Now we do linear regression to find \(\u{\theta}-\u{\theta}^*\) in the
usual way:
\begin{equation*}
\u{\theta}-\u{\theta}^* = (X^TX)^{-1}X^T{\u{w}}
\end{equation*}
from which we obtain our next estimate for \({\u{\theta}}\) as
\begin{equation*}
{\u{\theta}}=\u{\theta}^* + (X^TX)^{-1}X^T{\u{w}}
\end{equation*}
Let's take a moment to say "Hallelujah and Amen!" We've solved an
associated linear problem, and hope that the new estimate
for \(\u{\theta}\) is closer to the true solution.
Now we reset our guess \(\u{\theta}^*\) to this new value,
\begin{equation*}
\u{\theta}^* = {\u{\theta}}
\end{equation*}
and iterate -- that is, do it again -- computing the new, better
estimate \({\u{\theta}}\) for as long as the system is converging.
By "converging" I mean that the difference \(\u{\theta}-\u{\theta}^*\)
goes to zero as you iterate. Then you just cross your fingers and hope
that you've found a minimum!
(Don't you hate it when it comes down to crossing your fingers and
hoping? Well heh, at least it's a start!:)
The cool thing that you'll notice in the linear regression modeling is
that the parameter estimates are going to go to zero, but their standard
errors won't. So the p-values are going to go to 1, which is really funny
for linear regression, and the standard errors will converge to fixed
values -- which will be the standard errors of the non-linear parameters!
From those we can compute our parameter confidence intervals in the usual
way. That's really sensational. I hope that you're as excited as I am
about now (but somehow I'm imagining that you're thinking, um, this guy
is off his rocker...:).
Here's
a Mathematica file that illustrates the process in action for a
really interesting data set, involving the on-set of rigor mortis
in human bodies. Kind of a macabre problem, but we deal with all kinds of
modeling problems in this class!