Last time | Next time |
Samuel, Silas, and other found that temperatures are rising a degree or so over a century. It doesn't sound like much, but all our land-based ice melting (with disastrous consequences such as flooding, sea-level rise, loss of freshwater sources, no more hydro power, etc.).
I've already said that I wished you all had just said "Hey, what the heck are you talking about?" -- well, before midnight of the day it was due, which is when I got about three emails.... :)
So if you did check out 3/8/47 in the data as it arrived from NOAA, and those others, you might have noticed that they have negative DTRs (Diurnal Temperature Range - which explains how they were discovered). The DTR isn't reported by NOAA, so I'd computed that and noticed something that shouldn't happen: one of the number one rules is that you shouldn't have a min greater than a max!
If I get really ambitious, I might email NOAA to let them know about the errors in their database. That would be a good thing to do, to help them clean up. We'll see if I get a round tuit... for Christmas.
So we try to find anything like this -- obvious data errors -- which will cause trouble for our analysis. And them we fix them.
I don't have any other software for doing that at the moment, so this was a necessary labor.
It has been quite the labor of love -- no, not love, but rather pain -- to get gstat to do what I wanted. When you're working with different software packages, stuff happens. For example, I define a Gaussian variogram as \[ \gamma(h)=1.0 - e^{-\ln(20)\left(\frac{h}{r}\right)^2}; \] Gstat defines it as \[ \gamma(h)=1.0 - e^{-\left(\frac{h}{r}\right)^2}. \] I had to download the source code for gstat (which is public domain, thankfully) to discover that. But, in the end, we're getting along.
Another important trick, however: gstat is designed for spatial data, so it assumes two spatial dimensions -- so I created a constant coordinate to add to time, to trick it: $(1,1), (2,1), (3,1), \ldots, (45782,1)$.
Many statisticians don't know about variograms, and those who do associate them with mining and such -- spatial problems. But it works just fine for time series.
I used R to compute the empirical variogram, and (attempted) to let gstat fit a model -- but it didn't do non-linear regression as well as other software I use (I used xlispstat, but Mathematica does fine), so I did that myself and handed the variogram to gstat. Now that we agree on definitions, the pictures from gstat show variograms fitting the data a whole lot better (the "x" axis is actually "distance in years" -- so we're looking at data spaced by days):
So if we're going to simulate a temperature (let's say a min), you
Again, these were computed by linear regression using trig functions that are 1-year periodic. In the end I wanted periodic mean and sd functions, so that was important.
Variogram for the $Zmin$ data | Variogram for the simulated data |
Here's the result of regressing the simulated minima and the actual minima against $\sin(2 \pi t)$ and $\cos(2 \pi t)$:
(and, by the way, my results will be slightly different from yours, unless you too removed some erroneous data -- maybe you found some more erroneous data to remove...!:)
Linear Regression: Estimate SE Prob Constant 40.0818 (4.257565E-2) 0.00000 cos(2 pi t) -20.4231 (6.012267E-2) 0.00000 sin(2 pi t) -8.29364 (6.029912E-2) 0.00000 R Squared: 0.746108 Sigma hat: 9.10964 Number of cases: 45782 Degrees of freedom: 45779 |
simulated minima $Smin$: Mean: 40.017 SD: 18.079 |
Linear Regression: Estimate SE Prob Constant 40.0546 (4.318030E-2) 0.00000 cos(2 pi t) -20.9435 (6.097651E-2) 0.00000 sin(2 pi t) -7.24789 (6.115547E-2) 0.00000 R Squared: 0.742800 Sigma hat: 9.23901 Number of cases: 45782 Degrees of freedom: 45779 |
actual minima Mean: 39.993 SD: 18.217 |
Linear Regression: Estimate SE Prob Constant 60.6425 (4.549809E-2) 0.00000 cos(2 pi t) -24.7985 (6.424956E-2) 0.00000 sin(2 pi t) -9.12010 (6.443812E-2) 0.00000 R Squared: 0.787120 Sigma hat: 9.73493 Number of cases: 45782 Degrees of freedom: 45779 |
simulated maxima $Smax$: Mean: 60.567 SD: 21.099 |
Linear Regression: Estimate SE Prob Constant 60.6009 (4.605287E-2) 0.00000 cos(2 pi t) -25.3769 (6.503299E-2) 0.00000 sin(2 pi t) -7.87640 (6.522385E-2) 0.00000 R Squared: 0.784930 Sigma hat: 9.85363 Number of cases: 45782 Degrees of freedom: 45779 |
actual maxima Mean: 60.530 SD: 21.247 |
(Because our model has no climate change in it: the same means and standard deviations repeat year after year....)
But, to be honest, you've got to love banging your head against a wall...:)
After the "picture show" I'll make a few comments about the final exam material.
The other two are simulations, with the same temporal autocorrelation (or pretty close) -- that is, the simulated data has the same variogram -- with each simulation producing its own versions of the Fletcher data.
(If you guessed that Fletcher's was the one on the left, I'm grateful!)
Simulated data (mins) follows a beautiful variogram model. | Simulated data (maxes) also follows a beautiful -- and slightly different -- variogram model. |
Because these are simulated, they're a little better behaved than the
empirical/theoretical variograms we obtained from the data.
As I described last time, one interprets the models. So these models suggest that temperatures are correlated well over only a few days, then the correlation falls off to "background" levels after about 10 days. By three days they're at about 75% of the background variance; so by then they've already lost alot of their similarity to a given day. I would say that basically these indicate that you can expect today's temps to tell you roughly what tomorrow's will be, maybe the next day's, too -- but after that, it gets tricky...:)
|
Here's a very preliminary answer (again, using only two simulations):
In this graphic we look at the chi-square test of independence, to tell if there's any relationship between decade and a pair of variables -- either the Fletcher and Sim1, Fletcher and Sim2, or Sim1 and Sim2.
As you can see from the results, the Fletcher data is much different from the others.
Here (below) I compare a simulated minmin with a uniform distribution across the decades, and the chi-square says that there's independence between the decades and these -- i.e. that the simulation is essentially uniform. But let's start with the "whoops":
What "failure to round" means is that there were unlikely to be ties, so there were exactly 366 extreme years. That was very unlikely given the other simulation results, and so I swooped in. And now I know the reason for that. But I still don't know the reason for one of the data sets having only 365 extreme years (see the results below)...!...:)
This illustrates the importance of carefully examining everything for errors, and why it's important to have collaborators who check each others' work.
Let's look over the one below, and then I'll just show you the others with a little commentary. I'm doing a $\chi^2$ test for independence -- meaning there's no relationship between decade and these variables.
And that null is rejected: there is a relationship. (In fact, minmin was our most dramatic departure from uniformity, based on the histogram).
So I look at the results pair-wise in the following graphic, to determine which pair is "causing the problem", and conclude that the Fletcher results are really quite unlike the simulation results.
The other three variables are featured here below (with the grain of salt that one of the simulations should have been rounded, and wasn't). So these are some preliminary results. But they show the idea -- what we're trying to accomplish: that Fletcher's results are unlike results obtained assuming no climate change -- and so they are the result of something else.
The "something" could be error; or it could be bad luck; or it could be climate change. Only God knows for sure. But as scientists, we present the most likely hypothesis, and conclude that Fletcher's data is most consistent with an hypothesis of climate change in Wood County.
If we were lucky (blessed, etc) the simulations we generate with that might be indistinguishable from Fletcher's work, which would add credibility to our model. But it would also suggest that climate change is indeed occurring, which is sad.... And I assure you that it is occurring, and we're not preparing in the least.
Let's review briefly our journey this semester. The major themes are linear and non-linear modeling, with a focus on projects to guide us to the use of various techniques.
By the way, some of you are using "significant" in an entirely un-statistical sense. The p-value in a typical regression represents the probability that a parameter is "different from 0". So some of you should say "significantly different from 0" (rather than "significant").
Some of you seem to be using "significant" as a synonym for "important".
The scientists and mathematicians know this. "The politicians" are counting on miraculous cures and vaccines that do not yet exist. Or other miracles. That's not a sane policy, but then some of our politicians aren't sane....
In the case of the Covid-19 model of Italy, we could attempt to match the data, and make predictions for the number of cases and deaths.
I'm sorry that a real-life SIR had to so dramatically alter our course, but I hope that it drives home the very importance of our work, and perhaps shows you how we can apply these ideas and techniques in our lives -- hopefully to improve them.