My name is Keith and I have a hammer. It is called “curve fitting.” The world is full of nails that could be, and in my opinion, absolutely should be hit with this hammer.
I know, I know, there are critics of curve fitting. I know that no analysis procedure can increase the amount of information in your data. I know all that. But your data came from some real-world signal, and if you have noise added on top of it, then a procedure that removes the noise, and pushes your data back towards the original signal, is good. The processed data can be closer to the real signal than the noisy data, that’s good, so no more apologies.
I’m going to post some shtuff on this blog to demonstrate curve fitting using a variety of methods. The first of these installments is for Excel. I know! Excel can do arbitrary functional fitting! You can download the worksheet to follow along here: curve fitting in Excel
Let’s get going: As a simple example, let’s fit a straight line to some data. That’s something of the form y = ax + b, where we’ll have data for x and y, and we’ll figure out the slope parameter, a, and the offset parameter, b, for a straight line fit. Later we’ll do something more complicated.
- Step zero: generate some synthetic data
- Generate an estimate based on an initial guess of a and b.
- Overlay the fit on top of the data
- Optimize the fit
- Go on about your day
Okay, getting things started, let’s generate some data to test our setup. In the figure below, I made some synthetic data which is described by a straight line, and is corrupted by some noise. NOte the three variables in the first column, “actual slope”, “actual_offset”, and “noise”. These are named variables, meaning that I highlighted the cell and typed a name for the variable in the little box on the top left (where in the figure it says “actual_slope”. That way I can use the name of the variable in my equations and I don’t have to worry about having the correct cell reference (i.e. I can type “actual_slope” rather than $A$2 – you could use $A$2 in this case with the dollar signs so that the reference stays constant for all cells.)
So the equations for the data are as follows:
X is random : “rand()”
true Y is calculated from the X : “=C2*actual_slope+actual_offset”
estimated Y is a noisy version of true Y : “=D2+NORM.INV(RAND(),0,noise)”, where the NORM.INV stuff is a way to get normally distrubed noise, as is more common in the world. Important note, before you start the fitting, if you generate your data this way, with the rand() functions, you have to stabilize those values before you start the fitting. Rand() will regenerate random numbers each time there is a calculation, so your data will be jumping around and around while you’re algorithm is trying to fit it. So, highlight your data, right click and copy it, then right click and paste special – values over the data. Then your data are numbers, not rand() functions.
Next, we’ll guess at what the slope and offset of the best fit line are, and plot that. We don’t have to guess well, that’s the whole point, we can make a rough guess and let Excel optimize it. Below, you’ll see the updated spreadsheet. I have made new parameters, “estimated_slope” in cell A12, and “estimated_offset” in cell A15. I selected 1 and 5 for the starting point, and as you can see by the plot of the estimated line, it is not that good of a guess. Column G is the estimated Y value for the X values in column C, using those guessed parameters. Column H is the error (i.e. Estimated Y – Measured Y). Note that the error is calculated always from the measured Y, not true Y, because you’re not supposed to know that, if you did you’d be finished already.
Look at cell A18, which holds a value I named “SSE” which stands for sum of squared error. This is the sum of all the squared values in Column H, and this is our estimate of how good this line fits our data. 0.405 is the sum of the squared errors of our fit – our measured Y values, and this is what gets minimized in our optimization.
Now we go to the critical step. Optimizing our parameters so that our fit is as good as possible. In the Data menu, in the Analysis group, select “Solver”.
If the Solver command or the Analysis group is not available, you need to load the Solver Add-in program.
How to load the Solver Add-in program
1.Click the File tab, click Options, and then click the Add-Ins category.
2.In the Manage box, click Excel Add-ins, and then click Go.
3.In the Add-ins available box, select the Solver Add-in check box, and then click OK.
Set your Objective, which means to ‘maximize’ or ‘minimize’ a certain value, in our case we are ‘minimizing’ the SSE for our fit, so put the address for the SSE, and make sure you select “Minimize”. The other imporant factor is to determine which cells hold your variables, in our case, the named variables “estimated_slope” and “estimated_offset”.
And here is the result, notice how the red fit line is a much better estimate of our true hidden values. Also, our cells which held our estimated variables are now different, and are relatively good approximations to the true values. The more data you have, or the less noise you have, the better that approximation will be.
Now, let’s do something harder. (something that couldn’t be done with ‘add trendline’). Below you’ll see a more complicated dataset, which is roughly the shape of the current induced by a synaptic impulse in your brain.
Here I’ve generated a data trace, which is what we in the neuroscience biz call an excitatory post-synaptic conductance (EPSC). It is an impulse, with a fast onset, and a slower offset, and is pretty closely approximated by an exponential rise and decay. Here we’re going to try to figure out via fitting, how fast the EPSC rises, and decays, and at exactly what time it comes on, and the size and all that stuff.
I’ve set it up the exact same way, made some initial guesses as to the variables, and I’ve made a cell which holds the objective measure of the goodness-of-fit, and away we go, running Solver. And okay, it didn’t work the first time, with those initial starting conditions, let me try again with a closer guess. Oh, and I forgot to select Minimize, but wait it still isn’t working. Okay, well, I guess it’s too hard, it’s not working. Maybe I’m doing something wrong, oh wait! I forgot to copy, and paste-special-values my data. Probably that’s it, my data is changing every iteration and that’s making it hard. Let’s see if that works now. Yeah, that’s better.
Note the estimated parameters are pretty close to the “true parameters” and the fit looks pretty good. I hope this helps in your data analysis. Here’s the worksheet for these two problems. curve fitting in Excel