|Subject:||Re: [office-comment] RSQ - Unstable Algorithm|
|From:||Leonard Mada (disc...@gmx.net)|
|Date:||Jun 14, 2007 12:53:16 pm|
I refined the RSQ() algorithm. For real case examples (including the majestic failure of the old algorithm) AND for recalculations using a stable algorithm (see 2nd spreadsheet), see the following issue on the OOo bugs page: http://www.openoffice.org/issues/show_bug.cgi?id=78250
Eike Rathke wrote:
On Friday, 2007-06-08 21:58:26 +0300, Leonard Mada wrote:
RSQ(), STDEV(), STDEVA(), STDDEVP(), STDEVPA(), STEYX(), and VAR(), VARA(), VARP(), VARPA() suffer from the same shortcoming. They all mention the naive algorithm (algorithms based on [n*sum(x[i]2) - sum(x[i])2] are inherently unstable, see paragraphs 6.16.69-73, and 6.16.79-82)
Actually the OOoCalc implementation does the two-pass approach using the mean value for the STDEV*() and VAR*() functions. We could give both formulas, the "naive" mathematically correct one (because that is the one you'll usually find in books) and the one interesting for better numerical stability.
Unfortunately, this stable algorithm implies passing *two times* through the data, which bears a significant time penalty for large data sets or data sets on a network drive!!!
It seems to me the two passes nowadays don't add that much penalty anymore, as long as you build up a memory array and don't try to read data twice from the very source, e.g. a network drive.
One month ago I had a very unpleasant surprise: I changed the formatting of a single column from TEXT to NUMBERS on a medium sized spreadsheet (~4,000 - 5,000 rows) on a 2 GHz machine in a competitors program and had to wait more than one hour for the recalculations to complete. I would be very cautious about how fast modern computers are. Programs are usually NOT and they can break even the most advanced computer.
There are alternative *one-pass* algorithms that are more stable than the notoriously unstable algorithm (see on wikipedia, http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass).
Well, yes, but that's about correlation.
This one could be adapted to all other situations (CORREL() is by the way broken, too).
D. RSQ() [...] Though, I would think that rewriting the formula to use [ sum(x[i] - MEAN(x))2 ] instead of [n*sum(x[i]2) - sum(x[i])2] is a fairly safer alternative. Also, the numerator should be rewritten as: sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ] (or else the same risk exists, that the computed r2 is negative OR bigger then 1)
I haven't verified, are these really valid transformations of the former?
I have improved this algorithm even further. Basically, the algorithm goes like this: 1.) N*sum(Xi2) - (sum(Xi))2 is replaced by: N * sum(Xi - Xm)2 , where Xm = the MEAN of X the N will cancel later, so we do NOT need to multiply with N
2.) N*sum(Xi*Yi) - (sum(Xi))*(sum(Yi)) is equivalent to: N*sum[(Xi-Xm)*(Yi-Ym)]
3.) dividing the previous 2 quantities and canceling N gives: b = sum[(Xi-Xm)*(Yi-Ym)]/sum(Xi-Xm)2
4.) Ycalc[i] = a + b * Xi adding every Ycalc gives: sum(Yc[i]) = sum(a) + b*sum(Xi) however sum(Yc[i]) = sum(Yi) = N*Ym, therefore N*Ym = N*a + N*b*Xm => canceling N Ym = a + b * Xm => a = Ym - b * Xm [I actually can show mathematically that *this is indeed TRUE*.]
5.) and final step, calculating R2 first calculate: Yfirst = Yi - Ym Ysecond = Yi - Ycalc[i] , then compute Yfirst - Ysecond, and Yfirst + Ysecond and finally: R2 = sum[(Yf - Ys)*(Yf + Ys)]/sum(Yi - Ym)2
As you can see, this algorithm works even for the very difficult test case I provided. (see the second spreadsheet on the OOo bug tracking site)
For steps 1-4 I am pretty confident that they are the best method to compute "a" and "b". I am still unsure IF step 5 is the best way to get R2, though this is surely more stable than the one provided in the original formula. There could exist even more stable algorithms (by decomposing sum[(Yf - Ys)*(Yf + Ys)], BUT I have NO idea how to decompose it, yet).
Hope this will help.