| From | Sent On | Attachments |
|---|---|---|
| Leonard Mada | Jun 7, 2007 12:58 pm | |
| Eike Rathke | Jun 8, 2007 9:04 am | |
| Leonard Mada | Jun 8, 2007 11:53 am | |
| robe...@us.ibm.com | Jun 8, 2007 12:43 pm | |
| Eike Rathke | Jun 13, 2007 10:06 am | |
| Leonard Mada | Jun 14, 2007 12:53 pm | |
| Leonard Mada | Jul 6, 2007 12:10 pm | |
| Leonard Mada | Aug 9, 2007 2:23 pm |
| Subject: | Re: [office-comment] RSQ - Unstable Algorithm | |
|---|---|---|
| From: | Leonard Mada (disc...@gmx.net) | |
| Date: | Jun 8, 2007 11:53:13 am | |
| List: | org.oasis-open.lists.office-comment | |
Hi,
I would like to discuss the following topics: A. FUNCTIONS AFFECTED: STDEV...(), VAR...(), STEYX() and RSQ() B. STABLE ALGORITHM: use 2-pass C. ALTERNATIVE FUNCTIONS: implement alternative functions using fast algorithms D. RSQ(): special problem: three-pass algorithm
<!-- @page { size: 21.59cm 27.94cm; margin: 2cm } H1 { margin-top: 0.85cm; margin-bottom: 0.21cm; border-top: 1px solid #000000; border-bottom: none; border-left: none; border-right: none; padding-top: 0.21cm; padding-bottom: 0cm; padding-left: 0cm; padding-right: 0cm; color: #000099; page-break-before: always } H1.western { font-family: "Times New Roman", serif; font-size: 18pt } H1.cjk { font-family: "Lucida Sans Unicode"; font-size: 18pt } H1.ctl { font-family: "Arial", sans-serif; font-size: 16pt } P { margin-bottom: 0cm } H2 { margin-bottom: 0.21cm; border: none; padding: 0cm; color: #000099; page-break-before: auto; page-break-after: auto } H2.western { font-family: "Times New Roman", serif; font-size: 14pt } H2.cjk { font-family: "Lucida Sans Unicode"; font-size: 14pt } H2.ctl { font-family: "Arial", sans-serif; font-size: 14pt; font-style: italic; font-weight: medium } H3 { margin-bottom: 0.21cm; border: none; padding: 0cm; color: #000099; page-break-before: auto; page-break-after: auto } H3.western { font-family: "Times New Roman", serif; font-size: 13pt } H3.cjk { font-family: "Lucida Sans Unicode"; font-size: 13pt } H3.ctl { font-family: "Arial", sans-serif; font-size: 13pt; font-style: italic } --> A. FUNCTIONS AFFECTED It seems that: RSQ(), STDEV(), STDEVA(), STDDEVP(), STDEVPA(), STEYX(), and VAR(), VARA(), VARP(), VARPA() suffer from the same shortcoming. They all mention the naive algorithm (algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently unstable, see paragraphs 6.16.69-73, and 6.16.79-82)
The *Note* to some of these functions mentions that naive algorithms *run into trouble*, but I do not understand the reason for mentioning the naive algorithm in the ODF specification then. [NO such note for the RSQ(), and VAR...() functions!]
B. STABLE ALGORITHM For the STDEV...() and VAR...() functions: using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] is a safe alternative.
Unfortunately, this stable algorithm implies passing *two times* through the data, which bears a significant time penalty for large data sets or data sets on a network drive!!!
There are alternative *one-pass* algorithms that are more stable than the notoriously unstable algorithm (see on wikipedia, http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass).
C. ALTERNATIVE FUNCTIONS I am very supportive of the idea to implement *two sets* of functions: 1.) the stable two-pass version (as currently named) 2.) a *fast implementation* for every function mentioned above: - should use a fast one-pass algorithm - BUT more stable than the naive algorithm - should be named: STDEV...FAST()
D. RSQ() As mentioned earlier, the RSQ() formula uses two (three) times the naive algorithm (AND uses both times the computed values as divisors). Without doing a formal calculation of the error terms, this algorithm feels *very, very unstable*.
Unfortunately, calculating the error of the coding algorithm even for something as simple as the standard deviation is NOT trivial, and is certainly ugly for the RSQ() function. I also lost any contact with mathematics in 1995 and took a very different career. I would need to dig up old notes and search through books to be able to formally calculate the error term.
Though, I would think that rewriting the formula to use [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] is a fairly safer alternative. Also, the numerator should be rewritten as: sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ] (or else the same risk exists, that the computed r^2 is negative OR bigger then 1)
However, this implies a *significant time penalty*!
While computing the STDEV() and VAR() functions using this approach implies a two-pass algorithm, the RSQ() will need a *three pass* algorithm: 1.) first compute MEAN(X) and MEAN(Y) 2.) then compute 'a' and 'b' 3.) and ONLY then one is able to compute r^2 (as Ycalc is needed in this last step, AND Ycalc depends on 'a' and 'b')
This is really ugly. I would definitely recommend here a faster implementation, e.g. an RSQFAST() function, though I do NOT know the performance and stability of such an algorithm.
I will search around to see IF I find a better alternative.
Sincerely,
Leonard Mada
Eike Rathke wrote:
Hi Leonard,
On Thursday, 2007-06-07 23:04:10 +0300, Leonard Mada wrote:
Unfortunately, the formulas/algorithms given for 'a' and 'b' seem to be computationally unstable - they both use terms of the form n*sum(x^2) - (sum(x))^2 which are notoriously unstable
Although I did NOT do a formal calculation of the error term, having *two* occurrences of this unstable calculation, and both times as *divisors* makes the formula very unreliable.
Thanks for pointing out. Did you try it out in OOoCalc and noticed shortcomings?
SUGGESTIONS: - remove formula from specification
I think that is not a good option because it would leave the definition without a mathematical description. Even if the given formula may numerically be instable at some point it is mathematically correct. However, we could add a note saying so.
- consult statistician knowledgeable of stable coding algorithms
That of course would be favourable. As I also know from the OOo mailing lists that you're interested in that area, could you help us or are you in contact with one?
Eike





