atom feed7 messages in org.oasis-open.lists.office-commentRe: [office-comment] DISTINCT Values
FromSent OnAttachments
Leonard MadaJun 7, 2007 2:14 pm 
Patrick DurusauJun 7, 2007 3:53 pm 
Eike RathkeJun 8, 2007 9:39 am 
Leonard MadaJun 8, 2007 5:10 pm 
Patrick DurusauJun 9, 2007 4:11 am 
Leonard MadaJun 9, 2007 12:33 pm 
David A. WheelerJun 10, 2007 5:17 pm 
Subject:Re: [office-comment] DISTINCT Values
From:Leonard Mada (disc@gmx.net)
Date:Jun 9, 2007 12:33:51 pm
List:org.oasis-open.lists.office-comment

Hi Patrick,

Patrick Durusau wrote:

...

What we need are details. Use case scenarios are useful but only up to a point.

For example, you mention the R as factor () below.

Maybe I was not able to clearly explain what I meant.

In simple words, I wanted an *enhanced function corresponding* to the *pivot tables*. Well, maybe it is now easier to understand. Current implementations of pivot tables seem quite weak to me. And they are NOT functions. I therefore do want: - something more advanced - easily expandable / flexible - and defined as spreadsheet functions

The function DISTINCT() was meant as the first step in this process. This would generate the groups of data / make the categories. Indeed, these categories would behave like factors (in R, and generally in statistics, these are called factors - respectively levels of a variable). Further functions should have followed, which would generate the various reports (these would imply extensive vector operations). Indeed, factors are extensively used in vector/matrix operations.

Recalling that OpenDocument is an *interchange* format, how do we deal with the following issue?

Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers. Rather unfortunately users often make use of the implementation in order to make some calculations easier. This, however, is an implementation issue and is not guaranteed to hold in all implementations of R. (Section 2.3.1 Factors, R Definition Language)

## THIS IS A SIDE NOTE - the previous WARNING is irrelevant both to ODF and to R-users that stick to the S+ standard - for someone working with factors, it is irrelevant how factors are *INTERNALLY* stored in R - 'is.factor()' will ALWAYS return TRUE for a factor-object irrespective of its internal storage ('as.factor()' interprets something as a factor) - internally (in R), factors are currently stored in a way that uses integers - THIS data structure should however NEVER be known nor assumed by users, and therefore, it should NEVER be used (as open-source, of course you can get the details) - these are hidden methods (thats why you declare 'private' and 'protected' in C++ classes, to hide the implementation) - however, obviously, there are users who make use of this and even worse, perform mathematical calculations with factors (it makes NO sense to compare mathematically a level "A" with a level "B", or with an integer, BUT some do exactly that)

## END SIDE NOTE

We do not specify implementation details so it is possible for an "as factor()" function to work differently depending upon implementation details.

Having a function defined by a standard work differently is a bad thing.

## SIDE NOTE - 'is.factor()' and 'as.factor()' WILL work as expected in R even in the future - users who interpret this result as an integer are affected, and I fully support this idea, they should have never supposed those factors to be stored as integers - *A factor may be purely nominal or may have ordered categories*!!! NO mention of integers. ## END SIDE NOTE

CONCLUSIONS ============ Indeed, spreadsheets should have functions that perform assignment of some data into *categories*. DISTINCT() was supposed to do so. These categories would then behave like the described factors. Pivot Tables (aka Data Tables) do currently similar things, though I wanted something more advanced. And I wanted a function.

Hope this explanation clarifies some of the issues.

Sincerely,

Leonard

I don't know whether that would actually change the result of a function or not but it is an example of the level of detail that is necessary to consider when defining a function in a standard.

I suspect it would be possible to define "as factor()" such that it had a standardized result and if someone allowed used based on implementation details they would be non-conformant. I say that not having looked at the details. And by details I do not mean use or test cases but a formal definition of the function.

I know the formula SC has a number of functions that still need some work so maybe we need a rule that welcomes new function proposals but grants priority to requests accompanied by work on functions already accepted for standardization.

David, what say you?

Hope you are having a great weekend!