atom feed3 messages in com.selenic.mercurial-develUser metadata support
FromSent OnAttachments
Guenther BrunthalerNov 20, 2006 7:45 am 
AndreyNov 20, 2006 9:04 am 
Matt MackallNov 20, 2006 9:55 am 
Subject:User metadata support
From:Guenther Brunthaler (spam@gmx.nospam.net)
Date:Nov 20, 2006 7:45:06 am
List:com.selenic.mercurial-devel

Hi all,

I'm rather new to Mercurial, but I have used a lot of different SCMs before.

Mercurial looks in many aspects like the "right thing" to me.

I used to check out Monotone some time ago, but it had just too many shortcomings to be useful - most of which Mercurial managed to avoid.

Actually, there is only one single big thing left (except for symlink support) which would make everyone happy: User metadata support.

With user metadata, I mean something like the "properties" of Subversion.

In essence, it's just a version-controlled key/value list associated with each file.

You might think nobody actually needs such a thing?

Let's illustrate a few cases where such metadata would be highly useful:

* Additional permission bits. Currently, Mercurial supports the executable bit right out of the box. Fine. But what if more permission bits should be associated with a file, such as the sticky bit. Or creating a file as read-only. Or a special POSIX ACL. If hooks for checkin/checkout had access to file metadata, the hooks could set the appropriate bits on checkout as required, and without a need to integrate such features into the Mercurial core.

* Actually, even the executable bit needed not to be supported directly my the core Mercurial any more: Hooks could take over that job too, provided they have access to metadata items such as to a property named "hg:executable".

* Line-ending conversion. While I agree that line-ending conversions should normally be performed based on heuristics because users tend to forget about setting special properties, there are exceptions. What if a file with extension .txt is a texture in some project subdirectory rather than a text file like in the rest of the project? If the autodetection heuristics for binary files fails, we'll be screwed as soon as line-ending conversion will be attempted on that file. Using a property such as "hg:eol-style" set to "binary" would let a hook script override autodetection in such cases.

* Character set conversion. What if a single directory contains text files in different character set encodings? Just think of text files on a Windows machine which shall also be edited on a UTF-8 Linux workstation: On the Windows side, most files will be using the "ANSI" character set (in fact WINDOWS-1252 because of that EURO-Symbol), but some files intended to be used by the Console are instead represented using the "OEM" character set (IBM CP 437 or CP 850). Using the same conversion for all text files cannot work in this case. And they all share the same filename extension. It is necessary to override the conversions on a per-file basis. Metadata properties would allow the hook to also take care of this.

* Stream metadata. Machines like the Apple Macintosh can use different streams in a file, the so-called "data fork" and "resource fork". Think of it as a kind of sub-file. Same for NTFS which also supports streams. Each stream of a file has the potential to contain data a user would like to be subject to version control. The "normal file contents" are just the contents of the default stream of each file. But the best is yet to come:

* Symlinks could be implemented using hooks having access to stream metadata! In this case, a symlink would be treated as a stream with a reserved name of a file which has no default stream (and thus no normal file data) at all. That means, when checking out, there will no file be created. But the checkout hooks (which will be run for all streams, not just for file data contents) can check the stream type and create a symlink.

* Any kind of additional information to be attached to files/streams, such as MIME types etc. The hooks can make use of this information if required.

* Directory attributes! Directory can also also have metadata streams, storing version-controlled metadata about the directories, such as "hg:ignore".

* Tracking even empty directories and renaming or moving of directories. If we assign an (empty) dummy stream such as "hg:dir" to each directory, we can deduce the existence of a directory from the mere existence of that stream. Which means we won't need things like dummy-".keep" any more.

You see, metadata support would in fact be exceptionally useful for everyone, and would even make implementation of some things easier.

For instance, you could forget about the executable bit or symlinks in the core Mercurial project, and delegate such problems to the hook scripts.

Of course, metadata support should be implemented in a way that requires the least changes to the existing implementation.

First, what is actually needed:

* Not files are versioned, but streams are. It's pretty much the same from the perspective of a revlog, but we have to add an additional level below the leaf level (as it is now).

For instance, instead of having a

./hg/data/somedir/somefile.d

we then could have a

./hg/data/somedir/somefile/hg_data.d

which means: This represents a stream with name "hg:data" of the version controlled "object somedir/somefile". We say "object" here rather than "file", because that object could be a symlink as well - depending on which stream properties it has.

In this case, it is a file, because it has a "hg:data" property (which contains the actual file contents).

But it could have additional properties as well:

./hg/data/somedir/somefile/hg_executable.d

could indicate the fact that file "somedir/somefile" has an additional property "hg:executable", which means its executable bit should be set on checkout.

"hg:executable" is also the example for a "switch"-style property: It's mere existence indicates something; the actual revlog contents will typically an empty file because all that matters is whether this property is there or not.

You can also see: Streams and Properties are pretty much the same in this model - from the viewpoint of the revlog they are just more files to be version-controlled.

It's the only *interpretation* as streams/properties which makes them special.

Another example: In order to save a symlink instead of a file using the same name as above, we could use a property-revlog like this:

./hg/data/somedir/somefile2/hg_symlink.d

which represents a stream with name "hg:symlink", and the contents of the current revision of that revlog contain the symlink target.

And now how to store a "hg:ignore" property for directory "somedir/somesubdir":

./hg/data/somedir/somedir/somesubdir/hg_ignore.d

So, the most important thing to be changed for implementing that feature is to add an additional subdirectory level at the leaves of the version-controlled directory tree that is omitted when checking out a revision, but available to the hooks.

In the output of "hg manifest" the streams could be displayed using the -v option, while the "hg:data" stream should be suppressed from output in the normal case (because it is the default).

For instance,

$ hg manifest

hexstuff... somedir/somefile hexstuff... somedir/somefile2 [hg:symlink] hexstuff... somedir/somesubdir [hg:ignore]

$ hg manifest -v

hexstuff... somedir/somefile [hg:data] hexstuff... somedir/somefile2 [hg:symlink] hexstuff... somedir/somesubdir [hg:ignore]

Of course, streams could also be displayed in any different way as well; it's just an example.

The manifest internal format needed to be updated as well:

$ hg debugdata .hg/00manifest.d 0

somedir/somefile/hg_data<hexstuff> somedir/somefile2/hg_symlink<hexstuff> somedir/somesubdir/hg_ignore<hexstuff>

So, actually it does NOT need to be changed, but rather includes the *uninterpreted* contents of the .gh/data directory, including the leaf revlogs which always represent streams.

To be more precise: *All* the revlogs now represent streams! Because the actual files or directories or symlinks will be represented by subdirectories now which contain the stream revlogs. And whether such a directory will be interpreted as the name of a version-controlled file, directory, symlink, fifo, device file or something different depends solely on which stream revlogs exist in that directory.

And to make the best of properties, there should be a means of *inheriting* them from the parent directory, possibly overriding them in nested subdirectories. But that's worth its own thread I think.

Regarding stream names, it might be wise to enforce a naming policy to avoid name clashes with user-defined properties.

A suggestion of such a policy:

* All property/stream names optionally start with a namespace prefix, followed by a colon, and then an identifier. For instance, in "hg:data", "hg" is the name of the namespace, and "data" is the namespace-relative name of the stream.

* Namespace "hg" is reserved for Mercurial's "well-known" or specially interpreted streams. For instance, while "hg:executable" could be a user-defined property as well which is only of interest for user-defined hooks, "hg:data" is clearly of essential interest for the internal checkout and checkin functions of Mercurial.

* Namespace "urn" is reserved for property names which conform to the URN syntax, e. g. globally unique and *persistent* identifiers. (Persistency is also the big difference between an URL and an URN. URLs cannot truly be considered to be persistent: Domains come into existence and go away all the time.) For instance, there is a "urn:uuid" scheme which allows to create URNs based on UUIDs for those who like this. But numerous other schemes exist as well.

* Namespace "rdn" is reserved for the "reversed domain name" identifiers which are so popular in JAVA (or Monotone). This specifies properties such as "rdn:com.sun.java/bigproject/specialstream". However, as stated in the previous paragraph, DNS names might be not be the best choice to guarantee uniqueness of a name - at least not over time. URNs, in contrast, will do (if an appropriate URN scheme is chosen, such as "urn:uuid").

* All other names with or without a namespace prefix are free to be used by users in any way they like.

However, there is a problem here: The "urn" and "rdn" namespaces allow to include slashes, colons and other characters better to be avoded in filenames. Especially under Windows.

So I suggest a simple name mapping strategy:

* We add a pseudo-namespace "b32" which encodes whatever follows it in BASE-32 encoding.

* That pseudo-namespace will only be recognized at the beginning of a stream name and will encode whichever follows it in BASE-32.

* colon characters are mapped into underscores.

Here are some examples of stream names and the mapped revlog file names which will represent them:

"hg:data" -> "hg_data2.d" "hg:symlink" -> "hg_symlink.d" "plain" -> "plain.d" "usernamespace:whatever" -> "usernamespace_whatever.d" "funny_name" -> "b32_<base32stuff>.d" "urn:uuid:11223344-5566-3353-aabbccddeeff" -> "b32_<base32stuff>.d"

In those examples, <base32stuff> is a placeholder for the BASE-32 encoding of the string on the left side.

Why BASE-32 instead of BASE-64 one might ask?

Because BASE-32 does not use both upper- and lower case characters in the encoding it generates, which eliminates problems which filesystems that do not preserve letter case in file or directory names. (See the RFC about BASE-32 encoding for more details.)

Anyway, all the above is a mere suggestion to show how streams could be implemented; I'll be happy to keep it open to discussion.

But I would really be happy to see properties supported by Mercurial some day, which will also be the day I convert my SVK repositories into Mercurial!

Currently I cannot use Mercurial because I have lots of symlinks under version control in SVK.

I am using SVK because it is still the best distributed SCM I have encountered so far: It has (nearly) all the features of Subversion, but adds fully disconnected (off-line) operation.

SVK has also some disadvantages. The biggest disadvantage of SVK is its lack of concise documentation and it's largely intransparent operation. It's very obscure. *And* written in Perl. ;-)

Mercurial clearly excels here: All basic data structures (i. e. the revlog) are well defined and the interconnection between the components of the data structures (revlog, nodeid, manifest, etc) are nicely explained. This is how it should be.

In SVK I do not even fully understand the options and operation modes of its 3 merge commands... especially in disconnected or mirrored operation.

However, it works.

Somehow.

But I would really prefer Mercurial - if it only could support support properties like symlinks and character conversion attributes.

Greetings from Vienna, Guenther