4 messages in com.perforce.revml[revml] Re: RevML DTD| From | Sent On | Attachments |
|---|---|---|
| Barrie Slaymaker | 07 Jan 2005 07:18 | |
| Peter Miller | 09 Jan 2005 13:50 | |
| Barrie Slaymaker | 15 Jan 2005 04:51 | |
| Peter Miller | 16 Jan 2005 03:11 | .revml, .dtd, .entities |
| Subject: | [revml] Re: RevML DTD![]() |
|---|---|
| From: | Peter Miller (mill...@canb.auug.org.au) |
| Date: | 01/09/2005 01:50:34 PM |
| List: | com.perforce.revml |
On Sat, 2005-01-08 at 02:18, Barrie Slaymaker wrote:
See this link for the latest:
I've grabbed this one (0.38)
What follows could be seen as me not liking RevML. The reverse is the case: I think it is a great idea, and one who's time has come. There are just a stack of questions I have concerning implementation, which have arisen as I sit here and code an Aegis RevML import/export tool.
I can send you some [example].
Yes, please!
VCP generates only valid XML (it checks elements against the DTD).
I was going to check my output via nsgmls and the DTD. I'm writing C++, not Perl. Is there a "DTD to C++" tool?
How come you introduced <char code="0xNN"> instead of using the existing &#xNN; mechanism? Maybe some words of explanation in the DTD would help.
It's an XML thing: no matter what XML method you use, you are not allowed to encode any character point below a space (32) with the exception of a few control characters like carriage return and line feed. Even in XML1.1 you can't encode a NUL (0x00). So we need a non-builtin way to carry the occasional illegal character through XML.
If I understand correctly this means that for 0x7F to 0xFF, I use &#xNN; and for 0x00 to 0x1F, I use <char code="0xNN">
Most of RevML has few assumptions other than a series of revisions linked in some way.
For a SCM with real change sets (a few pre-CVS and most post-CVS VC/SCM systems have them) is it expected that a single RevML file describes a single change set?
The typical usage of diff/patch it that of a change set, even when the underlying repository is CVS which does not itself understand change sets. The change set model of diff/patch is one developers understand intuitively. I'm not sure putting anything else in a RevML file would meet user expectations. Aggregating several related change sets into one big change set is still a change set and doesn't break the model.
Systems like Subversion and, I presume, Aegis, would need their own element; the DTD above defines <cvs_info>, <p4_info>, etc. in each rev currently as PCDATA blobs, but we can define structured information in to them at some point as well.
But that's inherently invalid. It means that you are (implicitly) encouraging vendors to add tags which *aren't* in the DTD to support their own platforms.
For example, if I produce a tool which writes RevML files with <aegis_info> tags, this is not (yet) a valid RevML file, until you happen one day to see one and add it to the next rev of the DTD. This doesn't scale for an arbitrary number of VC/SCM systems that the RevML DTD author has never, and probably will never, see or use or even hear of.
BTW: maybe "vendor" is the wrong word... it could be interpreted to exclude Aegis, arch, darcs, monotone, OpenCM, etc.
What is a system supports file attributes beyond the ones in the DTD?
We'd open the DTD up to allow them. Let's define them :).
I'd rather add attributes in a way that didn't require DTD changes (see above). That way a new vendor can appear on the scene, and still produce valid RevML files.
I want to capture standard stuff in the DTD to prevent accidental or overly creative misuse. By standardizing the commonly available pieces in the DTD, including element ordering where convenient, we narrow the range of variation and limit accidental dependance on unspecified ordering, for instance.
Agreed. But by having commentary in the DTD which says it's OK to add vendor tags as required just means you get actual misuse (not valid RevML), instead of just creative misuse.
Given the presence of the <REP_TYPE>, why is the rep_type redundantly present in the names of all the <*_INFO> forms?
It is not now, not sure why it every was.
Err... the <!ELEMENT p4_info (#PCDATA|char)* > (etc) definitions are still in the DTD, and still referenced by the <!ELEMENT rev> defintion.
What is a change set moves a file *and* changes it?
That has not been considered. [...] I've not implemented any backends that
support a discreet
"move"
Aegis, Arch, Subversion, almost any VC/SCM project started since 2000, all support file moves a first class operation. Some also support "forking" (my term) a file, so they share common history up to a certain point, and then diverge... a semantic quagmire.
Yes, although two of the four systems (CVS, VSS) have no changeset concept
It's not especially difficult code to extract implicit change set info from CVS... the Aegis import facility does this. (Sliding time window across the set of all files, plus different users mean different change sets.) I've tried it on large projects. At worst it produces too many change sets rather than too few (e.g. half a commit before lunch, the other half after lunch).
and so the DTD does not assume changesets. I'd like to see some explicit support for declaring changeset-wide information and then referring to it in individual revs, but that would mean a whole lot more logic to handle indirection and save little or no disk space when RevML is compressed.
This gets back to my earlier question: does a single RevML contain a single change set? Or does it contain all change sets for the entire history of a project? Or something else?
If it only contains a single change set, then no extra machinery beyond additional non-<REV> attributes is required.
I have found that the users of Aegis accept the branch-as-a-single-change-set model with few problems. However, grabbing (and applying) all the change sets of a branch _as separate change sets_ requires more machinery, and for which there are few successful implementations.
If the RevML was supposed to be able to encapsulate change sets, maybe
<!ELEMENT revml (change_set*)> would be enough, with <!ELEMENT change_set> being defined as the current <REVML> definition. Adding recursion would probably be helpful... meaning a change set can be the composed of a sequenced set of sub-change-sets.
I want to limit the ad-hoc use of a generic form to truely generic attributes; common attributes should be embodied in the DTD to encourage standardization and once a common attribute escapes in to the wild encapsulated in a generic form, it can never be recaptured in a standard form without having every tool support both forms (ugh).
Yes, this can be a problem. But mostly, a case of looking at attributes with several synonyms - extra rows in a lookup table. The different formats for the values would be a pain, though.
But in a way this like email headers. The truly generic ones don't have an X- prefix.
For aegis, the change set attributes include brief_description, description (RevML's <COMMENT> ?), cause, several testing flags, a pile of history information including developer(s) and reviewer(s), plus arbitrary user define attributes. The file-in-a-change-set attributes include action, usage (source, test, etc), Content-Type plus arbitrary user defined attributes.
The <TYPE> form is too limited.
It is sufficient for the systems we've used RevML with
And will always be for the vast majority of VC/SCM uses, I expect, however there was an interesting thread on OpenCM mailing list some time early 2004 about content types and their applications.
Maybe allowing "text/*" to be understood to mean "text" would be sufficient in the DTD comments.
Ignoring portions of an XML grammar is easy :). Coping with multiple authors who do not happen to choose the same spelling for a <name> is difficult, I think.
Yes. This is something I would like to avoid. I think it needs extension mechanisms which are in the RevML content, rather than the RevML structure.
Re: <attribute><name>blah</name><value>blah</value></attribute>
Plus, they can all have X-system-blah-blah extensions. The ones that support arbitrary user defined attributes could have User-blah-blah attributes, too.
Nice approach, actually.
No new with me, it's how extension email headers are written.
I like the idea of a <user_attribute> and <site_attribute> if an SCM makes some true semantic difference between them.
Why "site", why not "vendor"? And why make it a different part of the RevML structure?
Ideally, I'd like to be able to receive a change set in RevML into an Aegis repository, and all attributes that Aegis doesn't understand, it simply inserts into the arbitrary attributes of the change set. When the change set is exported again via RevML, it gets all those attributes Aegis didn't understand, plus all of the ones Aegis did understand. All it takes is a little code to say that "[xX]-*-*" attributes don't get a "User-" prefix.
Note that some systems give each file a unique ID (at least two that I know of use the standard GUID/UUID format) which is immutable; they model filenames as an editable attribute of a file, thus a file rename is a simple change of the filename attribute.
The <rev id="..."> should contain the GUID/UUID while the <name> should be it's current public identity.
Time to clarify things... what is the rev id supposed to be? The language in the DTD comments is too loose for me.
Each change set has a UUID, meaning that when I package it up (using aedist) and email it to a developer, when it unpacks at their end, it gets the same UUID. Each change set is "the same" not matter which repository it is in.
But... each file also has its own UUID, from a completely different pool of UUIDs. (Change set UUIDs have nothing to do with file UUIDs, and vice versa.) Now, when the REV element is given an ID attribute, is it the ID of the file, or the ID of the change set?
It makes sense that it would be the ID of the change set, because this allows all the file revisions of a single change set to be grouped together... if a RevML file can contain more than one change set.
But is a RevML file only ever contains a single change set, it would make sense that the REV element's ID would be the file's ID because this would allow grouping file histories in the face of renames.
Specific RevML DTD 0.38 comments:
Maybe a preamble comment with a glossary? Especially when "site", "vendor", "repository", "tool", "user" (etc) are used as adjectives. Being fairly pedantic about wording is a Good Thing in a standard.
In the REVML element, does the COMMENT element refer to the tool and/or vendor, the site, the specific repository/project at a site, the specific repository/project replicated to several sites, or a change set? Or something else?
In the REVML element, is references a BRANCHES element which is never defined.
The REP_TYPE element's data can be a "vendor" name. Is this case sensitive? I also notice that you interchangeably use p4/perforce, vss/sourcesafe, etc, all through the DTD comments. Are aliases allowed in the REP_TYPE tag value?
The REP_DESC description talks about the repository as if it was a site specific attribute, but the suggested values look more like a tool ("vendor") attribute. Which is it?
The REV_ROOT element appears to describe what Aegis calls a project... a unique (within a site) repository identifier. A site could potentially host many, many projects. Would this map to a CVS module name, or a the actual path to the CVS_ROOT? because CVS_ROOT potentially refers to *many* projects (it doesn't help that CVS itself is rather fuzzy about the distinction).
The BRANCH_ID element comment talks about exporting a branch. Does this mean that a single RevML file is intended to describe all of the change sets to a branch? (What if the branch has sub-branches? are they in there too?)
Is it really necessary to have ACTION, P4_ACTION and SOURCESAFE_ACTION? Surely a single ACTION with add/create, delete/remove, edit/modify and move/rename values is sufficient?
(Well, not 8 alternatives, 4 will do. Aegis has more, but they can all be encoded as "edit".)
The DIGEST element - what exactly does it contain? Is it the md5sum of the content/delta text, or is it the md5sum of the file after the delta is applied? Or something else? It is only by context that I guess an md5sum is the value.
What happened to the <FILE_COUNT> element? I like the idea of a progress bar. (probably misnamed, maybe it should have been a <REV_COUNT>, although it does highlight the need to carefully distinguish between a REV and a file in the glossary and then rigorously use the terms as defined.)
That's plenty for now.
-- Peter Miller <mill...@canb.auug.org.au>





.revml, .dtd, .entities