|Frans Englich||Sep 30, 2004 2:00 pm|
|Sean Wheller||Oct 1, 2004 12:16 am|
|Jirka Kosek||Oct 1, 2004 1:06 am||.bin|
|TiP||Oct 1, 2004 3:36 am|
|Javier Farreres de la Morena||Oct 4, 2004 12:51 am|
|Michael Smith||Oct 13, 2004 4:16 am||.pgp|
|Michael Smith||Oct 13, 2004 6:28 am||.pgp|
|Michael Smith||Oct 13, 2004 6:43 am||.pgp|
|Frans Englich||Oct 16, 2004 6:04 pm|
|Bob Stayton||Oct 18, 2004 10:35 am|
|Bob Stayton||Oct 18, 2004 10:54 am|
|Frans Englich||Oct 28, 2004 1:31 pm|
|Subject:||Re: [docbook-apps] Dynamic web serving of large Docbook|
|From:||Frans Englich (fran...@telia.com)|
|Date:||Oct 16, 2004 6:04:59 pm|
Michael, thanks for your extensive replies. I have been looking into this relatively extensively, and it sure is tricky. Docbook is a very attractive format to have beneath, and being able to swiftly use it in large web projects would make it even more powerfull. I think it applies to many, so a clean, thorough solution which is pushed upstream(into a CMS or stylesheets) would gain many people.
It should be noted I have no possibilities for financing or proprietary solutions due to several reasons, one is that it's for an open source project. Also, sorry about the late reply :|
On Wednesday 13 October 2004 13:29, Michael Smith wrote:
Reading through your message a little more...
The perfect solution, AFAICT, would be a dynamic, cached, generation. When a certain section is requested, only that part is transformed, and cached for future deliveries. It sounds nice, and sounds like it would be fast.
I looked at Cocoon(cocoon.apache.org) for helping me with this, and it does many things well; it caches XSLT sheets, the source files, and even CIncludes(same as XIncludes basically).
However, AFAICT, Docbook makes it not easy:
* If one section is to be transformed, the sheets must parse /all/ sources, in order to resolve references and so forth. There's no way to workaround this, right?
It seems like your main requirement as far as HTML output is to be able to preserve stable cross-references among your rendered pages. And you would like to be able to dynamically regenerate just a certain HTML page without regenerating every HTML page that it needs to cross-reference.
And, if I understand you right, your requirement for PDF output is to be able to generate a PDF file with the same content as each HTML chunk, without regenerating the whole set/book it belongs to. (At least that's what I take your mention "chunked PDF" in your original message to mean.)
Yes, correct interpretation.
(But -- this is just an indicental question -- in the case of the PDF chunks, you're not able to preserve cross-references between individual PDF files, right? There's no easy way to do that. Not that I know of at least.)
Nope, the PDF would simply contain the content of the viewed page without any webspecifics such as navigation; used for printing. Example(upper right corner): http://xml.apache.org/
If the above is all an accurate description of your requirements, then I think a partial solution is
- set up the relationship between your source files and HTML output such that the DocBook XML source for your parts are stored as separate physical files that corresponded one-to-one with the HTML files in your chunked output
- use olinks for cross-references (instead of using xref or link)
If you were to do those two things, then maybe:
1. You could do an initial "transform everything" step of your set/book file, with the individual XML files brought together using XInclude or entities; that would generate your TOC & index and one big PDF file for the whole set/book
2. You would then need to to generate a target data file for each of your individual XML files, using a unique filename value for the targets.filename parameter for each one, and then regenerate the HTML page for each individual XML file, and also the corresponding PDF output file.
3. After doing that initial setup once, then each time an individual part is requested (HTML page or individual PDF file), you could regenerate just that from its corresponding XML source file.
The cross-references in your HTML output will then be preserved (as long as the relationship between files hasn't changed and you use the target.database.document and current.docid parameters when calling your XSLT engine).
I _think_ that all would work. But Bob Stayton would know best. (He's the one who developed the olink implementation in the DocBook XSL stylesheets.)
A limitation of it all is that, if a writer adds a new section to a document, you're still going to need to re-generate the whole set/book to get that new section to show up in the master TOC. Same thing if a writer adds an index marker, in order to get that marker to show up in the index.
But one way to deal with that is, you could just do step 3 above on-demand, and have steps 1 and 2 re-run, via a cron job or equivalent, at some regular interval -- once a day or once an hour or at whatever the minimum interval is that you figure would be appropriate given how often writers are likely to add new sections or index markers.
And during that interval, of course there would be some possibility of an end user not being aware of a certain newly added section because the TOC hasn't been regenerated yet, and similarly, not finding anything about that section in the index because it hasn't been regenerated yet.
* Cocoon specific: It cannot cache "a part" of a transformation, which means the point above isn't workarounded. Right? This would otherwise mean the transformation of all non-changed sources would be cached.
Caching is something that you could do with or without Cocoon, and something that's entirely separate from transformation phase. You wouldn't necessarily need Cocoon or anything Cocoon-like if you used the solution above (and if it would actually work as I think). And using Cocoon just to handle caching would probably be overkill. I think there are probably some lighter-weight ways to handle caching.
Anyway, I think the solution I described would be some work to set up -- but you could hire some outside expertise to help you do that (Bob Stayton comes to mind for some reason...).
I looked at the solution of using an olink database, but perhaps I discarded it too quickly. Perhaps I'm setting the threshold to high(I am..), but I find it hackish; it isn't transparent, and it most of all disturbs creation of content: one can't use standard Docbook, and authors have to bother with technical problems. It's messy.
One thing which can be remembered is that splitting the source document mustn't be done propotionally to what pieces that are rendered; it only have to be kept in such small pieces that performance is acceptable(it's a small detail, but it can from an editing perspective be practical with a document larger than what is to be viewed), /assuming/ the CMS( or whatever content generation mechanism is used) can map the generated output to a certain part in the source file(like XInclude).
To recapitulate, the problem is the initial transformation of the requested content -- that the XSLs must traverse "all" the sources -- and that performance hit is the same regardless of whether it's PDF, HTML, and if the requested content is small. Once it's generated all is cool, since it's cached for later deliveries. That's the key problem -- everything depends on it.
Here's possible solutions:
1. The olink way you described. It works, but it's complex, restraining, and intrusive on content creation.
2. True static content(croned). Not intrusive on content creation, but it's perhaps too simple(too dumb), and it actually can become a performance issue too; generating PDFs for each section -- that's a lot of mega bytes to write to disk each time the cron job runs.
3. To actually go for the long transformation which we try to avoid; that all the sources are transformed for each requested section. First of all, this long transformation happens for the first request -- the first user -- and then it's cached. How long does it take then? Cocoon caches includes, and the files, so when the cache becomes invalidated one source file is reloaded(the one which have changed) while all others and the Docbook XSLs(they're huge) are kept in memory(DOM, I presume) -- perhaps that's enough for reducing that first transformation to reasonable speeds. I'm only speculating, no doubt that it's the transformation that takes the longest time(perhaps someone knows if I'm unrealistic, but otherwise real testing gives the definite answer). If this worked, it would be the best solution.
These approaches can also be combined; the html output could be static(cron), while PDFs are dynamic. In this way the performance trouble of 2) are gone(writing tons of PDF files), and perhaps the delay is ok for PDF. From my shallow reading about Forrest, I have understood it's good at combining serving dynamic and generating static, perhaps it can be a way to pull it all together under one technical framework.
Another trouble, or at least something which requires action, with flexible website integration is navigation. As I see it, Docbook is tricky on that front -- the XSLs are quite focused on static content generation, the chunked output for example. Since dynamic generation basically takes a node and transforms with docbook.xsl, navigation must be hand written, for example if one wants the TOC as a sidebar, and that it changes depending on what is viewed(flexible integration). I bet this is relatively easy to do, considering how the XSLs are written, and this could be good to have in a generic way somewhere(Forrest, Docbook XSLs, perhaps..).
Yes, speculations. When I write something, have actual numbers, proof of concept, or know what I actually talk about, I will definitely share it on this list.
Hm.. That's as far as I see.