Ours & Hippy — le blog

Not dead yet: xmltools, long after the summer

2010-04-06T19:29:07+02:00

Soon, it will be the Summer (of Code) again, and this year, too, NetBSD will be participating.^α At around the same time, last year, I was busy writing technical proposals for what would become xmltools, one of the accepted NetBSD SoC projects of 2009. Well, most of you have probably forgotten about that small project; some may even think it was a useless attempt and a waste of time and money, which could have been better spent elsewhere… Err, in fact, I’m writing this to say that my xmltools are not dead! And I still intend to contribute the code to the NetBSD tree sometime in the (near) future.

α : And if you’re a student who’s into systems and stuff, looking for a cool organization to join, with a broad selection of projects, why not give NetBSD a try? :)

That being said, if you take a look at the Git repo, you’ll surely notice that nothing has moved for months… If you fool around a bit more, though, you may notice another project on the same Git host: regxml. Yep, this is a rewrite of my xmltools!

Why a rewrite, you ask? Well, basically because the old implementation was flawed by design, in addition to having become a mess.

The little boring story

And here’s the long answer, for those who care. Feel free to skip this section if it bores you.

During the second part of last year’s SoC, as I was trying to work on xmlsed, I realized my design was flawed. The main problem was that I had based all my efforts around a concept that was, for our purpose, completely inappropriate: node sets.

So what’s a node set? A node set is simply that: a set of nodes. Now, that’s great if you are extracting data from an XML document, as did xmlgrep, but it sucks if you’re going to operate further on that. In particular, removing arbitrary nodes from an XML tree doesn’t yield a tree, and replacing a node set made no sense at all. (How would you replace two far away nodes with one big chunk of XML? There are many ways to go about it… but they’re all wrong IMO.)

So what emerged at the end of the summer was that… I needed a new design, and as my original algorithm had become crippled with dubious ad hoc attempts to support operations that it never could, I needed a new algorithm too. (Besides, there was a deficiency in my algorithm that made it perform in super-polynomial time with some patterns and data sets, which was bad.)

It took me something like one month (since I wasn’t full-time on it) to jot down the new concepts and a (hopefully) good algorithm on paper. Then, for months, I’d try to implement it, but each time, something would feel really wrong, and I’d trash my new implementation. Well, that was till some days ago, when somehow, I got it right, and so the new xmltools were born!

XML intervals

So all this boils down to the following fundamental change: XML node sets have been replaced by XML intervals, as the core objects the xmltools operate on. An XML interval is really just that: a bunch of adjacent siblings and all their children. You should read the included xmltools(7) manual page for more information, from a user’s point of view.

From a more theoretical viewpoint, intervals have two important advantages over sets:

Inserting or removing an XML interval in an XML tree keeps it a tree. In other words, XML trees are stable by insertion and deletion of arbitrary XML intervals. (It may not have any meaning anymore, but remember xmltools are about syntax, not semantics.)
XML intervals map (injectively) to byte intervals in the XML file, hence all editing operations on XML intervals (sequences of insertions and deletions) can be expressed as operations on sequences of bytes in the file itself.

The first property is what I am convinced will make xmlsed possible. Previously, with node sets, it was a nightmare to give proper semantics to the various operations. But now, we have simple objects that still retain a great deal of expressivity but can be manipulated easily, with simple rules.

Although I do not use the second property yet, I’m definitely thinking about introducing an API for that sometime in the (not quite near) future.

The idea is that you run the XML matcher on a file mapped in memory (mapping new chunks as needed), and get a program script consisting of byte operations, that you can then run very efficiently on the memory-mapped file! Instead of printing an interval, you’d just copy over the entire memory region. Instead of deleting an interval, you’d just ignore that chunk. You get the idea. It could be used for efficient processing of massive data with complex transformation rules, but we’re not quite there, yet. :)

xmltools and NetBSD

As I’ve discussed with David (<dyoung>), I intend to import my core library, as well as xmlgrep into base sometime in the near future (once it has undergone more thorough testing). It depends on Expat for XML parsing, so I’ll need to import that too.

After much thinking, I think I’ll settle with a subdirectory in src/external/bsd, though I use the NetBSD build system, as I want to keep my development model using Git, and the ability to easily make packages of all my tools for non-NetBSD platforms.

The new code base uses (Net)BSD idioms more extensively than the old one (because I’ve got the time to read more NetBSD code since I started in May of last year). (E.g. I’ve followed David’s suggestion that I should make use of sys/queue.h.) Quite surprisingly, it compiles almost as is on GNU/Linux, with three additional defines and two trivial functions. I do provide a simple GNU makefile in addition to the default NetBSD build system makefiles, for GNU/Linux builds. Of course, you’re welcome to try to build it on other systems I don’t have access to!

What the future holds

Well, what’ll come next depends on a lot of factors, some of which are completely outside of my control. But there is one thing you can do: please download, compile, read the man pages, and test xmlgrep and give me some feedback, so I can fix bugs!

$ git clone git://git.huoc.org/regxml.git
Or through Git-Web: http://git.huoc.org/?p=regxml.git;a=summary

As far as the xmltools project is concerned, I’m now going to concentrate on xmlsed. I already have an idea of the implementation, and I think it’s not going to be too hard… hopefully, I’m not wrong. :)

P.S.: Oh, I forgot to talk about the new algorithm. Well, to get the idea, it’s basically an automaton backed by a backtracking memoizing work-queue-based interpreter, which roams the XML tree and try to match intervals. You should read the xmltools(7) man page for a high-level overview of the technique. Maybe I’ll write something more consistent on the subject later…

xmlsed prototype

2009-10-12T22:30:46+02:00

Following David Young’s post on the official NetBSD blog, here’s my own small announcement: xmlsed is here, or at least a somewhat working first version. It’s in the Git repository, under the master branch (everything’s been merged back into master some time ago).

It’s mostly untested, and the exact interface/behavior is likely to change, but if you’re interested in helping out, please read the man pages included in the distribution (xmlsed(1) and xml_pattern(7) should get you started) and play a bit with some XML data (you can find sample files in tests/data/).

So, what does my xmlsed do? Basically, it’s somewhat like sed(1) except it does only one of the many things sed can do (though it’s probably the most popular feature of sed): it replaces things with other things, where, in our case, a "thing" is one or more nodes.

It is based on the new event-handling mini internal framework and uses the same patterns as xmlgrep (and the same matching code), augmented with the register-binding operator >= which lets you put captured nodes into a named variable, called a register. Then you can put nodes of the form <$REGNAME/> in your replacement templates, and voilà! The register’s contents gets inserted in place of the <$REGNAME/>. Something like xmlsed '((a#)>=a%(b#)>=b) <$b/><$a/>' should swap around two consecutive nodes a and b… well, I very much hope it does.

Well, that’s for the good part. Now, there are many things left unresolved in xmlsed, mainly the fact that its behavior is not consistent with xmlgrep. In xmlsed, if you replace a node, it also deletes all of its descendants, whereas when you select a node in xmlgrep (or in a capture in xmlsed), it comes alone, without any of its descendants (except attributes). In fact, you can even do a lot more weird stuffs in selections like selecting and grouping together non directly connected parts of the tree. To sum up: selection is way more powerful than replacement, and by default, they don’t really look alike (and they aren’t quite implemented the same way either…).

I’m thinking about rationalizing all this, along with the syntax. Through the many feature additions (as David and I decided we needed more power), selection patterns have probably become too convoluted, unnecessarily complex and fine-grained where it doesn’t matter. However, we should not give up on all that flexibility and probably bring some over to the replacement mechanism. It’ll also be a good opportunity to revise the syntax (more XPath-isms?) and migrate to some automated parser generator, one that can create reentrant and push-style parsers. I have that big patch of NetBSD yacc doing just that, waiting in my Git repository, but it probably needs more polishing before I can hope to submit it to the mailing lists to get myself flamed for doing things on my own, but oh well, we’ll see.

Now’s not the time for that yet. First, I need to test xmlsed more thoroughly to ensure that the implementation underlying the syntax really works.

Also, other things that I’ll want/need to do once I’m done with xmlsed (to the point where it runs well enough, doing something useful) are, in no particular order:

Rewrite the matching code to be more generic, simple, and readable: the idea is to split various constructs into more elementary blocks (I haven’t decided exactly what they would be, but I’m thinking about it) and have some kind of automaton feed on these instructions and move accordingly.
Improve the data structures in use, both for efficiency reasons and to make it possible to (theorically) run multiple instances of the matcher on the same stream of data, hopefully opening the way for some more concurrency. This last point is almost possible now, but not quite yet; there are still some matching data that remain in the node itself while they shouldn’t.
Implement the amended algorithm I’ve devised that should (if I’m correct…) eliminate the "bad" parts of the complexity that make it theorically inferior to the transducer network algorithm, while taking somewhat more memory (though the asymptotic bounds on memory consumption should stay the same).
Do something about locales; also handle integration into base, and packaging for non-NetBSD systems. In short: better integration.

Then of course, there’ll also be work on the remaining tools: some kind of xmljoin is probably next after xmlsed. In which order I address the various points, I don’t know yet; it’ll probably depend on what I feel is more urgent, and what I feel like doing first. :)

xmltools: after the GSoC

2009-09-07T17:14:48+02:00

Since the end of the Summer of Code, I have been busy doing various miscellaneous tasks around my xmltools, so that it be ready for inclusion into the NetBSD tree some time in the future. This is a short summary of what has happened (just so people know that my project’s not dead, I’ve not vanished, or something along these lines); please take a look at the code repo for details:

ATF test suites

I’ve finally decided to convert my bunch of shell scripts to ATF tests. These are basic tests which check the various constructs of the pattern language. There are 46 tests, three of which are optional (they are made against rather big data sets that I do not distribute with the sources). These include a test suite centered around property list matching. If you have relevant test cases and would like to see them included, please send them over.

Library API

Some people had been asking me if there would be a library, well, yes, and I’ve libified all the existing code in anticipation of that, and also because I wanted to eventually commit somthing clean to the NetBSD tree. All routines have been made to support proper error reporting, and expose clean interfaces. The library is divided into four main sub-namespaces:^α

hhxml_chunk_* : XML in-memory tree manipulations.
hhxml_stream_* : XML streams and (push-style) parsers. The streams have a hybrid design: they can be used to read nodes one at a time in a pure stream fashion or to generate partial in-memory trees. See misc/atomheadlines.c in the source distribution for an example.
hhxml_path_* : patterns and matching. This one is incomplete and the old match.c will need to be converted to fit into the new API.
hhxml_pprint_* : XML pretty printer. The implementation is also mostly incomplete at this time: it only supports printing to stdout, well, because the command-line tools don’t require any more than that.

The various parts of the API will be completed as I go about writing the tools themselves, but I wanted to have clean well-defined interfaces to build onto.

α: By the way, "hhxml" stands for "(the) very short XML (library)".

Non-visible changes to the pattern matching engine and plans

Well, I’ll write more on this one when I get the time, but I’ve been looking into other implementations and theories, especially SPEX, which boasts algorithms with good complexities (better than what is currently implemented in my tools). I have plans for improvements to the engine which I believe, if I’m not wrong, would bring us similar performance, but that will have to wait, since it is not critically important, at the moment, in my opinion.

That’s all for today!

xmltools: what's done, what's left

2009-08-13T23:44:37+02:00

So the Summer of Code ends soon and it’s time to report what has been done, and what is still lacking. Obviously, everybody can see there is no xmlsed yet in the repository, so what have I been doing since I completed xmlgrep one month ago? What went wrong?

Well, if you ask me, nothing really went wrong, I have just deviated somewhat from my original goals and taken the time to do some things that weren’t part of the initial plan. And even though I am disappointed not to have finished, this has brought its share of good additions. Most of these modifications stem from discussions with David (Young) about features we’d like to see in my xmltools. Also in all these three months, I’ve spent about three weeks optimizing the matcher code.

What we have

Let’s talk about the good stuff first. Of course, the visible part of my work is the existence of xmlgrep, but most of the code actually lives in a common static library. It provides:

definitions for patterns and trees;
utilities: (tree) buffers, bit vectors, symbol look-up, state and transition caching;
a string to pattern parser (that I often just call "pattern compiler");
an XML parser (which can handle fragments and multi-roots documents, or work as an interface to Expat, for strict XML parsing);
an XML pretty printer;
the code for the matcher itself;
and a higher-level API used to build the tools themselves.

The library has managed to remain reasonably small, with less than 5k lines of code, counting comments and everything, and the core matching algorithm fits in less than 1.5k lines or so (compare this to the 15k-line xpath.c from libxml2). It is stream-oriented and care has been taken so it eats as little memory as possible while remaining efficient.

The pattern matching code

The matcher is undeniably the most important and sensible piece of code in there. It is a blend of a kind-of-NFA and a backtracking machine, which processes one node at the time (NFA) but is able to do unlimited look-ahead (backtracking). Recent work has involved adding new advanced constructs to the matcher (besides forward and look-ahead mechanisms, which were already implemented) and improving its performances when look-ahead is in use.

Performances

On the performance side, the main addition is that of a state cache and pre-computed states created by the pattern compiler (instead of the matching code itself).

The state cache (implemented as a hash table) operates at two levels :

propagation (from parent to child and sibling to sibling);
and filtering (deciding whether based on local properties, states match or not).

One effect of using a cache is that bit vectors need not be pooled anymore and a lot of copying has disappeared in favor of pointer swapping. Also, relying on states pre-computed by the pattern compiler has shifted the focus from looping bit by bit through bit vectors to whole-vector operations, most of which can be auto-vectorized by a sufficiently smart compiler (ICC on Linux can vectorize a dozen of these in the whole xmlgrep code, but as I’ve pointed before, GCC auto-vectorizer seems to lag behind its non-use, at least on my code and setup).

With these optimizations merged in, xmlgrep now outperforms xmlstarlet both in terms of speed and memory consumption on my silly little benchmark (for those who have not been there, it consists in grepping for a word and its definition in a 50M XML-formatted English dictionary, and yes, it’s simple-minded), even when using suboptimal look-ahead subpatterns.

Indeed, knowledge of the XML format often enables us to write optimized patterns by anchoring a more costly part with a basic predicate. In our case, instead of searching for p[hw/.="somestring"] (the suboptimal case), we could search for body/p[hw/.="somestring"] since p can only occur inside body. This is an improvement since the p[] part, being a look-ahead subpattern, is sensibly slower than a forward pattern, so reducing the number of tentative matches is essential.

Compared to benchmarks done at the time of the release of xmlgrep, handling of this suboptimal case is about three times faster now.

Features

On the feature side, which is probably more interesting for casual would-be users, I have added grouping, Kleene-like operators, and subpattern negation, as well as reworked some of the old syntax.

Subpattern negation is the simplest extension, algthough it’s fairly powerful. It can be achieved because of backtracking so it’s only available on (look-ahead) subpatterns, not groups (see below). Negation lets you write something like {a}!{a/b} which selects all elements a that do not have a b element. This can’t be done with regular patterns because a node can have many children, and thus, unlike with regexes, something like a/!b (would-be syntax) would match some a node with a child which is not a b, but there can be many other children.

For example:

$ xmlgrep -x '{dict}!{dict/@id}/key/.' test.plist

will extract only keys from unnamed dictionaries.

The second addition is Kleene-like operators. Well, sorry about the name, but it’s just what comes to mind when I use these. They are, as the name implies, somewhat like their regex counter part, the Kleene star, in that they match zero or more repetitions of the preceding predicate. E.g. a*%b matches <a/><a/><a/>...<b/> but not, say, <a/><c/><d/>...<b/>. The %% and // operators have now become special cases of the Kleene operators. a%%b can be written a%**%b and a//b is really a/**/b, where the first star stands for "any node" and the second is part of the operator.

But having these repetition operators limited to repeating a single node predicate would have been quite boring, and so groups were born. Groups, written inside parentheses () (notice these have replaced the old subpattern syntax which has been moved back to being {} from its first days), well, group a pattern fragment, so that it can be repeated. They also accept alternation so you can write a%(b|c)*%d, which just seems natural.

Let’s take an example:

$ cat test.xml
<B/><a/><b/><a/><b/><a/><b/><a/><b/><a/><b/><E/>
$ xmlgrep 'B%(a|b)*%E' test.xml
<E/>

Groups are implemented without backtracking so using a group is not much more costly than using plain forward patterns, and is faster than using look-ahead subpatterns.

Yet, that is only true of normal groups. When you use a selection operator (+, #, - or =) on a group, it becomes a capturing group, a.k.a. look-ahead group. A look-ahead group is really a look-ahead subpattern combined with a group and has a somewhat subtle behavior. It behaves like a group, with the following difference: any selection operator inside a look-ahead group will only apply if the whole group matches. This can be used to bind selectors together. Thus we can revisit the classical key/value pair example: how do we match a key/pair with some properties on the key and some other on the value? You can always do that with simple look-ahead subpatterns combined with forward patterns, but groups just make it more convenient.

Imagine you want to match all key/value pairs where the key contains the letter a and the value is a string. Easy, with a capturing group:

$ xmlgrep -x '(key[.~"a"]#%string#)+' test.plist

Other components, reusability

Some people have discussed with me the possibility of making the library reusable. As it is now, it is not yet ready for wider exposure, but a lot of work has gone into rationalizing its API, in order to prepare for that. The API needs to be stabilized and the various components need to be better segregated.

Currently, the data structures in use are tightly integrated. For instance, the parser actually builds a tree with nodes able to hold matching states. Somehow, I would need to break that dependency, so that the matcher depends on the parser, but not the opposite, so you could use the XML parser as a stand-alone component (it has a nice hybrid design). Other parts are less affected, though.

What’s left to do

Now for the bitter part. Obviously, there are still a lot of things left to do, and I intend to continue working on this project, but whether it will be integrated into NetBSD at some point in the future is not for me to decide, though I really hope it will.

In short, here is the plan:

write support for xmlsed templates in the library;
write xmlsed proper;
change the/add a high-level API to the library to make it possible to write xmljoin (because the high level framework only accepts one input file at this point in time);
see what more needs to be adjusted for xmljoin to fit into the framework;
write xmljoin proper;
see what needs to be done for xmlsort to fit into the framework;
write xmlsort.

Of course, I have given some thoughts to the "see what needs to be done" phases, but now is not right time to talk about that.

Also, at some point, I’ll need to address several shortcomings of the current code base:

add locale support;
convert my bunch of "regression shell scripts" to proper test suites (ATF, probably);
add non-NetBSD build support (currently, I have to hand-compile it on my Linux box).

And I will also want to:

clean-up the API and release the core components as a stand-alone library;
write documentation on the internals to accompany that library;
and of course, integrate it into NetBSD, if I am given permission. :)

Lastly, about xmlsed, though you can see no xmlsed directory in the repository, I can say it is pretty close to completion since the pieces needed are almost all there in the library (the only one missing is the template support, and that’s not the hard part).

Conclusion

The Google Summer of Code is nearing its end, but my summer vacations are not over yet. I won’t have classes until September, so I plan to carry on as if nothing much happened at least till the end of August. After that, I won’t have nearly as much free time, but I’ll do what I can; let’s not be hasty and wait till then to see how far the project has gone.

I will continue to update this blog and the code will be available at my Git repository, as usual. (Anyway, I have just recently added comments to my blog so feel free to drop by and give me your opinion.)

xmltools implementation: automata and backtracking

2009-07-20T00:39:10+02:00

At the moment, I’m implementing submatches in patterns, in preparation for xmlsed. Since most of what I do draws from formal languages and automata (except I’m too lazy to write anything formal), I’ve been looking into TNFAs for inspiration. Seeing how TRE seems to implement look-ahead without backtracking, I began to ask myself if such a thing could be done for my pattern matcher. Unfortunately, I can’t seem to figure out a solution by myself, so I’m blogging instead. :)

Basically, the problem is with constructs such as a[b]/c which match c, child of a only if a has a child b. However, when we reach a, we don’t know yet whether it has a child b or not, so we’re in need of some look-ahead.

This is akin to a(?=b)c in Perl regexes, but as you can already see, it’s fairly different. Perhaps the most obvious difference is that a tree has "two dimensions" while a string has only one, so it’s possible for a to have two children b and c, in a tree, while it’s not possible for a to be followed by two different characters, in a string.

Backtracking

The easiest way to solve this problem is to use backtracking. This is how the matcher is currently implemented. Even though about everything else is based solely on states and transitions, the part implementing look-ahead predicates uses backtracking.

The algorithm looks like this: everytime we have to match a look-ahead predicate, we stop, read as many nodes as we need to decide and then either drop the state or keep it, depending on the outcome.

This is a simple approach and it works well (i.e. it gives the expected results). Only, it’s slow. As a comparison, using the patterns I’ve used before for my benchmarks, the difference between hw/.="Ab\"a*cus" and p[hw/.="Ab\"a*cus"]# is the second one is about 2.66 times slower. This can be hand-optimized by rewriting the second pattern as body/p[hw/.="Ab\"a*cus"]#, with knowledge of the input format, where p can only appear as a child of body. This cuts the slowness to a mere 1.62 times slower.

At the same time, it demonstrates a major source of inefficiency in my backtracking implementation: loose (non-anchored) subpatterns, which may potentially apply to a great number of nodes yet would fail immediately for most of them, result in a lot of unwanted short backtracking segments where the matcher tries the subpattern and then immediately backtracks after failing, thus still examining the node twice.

Product automata

In theory, a pattern such as a(?=b)c matches if a is followed by both b and c so another natural idea would be to use a product automaton which transitions between couples of states: if we end up in two final states, then we win.

Only, this works well because we want to know whether the string matches as a whole. However, with an XML pattern such as a[b]/c, we need to identify which node matched the c part. This is where things start to get complicated.

Now we do know how to extract a submatch from a string with tagged transitions. So, great, does this apply to our XML matcher? The answer is "no", as far as I know (I’d be glad if someone proved me wrong, though). Tagged transitions are good at extracting one submatch, the last one, but in our case, the problem is we want all occurrences of c, not just the last one.

Keeping this information globally associated with the state is not an option since there may well be multiple instances of the look-ahead predicate trying to match at the same time. But trying to differentiate between two states based on the matching position is not an option either, because that could easily lead to linear memory usage. Consider, for example, the following XML fragment:

<a/><a/><a/><a/><a/>...<b/>

Trying to match {a%%b} (or (a%%b) using the current syntax in the master branch; sorry, I reverted to the old syntax because () will be used for groups) against that would result in a match at every a element. Hence, keeping the position as a state property is not a solution, since we’d have to distinguish between as many a nodes as there are in the input data.

Optimizing the backtracking method

Although the implementation of the matcher has undergone many changes in order to improve performances recently, mostly so as to accommodate new features while keeping a reasonable running time, I believe there’s still room for much improvement, maybe in the form of algorithmic enhancements rather than implementation optimizations.

Some of my more realistic ideas are:

Read one element ahead to reject some subpatterns without entering a backtracking context.
Try to predict the outcome of a look-ahead predicate and compute the more likely transitions along with the predicate itself, saving on backtracking (except to prune the tree) in case we’re right (but incuring additional recomputations if we’re wrong).

Also, this is not directly related to the backtracking approach, but I’m thinking of putting more information in the state cache; at the moment, we’re only caching "raw" transitions which account for the old state set and the direction (i.e. successor or child); we could try to cache local information as well, which would move us one step closer to the way NFA transition functions get cached to get DFAs.

Conclusion: do we need performance that badly?

Well, that’s a good question. And the answer is "we probably don’t". This is kind of my hobby, so don’t get the wrong idea. The whole thing is not that slow. On my simple example, xmlgrep was only 0.5s slower than libxml-powered xmlstarlet, taking less than six seconds to look up a word definition in a 58M dictionary, on my 2x2GHz Core 2, while using 100 times less memory.

I doubt anybody is going to use xmlgrep to look up words in dictionaries anytime soon; such a specialized task would best be served using an indexed collection of some sort, although I should mention that there are XML formats which are more suited to being parsed using my tools than others, so converting to a more appropriate (e.g. annotated) XML format using xmlsed then searching with xmlgrep could well be a viable solution too (given the conversion is one-time or infrequent compared to searches). More on that once xmlsed is out!

xmltools update: new I/O layer and further plans

2009-07-08T15:45:28+02:00

It has been nearly two weeks since my last update on my plans for xmlsed. Since that time, David and I have come to an agreement on the (hypothetical) syntax and my hybrid model has been retained.

Rewriting the I/O module

Although I now have a fair idea of what xmlsed should look like and how to implement it, some elements introduced after the recent discussions with my mentor required changes to the existing code.

In particular, the new template syntax required an XML parser able to handle partial XML documents. Besides, I wanted some syntactic sugar on top of XML to make writing templates easier. Expat being a well-behaved XML parser is quite strict about the syntax and there is no easy way to work around that. Though I did spend some days trying, at first, I eventually gave up, as it proved hard to write, let alone maintain in the future.

At the same time, I began to realize the original I/O abstraction I designed was showing its limits. I initially wanted to be able to read and write multiple tree representation formats, but in the end, the need to fully support XML, with all its specifics (doctypes, PIs, etc.), has convinced me to drop the idea and focus on XML alone.

All these reasons led me to rewrite the I/O layer (almost) completely. This work is being committed to yet another temporary branch: xmlpush.

Maybe the most visible change is that multi-backend support was dropped and the I/O module now consists of only two drivers: the Expat-based parser (the strict parser) and a home-made loose parser.

All tools now support two modes: the strict mode (-s flag) and the default loose mode.

In the strict mode, compliance with the XML standard is a priority: all extensions are disabled, including multi-root and partial documents (which was implemented with Expat as an ugly kludge before, so it was removed), the XML prolog is honored (the specified encoding is used and entity declarations are parsed), and stricter rules are enforced (e.g. in names).
In the loose mode, extensions are supported and the XML prolog is ignored (instead, we use the system locale).

The encoding issue

Although I’ve just stated how encodings will be handled, at the moment, I have not written any code to suppor that yet. Actually, it’s not a simple matter.

First of all, the XML standard mandates support of Unicode. For one thing, XML character entities (&xxxx;) are references to character codes from the Unicode tables. This makes support for Unicode pretty much mandatory.

In order to handle this, the Expat people have chosen to serve all data as UTF-8 (or UTF-16, depending on the build-time configuration), no matter what input encoding is used. This should have given you a hint: Expat has its own encoding conversion engine.

Now, even though XML documents can specify an encoding, and so it would be alright to always use UTF-8, that is not an acceptable solution in the real world: Unix users expect their programs to comply to the current locale. But Expat does not care one way or the other about, or integrate with, the locale system.

Fortunately, NetBSD has iconv(3) (though neither FreeBSD nor OpenBSD does, apparently, which will be a problem in making my programs portable; there seems to be an ongoing GSoC effort to port NetBSD libiconv to FreeBSD though), so I plan to run everything through that on input and again on output. Sounds like a waste of resources? Well, don’t blame me.

However, that’s not all: there is also the loose parser. Since this one was written by me, and I had no reason to force UTF-8 everywhere, it does no conversion and assumes all sources are in the current locale. The only problem will be for character entities (not currently implemented), but I intend to localize the Unicode-to-locale transformation to these, since it is costly.

Remember the push-style vs event-driven parser debate?

Well, it wasn’t really a debate, but maybe some of you remember that I posted some weeks ago a ticket on this blog about how I grew dissatisfied with the rigid event-driven behavior of Expat and wanted a push-style parser.

I took the occasion this time to make my wish come true (almost) and the new I/O design is mostly push-style. I say mostly because I’ve made some compromises in order not to impair performances.

At first, I thought about having the parser run on a fragment of input text and build a queue of events to be fed to the application one by one, on a on-demand basis. But then I realized I could just use the internal tree directly as my event queue, and have the application read that. Since we have support for look-ahead in the matching engine (and hence need to keep a partial internal tree in any case) and most tools will use that, this effectively moved the tree building code from the matcher to the I/O module, at the same time eliminating direct event responses (a bit less than 500 lines of code). This last point needs some clarification: sure we do use a little more memory (but honestly that just doesn’t show) since we build the tree first and only then process it, but this is limited to how many nodes can be represented in one read buffer, which means it’s mostly insignificant. But we have gained what I think is far more important: an unified implementation of the matcher (which is the most sensitive part).

Further plans

At the moment, the new parser does not support things such as entity declarations and I am still pondering whether to write code for that or not. In any case, it would be a good idea to have a xmlcat(1) utility which fully supports XML, including external entities, with the ability to fetch and include external documents (fetch(3)?) and substitute references. But let’s leave this discussion for another time.

As for the locale support, I think it will come after xmlsed is somewhat ready (i.e. as xmlgrep is right now). So for now, some more testing needs to be done on the new I/O components, and when this is done, I will resume work on xmlsed proper.