<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Ours &amp; Hippy — le blog</title>
  <link href="http://blog.huoc.org/" />
  
  <author>
    <name>Ours &amp; Hippy</name>
    <email>ourshippy@huoc.org</email>
  </author>
  <id>tag:blog.huoc.org,2009:atom</id>
  <updated>2010-04-06T19:29:07+02:00</updated>
  <entry>
      <id>tag:blog.huoc.org,2009:posts/xmltools-not-dead</id>
      <link rel="alternate" href="http://blog.huoc.org/xmltools-not-dead.html" />
      <title xml:lang="en">
        Not dead yet: xmltools, long after the summer
      </title>
      <published>2010-04-06T19:29:07+02:00</published>
      <updated>2010-04-06T19:29:07+02:00</updated>
      <author>
          <name>Nhat Minh Lê  (rz0)</name>
          </author>
      <category term="netbsd" />
      <category term="parsing" />
      <category term="xmlgrep" />
      <category term="xmlsed" />
      <category term="xmltools" />
      <content xml:lang="en" type="html">
         
&lt;p&gt;Soon, it will be the Summer (of Code) again, and this year, too,
NetBSD will be participating.&lt;sup&gt;α&lt;/sup&gt; At around the same time, last year,
I was busy writing technical proposals for what would become xmltools,
one of the accepted NetBSD SoC projects of 2009. Well, most of you
have probably forgotten about that small project; some may even think
it was a useless attempt and a waste of time and money, which could
have been better spent elsewhere… Err, in fact, I’m writing this to
say that my xmltools are not dead! And I still intend to contribute
the code to the NetBSD tree sometime in the (near) future.
&lt;/p&gt;&lt;div class="Notes"&gt;&lt;p&gt;α : And if you’re a student who’s into systems and stuff, looking
for a cool organization to join, with a broad selection of projects,
why not &lt;a class="extern" href="http://www.netbsd.org/contrib/soc-projects.html"&gt;give NetBSD
  a try&lt;/a&gt;? :)
&lt;/p&gt;&lt;/div&gt;&lt;p&gt;That being said, if you take a look at the Git repo, you’ll surely
notice that nothing has moved for months… If you fool around a bit
more, though, you may notice another project on the same Git host:
&lt;a class="extern" href="http://git.huoc.org/?p=regxml.git;a=summary"&gt;regxml&lt;/a&gt;. Yep, this is
a rewrite of my xmltools!
&lt;/p&gt;&lt;p&gt;Why a rewrite, you ask? Well, basically because the old implementation
was flawed by design, in addition to having become a mess.
&lt;/p&gt;&lt;h3&gt;The little boring story
&lt;/h3&gt;&lt;p&gt;And here’s the long answer, for those who care. Feel free to skip this
section if it bores you.
&lt;/p&gt;&lt;p&gt;During the second part of last year’s SoC, as I was trying to work on
xmlsed, I realized my design was flawed. The main problem was that
I had based all my efforts around a concept that was, for our purpose,
completely inappropriate: node sets.
&lt;/p&gt;&lt;p&gt;So what’s a node set? A node set is simply that: a set of nodes. Now,
that’s great if you are extracting data from an XML document, as did
&lt;b&gt;xmlgrep&lt;/b&gt;, but it sucks if you’re going to operate further on
that. In particular, removing arbitrary nodes from an XML tree doesn’t
yield a tree, and replacing a node set made no sense at all. (How
would you replace two far away nodes with one big chunk of XML? There
are many ways to go about it… but they’re all &lt;em&gt;wrong&lt;/em&gt; IMO.)
&lt;/p&gt;&lt;p&gt;So what emerged at the end of the summer was that… I needed a new
design, and as my original algorithm had become crippled with dubious
ad hoc attempts to support operations that it never could, I needed
a new algorithm too. (Besides, there was a deficiency in my algorithm
that made it perform in super-polynomial time with some patterns and
data sets, which was bad.)
&lt;/p&gt;&lt;p&gt;It took me something like one month (since I wasn’t full-time on it)
to jot down the new concepts and a (hopefully) good algorithm on
paper. Then, for months, I’d try to implement it, but each time,
something would feel really wrong, and I’d trash my new
implementation. Well, that was till some days ago, when somehow, I got
it right, and so the new xmltools were born!
&lt;/p&gt;&lt;h3&gt;XML intervals
&lt;/h3&gt;&lt;p&gt;So all this boils down to the following fundamental change: XML node
sets have been replaced by XML intervals, as the core objects the
xmltools operate on. An XML interval is really just that: a bunch of
adjacent siblings and all their children. You should read the included
&lt;b&gt;xmltools(7)&lt;/b&gt; manual page for more information, from a user’s point
of view.
&lt;/p&gt;&lt;p&gt;From a more theoretical viewpoint, intervals have two important
advantages over sets:
&lt;/p&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;Inserting or removing an XML interval in an XML tree keeps it
a tree. In other words, XML trees are stable by insertion and
deletion of arbitrary XML intervals. (It may not have any meaning
anymore, but remember xmltools are about &lt;em&gt;syntax&lt;/em&gt;, not semantics.)
&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;XML intervals map (injectively) to byte intervals in the XML file,
hence all editing operations on XML intervals (sequences of
insertions and deletions) can be expressed as operations on
sequences of bytes in the file itself.
&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;The first property is what I am convinced will make &lt;b&gt;xmlsed&lt;/b&gt;
possible. Previously, with node sets, it was a nightmare to give
proper semantics to the various operations. But now, we have simple
objects that still retain a great deal of expressivity but can be
manipulated easily, with simple rules.
&lt;/p&gt;&lt;p&gt;Although I do not use the second property yet, I’m definitely thinking
about introducing an API for that sometime in the (not quite near)
future.
&lt;/p&gt;&lt;p&gt;The idea is that you run the XML matcher on a file mapped in memory
(mapping new chunks as needed), and get a program script consisting of
byte operations, that you can then run very efficiently on the
memory-mapped file! Instead of printing an interval, you’d just copy
over the entire memory region. Instead of deleting an interval, you’d
just ignore that chunk. You get the idea. It could be used for
efficient processing of massive data with complex transformation
rules, but we’re not quite there, yet. :)
&lt;/p&gt;&lt;h3&gt;xmltools and NetBSD
&lt;/h3&gt;&lt;p&gt;As I’ve discussed with David (&amp;lt;dyoung&amp;gt;), I intend to import my core
library, as well as xmlgrep into base sometime in the near future
(once it has undergone more thorough testing). It depends on Expat for
XML parsing, so I’ll need to import that too.
&lt;/p&gt;&lt;p&gt;After much thinking, I think I’ll settle with a subdirectory in
&lt;code&gt;src/external/bsd&lt;/code&gt;, though I use the NetBSD build system, as I want to
keep my development model using Git, and the ability to easily make
packages of all my tools for non-NetBSD platforms.
&lt;/p&gt;&lt;p&gt;The new code base uses (Net)BSD idioms more extensively than the old
one (because I’ve got the time to read more NetBSD code since
I started in May of last year). (E.g. I’ve followed David’s suggestion
that I should make use of &lt;code&gt;sys/queue.h&lt;/code&gt;.) Quite surprisingly, it
compiles almost as is on GNU/Linux, with three additional defines and
two trivial functions. I do provide a simple GNU makefile in addition
to the default NetBSD build system makefiles, for GNU/Linux builds. Of
course, you’re welcome to try to build it on other systems I don’t
have access to!
&lt;/p&gt;&lt;h3&gt;What the future holds
&lt;/h3&gt;&lt;p&gt;Well, what’ll come next depends on a lot of factors, some of which are
completely outside of my control. But there is one thing &lt;em&gt;you&lt;/em&gt; can do:
please download, compile, read the man pages, and test xmlgrep and
give me some feedback, so I can fix bugs!
&lt;/p&gt;&lt;blockquote&gt;&lt;div&gt;&lt;pre&gt;&lt;code&gt;$ git clone git://git.huoc.org/regxml.git
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Or through Git-Web:
&lt;a class="extern" href="http://git.huoc.org/?p=regxml.git;a=summary"&gt;http://git.huoc.org/?p=regxml.git;a=summary&lt;/a&gt;
&lt;/p&gt;&lt;/div&gt;&lt;/blockquote&gt;&lt;p&gt;As far as the xmltools project is concerned, I’m now going to
concentrate on xmlsed. I already have an idea of the implementation,
and I think it’s not going to be too hard… hopefully, I’m not
wrong. :)
&lt;/p&gt;&lt;p&gt;P.S.: Oh, I forgot to talk about the new algorithm. Well, to get the
idea, it’s basically an automaton backed by a backtracking memoizing
work-queue-based interpreter, which roams the XML tree and try to
match intervals. You should read the &lt;b&gt;xmltools(7)&lt;/b&gt; man page for
a high-level overview of the technique. Maybe I’ll write something
more consistent on the subject later…
&lt;/p&gt;

      </content>
    </entry>
  <entry>
      <id>tag:blog.huoc.org,2009:posts/xmlsed-prototype</id>
      <link rel="alternate" href="http://blog.huoc.org/xmlsed-prototype.html" />
      <title xml:lang="en">
        xmlsed prototype
      </title>
      <published>2009-10-12T22:30:46+02:00</published>
      <updated>2009-10-12T22:30:46+02:00</updated>
      <author>
          <name>Nhat Minh Lê  (rz0)</name>
          </author>
      <category term="api" />
      <category term="xmlsed" />
      <category term="xmltools" />
      <content xml:lang="en" type="html">
         
&lt;p&gt;Following David Young’s post on the official NetBSD blog, here’s my
own small announcement: xmlsed is here, or at least a somewhat working
first version. It’s in the Git repository, under the master branch
(everything’s been merged back into master some time ago).
&lt;/p&gt;&lt;div class="Avertissement"&gt;&lt;p&gt;It’s mostly untested, and the exact interface/behavior is likely to
change, but if you’re interested in helping out, please read the man
pages included in the distribution (&lt;b&gt;xmlsed(1)&lt;/b&gt; and
&lt;b&gt;xml_pattern(7)&lt;/b&gt; should get you started) and play a bit with some
XML data (you can find sample files in &lt;code&gt;tests/data/&lt;/code&gt;).
&lt;/p&gt;&lt;/div&gt;&lt;p&gt;So, what does my xmlsed do? Basically, it’s somewhat like &lt;b&gt;sed(1)&lt;/b&gt;
except it does only one of the many things sed can do (though it’s
probably the most popular feature of sed): it replaces things with
other things, where, in our case, a &amp;quot;thing&amp;quot; is one or more nodes.
&lt;/p&gt;&lt;p&gt;It is based on the new event-handling mini internal framework and uses
the same patterns as xmlgrep (and the same matching code), augmented
with the register-binding operator &lt;code&gt;&amp;gt;=&lt;/code&gt; which lets you put captured
nodes into a named variable, called a register. Then you can put nodes
of the form &lt;code&gt;&amp;lt;$REGNAME/&amp;gt;&lt;/code&gt; in your replacement templates, and &lt;i&gt;voilà&lt;/i&gt;!
The register’s contents gets inserted in place of the
&lt;code&gt;&amp;lt;$REGNAME/&amp;gt;&lt;/code&gt;. Something like &lt;code&gt;xmlsed '((a#)&amp;gt;=a%(b#)&amp;gt;=b) &amp;lt;$b/&amp;gt;&amp;lt;$a/&amp;gt;'&lt;/code&gt;
should swap around two consecutive nodes &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt;… well, I very
much hope it does.
&lt;/p&gt;&lt;p&gt;Well, that’s for the good part. Now, there are many things left
unresolved in xmlsed, mainly the fact that its behavior is not
consistent with xmlgrep. In xmlsed, if you replace a node, it also
deletes all of its descendants, whereas when you select a node in
xmlgrep (or in a capture in xmlsed), it comes alone, without any of
its descendants (except attributes). In fact, you can even do a lot
more weird stuffs in selections like selecting and grouping together
non directly connected parts of the tree. To sum up: selection is way
more powerful than replacement, and by default, they don’t really look
alike (and they aren’t quite implemented the same way either…).
&lt;/p&gt;&lt;p&gt;I’m thinking about rationalizing all this, along with the
syntax. Through the many feature additions (as David and I decided we
needed more power), selection patterns have probably become too
convoluted, unnecessarily complex and fine-grained where it doesn’t
matter. However, we should not give up on all that flexibility and
probably bring some over to the replacement mechanism. It’ll also be
a good opportunity to revise the syntax (more XPath-isms?) and migrate
to some automated parser generator, one that can create reentrant and
push-style parsers. I have that &lt;a class="extern" href="http://git.huoc.org/?p=yacc.git;a=summary"&gt;big patch of NetBSD yacc&lt;/a&gt;
doing just that, waiting in my Git repository, but it probably needs
more polishing before I can hope to submit it to the mailing lists to
get myself flamed for doing things on my own, but oh well, we’ll see.
&lt;/p&gt;&lt;p&gt;Now’s not the time for that yet. First, I need to test xmlsed more
thoroughly to ensure that the implementation underlying the syntax
really works.
&lt;/p&gt;&lt;p&gt;Also, other things that I’ll want/need to do once I’m done with xmlsed
(to the point where it runs well enough, doing something useful) are,
in no particular order:
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;Rewrite the matching code to be more generic, simple, and readable:
the idea is to split various constructs into more elementary blocks
(I haven’t decided exactly what they would be, but I’m thinking
about it) and have some kind of automaton feed on these
instructions and move accordingly.
&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Improve the data structures in use, both for efficiency reasons and
to make it possible to (theorically) run multiple instances of the
matcher on the same stream of data, hopefully opening the way for
some more concurrency. This last point is almost possible now, but
not quite yet; there are still some matching data that remain in the
node itself while they shouldn’t.
&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Implement the amended algorithm I’ve devised that should (if I’m
correct…) eliminate the &amp;quot;bad&amp;quot; parts of the complexity that make it
theorically inferior to the transducer network algorithm, while
taking somewhat more memory (though the asymptotic bounds on memory
consumption should stay the same).
&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Do something about locales; also handle integration into base, and
packaging for non-NetBSD systems. In short: better integration.
&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Then of course, there’ll also be work on the remaining tools: some
kind of xmljoin is probably next after xmlsed. In which order
I address the various points, I don’t know yet; it’ll probably depend
on what I feel is more urgent, and what I feel like doing first. :)
&lt;/p&gt;

      </content>
    </entry>
  <entry>
      <id>tag:blog.huoc.org,2009:posts/xmltools-after-gsoc</id>
      <link rel="alternate" href="http://blog.huoc.org/xmltools-after-gsoc.html" />
      <link rel="tag:huoc.org,2009:atom/parent" href="tag:blog.huoc.org,2009:posts/gsoc-2009" />
      <title xml:lang="en">
        xmltools: after the GSoC
      </title>
      <published>2009-09-07T17:14:48+02:00</published>
      <updated>2009-09-07T17:14:48+02:00</updated>
      <author>
          <name>Nhat Minh Lê  (rz0)</name>
          </author>
      <category term="api" />
      <category term="atf" />
      <category term="xmltools" />
      <content xml:lang="en" type="html">
         
&lt;p&gt;Since the end of the Summer of Code, I have been busy doing various
miscellaneous tasks around my xmltools, so that it be ready for
inclusion into the NetBSD tree some time in the future. This is
a short summary of what has happened (just so people know that my
project’s not dead, I’ve not vanished, or something along these
lines); please take a look at the code repo for details:
&lt;/p&gt;&lt;dl&gt;&lt;dt&gt;ATF test suites&lt;/dt&gt;&lt;dd&gt;&lt;p&gt;I’ve finally decided to convert my bunch of shell scripts to ATF
tests. These are &lt;em&gt;basic tests&lt;/em&gt; which check the various constructs of
the pattern language. There are 46 tests, three of which are
optional (they are made against rather big data sets that I do not
distribute with the sources). These include a test suite centered
around property list matching. If you have relevant test cases and
would like to see them included, please send them over.
&lt;/p&gt;&lt;/dd&gt;&lt;dt&gt;Library API&lt;/dt&gt;&lt;dd&gt;&lt;p&gt;Some people had been asking me if there would be a library, well,
yes, and I’ve libified all the existing code in anticipation of
that, and also because I wanted to eventually commit somthing clean
to the NetBSD tree. All routines have been made to support proper
error reporting, and expose clean interfaces. The library is divided
into four main sub-namespaces:&lt;sup&gt;α&lt;/sup&gt;
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;code&gt;hhxml_chunk_*&lt;/code&gt; : XML in-memory tree manipulations.
&lt;/li&gt;&lt;li&gt;&lt;code&gt;hhxml_stream_*&lt;/code&gt; : XML streams and (push-style) parsers. The
streams have a hybrid design: they can be used to read nodes one
at a time in a pure stream fashion or to generate partial
in-memory trees. See &lt;code&gt;misc/atomheadlines.c&lt;/code&gt; in the source
distribution for an example.
&lt;/li&gt;&lt;li&gt;&lt;code&gt;hhxml_path_*&lt;/code&gt; : patterns and matching. This one is incomplete and
the old &lt;code&gt;match.c&lt;/code&gt; will need to be converted to fit into the new
API.
&lt;/li&gt;&lt;li&gt;&lt;code&gt;hhxml_pprint_*&lt;/code&gt; : XML pretty printer. The implementation is also
mostly incomplete at this time: it only supports printing to
stdout, well, because the command-line tools don’t require any
more than that.
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The various parts of the API will be completed as I go about writing
the tools themselves, but I wanted to have clean well-defined
interfaces to build onto.
&lt;/p&gt;&lt;div class="Notes"&gt;&lt;p&gt;α: By the way, &amp;quot;hhxml&amp;quot; stands for &amp;quot;(the) very short XML
(library)&amp;quot;.
&lt;/p&gt;&lt;/div&gt;&lt;/dd&gt;&lt;dt&gt;Non-visible changes to the pattern matching engine and plans&lt;/dt&gt;&lt;dd&gt;&lt;p&gt;Well, I’ll write more on this one when I get the time, but I’ve been
looking into other implementations and theories, especially &lt;a class="extern" href="http://spex.sourceforge.net/"&gt;SPEX&lt;/a&gt;,
which boasts algorithms with good complexities (better than what is
currently implemented in my tools). I have plans for improvements to
the engine which I believe, if I’m not wrong, would bring us similar
performance, but that will have to wait, since it is not critically
important, at the moment, in my opinion.
&lt;/p&gt;&lt;/dd&gt;&lt;/dl&gt;&lt;p&gt;That’s all for today!
&lt;/p&gt;

      </content>
    </entry>
  <entry>
      <id>tag:blog.huoc.org,2009:posts/xmltools-whats-done-whats-left</id>
      <link rel="alternate" href="http://blog.huoc.org/xmltools-whats-done-whats-left.html" />
      <link rel="tag:huoc.org,2009:atom/parent" href="tag:blog.huoc.org,2009:posts/gsoc-2009" />
      <title xml:lang="en">
        xmltools: what's done, what's left
      </title>
      <published>2009-08-13T23:44:37+02:00</published>
      <updated>2009-08-13T23:44:37+02:00</updated>
      <author>
          <name>Nhat Minh Lê  (rz0)</name>
          </author>
      <category term="gsoc" />
      <category term="xmltools" />
      <content xml:lang="en" type="html">
         
&lt;p&gt;So the Summer of Code ends soon and it’s time to report what has been
done, and what is still lacking. Obviously, everybody can see there is
no xmlsed yet in the repository, so what have I been doing since
I completed xmlgrep one month ago? What went wrong?
&lt;/p&gt;&lt;p&gt;Well, if you ask me, &lt;em&gt;nothing really went wrong&lt;/em&gt;, I have just deviated
somewhat from my original goals and taken the time to do some things
that weren’t part of the initial plan. And even though I am
disappointed not to have finished, this has brought its share of good
additions. Most of these modifications stem from discussions with
David (Young) about features we’d like to see in my xmltools. Also in
all these three months, I’ve spent about three weeks optimizing the
matcher code.
&lt;/p&gt;&lt;h3&gt;What we have
&lt;/h3&gt;&lt;p&gt;Let’s talk about the good stuff first. Of course, the visible part of
my work is the existence of xmlgrep, but most of the code actually
lives in a common static library. It provides:
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;definitions for patterns and trees;
&lt;/li&gt;&lt;li&gt;utilities: (tree) buffers, bit vectors, symbol look-up, state and
transition caching;
&lt;/li&gt;&lt;li&gt;a string to pattern parser (that I often just call &amp;quot;pattern compiler&amp;quot;);
&lt;/li&gt;&lt;li&gt;an XML parser (which can handle fragments and multi-roots documents,
or work as an interface to Expat, for strict XML parsing);
&lt;/li&gt;&lt;li&gt;an XML pretty printer;
&lt;/li&gt;&lt;li&gt;the code for the matcher itself;
&lt;/li&gt;&lt;li&gt;and a higher-level API used to build the tools themselves.
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The library has managed to remain reasonably small, with less than 5k
lines of code, counting comments and everything, and the core matching
algorithm fits in less than 1.5k lines or so (compare this to the
15k-line &lt;code&gt;xpath.c&lt;/code&gt; from libxml2). It is stream-oriented and care has
been taken so it eats as little memory as possible while remaining
efficient.
&lt;/p&gt;&lt;h4&gt;The pattern matching code
&lt;/h4&gt;&lt;p&gt;The matcher is undeniably the most important and sensible piece of
code in there. It is a &lt;a href="http://blog.huoc.org/./16-xmltools-implementation-automata-and-backtracking.html"&gt;blend of a kind-of-NFA and a backtracking
machine&lt;/a&gt;, which processes one node at the time (NFA) but
is able to do unlimited look-ahead (backtracking). Recent work has
involved adding new advanced constructs to the matcher (besides
forward and look-ahead mechanisms, which were already implemented) and
improving its performances when look-ahead is in use.
&lt;/p&gt;&lt;h5&gt;Performances
&lt;/h5&gt;&lt;p&gt;On the performance side, the main addition is that of a state cache
and pre-computed states created by the pattern compiler (instead of
the matching code itself).
&lt;/p&gt;&lt;p&gt;The state cache (implemented as a hash table) operates at two levels :
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;propagation (from parent to child and sibling to sibling);
&lt;/li&gt;&lt;li&gt;and filtering (deciding whether based on local properties, states
match or not).
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;One effect of using a cache is that bit vectors need not be pooled
anymore and a lot of copying has disappeared in favor of pointer
swapping. Also, relying on states pre-computed by the pattern compiler
has shifted the focus from looping bit by bit through bit vectors to
whole-vector operations, most of which can be auto-vectorized by
a sufficiently smart compiler (ICC on Linux can vectorize a dozen of
these in the whole xmlgrep code, but as I’ve &lt;a href="http://blog.huoc.org/./9-more-benchmarks-the-bitvopt-branch.html"&gt;pointed before&lt;/a&gt;,
GCC auto-vectorizer seems to lag behind its non-use, at least on my
code and setup).
&lt;/p&gt;&lt;p&gt;With these optimizations merged in, xmlgrep now outperforms xmlstarlet
both in terms of speed and memory consumption on my silly little
benchmark (for those who have not been there, it consists in grepping
for a word and its definition in a 50M XML-formatted English
dictionary, and yes, it’s simple-minded), even when using suboptimal
look-ahead subpatterns.
&lt;/p&gt;&lt;p&gt;Indeed, knowledge of the XML format often enables us to write
optimized patterns by &lt;em&gt;anchoring&lt;/em&gt; a more costly part with a basic
predicate. In our case, instead of searching for
&lt;code&gt;p[hw/.=&amp;quot;somestring&amp;quot;]&lt;/code&gt; (the suboptimal case), we could search for
&lt;code&gt;body/p[hw/.=&amp;quot;somestring&amp;quot;]&lt;/code&gt; since &lt;code&gt;p&lt;/code&gt; can only occur inside
&lt;code&gt;body&lt;/code&gt;. This is an improvement since the &lt;code&gt;p[]&lt;/code&gt; part, being
a look-ahead subpattern, is sensibly slower than a forward pattern, so
reducing the number of tentative matches is essential.
&lt;/p&gt;&lt;p&gt;Compared to benchmarks done at the time of the release of xmlgrep,
handling of this suboptimal case is about three times faster now.
&lt;/p&gt;&lt;h5&gt;Features
&lt;/h5&gt;&lt;p&gt;On the feature side, which is probably more interesting for casual
would-be users, I have added grouping, Kleene-like operators, and
subpattern negation, as well as reworked some of the old syntax.
&lt;/p&gt;&lt;p&gt;Subpattern negation is the simplest extension, algthough it’s fairly
powerful. It can be achieved because of backtracking so it’s only
available on (look-ahead) subpatterns, not groups (see
below). Negation lets you write something like &lt;code&gt;{a}!{a/b}&lt;/code&gt; which
selects all elements &lt;code&gt;a&lt;/code&gt; that do &lt;em&gt;not&lt;/em&gt; have a &lt;code&gt;b&lt;/code&gt; element. This can’t
be done with regular patterns because a node can have many children,
and thus, unlike with regexes, something like &lt;code&gt;a/!b&lt;/code&gt; (would-be syntax)
would match some &lt;code&gt;a&lt;/code&gt; node with a child which is not a &lt;code&gt;b&lt;/code&gt;, but there
can be many other children.
&lt;/p&gt;&lt;p&gt;For example:
&lt;/p&gt;&lt;pre&gt;&lt;code&gt;$ xmlgrep -x '{dict}!{dict/@id}/key/.' test.plist
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;will extract only keys from unnamed dictionaries.
&lt;/p&gt;&lt;p&gt;The second addition is Kleene-like operators. Well, sorry about the
name, but it’s just what comes to mind when I use these. They are, as
the name implies, somewhat like their regex counter part, the Kleene
star, in that they match zero or more repetitions of the &lt;em&gt;preceding
predicate&lt;/em&gt;. E.g. &lt;code&gt;a*%b&lt;/code&gt; matches &lt;code&gt;&amp;lt;a/&amp;gt;&amp;lt;a/&amp;gt;&amp;lt;a/&amp;gt;...&amp;lt;b/&amp;gt;&lt;/code&gt; but not, say,
&lt;code&gt;&amp;lt;a/&amp;gt;&amp;lt;c/&amp;gt;&amp;lt;d/&amp;gt;...&amp;lt;b/&amp;gt;&lt;/code&gt;. The &lt;code&gt;%%&lt;/code&gt; and &lt;code&gt;//&lt;/code&gt; operators have now become
special cases of the Kleene operators. &lt;code&gt;a%%b&lt;/code&gt; can be written &lt;code&gt;a%**%b&lt;/code&gt;
and &lt;code&gt;a//b&lt;/code&gt; is really &lt;code&gt;a/**/b&lt;/code&gt;, where the first star stands for &amp;quot;any
node&amp;quot; and the second is part of the operator.
&lt;/p&gt;&lt;p&gt;But having these repetition operators limited to repeating a single
node predicate would have been quite boring, and so groups were
born. Groups, written inside parentheses &lt;code&gt;()&lt;/code&gt; (notice these have
replaced the old subpattern syntax which has been moved back to being
&lt;code&gt;{}&lt;/code&gt; from its first days), well, group a pattern fragment, so that it
can be repeated. They also accept alternation so you can write
&lt;code&gt;a%(b|c)*%d&lt;/code&gt;, which just seems natural.
&lt;/p&gt;&lt;p&gt;Let’s take an example:
&lt;/p&gt;&lt;pre&gt;&lt;code&gt;$ cat test.xml
&amp;lt;B/&amp;gt;&amp;lt;a/&amp;gt;&amp;lt;b/&amp;gt;&amp;lt;a/&amp;gt;&amp;lt;b/&amp;gt;&amp;lt;a/&amp;gt;&amp;lt;b/&amp;gt;&amp;lt;a/&amp;gt;&amp;lt;b/&amp;gt;&amp;lt;a/&amp;gt;&amp;lt;b/&amp;gt;&amp;lt;E/&amp;gt;
$ xmlgrep 'B%(a|b)*%E' test.xml
&amp;lt;E/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Groups are implemented without backtracking so using a group is not
much more costly than using plain forward patterns, and is faster than
using look-ahead subpatterns.
&lt;/p&gt;&lt;p&gt;Yet, that is only true of normal groups. When you use a selection
operator (&lt;code&gt;+&lt;/code&gt;, &lt;code&gt;#&lt;/code&gt;, &lt;code&gt;-&lt;/code&gt; or &lt;code&gt;=&lt;/code&gt;) on a group, it becomes a capturing
group, a.k.a. look-ahead group. A look-ahead group is really
a look-ahead subpattern combined with a group and has a somewhat
subtle behavior. It behaves like a group, with the following
difference: any selection operator inside a look-ahead group will only
apply &lt;em&gt;if the whole group matches&lt;/em&gt;. This can be used to bind selectors
together. Thus we can revisit the classical key/value pair example:
how do we match a key/pair with some properties on the key and some
other on the value? You can always do that with simple look-ahead
subpatterns combined with forward patterns, but groups just make it
more convenient.
&lt;/p&gt;&lt;p&gt;Imagine you want to match all key/value pairs where the key contains
the letter &lt;code&gt;a&lt;/code&gt; &lt;em&gt;and&lt;/em&gt; the value is a string. Easy, with a capturing
group:
&lt;/p&gt;&lt;pre&gt;&lt;code&gt;$ xmlgrep -x '(key[.~&amp;quot;a&amp;quot;]#%string#)+' test.plist
&lt;/code&gt;&lt;/pre&gt;&lt;h4&gt;Other components, reusability
&lt;/h4&gt;&lt;p&gt;Some people have discussed with me the possibility of making the
library reusable. As it is now, it is not yet ready for wider
exposure, but a lot of work has gone into rationalizing its API, in
order to prepare for that. The API needs to be stabilized and the
various components need to be better segregated.
&lt;/p&gt;&lt;p&gt;Currently, the data structures in use are tightly integrated.  For
instance, the parser actually builds a tree with nodes able to hold
matching states. Somehow, I would need to break that dependency, so
that the matcher depends on the parser, but not the opposite, so you
could use the XML parser as a stand-alone component (it has a &lt;a href="http://blog.huoc.org/./15-xmltools-update-new-io-layer-and-further-plans.html"&gt;nice
hybrid design&lt;/a&gt;). Other parts are less affected, though.
&lt;/p&gt;&lt;h3&gt;What’s left to do
&lt;/h3&gt;&lt;p&gt;Now for the bitter part. Obviously, there are still a lot of things
left to do, and I intend to continue working on this project, but
whether it will be integrated into NetBSD at some point in the future
is not for me to decide, though I really hope it will.
&lt;/p&gt;&lt;p&gt;In short, here is the plan:
&lt;/p&gt;&lt;ol&gt;&lt;li&gt;write support for xmlsed templates in the library;
&lt;/li&gt;&lt;li&gt;write xmlsed proper;
&lt;/li&gt;&lt;li&gt;change the/add a high-level API to the library to make it possible
to write xmljoin (because the high level framework only accepts one
input file at this point in time);
&lt;/li&gt;&lt;li&gt;see what more needs to be adjusted for xmljoin to fit into the
framework;
&lt;/li&gt;&lt;li&gt;write xmljoin proper;
&lt;/li&gt;&lt;li&gt;see what needs to be done for xmlsort to fit into the framework;
&lt;/li&gt;&lt;li&gt;write xmlsort.
&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Of course, I have given some thoughts to the &amp;quot;see what needs to be
done&amp;quot; phases, but now is not right time to talk about that.
&lt;/p&gt;&lt;p&gt;Also, at some point, I’ll need to address several shortcomings of the
current code base:
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;add locale support;
&lt;/li&gt;&lt;li&gt;convert my bunch of &amp;quot;regression shell scripts&amp;quot; to proper test suites
(ATF, probably);
&lt;/li&gt;&lt;li&gt;add non-NetBSD build support (currently, I have to hand-compile it
on my Linux box).
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;And I will also want to:
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;clean-up the API and release the core components as a stand-alone
library;
&lt;/li&gt;&lt;li&gt;write documentation on the internals to accompany that library;
&lt;/li&gt;&lt;li&gt;and of course, integrate it into NetBSD, if I am given
permission. :)
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Lastly, about xmlsed, though you can see no &lt;code&gt;xmlsed&lt;/code&gt; directory in the
repository, I can say it is pretty close to completion since the
pieces needed are almost all there in the library (the only one
missing is the template support, and that’s not the hard part).
&lt;/p&gt;&lt;h3&gt;Conclusion
&lt;/h3&gt;&lt;p&gt;The Google Summer of Code is nearing its end, but my summer vacations
are not over yet. I won’t have classes until September, so I plan to
carry on as if nothing much happened at least till the end of
August. After that, I won’t have nearly as much free time, but I’ll do
what I can; let’s not be hasty and wait till then to see how far the
project has gone.
&lt;/p&gt;&lt;p&gt;I will continue to update this blog and the code will be available at
my &lt;a class="extern" href="http://git.huoc.org"&gt;Git repository&lt;/a&gt;, as usual. (Anyway, I have
just recently added comments to my blog so feel free to drop by and
give me your opinion.)
&lt;/p&gt;

      </content>
    </entry>
  <entry>
      <id>tag:blog.huoc.org,2009:posts/xmltools-implementation-automata-and-backtracking</id>
      <link rel="alternate" href="http://blog.huoc.org/xmltools-implementation-automata-and-backtracking.html" />
      <link rel="tag:huoc.org,2009:atom/parent" href="tag:blog.huoc.org,2009:posts/gsoc-2009" />
      <title xml:lang="en">
        xmltools implementation: automata and backtracking
      </title>
      <published>2009-07-20T00:39:10+02:00</published>
      <updated>2009-07-20T00:39:10+02:00</updated>
      <author>
          <name>Nhat Minh Lê  (rz0)</name>
          </author>
      <category term="backtracking" />
      <category term="nfa" />
      <category term="parsing" />
      <category term="xmltools" />
      <content xml:lang="en" type="html">
         
&lt;p&gt;At the moment, I’m implementing submatches in patterns, in preparation
for xmlsed. Since most of what I do draws from formal languages and
automata (except I’m too lazy to write anything formal), I’ve been
looking into &lt;a class="extern" href="http://www.laurikari.net/ville/regex-submatch.pdf"&gt;TNFAs&lt;/a&gt; for inspiration. Seeing how &lt;a class="extern" href="http://laurikari.net/tre/"&gt;TRE&lt;/a&gt; seems to
implement look-ahead without backtracking, I began to ask myself if
such a thing could be done for my pattern matcher. Unfortunately,
I can’t seem to figure out a solution by myself, so I’m blogging
instead. :)
&lt;/p&gt;&lt;p&gt;Basically, the problem is with constructs such as &lt;code&gt;a[b]/c&lt;/code&gt; which match
&lt;code&gt;c&lt;/code&gt;, child of &lt;code&gt;a&lt;/code&gt; only if &lt;code&gt;a&lt;/code&gt; has a child &lt;code&gt;b&lt;/code&gt;. However, when we reach
&lt;code&gt;a&lt;/code&gt;, we don’t know yet whether it has a child &lt;code&gt;b&lt;/code&gt; or not, so we’re in
need of some &lt;em&gt;look-ahead&lt;/em&gt;.
&lt;/p&gt;&lt;p&gt;This is akin to &lt;code&gt;a(?=b)c&lt;/code&gt; in Perl regexes, but as you can already see,
it’s fairly different. Perhaps the most obvious difference is that
a tree has &amp;quot;two dimensions&amp;quot; while a string has only one, so it’s
possible for &lt;code&gt;a&lt;/code&gt; to have two children &lt;code&gt;b&lt;/code&gt; and &lt;code&gt;c&lt;/code&gt;, in a tree, while
it’s not possible for &lt;code&gt;a&lt;/code&gt; to be followed by two different characters,
in a string.
&lt;/p&gt;&lt;h3&gt;Backtracking
&lt;/h3&gt;&lt;p&gt;The easiest way to solve this problem is to use backtracking. This is
how the matcher is currently implemented. Even though about everything
else is based solely on states and transitions, the part implementing
look-ahead predicates uses backtracking.
&lt;/p&gt;&lt;p&gt;The algorithm looks like this: everytime we have to match a look-ahead
predicate, we stop, read as many nodes as we need to decide and then
either drop the state or keep it, depending on the outcome.
&lt;/p&gt;&lt;p&gt;This is a simple approach and it works well (i.e. it gives the
expected results). Only, it’s slow. As a comparison, using the
patterns I’ve used before for my benchmarks, the difference between
&lt;code&gt;hw/.=&amp;quot;Ab\&amp;quot;a*cus&amp;quot;&lt;/code&gt; and &lt;code&gt;p[hw/.=&amp;quot;Ab\&amp;quot;a*cus&amp;quot;]#&lt;/code&gt; is the second one is
about 2.66 times slower. This can be hand-optimized by rewriting the
second pattern as &lt;code&gt;body/p[hw/.=&amp;quot;Ab\&amp;quot;a*cus&amp;quot;]#&lt;/code&gt;, with knowledge of the
input format, where &lt;code&gt;p&lt;/code&gt; can only appear as a child of &lt;code&gt;body&lt;/code&gt;. This
cuts the slowness to a mere 1.62 times slower.
&lt;/p&gt;&lt;p&gt;At the same time, it demonstrates a major source of inefficiency in my
backtracking implementation: loose (non-anchored) subpatterns, which
may potentially apply to a great number of nodes yet would fail
immediately for most of them, result in a lot of unwanted short
backtracking segments where the matcher tries the subpattern and then
immediately backtracks after failing, thus still examining the node
twice.
&lt;/p&gt;&lt;h3&gt;Product automata
&lt;/h3&gt;&lt;p&gt;In theory, a pattern such as &lt;code&gt;a(?=b)c&lt;/code&gt; matches if &lt;code&gt;a&lt;/code&gt; is followed by
&lt;em&gt;both&lt;/em&gt; &lt;code&gt;b&lt;/code&gt; and &lt;code&gt;c&lt;/code&gt; so another natural idea would be to use a product
automaton which transitions between couples of states: if we end up in
two final states, then we win.
&lt;/p&gt;&lt;p&gt;Only, this works well because we want to know whether the string
matches as a whole. However, with an XML pattern such as &lt;code&gt;a[b]/c&lt;/code&gt;, we
need to identify &lt;em&gt;which&lt;/em&gt; node matched the &lt;code&gt;c&lt;/code&gt; part. This is where
things start to get complicated.
&lt;/p&gt;&lt;p&gt;Now we do know how to extract a submatch from a string with tagged
transitions. So, great, does this apply to our XML matcher? The answer
is &amp;quot;no&amp;quot;, as far as I know (I’d be glad if someone proved me wrong,
though). Tagged transitions are good at extracting &lt;em&gt;one&lt;/em&gt; submatch, the
last one, but in our case, the problem is we want &lt;em&gt;all&lt;/em&gt; occurrences of
&lt;code&gt;c&lt;/code&gt;, not just the last one.
&lt;/p&gt;&lt;p&gt;Keeping this information globally associated with the state is not an
option since there may well be multiple instances of the look-ahead
predicate trying to match at the same time. But trying to
differentiate between two states based on the matching position is not
an option either, because that could easily lead to linear memory
usage. Consider, for example, the following XML fragment:
&lt;/p&gt;&lt;pre&gt;&lt;code&gt;&amp;lt;a/&amp;gt;&amp;lt;a/&amp;gt;&amp;lt;a/&amp;gt;&amp;lt;a/&amp;gt;&amp;lt;a/&amp;gt;...&amp;lt;b/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Trying to match &lt;code&gt;{a%%b}&lt;/code&gt; (or &lt;code&gt;(a%%b)&lt;/code&gt; using the current syntax in the
master branch; sorry, I reverted to the old syntax because &lt;code&gt;()&lt;/code&gt; will
be used for groups) against that would result in a match at every &lt;code&gt;a&lt;/code&gt;
element. Hence, keeping the position as a state property is not
a solution, since we’d have to distinguish between as many &lt;code&gt;a&lt;/code&gt; nodes
as there are in the input data.
&lt;/p&gt;&lt;h3&gt;Optimizing the backtracking method
&lt;/h3&gt;&lt;p&gt;Although the implementation of the matcher has undergone many changes
in order to improve performances recently, mostly so as to accommodate
new features while keeping a reasonable running time, I believe
there’s still room for much improvement, maybe in the form of
algorithmic enhancements rather than implementation optimizations.
&lt;/p&gt;&lt;p&gt;Some of my more realistic ideas are:
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;Read one element ahead to reject some subpatterns without entering
a backtracking context.
&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Try to predict the outcome of a look-ahead predicate and compute the
more likely transitions along with the predicate itself, saving on
backtracking (except to prune the tree) in case we’re right (but
incuring additional recomputations if we’re wrong).
&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Also, this is not directly related to the backtracking approach, but
I’m thinking of putting more information in the state cache; at the
moment, we’re only caching &amp;quot;raw&amp;quot; transitions which account for the old
state set and the direction (i.e. successor or child); we could try to
cache local information as well, which would move us one step closer
to the way NFA transition functions get cached to get DFAs.
&lt;/p&gt;&lt;h3&gt;Conclusion: do we need performance that badly?
&lt;/h3&gt;&lt;p&gt;Well, that’s a good question. And the answer is &amp;quot;we probably
don’t&amp;quot;. This is kind of my hobby, so don’t get the wrong idea. The
whole thing is not that slow. On my simple example, xmlgrep was only
0.5s slower than libxml-powered xmlstarlet, taking less than six
seconds to look up a word definition in a 58M dictionary, on my 2x2GHz
Core&amp;#xA0;2, while using 100 times less memory.
&lt;/p&gt;&lt;p&gt;I doubt anybody is going to use xmlgrep to look up words in
dictionaries anytime soon; such a specialized task would best be
served using an indexed collection of some sort, although I should
mention that there are XML formats which are more suited to being
parsed using my tools than others, so converting to a more appropriate
(e.g. annotated) XML format using xmlsed then searching with xmlgrep
could well be a viable solution too (given the conversion is one-time
or infrequent compared to searches). More on that once xmlsed is out!
&lt;/p&gt;

      </content>
    </entry>
  <entry>
      <id>tag:blog.huoc.org,2009:posts/xmltools-update-new-io-layer-and-further-plans</id>
      <link rel="alternate" href="http://blog.huoc.org/xmltools-update-new-io-layer-and-further-plans.html" />
      <link rel="tag:huoc.org,2009:atom/parent" href="tag:blog.huoc.org,2009:posts/gsoc-2009" />
      <title xml:lang="en">
        xmltools update: new I/O layer and further plans
      </title>
      <published>2009-07-08T15:45:28+02:00</published>
      <updated>2009-07-08T15:45:28+02:00</updated>
      <author>
          <name>Nhat Minh Lê  (rz0)</name>
          </author>
      <category term="io" />
      <category term="xml" />
      <category term="xmltools" />
      <content xml:lang="en" type="html">
         
&lt;p&gt;It has been nearly two weeks since my last update on my plans for
xmlsed. Since that time, David and I have come to an agreement on the
(hypothetical) syntax and my hybrid model has been retained.
&lt;/p&gt;&lt;h3&gt;Rewriting the I/O module
&lt;/h3&gt;&lt;p&gt;Although I now have a fair idea of what xmlsed should look like and
how to implement it, some elements introduced after the recent
discussions with my mentor required changes to the existing code.
&lt;/p&gt;&lt;p&gt;In particular, the new template syntax required an XML parser able to
handle partial XML documents. Besides, I wanted some syntactic sugar
on top of XML to make writing templates easier. Expat being
a well-behaved XML parser is quite strict about the syntax and there
is no easy way to work around that. Though I did spend some days
trying, at first, I eventually gave up, as it proved hard to write,
let alone maintain in the future.
&lt;/p&gt;&lt;p&gt;At the same time, I began to realize the original I/O abstraction
I designed was showing its limits. I initially wanted to be able to
read and write multiple tree representation formats, but in the end,
the need to fully support XML, with all its specifics (doctypes, PIs,
etc.), has convinced me to drop the idea and focus on XML alone.
&lt;/p&gt;&lt;p&gt;All these reasons led me to rewrite the I/O layer (almost)
completely. This work is being committed to yet another temporary
branch: &lt;code&gt;xmlpush&lt;/code&gt;.
&lt;/p&gt;&lt;p&gt;Maybe the most visible change is that multi-backend support was
dropped and the I/O module now consists of only two drivers: the
Expat-based parser (the &lt;i&gt;strict&lt;/i&gt; parser) and a home-made &lt;i&gt;loose&lt;/i&gt;
parser.
&lt;/p&gt;&lt;p&gt;All tools now support two modes: the strict mode (&lt;code&gt;-s&lt;/code&gt; flag) and the
default loose mode.
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;In the &lt;strong&gt;strict mode&lt;/strong&gt;, compliance with the XML standard is
a priority: all extensions are disabled, including multi-root and
partial documents (which was implemented with Expat as an ugly
kludge before, so it was removed), the XML prolog is honored (the
specified encoding is used and entity declarations are parsed), and
stricter rules are enforced (e.g. in names).
&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;In the &lt;strong&gt;loose mode&lt;/strong&gt;, extensions are supported and the XML prolog
is ignored (instead, we use the system locale).
&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;The encoding issue
&lt;/h3&gt;&lt;p&gt;Although I’ve just stated how encodings will be handled, at the
moment, I have not written any code to suppor that yet. Actually, it’s
not a simple matter.
&lt;/p&gt;&lt;p&gt;First of all, the XML standard mandates support of Unicode. For one
thing, XML character entities (&lt;code&gt;&amp;amp;xxxx;&lt;/code&gt;) are references to character
codes from the Unicode tables. This makes support for Unicode pretty
much mandatory.
&lt;/p&gt;&lt;p&gt;In order to handle this, the Expat people have chosen to serve all
data as UTF-8 (or UTF-16, depending on the build-time configuration),
no matter what input encoding is used. This should have given you
a hint: Expat has its own encoding conversion engine.
&lt;/p&gt;&lt;p&gt;Now, even though XML documents can specify an encoding, and so it
would be alright to always use UTF-8, that is not an acceptable
solution in the real world: Unix users expect their programs to comply
to the current locale. But Expat does not care one way or the other
about, or integrate with, the locale system.
&lt;/p&gt;&lt;p&gt;Fortunately, NetBSD has &lt;b&gt;iconv(3)&lt;/b&gt; (though neither FreeBSD nor
OpenBSD does, apparently, which will be a problem in making my
programs portable; there seems to be an &lt;a class="extern" href="http://kovesdan.org/doc/en_US.ISO8859-1/soc2009/soc2009.html"&gt;ongoing GSoC effort to port
NetBSD libiconv to FreeBSD&lt;/a&gt; though), so I plan to run
everything through that on input and again on output. Sounds like
a waste of resources?  Well, don’t blame me.
&lt;/p&gt;&lt;p&gt;However, that’s not all: there is also the loose parser. Since this
one was written by me, and I had no reason to force UTF-8 everywhere,
it does no conversion and assumes all sources are in the current
locale. The only problem will be for character entities (not currently
implemented), but I intend to localize the Unicode-to-locale
transformation to these, since it is costly.
&lt;/p&gt;&lt;h3&gt;Remember the push-style vs event-driven parser debate?
&lt;/h3&gt;&lt;p&gt;Well, it wasn’t really a debate, but maybe some of you remember that
I posted some weeks ago a ticket on this blog about how I grew
dissatisfied with the rigid event-driven behavior of Expat and wanted
a push-style parser.
&lt;/p&gt;&lt;p&gt;I took the occasion this time to make my wish come true (almost) and
the new I/O design is &lt;em&gt;mostly&lt;/em&gt; push-style. I say &lt;i&gt;mostly&lt;/i&gt; because I’ve
made some compromises in order not to impair performances.
&lt;/p&gt;&lt;p&gt;At first, I thought about having the parser run on a fragment of input
text and build a queue of events to be fed to the application one by
one, on a on-demand basis. But then I realized I could just use the
internal tree directly as my &lt;i&gt;event queue&lt;/i&gt;, and have the application
read that. Since we have support for look-ahead in the matching engine
(and hence need to keep a &lt;em&gt;partial&lt;/em&gt; internal tree in any case) and
most tools will use that, this effectively moved the tree building
code from the matcher to the I/O module, at the same time eliminating
direct event responses (a bit less than 500 lines of code). This last
point needs some clarification: sure we do use a little more memory
(but honestly that just doesn’t show) since we build the tree first
and only then process it, but this is limited to how many nodes can be
represented in one read buffer, which means it’s mostly
insignificant. But we have gained what I think is far more important:
an unified implementation of the matcher (which &lt;em&gt;is&lt;/em&gt; the most
sensitive part).
&lt;/p&gt;&lt;h3&gt;Further plans
&lt;/h3&gt;&lt;p&gt;At the moment, the new parser does not support things such as entity
declarations and I am still pondering whether to write code for that
or not. In any case, it would be a good idea to have a &lt;b&gt;xmlcat(1)&lt;/b&gt;
utility which fully supports XML, including external entities, with
the ability to fetch and include external documents (&lt;b&gt;fetch(3)&lt;/b&gt;?)
and substitute references. But let’s leave this discussion for another
time.
&lt;/p&gt;&lt;p&gt;As for the locale support, I think it will come after xmlsed is
somewhat ready (i.e. as xmlgrep is right now). So for now, some more
testing needs to be done on the new I/O components, and when this is
done, I will resume work on xmlsed proper.
&lt;/p&gt;

      </content>
    </entry>
  </feed>
