Which Qt XML parser are you?

Do you like getting the whole story all at once? Or are you more the kind of person that wants to hear things bit by bit, and think about each piece as it comes, and respond a little at a time?

That’s essentially the question to ask when choosing an XML parser to use in Qt. There’s the QDomDocument, which parses your XML document into a fully-navigable document object model (DOM) in memory that you can query, traverse, and change. And then there’s the QXmlStreamReader parser, which reads from a stream and lets you make incremental queries for entities and attributes as you go along. Finally, there’s also the QXmlSimpleReader parser, a simple SAX-like parser that can read the whole document in one shebang or incrementally and calls callbacks on objects you register when the parser encounters entities and attributes.

Which should you use in your project?

There’s a tradeoff between the simplicity of your code and the amount of memory you’re willing to sacrifice at run-time. DOM-based parsing as provided by QDomDocument is awesome, but comes at a cost—the parser has to churn over the entire document as it creates the DOM, and essentially mirrors the document in RAM in with its DOM. The streaming parser is great for larger documents, but requires that you build more structure into your parser, using that structure to mirror the internal hierarchy in your XML document.

For years, I’ve been an avid proponent of using stream-based parsers—readers of Dan’s and my book will remember us exhorting you to use stream-based parsers for mobile, because you can read the data as it comes over the stream and build your document representation in core as efficiently as possible. In a past life, I actually wrote a streaming XML parser for Qualcomm BREW, again adhering to the doctrine that you should hold as little in memory as possible, and process what comes in over the air (or off the wire) as quickly as possible, rather than buffering and processing. If you’re parsing a very large document, or if you’re working on very small devices (think legacy Symbian or Maemo), streaming is definitely the way to go.

While I still think for the large majority of applications that’s probably still the best approach, I had the opportunity recently to use a DOM parser in a commercial setting on a project, and wow, was it easy! In my case, I was parsing small documents—typically well under fifty kb—walking the DOM quickly and reducing their structure to a compact in-memory representation and discarding the DOM. I could have done the same thing with a streaming parser, but here the time it took me to write the code was of the essence; we wanted to get something up and running as quickly as possible. And this was a fast way to do it.

Someday, I’d like to write a benchmark app that uses all three methods to parse a longish (say, 250 kb) document and report on both performance and heap usage. It would also be interesting to see download-and-parse times for the three approaches over a benchmark network like 3G cellular, because depending on the application and document size, network performance becomes a factor, too. Until I can point at hard numbers, though, I’d advise you to consider both the DOM and streaming approach, and choose the approach that best balances network throughput, memory usage, and the amount of time and effort it takes you to get your project off the ground.

A forest of trees!

I recently needed a spatial index for some data I was working with at work, and decided to rough out both a quad tree and a k-d tree implementation for Qt.

Indexing data spatially is important if you have a lot of objects organized spatially, and need to do queries by an object’s location, rather than a single-dimensional key such as an ID. Qt has a framework for doing this in the context of the QGraphicsFramework, but doesn’t have generalized container classes that index along multiple dimensions — think, for example, of a multi-dimensional QMap.

I implemented the quad tree largely from scratch, looking at wikipedia and a couple implementations such as the one here. My quad tree uses C++ templates to abstract the data type, so you can stick anything in it (as long as it can be put in a QList; the underlying store uses QList for object management). Quad trees come up a lot in collision detection in computer graphics and gaming, where you want to know if an object is close enough to another object to collide, but don’t want to search through lots of objects and do collision detection on each object. It works by recursively subdividing the plane containing the objects, performing additional subdivisions each time a region contains more than a specific number of objects.

I cheated when it came to the k-d tree, because there’s a great C implementation on code.google.com that did precisely what I needed. Because I needed to store QVariants in the k-d tree, I just wrapped the entire C library in a Qt class, making the C code static to my implementation. It’s not quite as sexy as a templated version, because it’ll only store objects you can wrap in a QVariant, but for my application it was enough, and in the future if I have a different kind of object I want to store, it’s easy enough to make it storable as a QVariant, too. k-d trees are more general than quad trees, operating in more dimensions than two as they subdivide the search space.

Neither of these implementations are truly “high performance” — for clarity and ease of debugging, they use Qt’s data structures under the hood, which are fairly performant but not as high performance as managing all the data out of the application’s heap using pointers alone. This is especially true for the find methods, which return copies of the data stored in the trees rather than pointers to the actual data. This is, I believe, correct in my case, where I don’t want to expose the internal structure of my data, and where I’m using Qt classes that use references and copy-on-write semantics anyway. However, if you’re looking to squeeze every last bit of performance from your spatial index, you probably don’t want to start with the code you’ll find here. Nonetheless, for general computational tasks, these should serve you quite well.

You can find both of the implementations and a little standard I/O app for using them here.