posted on 2009-11-20, 12:23authored byO'Neil Davion Delpratt
Extensible Markup Language (XML) is a multi-purpose text-based format, used for storage, transmission and manipulation of data. XML documents are often held in main memory and processed via standard interfaces such as the Document Object Model (DOM). However, XML is inherently verbose, and the in-memory representation of XML documents by existing DOM implementations is up to ten times larger than the file size. This is a problem for machines with limited memory, such as mobile devices, where processing even moderately-sized XML documents requires more memory than is available. We focus on in-memory representations of XML documents for situations where space is limited and where rapid processing time is important. We propose a compact representation of XML documents that uses succinct or highly space-efficient data structures, that allows XML processing to be executed efficiently.
Succinct data structures use space that approaches the information-theoretic lower bound on the space that is required to represent the data, and support operations upon the representation in constant time. In the context of XML documents, we study and improve succinct representations for ordinal trees by adding features that make them more suitable for use in XML documents. We explore fast and space-efficient representations of the textual data of XML documents. Our basic approach is to concatenate all the textual data in the XML document into a single string, and extract individual textual values by computing the appropriate substring of the concatenated string. Computing the substring requires us to store offsets into the text. The storage of the offsets is surprisingly expensive, if stored naively (as 32 or 64-bit integer values).
We give a succinct representation and provide data-aware representations (adapted from work on inverted indices in information retrieval), and show their close connection. We describe Succinct DOM (SDOM), which is a DOM implementation that has low, stable and predictable memory usage. We show, via an experimental evaluation, that SDOM is extremely fast. A variant, SDOM-CT, applies BZip-based compression to textual and attribute data, and its space usage is comparable with “query-friendly” XML compressors. Some of these compressors support navigation and/or querying (e.g. subpath queries) of the compressed file. SDOM-CT does not support querying directly, but remains extremely fast: it is several orders of magnitude faster for navigation than query-friendly XML compressors that support navigation (and only a few times slower than popular DOM implementations such as the Apache Foundation’s Xerces-C).