posted on 2014-09-05, 15:30authored byMohammad Kamel Ahmad Kharabsheh
XML(Extensible Markup Language) is a language used in data representation and
storage, and transmission and manipulation of data. Excessive memory consumption
is an important challenge when representing XML documents in main memory.
Document Object Model (DOM) APIs are used in a processing level that provides
access to all parts of XML documents through the navigation operations. Although
DOM serves as a a general purpose tool that can be used in different applications,
it has high memory cost particularly if using na¨ıve. The space usage of DOM has
been reduced significantly while keeping fast processing speeds, by use of succinct
data structures in SiXDOM [1]. However, SiXDOM does not explore in depth XML
data compression principles to improve in-memory space usage. Such XML data
compression techniques have been proven to be very effective in on-disk compression
of XML document. In this thesis we propose a new approach to represent XML
documents in-memory using XML data compression ideas to further reduce space
usage while rapidly supporting operations of the kind supported by DOM.
Our approach is based upon a compression method [2] which represents an XML
document as a directed acyclic graph (DAG) by sharing common subtrees. However,
this approach does not permit the representation of attributes and textual data,
and furthermore, a naive implementation of this idea gives very poor space usage
relative to other space-efficient DOM implementations [1]. In order to realise the
potential of this compression method as an in-memory representation, a number
of optimisations are made by application of succinct data structures and variablelength
encoding. Furthermore, a framework for supporting attribute and textual
data nodes is introduced. Finally, we propose a novel approach to representing the
textual data using Minimal Perfect Hashing(MPH).
We have implemented our ideas in a software library called DAGDOMand performed
extensive experimental evaluation on a number of standard XML files. DAGDOM
yields a good result and we are able to obtain significant space reductions over existing
space-efficient DOM implementations (typically 2 to 5 times space reduction),
with very modest degradations in CPU time for navigational operations.