G V Subramanyam M S Thesis Abstract

Efficient binary structural join algorithms on XML data

M S Thesis, by G V Subramanyam, July 2005.

XML is the widely used format for representing and exchanging semi-structured data

on the web. XML queries naturally specify hierarchical patterns to select relevant

parts of an XML document. These patterns have a sequence of selection predicates

connected by operators representing structural relationships (either ancestor-descendant

or preceding-following sibling). In this context, the operation of structural

join involves discovering pairs of nodes that are structurally related from the cross

product of two sets of nodes.

Current XPath processing algorithms focus more on solving queries containing

ancestor-descendant relationship and pay little attention to the equally important

preceding-following sibling relationship. In this thesis, we present novel solutions for

solving queries containing structural join operators that specify ancestor-descendant

or sibling axes. For solving ancestor-descendant relationship, we present a solution

that involves maintaining an extra piece of information, called nDesc, for each element

in the XML document tree. The parameter nDesc specifies the number of descendants

having same tag name as that of the element. Using nDesc, one can effectively skip

most of the ancestors and descendants that do not participate in the join.

We present anti-structural join, which is new to the database literature, for processing

queries that specify don't-occur patterns. These patterns are specified using

not boolean function in XPath. We provide approaches for constructing XML subtrees

of the final output nodes that are obtained when a query is evaluated.

We also present a linear time algorithm for processing queries containing sibling

axes by using sibling-list at every level of the XML document tree. A sibling-list,

at any instance, holds set of elements that are under common parent. During the

join, we claim that each element in the element lists is visited once. An extensive

experimental evaluation shows that our join algorithms perform significantly better

than the currently existing join algorithms.