Book HomeSAX2Search this book

6.3. Including Subdocuments

In XML, external parsed entities are used to merge one file into another. This mechanism is used to partition larger XML documents (such as this book) into smaller ones (such as this chapter). Such external entities aren't quite the same as actual XML documents. They do not have DTDs; they have zero or more top-level elements instead of exactly one; and they have text declarations at the top instead of XML declarations.[27]

[27]These might show only the text encoding <?xml encoding='Big5'?> is a legal text declaration. To be an XML declaration, it would need to include a version first, like version='1.0'; it's good practice is to include both. Documents that use encoding declarations with no version number cannot be opened as XML directly. They can only be included in XML documents by way of an entity.

Those entities are in some ways awkward to use. Some people don't like to use DTDs, and their tools might not let them declare and create references to such entities. In any case, DTDs add the requirement that such entities be declared in advance. When you're building big documents out of little ones, widely spreading such knowledge can be undesirable. It's often easier to keep a local reference accurate than to update the remote declarations it depends on. Also, documents nest inside others, and small changes nested inside one document could force updates to many DTDs if the document is included in several others. In short, external parsed entities aren't as easy or natural to use as the #include "filename" syntax widely known to C/C++ developers. This is often viewed as a problem.

The response is obvious: use some other part of XML syntax to define a more natural inclusion construct. There's a W3C draft called XInclude, which doesn't quite do this (in the most current draft). XInclude uses element syntax, which is fine, but it doesn't just define a simple and familiar inclusion mechanism. XInclude supports the XPointer superset of XPath to embed almost arbitrary fragments of XML text. In effect, W3C's XInclude is a generalized linking model, and one which depends on significant infrastructure. The model hasn't met with widespread acceptance, and in any case is too complex to use for an example here. That's really too bad; normal inclusion is a strict streaming model, ideal for implementing with SAX, and the model of including fragments is exotic pretty much everywhere except within the linking community.

Here we show how to implement a variant of XInclude, which can replace many uses of external entities because it doesn't use XPointer. To emphasize the difference, we'll use a different syntax:

<?XInclude http://www.example.com/data/included.xml?>
    <!-- instead of what XInclude uses: -->
<xi:include
	xmlns:xi="http://www.w3.org/2001/XInclude"
	href="http://www.example.com/data/included.xml"
	parse='xml'
	encoding='euc-jp'
	>
    content of xi:include is ignored,
    the whole element gets replaced
</xi:include>

This example highlights several different SAX2 mechanisms. It uses the XMLFilterImpl class in two different modes and pays careful attention to the data it passes through. The different modes are as follows:

The code in Example 6-9 takes a few shortcuts but implements the essential inclusion functionality.

Example 6-9. XInclude processing instruction

import java.io.IOException;
import java.net.URL;
import java.util.Vector;
import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.helpers.XMLReaderFactory;

public final class XI extends XMLFilterImpl
implements LexicalHandler, Locator
{
    // Act as a proxy for whatever the current locator is.
    private Locator		locator;

    // to avoid circular inclusion
    private Vector		pending = new Vector (5, 5);

    private LexicalHandler	lexicalHandler;

    private static String	lexicalID =
	    "http://xml.org/sax/properties/lexical-handler";

    public void setDocumentLocator (Locator l)
    {
	locator = l;
	super.setDocumentLocator (this);
    }

    public String getSystemId ()
	{ return (locator == null) ? null : locator.getSystemId (); }
    public String getPublicId ()
	{ return (locator == null) ? null : locator.getPublicId (); }
    public int getLineNumber ()
	{ return (locator == null) ? -1 : locator.getLineNumber (); }
    public int getColumnNumber ()
	{ return (locator == null) ? -1 : locator.getColumnNumber (); }

    // Inner Filter Class: manage the current locator,
    // and filter out events that would be incorrect to report
    private class Scrubber extends XMLFilterImpl implements LexicalHandler
    {
	Locator		savedLocator;
	LexicalHandler	next;

	Scrubber (Locator l, LexicalHandler n)
	    { savedLocator = l; next = n; }

	// maintain proxy locator
	// only one startDocument()/endDocument() pair per event stream
	public void setDocumentLocator (Locator l)
	    { locator = l; }
	public void startDocument ()
	    { }
	public void endDocument ()
	    { locator = savedLocator; }
	
	private void reject (String message) throws SAXException
	    { throw new SAXParseException (message, locator); }

	// only the DTD from the base document gets reported
	public void startDTD (String root, String publicId, String systemId)
	throws SAXException
	    { reject ("DTD: " + systemId); }
	public void endDTD ()
	throws SAXException
	    { reject ("DTD"); }
	// ... so this should never happen
	public void skippedEntity (String name) throws SAXException
	    { reject ("entity: " + name); }

	// since we rejected DTDs, only built-in entities can be reported
	public void startEntity (String name)
	throws SAXException
	    { next.startEntity (name); }
	public void endEntity (String name)
	throws SAXException
	    { next.endEntity (name); }

	// other lexical events cause no worries
	public void startCDATA () throws SAXException
	    { next.startCDATA (); }
	public void endCDATA () throws SAXException
	    { next.endCDATA (); }
	public void comment (char buf[], int off, int len) 
		throws SAXException
	    { next.comment (buf, off, len); }
    }

    // count is zero in the document prologue and epilogue
    private int		count;

    public void startElement (String u, String l, String q, Attributes a)
    throws SAXException
	{ count++; super.startElement (u, l, q, a); }

    public void endElement (String u, String l, String q)
    throws SAXException
	{ --count; super.endElement (u, l, q); }
    
    public void startDocument () throws SAXException
	{ pending.addElement (locator.getSystemId ()); 
		 super.startDocument (); }

    
    public void endDocument () throws SAXException
	{ pending.clear (); super.endDocument (); }

    // handle  processing instructions
    public void processingInstruction (String target, String data)
    throws SAXException
    {
	if ("XInclude".equals (target)) {
	    // this should do full XML base processing
	    // instead we just handle relative and absolute URLs
	    try {
		URL		url = new URL (getSystemId ());

		url = new URL (url, data.trim ());
		data = url.toString ();
	    } catch (Exception e) {
		throw new SAXParseException (
		    "XInclude, can't use URI: " + data, locator, e);
	    }
	    xinclude (data);
	} else
	    super.processingInstruction (target, data);
    }

    // this might be called from startElement too
    private void xinclude (String uri)
    throws SAXException
    {
	XMLReader	helper;
	Scrubber	scrubber;

	if (count == 0)
	    throw new SAXParseException (
		    "XInclude, illegal location", locator);
	if (pending.contains (uri))
	    throw new SAXParseException (
		    "XInclude, circular inclusion", locator);

	// start with another parser acting just like us
	helper = XMLReaderFactory.createXMLReader ();
	helper.setEntityResolver (this);
	helper.setErrorHandler (this);

	// Set up the proxy locator and inner filter.
	scrubber = new Scrubber (locator, this);
	locator = null;
	scrubber.setContentHandler (this);
	helper.setContentHandler (scrubber);
	helper.setProperty (lexicalID, scrubber);

	// we INTEND to discard DTDHandler and DeclHandler events

	// Merge the included document, except its DTD
	try {
	    pending.addElement (uri);
	    helper.parse (uri);
	} catch (java.io.IOException e) {
	    SAXParseException	err;
	    ErrorHandler	handler;
	    
	    err = new SAXParseException (uri, locator, e);
	    handler = getErrorHandler ();
	    if (handler != null)
		handler.fatalError (err);
	    throw err;
	} finally {
	    pending.removeElement (uri);
	}
    }

    // LexicalHandler interface
    public void startEntity (String name)
    throws SAXException
	{ if (lexicalHandler != null) lexicalHandler.startEntity (name); }

    public void endEntity (String name)
    throws SAXException
	{ if (lexicalHandler != null) lexicalHandler.endEntity (name); }
    
    public void startDTD (String root, String publicId, String systemId)
    throws SAXException
	{ if (lexicalHandler != null) lexicalHandler.startDTD (root, publicId, 
              systemId); }

    public void endDTD () throws SAXException
	{ if (lexicalHandler != null) lexicalHandler.endDTD (); }
    public void startCDATA () throws SAXException
	{ if (lexicalHandler != null) lexicalHandler.startCDATA (); }
    public void endCDATA () throws SAXException
	{ if (lexicalHandler != null) lexicalHandler.endCDATA (); }
    public void comment (char buf[], int off, int len) throws SAXException
	{ if (lexicalHandler != null) lexicalHandler.comment (buf, off, len); }

    // so this works as a "consumer"
    public void setProperty (String uri, Object handler)
    throws SAXNotRecognizedException, SAXNotSupportedException
    {
	if (lexicalID.equals (uri))
	    lexicalHandler = (LexicalHandler) handler;
	else
	    super.setProperty (uri, handler);
    }

    // so this works as a "producer"
    public void parse (InputSource in)
    throws SAXException, IOException
    {
	XMLReader	parent = getParent ();

	if (parent != null)
	    parent.setProperty (lexicalID, this);
	super.parse (in);
    }
}

The most significant shortcut in this code is that, to simplify the example, XML Base isn't supported. That's easily fixed using the technique shown earlier, in Example 5-1. Similarly, the namespace reporting and validation modes of the default parser are assumed to be OK; they should be copied or specified as part of this event consumer's API.

Merging SAX event streams from two different sources is quite simple, except for DTD-related information. One basic problem is structural: DTD events may be reported only at the beginning of a SAX event stream, and the chance to do that has been lost by the time an included document is processed. Another basic problem is semantic: the events from the two sources could easily conflict with each other. Neither of those problems can be solved with a pure stream processing model, unless the included documents use the same DTD as the base document. Accordingly, this example treats DTD events from included streams as errors.

The best way to use XML inclusions is with XML text that doesn't use DTDs, perhaps using "XML 1.0 plus Namespaces" rules to help assign meaning to individual elements and attributes. Eliminating DTDs means some important bits of the XML Infoset will be unavailable, such as the attribute-typing information that tells you which elements are used as IDs. If all the files in question are themselves well-formed XML documents with both version and encoding in any XML declarations (and without a DTD), they can easily be included without significant restrictions. Such an inclusion facility can be convenient in a variety of application contexts, such as template-driven document processing and other cases where it's important to build larger documents from smaller ones.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.