XML Schema Languages (XML Schema)

A.1. What Is a XML Schema Language?

Roughly speaking, XML schema languages describe XML documents. Different approaches to that task, however, provide a wide range of functionality.

A.1.1. XML Schema Languages Are Not Schemas

The first thing we can say about XML schema languages is they are not schemas. At least they do not match the definition of a schema as given by Webster's dictionary, which states: "an outline or image universally applicable to a general conception, under which it is likely to be presented to the mind; as, five dots in a line are a schema of the number five; a preceding and succeeding event are a schema of cause and effect."

This definition does not apply to the languages known as "XML schema languages"; most of these are more complex than the documents they describe and are too difficult to "be presented to the mind." They focus on defining validation rules more than on representing or modeling a class of documents. When they do model a class of documents, they often want to add information to the documents they model.

Looking past the formal label of schemas, how can we classify so-called "XML schema languages"? Looking at all XML schema languages (DTDs, W3C XML Schema, RELAX NG, and also languages such as Schematron), the one thing they have in common is being transformations, which take a "schema" and an instance document as an input and transform them into a validation report, and optionally, into a PSVI (Post Schema Validation Infoset), a set of information added to the XML infoset of the source document. This PSVI (when it exists) includes information such as default values, datatypes, etc.

Changing the category of XML schema languages not only alters our perception of what they are, but also opens the game, since general-purpose transformations or programming languages, such as XSLT, Prolog, Java, C# and friends, can be considered XML schema languages.

A.1.2. Firewalls Against Diversity

The "X" in XML stands for Extensibility. XML is, in fact, so extensible and diverse that few if any applications are able to support this diversity. XML schema languages were created as firewalls against diversity and protect applications from meeting unexpected information and formats.

This analogy provides us our first classification of schema languages. These, like firewalls, can be open and allow any construction that isn't forbidden (as in the case of Schematron), or they can be closed and forbid anything that has not been allowed. (This is the case for most of the other schema languages, including W3C XML Schema.)

Closed firewalls are certainly much safer than open ones, but also much less extensible, since any new traffic must be allowed by the administrator before it can pass through. This is one of the reasons for the failure of protocols such as CORBA, as well as one the reasons web services has chosen to use HTTP. Closed XML schema languages can represent the same kind of threat against the diversity of XML vocabularies as closed firewalls do against the diversity of IP protocols.

Therefore, there is a trade-off: being able to check whether a document we have received or are about to send is valid and won't blow up our applications is not only useful but even necessary. However, our schemas should stay open whenever possible to remain extensible. (Chapter 13, "Creating Extensible Schemas" presents the techniques available with the W3C XML Schema to limit the danger of "closedness.")

A.1.3. Intrusive Modeling Tools

Schematron is the only XML schema language that doesn't base its validation on a model of the class of the documents that are considered valid. All the other XML schema languages describe the structure of the valid documents (which is where the name "schema" comes from). The lack of expressiveness of their description languages can be another threat to the diversity of XML vocabularies since the structures, which cannot be described with one of the major XML schema languages, might become deprecated automatically. Unfortunately, this will likely be the case with the W3C XML Schema, in which expressiveness can be considered medium. Some existing vocabularies, such as RSS 1.0 and WebDAV, cannot be described with W3C XML Schema.

One may argue whether such and such a structure, which cannot be described by such and such a language, is good practice or not. However, I think that XML schema languages should be as neutral as possible and not add constraints to those defined by the XML 1.0 and Namespaces in XML Recommendations. XML is still a young technology, and many innovative ways of using it are still to be discovered. Some of them may be jeopardized by the lack of expressiveness of W3C XML Schema.

On the other hand, modeling activity is valuable by itself and its outcome, expressed as a XML schema, can be used to automate or enhance the processing of the XML document or the generation of applications and generic tools, such as Version 2 of XPath and XSLT. The initial version of XQuery will rely on the information provided by XML schemas for advanced features, but also for things as basic as knowing which sort order should be used for each node.

In modeling, we find the same basic differences that can be found between an API (such as the DOM, in which each node is manipulated individually and is highly differentiated) and a model such as XPath, which enables splitting XML documents into sets of nodes ("nodesets"). While RELAX NG (co-authored by James Clark, who was also the editor of XPath 1.0) is based on the definition of patterns (classes of undifferentiated nodesets or containers encapsulating elements, attributes, and text nodes) W3C XML Schema has defined differentiated and different constructions to define elements, attributes, their content (called simple or complex types), and groups of elements or attributes.

Although this differentiation will seem natural when we get used to the W3C XML Schema, it is often useful to remember that those different constructions can be seen as a difference of perspective. Also, elements, attributes, types, and groups are "patterns," as defined by RELAX NG with a different granularity, and can be embedded within each other.

There is also a second consequence to any modeling activity, which is to change our perception of what is modeled. As the outside world is seen differently after Aristotle, Newton, and Einstein, your perception of a given XML document will vary depending upon which XML schema language you use.

A.1.4. Early Binding Tools

Another often mentioned quality of XML is its ability to serve as a base for late binding and highly decoupled systems in which the sender and receiver applications are independent of each other. This late binding ability has two major advantages. The first, which is very practical, is a complete independence between the systems and applications that create the XML document on one side and those that use it on the other side. The second, which is more abstract, allows the receiver to apply its own treatment and project its own semantic to "understand" the document, leaving the possibility of adding some value to the sent message.

XML Schema languages may be a danger for late binding approaches, especially those like the W3C XML Schema, which produce a PSVI, since the association of information from the schema is a form of early binding that binds a document to a specific XML schema language (necessary to interpret the document). Here again, this danger seems to be the price of automating the writing of applications to process the documents.

Appendix A. XML Schema Languages

Contents:

A.1. What Is a XML Schema Language?

A.1.1. XML Schema Languages Are Not Schemas

A.1.2. Firewalls Against Diversity

A.1.3. Intrusive Modeling Tools

A.1.4. Early Binding Tools


16. Datatype Reference Guide		A.2. Classification of XML Schema Languages