Common Patterns (XML Schema)

6.5.1. String Datatypes

Regular expressions treat information in its textual form. This makes them an excellent mechanism for constraining strings.

6.5.1.1. Unicode blocks

Unicode is a great asset of XML; however, there are few applications able to process and display all the characters of the Unicode set correctly and still fewer users able to read them! If you need to check that your string datatypes belong to one (or more) Unicode blocks, you can derive them from basic types such as:

<xs:simpleType name="BasicLatinToken">
  <xs:restriction base="xs:token">
    <xs:pattern value="\p{IsBasicLatin}*"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="Latin-1Token">
  <xs:restriction base="xs:token">
    <xs:pattern value="[\p{IsBasicLatin}\p{IsLatin-1Supplement}]*"/>
  </xs:restriction>
</xs:simpleType>

Note that such patterns do not impose a character encoding on the document itself and that, for instance, the Latin-1Token datatype could validate instance documents using UTF-8, UTF-16, ISO-8869-1 or other encoding. (This assumes the characters used in this string belong to the two Unicode blocks BasicLatin and Latin-1Supplement.) In other words, working on the lexical space, i.e., after the transformations have been done by the parser, these patterns do not control the physical format of the instance documents.

6.5.1.2. Counting words

We have already seen a trick to count the words using a dummy derivation by list; however, this derivation counts only whitespace-separated "words," ignoring the punctuation that was treated like normal characters. We can limit the number of words using a couple of patterns. To do so, we can define an atom, which is a sequence of one or more "word" characters (\w+) followed by one or more nonword characters (\W+), and control its number of occurrences. If we are not very strict on the punctuation, we also need to allow an arbitrary number of nonword characters at the beginning of our value and to deal with the possibility of a value ending with a word (without further separation). One of the ways to avoid any ambiguity at the end of the string is to dissociate the last occurrence of a word to make the trailing separator optional:

<xs:simpleType name="story100-200words">
  <xs:restriction base="xs:token">
    <xs:pattern value="\W*(\w+\W+){99,199}\w+\W*"/>
  </xs:restriction>
</xs:simpleType>

6.5.1.3. URIs

We have seen that xs:anyURI doesn't care about "absolutizing" relative URIs and it may be wise to impose the usage of absolute URIs, which are easier to process. Furthermore, it can also be interesting for some applications to limit the accepted URI schemes. This can easily be done by a set of patterns such as:

<xs:simpleType name="httpURI">
  <xs:restriction base="xs:anyURI">
    <xs:pattern value="http://.*"/>
  </xs:restriction>
</xs:simpleType>

6.5.2. Numeric and Float Types

While numeric types aren't strictly text, patterns can still be used appropriately to constrain their lexical form.

6.5.2.1. Leading zeros

Getting rid of leading zeros is quite simple but requires some precautions if we want to keep the optional sign and the number "0" itself. This can be done using patterns such as:

<xs:simpleType name="noLeadingZeros">
  <xs:restriction base="xs:integer">
    <xs:pattern value="[+-]?([1-9][0-9]*|0)"/>
  </xs:restriction>
</xs:simpleType>

Note that in this pattern, we chose to redefine all the lexical rules that apply to an integer. This pattern would give the same lexical space applied to a xs:token datatype as on a xs:integer. We could also have relied on the knowledge of the base datatype and written:

  <xs:simpleType name="noLeadingZeros">
    <xs:restriction base="xs:integer">
      <xs:pattern value="[+-]?([^0].*|0)"/>
    </xs:restriction>
  </xs:simpleType>

Relying on the base datatype in this manner can produce simpler patterns, but can also be more difficult to interpret since we would have to combine the lexical rules of the base datatype to the rules expressed by the pattern to understand the result.

6.5.2.2. Fixed format

The maximum number of digits can be fixed using xs:totalDigits and xs:fractionDigits. However, these facets are only maximum numbers and work on the value space. If we want to fix the format of the lexical space to be, for instance, "DDDD.DD", we can write a pattern such as:

<xs:simpleType name="fixedDigits">
  <xs:restriction base="xs:decimal">
    <xs:pattern value="[+-]\.{4}\..{2}"/>
  </xs:restriction>
</xs:simpleType>

6.5.3. Datetimes

Dates and time have complex lexical representations. Patterns can give developers extra control over how they are used.

6.5.3.1. Time zones

The time zone support of W3C XML Schema is quite controversial and needs some additional constraints to avoid comparison problems. These patterns can be kept relatively simple since the syntax of the datetime is already checked by the schema validator and only simple additional checks need to be added. Applications which require that their datetimes specify a time zone may use the following template, which checks that the time part ends with a "Z" or contains a sign:

<xs:simpleType name="dateTimeWithTimezone">
  <xs:restriction base="xs:dateTime">
    <xs:pattern value=".+T.+(Z|[+-].+)"/>
  </xs:restriction>
</xs:simpleType>

Still simpler, applications that want to make sure that none of their datetimes specify a time zone may just check that the time part doesn't contain the characters "+", "-", or "Z":

<xs:simpleType name="dateTimeWithoutTimezone">
  <xs:restriction base="xs:dateTime">
    <xs:pattern value=".+T[^Z+-]+"/>
  </xs:restriction>
</xs:simpleType>

In these two datatypes, we used the separator "T". This is convenient, since no occurrences of the signs can occur after this delimiter except in the time zone definition. This delimiter would be missing if we wanted to constrain dates instead of datetimes, but, in this case, we can detect the time zones on their ":" instead:

<xs:simpleType name="dateWithTimezone">
  <xs:restriction base="xs:date">
    <xs:pattern value=".+[:Z].*"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="dateWithoutTimezone">
  <xs:restriction base="xs:date">
    <xs:pattern value="[^:Z]*"/>
  </xs:restriction>
</xs:simpleType>

Applications may also simply impose a set of time zones to use:

<xs:simpleType name="dateTimeInMyTimezones">
  <xs:restriction base="xs:dateTime">
    <xs:pattern value=".+\+02:00"/>
    <xs:pattern value=".+\+01:00"/>
    <xs:pattern value=".+\+00:00"/>
    <xs:pattern value=".+Z"/>
    <xs:pattern value=".+-04:00"/>
  </xs:restriction>
</xs:simpleType>

We promised earlier to look at xs:duration and see how we can define two datatypes that have a complete sort order. The first datatype will consist of durations expressed only in months and years, and the second will consist of durations expressed only in days, hours, minutes, and seconds. The criteria used for the test can be the presence of a "D" (for day) or a "T" (the time delimiter). If neither of those characters are detected, then the datatype uses only year and month parts. The test for the other type cannot be based on the absence of "Y" and "M", since there is also an "M" in the time part. We can test that, after an optional sign, the first field is either the day part or the "T" delimiter:

<xs:simpleType name="YMduration">
  <xs:restriction base="xs:duration">
    <xs:pattern value="[^TD]+"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="DHMSduration">
  <xs:restriction base="xs:duration">
    <xs:pattern value="-?P((\d+D)|T).*"/>
  </xs:restriction>
</xs:simpleType>


6.4. More Atoms		6.6. Back to Our Library

6.5. Common Patterns