Lecture Outline

We shall study XML Path Language (XPath), Version 1.0, which is a W3C Recommendation published on November 16, 1999.
Link: http://www.w3.org/TR/xpath
We shall also look into several details of the XML Recommendatation:
1. The xml:space attribute
2. The xml:lang attribute
3. Character encodings in XML (i.e., encoding)
Useful resources on the Web:
1. Chapter 9 of XML in a Nutshell (3rd edition) is on XPath. It is available for download.
2. Also check out Massimo Franceschet's "Caffè XML" web page.

XPath Data Model

An XML document can be viewed as a tree; XPath is a language for selecting nodes out of this tree. In the view of XPath, the tree contains 7 types of node:

the root node
element nodes
text nodes
attribute nodes
namespace nodes
processing instruction nodes
comment nodes

The root node and element nodes each has an order list of child nodes. Note that XPath operates on an XML document after CDATA sections, entity references, and document type declarations have been merged into the document.

An Example XML Document

<?xml version="1.0"?>
<?xml-stylesheet type="application/xml" href="people.xsl"?>
<!DOCTYPE people [
 <!ATTLIST homepage xlink:type CDATA #FIXED "simple"
                  xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink">
 <!ATTLIST person id ID #IMPLIED>
]>
<people>

  <person born="1912" died="1954" id="p342">
    <name>
      <first_name>Alan</first_name>
      <last_name>Turing</last_name>
    </name>
    <!-- Did the word computer scientist exist in Turing's day? -->
    <profession>computer scientist</profession>
    <profession>mathematician</profession>
    <profession>cryptographer</profession>
    <homepage xlink:href="http://www.turing.org.uk/"/>
  </person>

  <person born="1918" died="1988" id="p4567">
    <name>
      <first_name>Richard</first_name>
      <middle_initial>&#x50;</middle_initial>
      <last_name>Feynman</last_name>
    </name>
    <profession>physicist</profession>
    <hobby>Playing the bongoes</hobby>
  </person>

</people>

Note: The above XML document is taken from Chapter 9 of XML in a Nutshell (3rd edition), by Elliotte Rusty Harold & W. Scott Means.

The Tree Structure of the Example

Note: The above figure is taken from Chapter 9 of XML in a Nutshell (3rd edition), by Elliotte Rusty Harold & W. Scott Means.

Correction: In the above figure,

text/xsl should be application/xml
xref:href="http://www.turinng.org.uk" should be xlink:href="http://www.turing.org.uk"
The text Richard should appear in element first_name

7 Types of Node in XPath

The root node is the root of the tree. The element node for the document element is a child of the root node. The root node also has children processing instruction and comment nodes if they occur in the prolog and after the end of the document node.
There is an element node for every element in the document. The children of an element node are the element nodes, comment nodes, processing instruction nodes, and text nodes of its content.
Character data is grouped into text nodes. As much character data as possible is grouped into each text node.
There is a processing instruction node for every processing instruction, except for any processing instruction that occurs within the document type declaration.
There is a comment node for every comment, except for any comment that occurs within the document type declaration.

7 Types of Node in XPath, Continued

Each element node has an associate set of attribute nodes; the element is the parent of each of these attribute nodes; however, an attribute node is not a child of its parent element.
A defaulted attribute is treated the same as a specified attribute. If an attribute was declared for the element type in the DTD, but the default was declared as #IMPLIED, and the attribute was not specified on the element, then the element's attribute set does not contain a node of the attribute.
Each element has an associated set of namespace nodes, one for each distinct namespace prefix that is in scope for the element, and one for the default namespace if one is in scope for the element. The element is parent of each of the these namespace nodes; however, a namespace node is not a child of its parent element.

XPath Location Paths, Examples

An XPath location path identifies a set of nodes in a document. A location path is built out of successive location steps. Each step is evaluated relative to a particular node in the document called the context node.

The root location path: /
Child element location steps. Examples:
/people/person
/people/person/name
Attribute location steps. Examples:
/people/person/@born
/people/person/@id
The node test comment() is true for any comment node, text() is true for any text node, and processing-instruction() is true for any processing instruction node. Example:
/people/person/profession/text()
Wildcards: * matches all nodes of the principal node type, node() matches any node, and @* matches all attribute nodes. Examples:
/people/person/*
/people/person/@*
Multiple matches with |. Example:
//first_name | //last_name

Compound Location Paths & Predicates

XPath expressions can be formed just like Unix path expressions using /, .., ., and //. The last one, for all descendants of the context node, is new! Examples:

//name//*
//name/../profession/.

Each step in a location path may (but does not have to) have a predicate that selects from the node-set current at that step in the expression. Examples:

//person[@id = "p4567"]
//person[@born <=1920 and @born >= 1915]
/people/person[position()=1]/profession[position()=3]

Unabbreviated Location Paths

The following abbreviated XPath expression

//person[@born < 1950]/name

is the same as the following unabbreviated XPath expression

/descendant-or-self::node()/child::person[attribute::born < 1950]/child::name

There are 13 axes:

ancestor
ancestor-or-self
attribute
child
descendant
descendant-or-self
following
following-sibling
namespace
parent
preceding
preceding-sibling
self

Check out the illustration at Massimo Franceschet's "Caffè XML" web page.

More XPath Expressions

/child::doc/child::chapter[position()=5]/child::section[position()=2]
selects the second section of the fifth chapter of the doc document element
child::para[attribute::type='warning'][position()=5]
selects the fifth para child of the context node that has a type attribute with value warning
child::*[self::chapter or self::appendix]
selects the chapter and appendix children of the context node
More examples in the Location Paths section in XML Path Language (XPath), Version 1.0, a W3C Recommendation.
Also check out the examples at Massimo Franceschet's "XPath Functional Test" web page.

More about Predicates

Each location step may have zero or more predicates. A predicate is an XPath expression. This expression mostly evaluates to a Boolean value (but need not to).
An XPath processor works from left to right in an expression. After it has evaluated everything that proceeds the predicate, it is left with a context node list that may contain no node, one node, or more than one nodes.
The predicate is evaluated against each node in the context node list. If the expression returns true, the that node is retained in the list. if the expression returns false, then the node is removed from the list.
If the expression returns a number, then the node being evaluated is left in the list if and only if the number is the same as the position of that node in the context node list.
XPath 1.0 defines 27 built-in functions for use in XPath expressions (in particular, in the predicates). Check them out in Chapter 23 of XML in a Nutshell, or read them in the XML Path Language (XPath), Version 1.0, a W3C Recommendation.

XPath Location Paths

Location Paths:

LocationPath         ::= RelativeLocationPath	
                       | AbsoluteLocationPath	
AbsoluteLocationPath ::= '/' RelativeLocationPath?	
                       | AbbreviatedAbsoluteLocationPath	
RelativeLocationPath ::= Step	
                       | RelativeLocationPath '/' Step	
                       | AbbreviatedRelativeLocationPath

Location Steps:

Step          ::= AxisSpecifier NodeTest Predicate*	
                | AbbreviatedStep	
AxisSpecifier ::= AxisName '::'	
                | AbbreviatedAxisSpecifier

Axes:

AxisName ::= 'ancestor'	| 'ancestor-or-self' | 'attribute' | 'child' | 'descendant'	
           | 'descendant-or-self' | 'following'	| 'following-sibling' | 'namespace'	
           | 'parent' | 'preceding' | 'preceding-sibling' | 'self'

Node Tests and Predicates:

NodeTest      ::= NameTest | NodeType '(' ')' | 'processing-instruction' '(' Literal ')'
Predicate     ::= '[' PredicateExpr ']'	
PredicateExpr ::= Expr

Abbreviations:

AbbreviatedAbsoluteLocationPath	::= '//' RelativeLocationPath	
AbbreviatedRelativeLocationPath	::= RelativeLocationPath '//' Step	
AbbreviatedStep                 ::= '.'	| '..'	
AbbreviatedAxisSpecifier        ::= '@'?

White Space Handling in XML

The following is taken from Section 2.10 of Extensible Markup Language (XML) 1.0 (Fourth Edition), a W3C Recommendation.

In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code.

A special attribute named xml:space may be attached to an element to signal an intention that in that element, white space should be preserved by applications. In valid documents, this attribute, like any other, must be declared if it is used. When declared, it must be given as an enumerated type whose values are one or both of "default" and "preserve". For example:

<!ATTLIST poem  xml:space (default|preserve) 'preserve'>
<!ATTLIST pre xml:space (preserve) #FIXED 'preserve'>

The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:space attribute.

Language Identification in XML

The following is taken from Section 2.12 of Extensible Markup Language (XML) 1.0 (Fourth Edition), a W3C Recommendation.

In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, ...

<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p>
<p xml:lang="en-GB">What colour is it?</p>
<p xml:lang="en-US">What color is it?</p>
<sp who="Faust" desc='leise' xml:lang="de">
  <l>Habe nun, ach! Philosophie,</l>
  <l>Juristerei, und Medizin</l>
  <l>und leider auch Theologie</l>

  <l>durchaus studiert mit heißem Bemüh'n.</l>
</sp>

The language specified by xml:lang applies to the element where it is specified (including the values of its attributes), and to all elements in its content unless overridden with another instance of xml:lang.

A simple declaration for xml:lang might take the form

xml:lang CDATA #IMPLIED

but specific default values may also be given, if appropriate. ...

<!ATTLIST poem   xml:lang CDATA 'fr'>
<!ATTLIST gloss  xml:lang CDATA 'en'>
<!ATTLIST note   xml:lang CDATA 'en'>

What Does `encoding="UTF-8"` Mean?

The following is taken from Section What is UTF-8?, in UTF-8 and Unicode FAQ for Unix/Linux, written by Markus Kuhn.

Unicode is just a code table that assigns integer numbers to characters. There exist several alternatives for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes. The two most obvious encodings store Unicode text as sequences of either 2 or 4 bytes sequences. The official terms for these encodings are UCS-2 and UCS-4, respectively. ...

Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like "\0" or "/" which have a special meaning in filenames and other C library function parameters. ... The UTF-8 encoding described in RFC 3629 does not have these problems. ...

What Does `encoding="UTF-8"` Mean? Continued

The following is taken from Section What is UTF-8?, in UTF-8 and Unicode FAQ for Unix/Linux, written by Markus Kuhn.

UTF-8 has the following properties:

UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

All UCS characters > U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.

The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.

All possible 2³¹ UCS codes can be encoded.

UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.

The sorting order of Bigendian UCS-4 byte strings is preserved.

The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

What Does `encoding="UTF-8"` Mean? Continued

The following is taken from Section What is UTF-8?, in UTF-8 and Unicode FAQ for Unix/Linux, written by Markus Kuhn.

The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:

U-00000000 – U-0000007F:	0xxxxxxx
U-00000080 – U-000007FF:	110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF:	1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF:	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF:	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF:	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.

Examples: The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as

11000010 10101001 = 0xC2 0xA9

and character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

XPath

Lecture Outline

XPath Data Model

An Example XML Document

The Tree Structure of the Example

7 Types of Node in XPath

7 Types of Node in XPath, Continued

XPath Location Paths, Examples

Compound Location Paths & Predicates

Unabbreviated Location Paths

More XPath Expressions

More about Predicates

XPath Location Paths

White Space Handling in XML

Language Identification in XML

What Does encoding="UTF-8" Mean?

What Does encoding="UTF-8" Mean? Continued

What Does encoding="UTF-8" Mean? Continued

What Does `encoding="UTF-8"` Mean?

What Does `encoding="UTF-8"` Mean? Continued

What Does `encoding="UTF-8"` Mean? Continued