XPath

2008-10-15

Tyng-Ruey Chuang
trc@iis.sinica.edu.tw

Institute of Information Science
Academia Sinica, Taipei, Taiwan

Lecture Outline

XPath Data Model

An XML document can be viewed as a tree; XPath is a language for selecting nodes out of this tree. In the view of XPath, the tree contains 7 types of node:

The root node and element nodes each has an order list of child nodes. Note that XPath operates on an XML document after CDATA sections, entity references, and document type declarations have been merged into the document.

An Example XML Document

<?xml version="1.0"?>
<?xml-stylesheet type="application/xml" href="people.xsl"?>
<!DOCTYPE people [
 <!ATTLIST homepage xlink:type CDATA #FIXED "simple"
                  xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink">
 <!ATTLIST person id ID #IMPLIED>
]>
<people>

  <person born="1912" died="1954" id="p342">
    <name>
      <first_name>Alan</first_name>
      <last_name>Turing</last_name>
    </name>
    <!-- Did the word computer scientist exist in Turing's day? -->
    <profession>computer scientist</profession>
    <profession>mathematician</profession>
    <profession>cryptographer</profession>
    <homepage xlink:href="http://www.turing.org.uk/"/>
  </person>

  <person born="1918" died="1988" id="p4567">
    <name>
      <first_name>Richard</first_name>
      <middle_initial>&#x50;</middle_initial>
      <last_name>Feynman</last_name>
    </name>
    <profession>physicist</profession>
    <hobby>Playing the bongoes</hobby>
  </person>

</people>

Note: The above XML document is taken from Chapter 9 of XML in a Nutshell (3rd edition), by Elliotte Rusty Harold & W. Scott Means.

The Tree Structure of the Example

Note: The above figure is taken from Chapter 9 of XML in a Nutshell (3rd edition), by Elliotte Rusty Harold & W. Scott Means.

Correction: In the above figure,

7 Types of Node in XPath

7 Types of Node in XPath, Continued

XPath Location Paths, Examples

An XPath location path identifies a set of nodes in a document. A location path is built out of successive location steps. Each step is evaluated relative to a particular node in the document called the context node.

Compound Location Paths & Predicates

XPath expressions can be formed just like Unix path expressions using /, .., ., and //. The last one, for all descendants of the context node, is new! Examples:

Each step in a location path may (but does not have to) have a predicate that selects from the node-set current at that step in the expression. Examples:

Unabbreviated Location Paths

The following abbreviated XPath expression

//person[@born < 1950]/name

is the same as the following unabbreviated XPath expression

/descendant-or-self::node()/child::person[attribute::born < 1950]/child::name

There are 13 axes:

Check out the illustration at Massimo Franceschet's "Caffè XML" web page.

More XPath Expressions

More about Predicates

XPath Location Paths

White Space Handling in XML

The following is taken from Section 2.10 of Extensible Markup Language (XML) 1.0 (Fourth Edition), a W3C Recommendation.

In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code.
A special attribute named xml:space may be attached to an element to signal an intention that in that element, white space should be preserved by applications. In valid documents, this attribute, like any other, must be declared if it is used. When declared, it must be given as an enumerated type whose values are one or both of "default" and "preserve". For example:
<!ATTLIST poem  xml:space (default|preserve) 'preserve'>
<!ATTLIST pre xml:space (preserve) #FIXED 'preserve'>
The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:space attribute.

Language Identification in XML

The following is taken from Section 2.12 of Extensible Markup Language (XML) 1.0 (Fourth Edition), a W3C Recommendation.

In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, ...
<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p>
<p xml:lang="en-GB">What colour is it?</p>
<p xml:lang="en-US">What color is it?</p>
<sp who="Faust" desc='leise' xml:lang="de">
  <l>Habe nun, ach! Philosophie,</l>
  <l>Juristerei, und Medizin</l>
  <l>und leider auch Theologie</l>

  <l>durchaus studiert mit heißem Bemüh'n.</l>
</sp>
The language specified by xml:lang applies to the element where it is specified (including the values of its attributes), and to all elements in its content unless overridden with another instance of xml:lang.
A simple declaration for xml:lang might take the form
xml:lang CDATA #IMPLIED
but specific default values may also be given, if appropriate. ...
<!ATTLIST poem   xml:lang CDATA 'fr'>
<!ATTLIST gloss  xml:lang CDATA 'en'>
<!ATTLIST note   xml:lang CDATA 'en'>

What Does encoding="UTF-8" Mean?

The following is taken from Section What is UTF-8?, in UTF-8 and Unicode FAQ for Unix/Linux, written by Markus Kuhn.

Unicode is just a code table that assigns integer numbers to characters. There exist several alternatives for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes. The two most obvious encodings store Unicode text as sequences of either 2 or 4 bytes sequences. The official terms for these encodings are UCS-2 and UCS-4, respectively. ...
Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like "\0" or "/" which have a special meaning in filenames and other C library function parameters. ... The UTF-8 encoding described in RFC 3629 does not have these problems. ...

What Does encoding="UTF-8" Mean? Continued

The following is taken from Section What is UTF-8?, in UTF-8 and Unicode FAQ for Unix/Linux, written by Markus Kuhn.

UTF-8 has the following properties:

What Does encoding="UTF-8" Mean? Continued

The following is taken from Section What is UTF-8?, in UTF-8 and Unicode FAQ for Unix/Linux, written by Markus Kuhn.

The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:
U-00000000 – U-0000007F: 0xxxxxxx
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.
Examples: The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as
11000010 10101001 = 0xC2 0xA9
and character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:
11100010 10001001 10100000 = 0xE2 0x89 0xA0