2008-10-15
Tyng-Ruey Chuang
trc@iis.sinica.edu.tw
Institute of Information Science
Academia Sinica, Taipei, Taiwan
We shall study XML Path Language (XPath), Version 1.0,
which is a W3C Recommendation published on November 16, 1999.
Link:
http://www.w3.org/TR/xpath
We shall also look into several details of the XML Recommendatation:
xml:space
attributexml:lang
attributeencoding
)Useful resources on the Web:
An XML document can be viewed as a tree; XPath is a language for selecting nodes out of this tree. In the view of XPath, the tree contains 7 types of node:
the root node
element nodes
text nodes
attribute nodes
namespace nodes
processing instruction nodes
comment nodes
The root node and element nodes each has an order list of child nodes. Note that XPath operates on an XML document after CDATA sections, entity references, and document type declarations have been merged into the document.
<?xml version="1.0"?> <?xml-stylesheet type="application/xml" href="people.xsl"?> <!DOCTYPE people [ <!ATTLIST homepage xlink:type CDATA #FIXED "simple" xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink"> <!ATTLIST person id ID #IMPLIED> ]> <people> <person born="1912" died="1954" id="p342"> <name> <first_name>Alan</first_name> <last_name>Turing</last_name> </name> <!-- Did the word computer scientist exist in Turing's day? --> <profession>computer scientist</profession> <profession>mathematician</profession> <profession>cryptographer</profession> <homepage xlink:href="http://www.turing.org.uk/"/> </person> <person born="1918" died="1988" id="p4567"> <name> <first_name>Richard</first_name> <middle_initial>P</middle_initial> <last_name>Feynman</last_name> </name> <profession>physicist</profession> <hobby>Playing the bongoes</hobby> </person> </people>
Note: The above XML document is taken from Chapter 9 of XML in a Nutshell (3rd edition), by Elliotte Rusty Harold & W. Scott Means.
Note: The above figure is taken from Chapter 9 of XML in a Nutshell (3rd edition), by Elliotte Rusty Harold & W. Scott Means.
Correction: In the above figure,
text/xsl
should be application/xml
xref:href="http://www.turinng.org.uk"
should be
xlink:href="http://www.turing.org.uk"
Richard
should appear in element first_name
#IMPLIED
,
and the attribute was not specified on the element, then the element's
attribute set does not contain a node of the attribute.
An XPath location path identifies a set of nodes in a document. A location path is built out of successive location steps. Each step is evaluated relative to a particular node in the document called the context node.
The root location path: /
Child element location steps. Examples:
/people/person
/people/person/name
Attribute location steps. Examples:
/people/person/@born
/people/person/@id
The node test comment()
is true for any comment node,
text()
is true for any text node, and
processing-instruction()
is true for any processing instruction node. Example:
/people/person/profession/text()
Wildcards: *
matches all nodes of the principal node type,
node()
matches any node,
and @*
matches all attribute nodes. Examples:
/people/person/*
/people/person/@*
Multiple matches with |
. Example:
//first_name | //last_name
XPath expressions can be formed just like Unix path expressions using /, .., ., and //. The last one, for all descendants of the context node, is new! Examples:
//name//*
//name/../profession/.
Each step in a location path may (but does not have to) have a predicate that selects from the node-set current at that step in the expression. Examples:
//person[@id = "p4567"]
//person[@born <=1920 and @born >= 1915]
/people/person[position()=1]/profession[position()=3]
The following abbreviated XPath expression
//person[@born < 1950]/name
is the same as the following unabbreviated XPath expression
/descendant-or-self::node()/child::person[attribute::born < 1950]/child::name
There are 13 axes:
ancestor
ancestor-or-self
attribute
child
descendant
descendant-or-self
following
following-sibling
namespace
parent
preceding
preceding-sibling
self
Check out the illustration at Massimo Franceschet's "Caffè XML" web page.
/child::doc/child::chapter[position()=5]/child::section[position()=2]
selects the second section
of the fifth
chapter
of the doc
document
element
child::para[attribute::type='warning'][position()=5]
selects the fifth para
child of the context node that has
a type
attribute with value
warning
child::*[self::chapter or self::appendix]
selects the chapter
and appendix
children of
the context node
Location Paths:
LocationPath ::= RelativeLocationPath | AbsoluteLocationPath AbsoluteLocationPath ::= '/' RelativeLocationPath? | AbbreviatedAbsoluteLocationPath RelativeLocationPath ::= Step | RelativeLocationPath '/' Step | AbbreviatedRelativeLocationPath
Location Steps:
Step ::= AxisSpecifier NodeTest Predicate* | AbbreviatedStep AxisSpecifier ::= AxisName '::' | AbbreviatedAxisSpecifier
Axes:
AxisName ::= 'ancestor' | 'ancestor-or-self' | 'attribute' | 'child' | 'descendant' | 'descendant-or-self' | 'following' | 'following-sibling' | 'namespace' | 'parent' | 'preceding' | 'preceding-sibling' | 'self'
Node Tests and Predicates:
NodeTest ::= NameTest | NodeType '(' ')' | 'processing-instruction' '(' Literal ')' Predicate ::= '[' PredicateExpr ']' PredicateExpr ::= Expr
Abbreviations:
AbbreviatedAbsoluteLocationPath ::= '//' RelativeLocationPath AbbreviatedRelativeLocationPath ::= RelativeLocationPath '//' Step AbbreviatedStep ::= '.' | '..' AbbreviatedAxisSpecifier ::= '@'?
The following is taken from Section 2.10 of Extensible Markup Language (XML) 1.0 (Fourth Edition), a W3C Recommendation.
In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code.
A special attribute named xml:space
may be attached to an element
to signal an intention that in that element, white space should be preserved by applications.
In valid documents, this attribute, like any other, must be declared if it is used.
When declared, it must be given as an enumerated type whose values are one or both of
"default" and "preserve". For example:
<!ATTLIST poem xml:space (default|preserve) 'preserve'> <!ATTLIST pre xml:space (preserve) #FIXED 'preserve'>
The value "default" signals that applications' default white-space
processing modes are acceptable for this element; the value "preserve"
indicates the intent that applications preserve all the white space. This
declared intent is considered to apply to all elements within the content
of the element where it is specified, unless overridden with
another instance of the xml:space
attribute.
The following is taken from Section 2.12 of Extensible Markup Language (XML) 1.0 (Fourth Edition), a W3C Recommendation.
In document processing, it is often useful to identify the natural or formal
language in which the content is written. A special attribute
named xml:lang
may be inserted in documents to specify the language
used in the contents and attribute values of any element in an XML document.
In valid documents, this attribute, like any other, must be declared if it is used.
The values of the attribute are language identifiers as defined by
[IETF RFC 3066],
Tags for the Identification of Languages, ...
<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p> <p xml:lang="en-GB">What colour is it?</p> <p xml:lang="en-US">What color is it?</p> <sp who="Faust" desc='leise' xml:lang="de"> <l>Habe nun, ach! Philosophie,</l> <l>Juristerei, und Medizin</l> <l>und leider auch Theologie</l> <l>durchaus studiert mit heißem Bemüh'n.</l> </sp>
The language specified byxml:lang
applies to the element where it is specified (including the values of its attributes), and to all elements in its content unless overridden with another instance ofxml:lang
.
A simple declaration for xml:lang
might take the form
xml:lang CDATA #IMPLIED
but specific default values may also be given, if appropriate. ...
<!ATTLIST poem xml:lang CDATA 'fr'> <!ATTLIST gloss xml:lang CDATA 'en'> <!ATTLIST note xml:lang CDATA 'en'>
encoding="UTF-8"
Mean?The following is taken from Section What is UTF-8?, in UTF-8 and Unicode FAQ for Unix/Linux, written by Markus Kuhn.
Unicode is just a code table that assigns integer numbers to characters. There exist several alternatives for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes. The two most obvious encodings store Unicode text as sequences of either 2 or 4 bytes sequences. The official terms for these encodings are UCS-2 and UCS-4, respectively. ...
Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like "\0" or "/" which have a special meaning in filenames and other C library function parameters. ... The UTF-8 encoding described in RFC 3629 does not have these problems. ...
encoding="UTF-8"
Mean? ContinuedThe following is taken from Section What is UTF-8?, in UTF-8 and Unicode FAQ for Unix/Linux, written by Markus Kuhn.
UTF-8 has the following properties:
- UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
- All UCS characters > U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
- The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
- All possible 231 UCS codes can be encoded.
- UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
- The sorting order of Bigendian UCS-4 byte strings is preserved.
- The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
encoding="UTF-8"
Mean? ContinuedThe following is taken from Section What is UTF-8?, in UTF-8 and Unicode FAQ for Unix/Linux, written by Markus Kuhn.
The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:
U-00000000 – U-0000007F: | 0xxxxxxx |
U-00000080 – U-000007FF: | 110xxxxx 10xxxxxx |
U-00000800 – U-0000FFFF: | 1110xxxx 10xxxxxx 10xxxxxx |
U-00010000 – U-001FFFFF: | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-00200000 – U-03FFFFFF: | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-04000000 – U-7FFFFFFF: | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.
Examples: The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as
11000010 10101001 = 0xC2 0xA9
and character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:
11100010 10001001 10100000 = 0xE2 0x89 0xA0