|
|
|
AbstractThe Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML. Status of this documentThis document has been reviewed by W3C Members and other interested parties and has been endorsed by the Director as a W3C Recommendation. It is a stable document and may be used as reference material or cited as a normative reference from another document. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability of the Web. This document specifies a syntax created by subsetting an existing, widely used international text processing standard (Standard Generalized Markup Language, ISO 8879:1986(E) as amended and corrected) for use on the World Wide Web. It is a product of the W3C XML Activity, details of which can be found at http://www.w3.org/XML. A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR. This specification uses the term URI, which is defined by [Berners-Lee et al.], a work in progress expected to update [IETF RFC1738] and [IETF RFC1808]. The list of known errors in this specification is available at http://www.w3.org/XML/xml-19980210-errata. Please report errors in this document to xml-editor@w3.org. Extensible Markup Language (XML) 1.0Table of Contents1. Introduction1.1 Origin and Goals 1.2 Terminology 2. Documents 2.1 Well-Formed XML Documents 2.2 Characters 2.3 Common Syntactic Constructs 2.4 Character Data and Markup 2.5 Comments 2.6 Processing Instructions 2.7 CDATA Sections 2.8 Prolog and Document Type Declaration 2.9 Standalone Document Declaration 2.10 White Space Handling 2.11 End-of-Line Handling 2.12 Language Identification 3. Logical Structures 3.1 Start-Tags, End-Tags, and Empty-Element Tags 3.2 Element Type Declarations 3.2.1 Element Content 3.2.2 Mixed Content 3.3 Attribute-List Declarations 3.3.1 Attribute Types 3.3.2 Attribute Defaults 3.3.3 Attribute-Value Normalization 3.4 Conditional Sections 4. Physical Structures 4.1 Character and Entity References 4.2 Entity Declarations 4.2.1 Internal Entities 4.2.2 External Entities 4.3 Parsed Entities 4.3.1 The Text Declaration 4.3.2 Well-Formed Parsed Entities 4.3.3 Character Encoding in Entities 4.4 XML Processor Treatment of Entities and References 4.4.1 Not Recognized 4.4.2 Included 4.4.3 Included If Validating 4.4.4 Forbidden 4.4.5 Included in Literal 4.4.6 Notify 4.4.7 Bypassed 4.4.8 Included as PE 4.5 Construction of Internal Entity Replacement Text 4.6 Predefined Entities 4.7 Notation Declarations 4.8 Document Entity 5. Conformance 5.1 Validating and Non-Validating Processors 5.2 Using XML Processors 6. Notation AppendicesA. ReferencesA.1 Normative References A.2 Other References B. Character Classes C. XML and SGML (Non-Normative) D. Expansion of Entity and Character References (Non-Normative) E. Deterministic Content Models (Non-Normative) F. Autodetection of Character Encodings (Non-Normative) G. W3C XML Working Group (Non-Normative) 1. IntroductionExtensible Markup Language, abbreviated XML, describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language [ISO 8879]. By construction, XML documents are conforming SGML documents. XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure. A software module called an XML processor is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of another module, called the application. This specification describes the required behavior of an XML processor in terms of how it must read XML data and the information it must provide to the application. 1.1 Origin and GoalsXML was developed by an XML Working Group (originally known as the SGML Editorial Review Board) formed under the auspices of the World Wide Web Consortium (W3C) in 1996. It was chaired by Jon Bosak of Sun Microsystems with the active participation of an XML Special Interest Group (previously known as the SGML Working Group) also organized by the W3C. The membership of the XML Working Group is given in an appendix. Dan Connolly served as the WG's contact with the W3C. The design goals for XML are:
This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 1766 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version 1.0 and construct computer programs to process it. This version of the XML specification may be distributed freely, as long as all text and legal notices remain intact. 1.2 TerminologyThe terminology used to describe XML documents is defined in the body of this specification. The terms defined in the following list are used in building those definitions and in describing the actions of an XML processor:
2. DocumentsA data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further constraints. Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a "root" or document entity. Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup. The logical and physical structures must nest properly, as described in "4.3.2 Well-Formed Parsed Entities". 2.1 Well-Formed XML DocumentsA textual object is a well-formed XML document if:
Matching the
As a consequence of this, for each non-root
element 2.2 CharactersA parsed entity contains text, a sequence of characters, which may represent markup or character data. A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal graphic characters of Unicode and ISO/IEC 10646. The use of "compatibility characters", as defined in section 6.8 of [Unicode], is discouraged.
The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors must accept the UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in "4.3.3 Character Encoding in Entities". 2.3 Common Syntactic ConstructsThis section defines some symbols used widely in the grammar.
Characters are classified for convenience as letters, digits, or other characters. Letters consist of an alphabetic or syllabic base character possibly followed by one or more combining characters, or of an ideographic character. Full definitions of the specific characters in each class are given in "B. Character Classes". A Name is a token beginning with a letter
or one of a few punctuation characters, and continuing with letters, digits,
hyphens, underscores, colons, or full stops, together known as name characters.
Names beginning with the string " Note: The colon character within XML names is reserved for experimentation with name spaces. Its meaning is expected to be standardized at some future point, at which point those documents using the colon for experimental purposes may need to be updated. (There is no guarantee that any name-space mechanism adopted for XML will in fact use the colon as a name-space delimiter.) In practice, this means that authors should not use the colon in XML names except as part of name-space experiments, but that XML processors should accept the colon as a name character. An
Literal data is any quoted string not containing the quotation mark used
as a delimiter for that string. Literals are used for specifying the content
of internal entities (
2.4 Character Data and MarkupText consists of intermingled character data and markup. Markup takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, and processing instructions. All text that is not markup constitutes the character data of the document. The ampersand character (&) and the left angle bracket (<) may
appear in their literal form only when used as markup delimiters,
or within a comment, a processing
instruction, or a CDATA section. They are also legal within the literal entity value of an internal entity declaration;
see "4.3.2 Well-Formed Parsed Entities". If they are needed
elsewhere, they must be escaped using either numeric character
references or the strings " In the content of elements, character data is any string of characters
which does not contain the start-delimiter of any markup. In a CDATA section,
character data is any string of characters not including the CDATA-section-close
delimiter, " To allow attribute values to contain both single and double quotes, the
apostrophe or single-quote character (') may be represented as "
2.5 CommentsComments may appear anywhere in a document
outside other markup; in addition, they may appear within the document
type declaration at places allowed by the grammar. They are not part of
the document's character data; an XML processor may, but need not, make
it possible for an application to retrieve the text of comments. For compatibility, the string "
An example of a comment:
2.6 Processing InstructionsProcessing instructions (PIs) allow documents to contain instructions for applications.
PIs are not part of the document's character data, but must be passed through to the application.
The PI begins with a target ( 2.7 CDATA SectionsCDATA sections may occur anywhere character
data may occur; they are used to escape blocks of text containing characters
which would otherwise be recognized as markup. CDATA sections begin with
the string "
Within a CDATA section, only the An example of a CDATA section, in which "
2.8 Prolog and Document Type DeclarationXML documents may, and should, begin with an XML declaration which specifies the version of XML being used. For example, the following is a complete XML document, well-formed but not valid:
and so is this:
The version number " The function of the markup in an XML document is to describe its storage and logical structure and to associate attribute-value pairs with its logical structures. XML provides a mechanism, the document type declaration, to define constraints on the logical structure and to support the use of predefined storage units. An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it. The document type declaration must appear before the first element in the document.
The XML document type declaration contains or points to markup declarations that provide a grammar for a class of documents. This grammar is known as a document type definition, or DTD. The document type declaration can point to an external subset (a special kind of external entity) containing markup declarations, or can contain the markup declarations directly in an internal subset, or can do both. The DTD for a document consists of both subsets taken together. A markup declaration is an element type declaration, an attribute-list declaration, an entity declaration, or a notation declaration. These declarations may be contained in whole or in part within parameter entities, as described in the well-formedness and validity constraints below. For fuller information, see "4. Physical Structures".
The markup declarations may be made up in whole or in part of the replacement text of parameter entities.
The productions later in this specification for individual nonterminals
( Validity Constraint: Root Element Type Validity Constraint: Proper Declaration/PE Nesting Well-Formedness Constraint: PEs in Internal Subset Like the internal subset, the external subset and any external parameter
entities referred to in the DTD must consist of a series of complete markup
declarations of the types allowed by the non-terminal symbol
The external subset and external parameter entities also differ from the internal subset in that in them, parameter-entity references are permitted within markup declarations, not only between markup declarations. An example of an XML document with a document type declaration:
The system identifier " The declarations can also be given locally, as in this example:
If both the external and internal subsets are used, the internal subset is considered to occur before the external subset. This has the effect that entity and attribute-list declarations in the internal subset take precedence over those in the external subset. 2.9 Standalone Document DeclarationMarkup declarations can affect the content of the document, as passed from an XML processor to an application; examples are attribute defaults and entity declarations. The standalone document declaration, which may appear as a component of the XML declaration, signals whether or not there are such declarations which appear external to the document entity.
In a standalone document declaration, the value " If there are no external markup declarations, the standalone document
declaration has no meaning. If there are external markup declarations
but there is no standalone document declaration, the value " Any XML document for which Validity Constraint: Standalone Document Declaration
An example XML declaration with a standalone document declaration:
2.10 White Space HandlingIn editing XML documents, it is often convenient to use "white space"
(spaces, tabs, and blank lines, denoted by the nonterminal An XML processor must always pass all characters in a document that are not markup through to the application. A validating XML processor must also inform the application which of these characters constitute white space appearing in element content. A special attribute named
The value " The root element of any document is considered to have signaled no intentions as regards application space handling, unless it provides a value for this attribute or the attribute is declared with a default value. 2.11 End-of-Line HandlingXML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters carriage-return (#xD) and line-feed (#xA). To simplify the tasks of applications, wherever an external parsed entity or the literal entity value of an internal parsed entity contains either the literal two-character sequence "#xD#xA" or a standalone literal #xD, an XML processor must pass to the application the single character #xA. (This behavior can conveniently be produced by normalizing all line breaks to #xA on input, before parsing.) 2.12 Language IdentificationIn document processing, it is often useful to identify the natural or
formal language in which the content is written. A special attribute named
The
There may be any number of It is customary to give the language code in lower case, and the country code (if any) in upper case. Note that these values, unlike other names in XML documents, are case insensitive. For example:
The intent declared with A simple declaration for
but specific default values may also be given, if appropriate. In a collection of French poems for English students, with glosses and notes in English, the xml:lang attribute might be declared this way:
3. Logical StructuresEach XML document contains one or more elements, the boundaries of which are either delimited by start-tags and end-tags, or, for empty elements, by an empty-element tag. Each element has a type, identified by name, sometimes called its "generic identifier" (GI), and may have a set of attribute specifications. Each attribute specification has a name and a value.
This specification does not constrain the semantics, use, or (beyond
syntax) names of the element types and attributes, except that names beginning
with a match to Well-Formedness Constraint: Element Type Match Validity Constraint: Element Valid
3.1 Start-Tags, End-Tags, and Empty-Element TagsThe beginning of every non-empty XML element is marked by a start-tag.
The Well-Formedness Constraint: Unique Att Spec Validity Constraint: Attribute Value Type Well-Formedness Constraint: No External Entity References Well-Formedness Constraint: No An example of a start-tag:
The end of every element that begins with a start-tag must be marked by an end-tag containing a name that echoes the element's type as given in the start-tag:
An example of an end-tag:
The text between the start-tag and end-tag is called the element's content:
If an element is empty, it must be represented either by a start-tag immediately followed by an end-tag or by an empty-element tag. An empty-element tag takes a special form:
Empty-element tags may be used for any element which has no content,
whether or not it is declared using the keyword Examples of empty elements:
3.2 Element Type DeclarationsThe element structure of an XML document may, for validation purposes, be constrained using element type and attribute-list declarations. An element type declaration constrains the element's content. Element type declarations often constrain which element types can appear as children of the element. At user option, an XML processor may issue a warning when a declaration mentions an element type for which no declaration is provided, but this is not an error. An element type declaration takes the form:
where the Validity Constraint: Unique Element Type Declaration Examples of element type declarations:
3.2.1 Element ContentAn element type has element content when elements of that type
must contain only child elements (no character data), optionally separated
by white space (characters matching the nonterminal
where each The content of an element matches a content model if and only if it is possible to trace out a path through the content model, obeying the sequence, choice, and repetition operators and matching each element in the content against an element type in the content model. For compatibility, it is an error if an element in the document can match more than one occurrence of an element type in the content model. For more information, see "E. Deterministic Content Models". Validity Constraint: Proper Group/PE Nesting Examples of element-content models:
3.2.2 Mixed ContentAn element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements. In this case, the types of the child elements may be constrained, but not their order or their number of occurrences:
where the Validity Constraint: No Duplicate Types Examples of mixed content declarations:
3.3 Attribute-List DeclarationsAttributes are used to associate name-value pairs with elements. Attribute specifications may appear only within start-tags and empty-element tags; thus, the productions used to recognize them appear in "3.1 Start-Tags, End-Tags, and Empty-Element Tags". Attribute-list declarations may be used:
Attribute-list declarations specify the name, data type, and default value (if any) of each attribute associated with a given element type:
The When more than one 3.3.1 Attribute TypesXML attribute types are of three kinds: a string type, a set of tokenized types, and enumerated types. The string type may take any literal string as a value; the tokenized types have varying lexical and semantic constraints, as noted:
Validity Constraint: ID Validity Constraint: One ID per Element Type Validity Constraint: ID Attribute Default Validity Constraint: IDREF Validity Constraint: Entity Name Validity Constraint: Name Token Enumerated attributes can take one of a list of values provided in the declaration. There are two kinds of enumerated types:
A Validity Constraint: Notation Attributes Validity Constraint: Enumeration For interoperability, the same 3.3.2 Attribute DefaultsAn attribute declaration provides information on whether the attribute's presence is required, and if not, how an XML processor should react if a declared attribute is absent in a document.
In an attribute declaration, Validity Constraint: Required Attribute Validity Constraint: Attribute Default Legal Validity Constraint: Fixed Attribute Default Examples of attribute-list declarations:
3.3.3 Attribute-Value NormalizationBefore the value of an attribute is passed to the application or checked for validity, the XML processor must normalize it as follows:
|