Design and Implementation of a validating XML parser in Haskell: Master's thesis; University of Applied Sciences Wedel | ||
---|---|---|
Prev | Next |
The following chapter gives a short introduction to XML and Haskell. Readers who are familiar with these techniques can skip this chapter.
The Extensible Markup Language (XML) was standardized by the World Wide Web Consortium (W3C) at the beginning of 1998. The current standard is described by the Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation [WWW01]. XML forms a subset of the Structured Generalized Markup Language (SGML), which is extremely complex. XML itself is a simplified form of SGML. Both languages are meta languages for defining markup languages.
XML documents are structured hierarchical. The documents base on a simple data model: trees with attributes as some extra structure.
A Document Type Definition (DTD) can be used to define formal requirements to the structure of an XML document. It specifies a grammar for a class of documents. These are rules for the elements, attributes and other data and how they are logically related in an XML document. Each document, which corresponds to a DTD, is called an instance of this DTD.
Today XML is often used to define and exchange data structures application-independent. A central role plays the validation of XML documents. Validation of XML documents describes the check if the document corresponds to the constraints of a DTD. The application that reads an XML document and performs validation is called XML processor. Not all XML processors are validating ones. Applications that do not need these checks can use non-validating XML processors for performance reasons.
Every application that processes information from XML documents needs an XML processor, which reads an XML document and provides access to its content and structure to the processing application. Two parse methods exist: event based and tree based parse method. They define essential differences for the communication between the XML processor and applications.
The document is processed sequentially; the most popular parse method is SAX - the Simple API for XML [WWW10]. Each element is an event trigger, which can initiate an action on the processing application. In terms of processing speed and memory consumption this is the most performant technique for accessing an XML document in sequential order. The event based parsed method is also very efficient for processing only a few specific elements of a document. The main disadvantage is that it is more difficult to generate output with a different order of elements than the input, because some kind of memory is needed.
The document is transformed from a sequential into a hierarchical structure by the XML processor (e.g. DOM/JDOM). This tree model of the document is stored in memory and can be accessed by the processing application. The main advantage of this parse method is that it supports random access to the document. Random access is needed for example for Extensible Stylesheet Language Transformations (XSLT) [WWW07]. XSLT implements a tree-oriented transformation language for transmuting XML documents.
The tree representation can be used for traversing the document several times or construct output in a different order from the input. The main disadvantage of this method is that the tree model is only accessible when parsing of the document is complete. For large documents this can be slow and memory intensive.
There exist two classes of correctness for XML documents. The first class form the well-formed documents. These documents meet the syntax rules of the XML 1.0 specification [WWW01]. If a document does not meet these syntax rules, it is not an XML document and is rejected by any XML processor.
Some XML syntax rules:
Every document must have a single unique root element that encloses all other elements.
All elements must be correctly nested.
All elements must have corresponding start and end tags.
Attribute values must be enclosed within single or double quotes.
The second class form the valid documents. These documents are well-formed and meet in addition the constraints of a DTD (described in Section 1.1.4) or a Schema [WWW08]. This class of correctness is required if XML documents must be exchanged between applications reliably. Only validating XML processors perform these checks, non-validating XML processors ignore the DTD.
XML is a meta language for specifying markup languages, the DTD is a tool to create and describe this language. A DTD defines the elements, attributes, entities and notations used in an XML document. Further it defines a context free grammar that states which is the allowed content of an element. The DTD can point to an external DTD subset containing declarations, it may contain the declarations directly in an internal DTD subset, or even both. This section will describe the definitions of elements and their attributes. They form the basic concepts of XML documents. Entity declarations that are mainly used for replacing text and notation declarations that are used for specifying the format of special contents are not explained. For a complete overview see [WWW04].
A DTD declares all allowed elements in an XML document. Elements are defined by <!ELEMENT> declarations. This declaration specifies the name of the element and its content model. Only one declaration for each element is allowed, multiple declarations are forbidden. The order of the declarations is irrelevant.
<!ELEMENT name content model>
ELEMENT - Keyword
name - Element name, respectively its tag names
content model - Definition of the allowed types of child elements (text data, other elements) and their order
For defining the content model of elements there exist some operators. These operators are described shortly in the following.
Table 1-2. Occurrence indicators
(no indicator) | Element must appear exactly one time |
? | Element is optional, 0 or 1 |
* | Element is optional and repeatable, 0 or more |
+ | Element must appear at least one time, 1 or more |
Table 1-4. #PCDATA in content model
(#PCDATA) | Element has only text data or nothing as content |
(#PCDATA | el1 | el2)* | Mixed content. The element can contain text data and elements. #PCDATA must be listed in the declaration at first. It cannot be defined that allowed child elements have to appear in a certain order, or that they at least must appear. |
#PCDATA is the abbreviation for "parsed character data" and means that text data is inspected by the XML parser for eventual markup. If for example the text data contains entity references, the entities are expanded.
Attributes define additional name-value pairs for elements. The attributes are defined by <!ATTLIST> declarations. It defines the attribute type and default values.
<!ATTLIST target_element attr_name attr_type default>
ATTLIST - Keyword
target_element - Element name
attr_name - Attribute name
attr_type - Type of the attribute value, or list of values
default - Keyword or default value
Table 1-6. Type of the attribute value
CDATA | Text data |
ID | Unique identifier, value of attribute must be unique within the XML document |
IDREF | Reference to an ID, must match the value of some ID attribute |
IDREFS | One or more references to IDs, separated by white space |
NMTOKEN | Name token, attribute value must only consist of letters, digits and some extra characters |
NMTOKENS | One or more name tokens, separated by white space |
ENTITY | Name of an unparsed entity, declared in the DTD |
ENTITIES | One or more names of unparsed entities declared in the DTD, separated by white space |
(a | b | c) | Enumeration, list of possible attribute values |