Design and Implementation of a validating XML parser in Haskell: Master's thesis; University of Applied Sciences Wedel | ||
---|---|---|
Prev | Chapter 2. Package hdom | Next |
XML documents are structured hierarchically and can be represented as trees. Trees can be modeled easily with lists, because trees can be seen as a generalization of lists: nested lists. As shown in the first chapter, Haskell has some nifty facilities for representing and manipulating lists.
XML documents consist of many different logical units, which can be represented by different types. Functional languages like Haskell are well-equipped to process typed tree-structured data by using pattern matching and recursion.
Basically there exist two different approaches for representing XML documents in Haskell. One approach is to use a generic tree structure by which all XML documents can be handled.
Another approach is to use problem specific types that are derived from a specific Document Type Definition. Both approaches have several advantages and disadvantages. They are discussed in depth in the paper "Haskell and XML: Generic Combinators or Type-Based Translation?" by Malcolm Wallace and Colin Runciman [WWW22].
The Haskell XML Toolbox uses a generic data structure for representing whole XML documents. Its data model is more general than the one used in HaXml. The differences are discussed in depth in Section 5.2. A generic data model is the only practical approach for implementing a generic XML parser and a framework for processing any XML documents.
The data type that models a generic tree structure is the data type NTree, defined in the module NTree. It defines an n-ary tree: a node with a list of children that are also nodes. If a node is a leaf it has an empty child list. The data type NTrees is an abbreviation for the node list. Together both types are mutually recursive and form a multi-branch tree structure.
data NTree node = NTree node (NTrees node) deriving (Eq, Ord, Show, Read) type NTrees node = [NTree node] |
Building on the very general tree data type NTree are defined the types XmlTree and XmlTrees that are defined in the module XmlTreeTypes. They are used to specify a general recursive data type for XML documents. The data, which can be stored in the nodes, must be of type XNode.
type XmlTree = NTree XNode type XmlTrees = NTrees XNode |
The algebraic data type XNode is used for representing all kinds of XML's logical units. The type is described in detail in the next section.
Before introducing the data type XNode, some type synonyms have to be presented which are used in the definitions of XNode. The type TagAttrl defines the attribute list of an element, it is a list of name-value pairs.
type TagAttrl = [TagAttr] type TagAttr = (AttrName, AttrValue) |
Several type synonyms exist to make the definitions more understandable:
type TagName = String type AttrName = String type AttrValue = String |
XML documents consist of several different logical units like elements, text data, comments, DTD definitions or entity references. Each kind can be represented by an own type in Haskell so that processing an XmlTree can be implemented very efficient by the use of pattern matching. The algebraic data type XNode defines the basic nodes and leafs for all kinds of XML's logical units.
Together with the algebraic type DTDElem, which defines constructors for the DTD declarations, this data model allows a uniform processing of the whole XML document by filter functions, described in Section 2.3.
data XNode = XText String | XCharRef Int | XEntityRef String | XCmt String | XCdata String | XPi TagName TagAttrl | XTag TagName TagAttrl | XDTD DTDElem TagAttrl | XError Int String deriving (Eq, Ord, Show, Read) |
Description of the constructors:
Ordinary text data (leaf)
Note that an XText node does not necessarily represent a maximal contiguous sequence of characters. The parser may split text data up into multiple XText nodes. The parser produces XText nodes from white space occurring before or after tags, too.
There exist special filter functions in the module EditFilters for summing up sequences of XText nodes into one node and removing XText nodes from the tree, which contain white space only.
Character reference (leaf)
XML syntax: &#nnn; (nnn is a hexadecimal or decimal representation of the characters code point in ISO/IEC 10646)
Entity reference (leaf)
XML syntax: &...;
Comment (leaf)
CDATA section (leaf)
Processing Instruction (leaf)
TagName stores the name of the processing instruction. If name is xml, it has the attributes "version", "encoding" and "standalone". Otherwise there is only the attribute "value" which stores a list of the processing instruction attributes.
Element (node or leaf)
Inner node if the element has content, respectively children. Leaf if the element is empty. TagName stores the name of the element. The attribute list TagAttrl contains all attributes of the element.
DTD element (node or leaf)
The algebraic data type DTDElem is used to specify the concrete kind of the element, e.g. element declaration, attribute declaration or entity declaration. The attribute list TagAttrl contains almost all information from the DTD declarations (see Section 2.2.2.2).
Error (node or leaf)
Not an XML component. Internal extension for representing errors with error level and error message. The error level can be of type: warning, error or fatal error. The error node can have a child list with the nodes where the error occurred.
Because a DTD is quite complex, there exists an extra algebraic data type DTDElem for representing DTD elements in the XmlTree. For each DTD declaration there exists an own type. The nodes store almost all information from DTD declarations in their attribute list. The library XmlKeywords provides constants for the names and values of these attributes. Variable contents like the definition of the content model or the names in the value-list of enumerated attribute types are modeled using the node's child list in combination with helper nodes.
data DTDElem = DOCTYPE | ELEMENT | CONTENT | ATTLIST | ENTITY | NOTATION | NAME | PENTITY | PEREF | CONDSECT deriving (Eq, Ord, Show, Read) |
Description of the constructors:
The root node of a DTD, has XDTD nodes as children.
Attributes:
name - Name of the root element
SYSTEM - Reference to external subset (optional)
PUBLIC - Reference to external subset (optional)
Element declaration. If the element is of type children or mixed, it has a list of XDTD CONTENT nodes as children. These children describe the content model of the element.
Attributes:
name - Element name
type - EMPTY | ANY | #PCDATA | children | mixed
Not an XML component. Specifies the content model of an element if the element is of type children or mixed. An XDTD CONTENT node has a list of children, which can be of type XDTD CONTENT if there is some grouping in the content model, or of type XDTD NAME to specify the name of a child element.
Attributes:
kind - seq | choice
modifier - "" | ? | * | +
Attribute declaration. If the attribute declaration is of type NOTATION or ENUMERATION, this node has a list of children of type XDTD NAME to specify the enumerated values.
Attributes:
name - Element name
value - Attribute name
kind - #REQUIRED | #IMPLIED | #DEFAULT | #FIXED
type - CDATA | ID | IDREF | IDREFS | ENTITY | ENTITIES | NMTOKEN | NMTOKENS | NOTATION | ENUMERATION
default - Default value (optional)
General or unparsed entity declaration. If the entity is a general entity, it has a child list with nodes of type XDTD XCharRef and XDTD XText that specify the replacement text.
Attributes:
name - Entity name
SYSTEM - Reference to external file (optional and only for external or unparsed entities)
NDATA - reference to a notation (optional and only for unparsed entities)
Notation declaration.
Attributes:
name - Notation name
SYSTEM - Reference to external application
Not an XML component. Used by the attribute declaration XDTD ATTLIST to specify the list of values for attributes of type NOTATION and ENUMERATION.
Used by the node XDTD CONTENT to specify element names in content models.
Attributes:
name - Value of the name
The following node types are not accessible by any application. They are for internal use of the XML parser and handled by it. However it is possible to turn of this processing of the parser, so they are shortly described.
Parameter entity declaration.
Attributes:
name - Name of parameter entity
value - Value of parameter entity
SYSTEM - Reference to external file (optional)
PUBLIC - Reference to external file (optional)
Parameter entity reference.
Attributes:
PERef - Name of referenced parameter entity
Node for conditional sections. The child list contains the DTD declarations that are defined in the conditional section.
Attributes:
type - INCLUDE, IGNORE or parameter entity reference (%...;)
The following example demonstrates how an XML document is transformed into the generic tree structure XmlTree.
Example 2-1. Input document
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <!DOCTYPE a [ <!ATTLIST a att1 CDATA #IMPLIED> <!ELEMENT a (b, c?)> <!ELEMENT b EMPTY> <!ELEMENT c (#PCDATA)> ]> <a att1="test"> <b/> <c>hello world</c> </a> |
The following graph shows the XmlTree representation of the document after parsing. The predefined general entities amp, lt, gt, apos and quot are automatically added to the DTD by the XML parser. A dummy node with the name "/" forms the root node of the tree so that the DTD, the document subset, processing instructions and comments can be modeled by one tree.
White space before or after tags is represented by XText nodes. There exist filter functions in the module EditFilters for removing XText nodes containing white space only from the tree if these nodes are not necessary for further processing.
Example 2-2. Input document as XmlTree
---XTag "/" [("standalone","yes"),("version","1.0"),("source","example.xml"), | ("encoding","UTF-8")] | +---XPi "xml" [("version","1.0"),("encoding","UTF-8"),("standalone","yes")] | +---XText "\n\n" | +---XDTD DOCTYPE [("name","a")] | | | +---XDTD ENTITY [("name","lt")] | | | | | +---XCharRef 38 | | | | | +---XText "#60;" | | | +---XDTD ENTITY [("name","gt")] | | | | | +---XCharRef 62 . . . . (Definitions of the other predefined entities amp, apos and quot) . . | +---XDTD ATTLIST [("name","a"),("value","att1"),("default","42"), | | ("kind","#DEFAULT"),("type","CDATA")] | | | +---XDTD ELEMENT [("name","a"),("type","children")] | | | | | +---XDTD CONTENT [("modifier",""),("kind","seq")] | | | | | +---XDTD NAME [("name","b")] | | | | | +---XDTD CONTENT [("modifier","?"),("kind","seq")] | | | | | +---XDTD NAME [("name","c")] | | | +---XDTD ELEMENT [("name","b"),("type","EMPTY")] | | | +---XDTD ELEMENT [("name","c"),("type","#PCDATA")] | +---XText "\n\n" | +---XTag "a" [("att1","test")] | | | +---XText "\n " | | | +---XTag "b" [] | | | +---XText "\n " | | | +---XTag "c" [] | | | | | +---XText "hello world" | | | +---XText "\n" | +---XText "\n" |
The following listing shows the internal representation of the document subset. The root node with the name "a" has a list of attributes and a list of children.