Design and Implementation of a validating XML parser in Haskell: Master's thesis; University of Applied Sciences Wedel | ||
---|---|---|
Prev | Chapter 4. Package hvalidator | Next |
The following section describes how the algorithm for validating the element content works. It is implemented in the modules XmlRE and RE. The most commonly-used technique to solve this problem is based on finite automaton.
Another algorithm, based on derivatives of regular expressions, is much better suited for implementation in a functional language. This technique does not need to backtrack, like some NFA-based algorithms do and when combined with lazy evaluation and memorization it can be very efficient.
The algorithm of derivating regular expressions was first described by Janusz A. Brzozowski [Brzozowski64]. Joe English suggested using this algorithm for validating element contents in XML. He gave a short description of his ideas combined with a small Haskell program fragment [WWW26].
Mark Hopkins wrote some small C programs based on this algorithm. A description of the algorithm and working source code is available from the comp.compilers archive [WWW27]. James Clark describes an algorithm for RELAX NG validation [WWW28] that bases on derivatives, too.
Validating the element content is an instance of regular expression matching problem. Basically the content model of an element describes a regular expression. If given the regular expression e and the content c it has to be checked if c is in L(e)? L(e) denotes the language accepted by the regular expression e.
Informal described, the algorithm works as follows. A regular expression is applied to a sentence. The remaining regular expression is checked if it can be derived to an empty sequence. If it can be derived to an empty sequence, the sentence matches the regular expression. Otherwise the sentence is not described by the regular expression.
The algorithm can be seen as working on an infinite DFA with states labeled by regular expressions instead of symbols as in the canonical construction. The final states are those states for which the regular expression can be derived to an empty sequence. The transition function is a function, which derives the regular expression by a single symbol.
The regular expression foobar should be derived with respect to the string "foo". After applying the regular expression to this string, the remaining regular expression is bar. This regular expression cannot be derived to an empty string, because it expects the characters 'b', 'a' and 'r'. This means that the string bar is not described by the regular expression. (Short form: foo\foobar = bar)
The regular expression a* should be derived with respect to the string "aa". After applying the regular expression to the string, the regular expression a* remains. This expression can be derived to an empty string, because the *-operator indicates that none or more 'a' characters are expected. This means that the regular expression matches the string. (Short form: aa\a* = a*)
The following code was adopted from Joe English [WWW26], but has been extended to support meaningful error messages, to accept any single symbol (dot operator) and to work with data types of XmlTree.
Regular expressions are defined by the algebraic data type RE.
data RE a = RE_ZERO String --' L(0) = {} (empty set, or failure with message) | RE_UNIT --' L(1) = { [] } (empty sequence, or success) | RE_SYM a --' L(x) = { [x] } | RE_DOT --' accept any single symbol | RE_REP (RE a) --' L(e*) = { [] } `union` L(e+) | RE_PLUS (RE a) --' L(e+) = { x ++ y | x <- L(e), y <- L(e*) } | RE_OPT (RE a) --' L(e?) = L(e) `union` { [] } | RE_SEQ (RE a) (RE a) --' L(e,f) = { x ++ y | x <- L(e), y <- L(f) } | RE_ALT (RE a) (RE a) --' L(e|f) = L(e) `union` L(f) deriving (Show, Eq) |
The core of the algorithm is the delta ("derivative") function. It takes a regular expression re and a symbol s. The function returns a new expression that matches all sentences that are a suffix of sentences in L(re) beginning with s. The function works by simple case wise analysis.
delta :: (Eq a, Show a) => RE a -> a -> RE a delta re s = case re of RE_ZERO m -> re_zero m RE_UNIT -> re_zero ("Symbol "++ show s ++" unexpected.") RE_SYM sym | s == sym -> re_unit | otherwise -> re_zero ("Symbol "++ show sym ++" expected, but "++ "symbol "++ show s ++" found.") RE_REP e -> re_seq (delta e s) (re_rep e) RE_PLUS e -> re_seq (delta e s) (re_rep e) RE_OPT e -> delta e s RE_SEQ e f | nullable e -> re_alt (re_seq (delta e s) f) (delta f s) | otherwise -> re_seq (delta e s) f RE_ALT e f -> re_alt (delta e s) (delta f s) RE_DOT -> re_unit |
The function matches derives a regular expression with respect to a sentence by using the higher-order function foldl which folds a function into a list of values.
matches :: (Eq a, Show a) => RE a -> [a] -> RE a matches e = foldl delta e |
The auxiliary function nullable tests if a regular expression matches the empty sequence. If regular expressions are compound ones and the outer expression cannot be itself derived to an empty sequence, the sub-expressions have to be tested if they are nullable.
nullable :: (Show a) => RE a -> Bool nullable (RE_ZERO _) = False nullable RE_UNIT = True nullable (RE_SYM _) = False nullable (RE_REP _) = True nullable (RE_PLUS e) = nullable e nullable (RE_OPT _) = True nullable (RE_SEQ e f) = nullable e && nullable f nullable (RE_ALT e f) = nullable e || nullable f nullable RE_DOT = True |
The function checkRE checks if an input matched a regular expression. The regular expression RE_ZERO indicates that an error occured during derivation and that the regular expression does not match the given sentence.
checkRE :: (Show a) => RE a -> String checkRE (RE_ZERO m) = m checkRE re | nullable re = "" | otherwise = "Input must match " ++ printRE re |
The constructor functions re_seq, re_opt and so on, used in the delta function, are not shown. Without these constructor functions the intermediate regular expressions would grow very rapidly. The constructors simplify them at each step, e.g. by replacing a sequence of epsilons (RE_UNIT) by a single epsilon.
The described delta function is not the one used for validating XmlTree types. There exists a special function for validating XmlTree nodes in the module XmlRE.
The algorithm of derivating regular expressions can be expressed in Haskell very naturally: regular expressions are represented by a special data type, sentences are just a list of symbols. Pattern-matching and Haskell's standard functions for list processing allow a very clear and short implementation.
The algorithm can handle ambiguous regular expressions very well. If a regular expression consists of sub-expressions, all sub-expressions are derived by a certain symbol. If for example the ambiguous regular expression (a , b) | (a , c) is derived with respect to the sentence "ab", the algorithm will construct the regular expression (b | c) after deriving the original expression by the symbol a. After deriving the remaining regular expression by the symbol b, the regular expression (RE_UNIT | RE_ZERO) remains. This expression can be derived to the empty sequence and therefore the regular expression matches the sentence.
Haskell's lazy evaluation avoids the evaluation of the whole regular expression. The expression has only to be evaluated as much that nullable can calculate an answer.