Design and Implementation of a validating XML parser in Haskell: Master's thesis; University of Applied Sciences Wedel | ||
---|---|---|
Prev | Appendix A. User handbook | Next |
Time and space profiling is very useful to optimize programs and remove bottlenecks. We already added the ability of profiling to the XML parser, but did not use this information yet. The main focus of the parser lays on a clear and easy design at the moment.
The validating XML parser can be analyzed with the time and space profiling system of the Glasgow Haskell Compiler (GHC). Cost centers have been added for all main computations, so that it could be measured which parts of the parser consume most time and memory.
The tool Profiling can be used to run these tests with the XML parser. It can be built by executing make Profiling in the root directory of the Haskell XML Toolbox.
Profiling must be executed with an XML file as test input and several different parameters that influence GHC's profiling behavior. These parameters are described in depth in the documentation of GHC.
Four predefined profiling types have been added to the Makefile in the root directory of the Haskell XML Toolbox. The input file of Profiling can be set by the variable PROFILING_INPUT.
Predefined profiling targets:
Time and allocation profiling - Generate profiling information in XML format and display them with the tool ghcprof of GHC.
Time and allocation profiling - Generate profiling information in text format and display them with less.
Heap profiling - Break down the graph by the cost-center stack which produced the data.
Heap profiling - Break down the graph by the retainer set.
This section will show how to interpret the results from the profiling tests. The W3C Recommendation of the Extensible Hypertext Markup Language (XHTML) and its Strict DTD was chosen as test input. The test machine was a Pentium 4 with 1.6 GHz and 256 MB memory, running Debian 3.0.
The results from time and space profiling show that most time and memory is consumed by the function procesDTD. This function provides substitution of parameter entities in the internal and external DTD part. The next resource intensive function is parseXmlDoc that does the parsing of an XML document and constructs an XmlTree representation from it.
total time = 1.26 secs (63 ticks @ 20 ms) total alloc = 102,664,004 bytes (excludes profiling overheads) COST CENTER MODULE %time %alloc processDTD HdomParser 27.0 45.6 parseXmlDoc HdomParser 19.0 29.3 buildAllValFcts DocValidation 14.3 5.0 validateAttributes DTDValidation 12.7 0.5 processGeneralEntities HdomParser 7.9 9.6 MAIN MAIN 6.3 5.0 buildAllTransFcts DocTransformation 4.8 1.8 IdVal.validateIds Validation 4.8 1.1 validateElements DTDValidation 3.2 0.2 getXmlContents HdomParser 0.0 1.1 |
The ranking of the cost centers might have changed dramatically if the profiling tests are performed with other documents. In this case, parameter entities are exhaustively used in the Strict DTD of XHTML. It is not astonishing that the function that handles their substitution takes the most time and space. Another important aspect is the ratio of the DTD subset and the document subset. If for example the DTD subset is very large, its processing functions might take a large percentage of time and space consumption, too.
Heap profiling produces graphs that show the memory consumption of the cost centers during execution time. These graphs show very clearly how the program execution behavior of Haskell programs differs from programs written in imperative languages. Haskell uses lazy evaluation when evaluating expressions. Arguments to functions are evaluated only when they are needed. Functions only calculate their result as much as it is needed by other functions that called them.
Lazy evaluation leads to the fact that Haskell programs do not have a strict sequential evaluation order, but that they resemble a stream. The graphs show very clearly this execution behavior. First the parsing functions are started, after producing enough output, processing of the DTD starts. In the end the validation process starts calculations. The graphs show that almost all functions work in parallel after they have enough input.