Module Xml_lexer

module Xml_lexer: sig .. end
Simple XML lexer

This module provides an ocamllex lexer for XML files. It only supports the most basic features of the XML specification.

The lexer altogether ignores the following 'events': comments, processing instructions, XML prolog and doctype declaration.

The predefined entities (&, <, etc.) are supported. The replacement text for other entities whose entity value consist of character data can be provided to the lexer (see Xml_lexer.entities). Internal entities declarations are not taken into account (the lexer just skips the doctype declaration).

CDATA sections and character references are supported.

See Xml_lexer.strip_ws about whitespace handling.

Error reporting

type error =
| Illegal_character of char
| Bad_entity of string
| Unterminated of string
| Tag_expected
| Attribute_expected
| Other of string
val error_string : error -> string
exception Error of error * int
This exception is raised in case of an error during the parsing. The int argument indicates the character position in the buffer. Note that some non-conforming XML documents might not trigger an error.


type token =
| Tag of string * (string * string) list * bool (*Tag (name, attributes, empty) denotes an opening tag with the specified name and attributes. If empty, then the tag ended in "/>", meaning that it has no sub-elements.*)
| Chars of string (*Some text between the tags*)
| Endtag of string (*A closing tag*)
| EOF (*End of input*)
The type of the XML document elements
val strip_ws : bool Pervasives.ref
Whitespace handling: if strip_ws is true (the default), whitespaces next to a tag are ignored. Character data consisting only of whitespaces is thus suppressed (i.e. Chars "" tokens are skipped).
val entities : (string * string) list Pervasives.ref
An association list of entities definitions. Initially, it contains the predefined entities ( ["amp", "&"; "lt", "<" ...] ).
val token : Lexing.lexbuf -> token
The entry point of the lexer.
Raises Error in case of an invalid XML document
Returns the next token in the buffer