Module Xml_lexer

module Xml_lexer: sig .. end

Simple XML lexer

This module provides an ocamllex lexer for XML files. It only supports the most basic features of the XML specification.

The lexer altogether ignores the following 'events': comments, processing instructions, XML prolog and doctype declaration.

The predefined entities (&, <, etc.) are supported. The replacement text for other entities whose entity value consist of character data can be provided to the lexer (see Xml_lexer.entities). Internal entities declarations are not taken into account (the lexer just skips the doctype declaration).

CDATA sections and character references are supported.

See Xml_lexer.strip_ws about whitespace handling.

Error reporting

type error =

`\|`	`Illegal_character of char`
`\|`	`Bad_entity of string`
`\|`	`Unterminated of string`
`\|`	`Tag_expected`
`\|`	`Attribute_expected`
`\|`	`Other of string`

val error_string : error -> string

exception Error of error * int

This exception is raised in case of an error during the parsing. The int argument indicates the character position in the buffer. Note that some non-conforming XML documents might not trigger an error.

API

type token =

`\|`	`Tag of string * (string * string) list * bool`	`(*`	`Tag (name, attributes, empty)` denotes an opening tag with the specified `name` and `attributes`. If `empty`, then the tag ended in "/>", meaning that it has no sub-elements.	`*)`
`\|`	`Chars of string`	`(*`	Some text between the tags	`*)`
`\|`	`Endtag of string`	`(*`	A closing tag	`*)`
`\|`	`EOF`	`(*`	End of input	`*)`

The type of the XML document elements

val strip_ws : bool Pervasives.ref

Whitespace handling: if strip_ws is true (the default), whitespaces next to a tag are ignored. Character data consisting only of whitespaces is thus suppressed (i.e. Chars "" tokens are skipped).

val entities : (string * string) list Pervasives.ref

An association list of entities definitions. Initially, it contains the predefined entities ( ["amp", "&"; "lt", "<" ...] ).

val token : Lexing.lexbuf -> token

The entry point of the lexer.
Raises Error in case of an invalid XML document
Returns the next token in the buffer