Author:
Derek Walker
The XmlTrans transducer takes as input a well-formed XML file and a set of transformation rules and gives as output the application of the rules on the input XML file. It was designed for the processing of large XML files, keeping only the minimum necessary part of the document in memory at all times. The program is written in Java and uses an XML DOM parser.
This document gives a brief overview of the transformation language and how it is used. We first introduce the language concepts and structure of XmlTrans rules and rule files. We then provide some tips on how to write rules and interpret errors.
The following sections detail the various aspects of the XmlTrans transformation language.
Any line which starts with a semi-colon is considered to be a comment and is ignored by XmlTrans.
For Example:
; This is a comment.
The XmlTrans parser keeps a list of XML elements in the document which can be transformed by the rule set. All other elements are ignored by the parser and will be suppressed in the output, though each child is searched for transformable elements. At the top of the rule file at least one "trigger" is required to indicate which elements can be processed. This trigger associates an element with a rule set. Consequently, each XmlTrans rule file must contain at least one rule set. This is a collection of rules which are grouped together for convenience.
The syntax for a "trigger" is as follows:
[element name] : @ [rule set name]
Multiple triggers can be used to allow different kinds of rules to process different kinds of elements. For example:
ENTRY : @ normalEntryRules COMPOUNDENTRY : @ compoundEntryRules
The rule set is declared with the following syntax:
@ [rule set name]
For example:
@ normalEntryRules ; the normal rules go here
The rule set is terminated either by the end of the file or with the declaration of another rule set. Every rule in the rule file must be contained within a rule set.
In XmlTrans rule syntax variables are implicitly declared with their first use. Variables are always prefaced with a dollar sign ("$"). There are two types of variables:
Element variables: created
by an assignment of a pattern of elements to a variable. For
example: $a = LI Where LI is an element. Element
variables can contain one or more elements. If a given variable $a
contains a list of elements { A, B, C, ...}, transforming $a
will apply the transformation in sequence to A, B, C
and so on.
Attribute variables: created by an assignment of a
pattern of attributes to a variable. For Example: LI[ $a=TYPE
] Where TYPE is a standard XML variable.
While variables are not strongly typed (i.e. a list of elements is not distinguished from an individual element), attribute variables cannot be used in the place of element variables and vice versa.
The basic control structure of XmlTrans is the rule. A rule has a left-hand side(LHS) and a right-hand side(RHS). The LHS is a pattern of XML element(s) to match while the RHS is a specification for a transformation on those elements. The two sides are separated by an arrow "->" symbol. Rules can span multiple lines but must be terminated by a semicolon.
Rule failure.
We look at the left-hand side and the right-hand side in turn.
The basic building block of the LHS is the element pattern. This is a pattern involving a single element, its attributes, and children. The following table outlines some possible element patterns:
|
Pattern |
Description |
Example Match |
|
X |
Match a specific "X" element. |
<X>Text</X> |
|
X{Y} |
Match a specific element "X" with a specific child pattern. |
<X><Y>Text</Y></X> |
|
X[ATT == "VALUE"] |
Match a specific element "X" with the attribute "ATT" of value "VALUE". |
<X ATT="VALUE">Text</X> |
|
X[ATT != "VALUE"] |
Match a specific element "X" with the attribute "ATT" that does not contain the value "VALUE". |
<X ATT="OTHER">Text</X> |
|
$a |
Match any element. |
<X>Text</X> |
|
X{$a*} |
Match a specific element "X" with any number of children, assigning the children to the variable $a. |
<X>Text</X> |
XmlTrans also allows for more complex regular expressions of elements. These expressions match over the children of the element being examined. XmlTrans always works on one element and one element only when matching the RHS of a rule. For instance the following rule is not valid:
; NOT a valid XmlTrans rule ; ... indicates any completion of the rule. ; ( X Y ) -> ...;
In situations were such matches are desirable, it is necessary to work one level up, matching the multiple elements as children of their parent. Assuming that "X" and "Y" above are children of another element "Z", the following is a valid version of the above rule:
; A valid XmlTrans rule
; ... indicates any completion of the rule.
;
Z{ X Y } -> ...;XmlTrans supports the notion of a logical NOT over an element expression. This is represented by the standard "!" symbol. For instance, the following rule will match any element which is not an "X".
; Any element which is not an X ; ... indicates any completion of the rule. ; !(X) -> ...;
Support for general regular expressions is built into the language grammar. The standard symbols are used, as described in the following table:
|
Symbol |
Description |
Example pattern |
Example match |
|
* |
zero or more occurrences |
X{ Y* } |
<X></X> |
|
+ |
one or more occurrences |
X{ Y+ } |
<X><Y/></X> |
|
? |
zero or one occurrences |
X{ Y? } |
<X></X> |
In order to create rules of greater generality, elements and attributes in the RHS of a rule can be assigned to variables. For instance, we might want to transform a given element "X" in a certain way without specifying its children. The following rule would be used in such a case:
; Match X with zero or more unspecified children.
; ... indicates any completion of the rule.
;
X{$a*} -> ...;
In the rule above, the variable $a will be either empty (if
"X" has no children), a single element (if "X"
has one child), or a list of elements (if "X" has a series
of children. Similarly, the pattern X{$a} matches an
element "X" with exactly one child.
If an expression contains complex patterns, it is often useful to assign specific parts to different variables. This allows child nodes to be processed in groups on the LHS, perhaps being re-used several times or reordered. Consider the following rule:
;
; ... indicates any completion of the rule.
Z{ $a = (X Y)* $b = Q} -> ... ;In this case $a contain a (possibly empty) list of "X Y" element pairs. The variable $b will contain exactly one "Q". If this pattern cannot be matched the rule will fail.
In a similar fashion, attributes can be assigned to variables. The syntax is a bit obscure with the assignment "=" binding to the attribute name, followed by a possible "==" or "!=" expression. The following three rules demonstrate some possibilities:
; Match any X which has an attribute ATT ; The match will fail if X does not have an ATT ; X[ $att = ATT ] -> ...; ; Match any X which has an attribute ATT with the value "VALUE" ; The match will fail if ATT is not equal to "VALUE" or if the X ; does not have an ATT attribute. ; X[ $att = ATT == "VALUE"] -> ...; ; Match any X with an attribute which is NOT equal to "VALUE" ; The match will fail if X has an attribute ATT of value "VALUE" ; *AND* will fail if X does not have an attribute ATT. ; X[ $att = ATT != "VALUE"] -> ...;
It is important to note that the absence of the specified attribute
means that all the above rules will fail. In the last rule above the
reason is not obvious. In fact, XmlTrans always checks to see that
the attribute exists before attempting to match the attribute
expression. Thus, even an expression of the form X[ $att = ATT
!= "VALUE"] fails if the attribute "ATT"
is not found.
It is also important to note that the string corresponding to the attribute value cannot contain any wild cards or regular expressions. All matches with attribute values are exact.
The last type of expressions used on the RHS are string expressions. Strings are considered to be elements in their own right. Consider the following XML segment:
<X> A long time <Y>ago</Y> </X>
According to the XmlTrans parser, "X" has the following children: "A long time", "Y". The element "Y" has the child "ago". This can be extremely important when processing files which have many embedded spaces or line breaks between elements. Consider the following XML segment:
<X><Y/> <Y/></X>
The expression X{ Y Y } will not match because there is
a space " " element embedded between the "Y"
elements. Instead, the expression X{ Y " " Y}
matches.
Strings can also be matched as elements are matched in the top
level expression, except that they are enclosed in quotes and cannot
have attribute patterns like regular elements can. A special syntax,
/.*/, is used to mean any element which is a string. The
following are some sample string matching rules:
; Match any string ; /.*/ -> ... ; ; Match the text "suppress". ; Note that this will match <X>suppress</X>, ; but not <X>suppress me</X> ; "suppress" -> ...; ; Match a new line. ; "\n" -> ... ;
String matches do not operate on substrings, but rather on entire strings only. Consequently, a rule will match exactly the text of the LHS and will fail if it is embedded in other text. It will even fail to match if the text element contains a end-of-line marker not contained in the LHS expression.
The RHS supplies a construction pattern for the transformed tree node. The basic building block of the RHS is the constructor. A constructor can be one of several possibilities:
|
Type |
Example |
Input |
Output |
|
text |
X -> "Hello world"; |
<X>Text</X> |
Hello world |
|
entity reference |
X{$a*} -> Y{$a}; |
<X>Text</X> |
<Y>Text</Y> |
|
variable reference |
X{$a*} -> $a; |
<X>Text</X> |
Text |
|
attribute variable reference |
X[$a = ATT]{$b*} -> Y[OLDATT=$a]{$b} |
<X ATT="VAL">Text</X> |
<Y OLDATT="VAL">Text</Y> |
|
comments |
|||
|
element |
|||
|
same set reference |
X{$a*} -> Y{@($a)}; |
<X>Text</X> |
<Y>txeT</Y> |
|
set reference |
@ set1 |
<X>Text</X> |
<Y>Nothing</Y> |
The last two rows in the table above demonstrate how to process
the input file recursively. This is a critical concept to understand
before attempting to write XmlTrans rules. The expression
@ [set name]( [variable name] )
tells the XmlTrans transformer to continue processing on the elements
contained in the indicated variable. For instance, @set1($a)
indicates that the elements contained in the variable $a
should be processed by the rules in the set set1. A
special notation @([variable name]) is used to tell
the transformer to continue processing with the current rule set.
Thus, if the current rule set is set2, the expression
@($a) indicates that processing should continue on the
elements in $a using the rule set set2.
Consider the following XmlTrans file:
X : @ identity
@ identity
$a{ $b* } -> $a{ @($b) }
$a = /.*/ -> $aThis rule file essentially describes an identity operation on all elements "X" in the input file. Each child of "X" is written out exactly as it was read in. Note that any parents of "X" elements will be eliminated, however.
If there is only one constructor, then the RHS does not need to be contained in parentheses. Otherwise parentheses are obligatory. For example:
; Do not need parentheses
;
; Transforms X into Y
; ie. <X>Text</X> is
; transformed to <Y>Text</Y>
;
X{$a*} -> Y{$a}
; Parentheses needed
;
; Transforms X into Y
; ie. <X>Text</X> is
; transformed to <Y>Text</Y>again<Z>Text</Z>
;
X{$a*} -> ( Y{$a} "again" Z{$a} )There is only one warning which is commonly encountered. It appears as:
Warning: elements variable identifier a defined at line [line] column [column] was never referenced.
This warning is generated when a variable is declared on the LHS but
is not used on the RHS, as with $a below:
ENTRY{ $all=( $a=BASE{ $a1=HEAD } $b* ) }
->
( H2{@getCW($all)}
@($a1)
DL{
DD{
@($b)
}}
)These warnings are not strictly a problem, but may indicate places where the rules could be cleaned up.
This usually means that none of the elements in the input file had
a trigger at the top of the rule set file. This is commonly caused
when one file lists the elements in upper case and the other in lower
case. Since XML is case sensitive, the trigger entry : @
entrySet will NOT match the uppercase element <ENTRY/>
or <Entry/>.
These errors are caused when the rule file has a syntax error, such as mismatched parentheses:
ENTRY -> {)The corresponding error is:
XMLTrans 0.3 fatal error:
Encountered errors during parse.
ch.unige.issco.xmltrans.parser.ParseException: Encountered "{" at line 14, colum
n 8.
Was expecting one of:
"$" ...
"@" ...
"(" ...
<ID> ...
<STRING> ...
at java.lang.Throwable.&*lt;init>(Compiled Code)
at java.lang.Exception.<init>(Compiled Code)
at ch.unige.issco.xmltrans.parser.ParseException.<init>(Compiled Code)
at ch.unige.issco.xmltrans.parser.XMLTransParser.generateParseException(
Compiled Code)
at ch.unige.issco.xmltrans.parser.XMLTransParser.jj_consume_token(Compil
ed Code)
at ch.unige.issco.xmltrans.parser.XMLTransParser.Rhs(Compiled Code)
at ch.unige.issco.xmltrans.parser.XMLTransParser.Rule(Compiled Code)
at ch.unige.issco.xmltrans.parser.XMLTransParser.RuleSet(Compiled Code)
at ch.unige.issco.xmltrans.parser.XMLTransParser.Grammar(Compiled Code)
at XMLTrans.main(Compiled Code)An attempt to use an attribute variable assigned on the LHS as an element on the RHS as in the following rule will result in a variable type error. The following rule demonstrates this error:
; This rule will give an "undeclared variable" error
;
X[ $att=NUM]{$child*} -> Y{ $att ". " $child }To ensure that all the original elements are represented in the rule set, it is often useful to work from the DTD, writing at least one rule per element in the original DTD.
It is often useful to create a catch-all rule at the end of the main rule set to make errors clear in the output. The form of this rule could be:
; make errors clear in the output ; $a -> ( "ERROR: " $a )
All elements which do not have corresponding rules will appear in the output prefaced by the string "Error: ".
The following rule can be used to suppress all elements which were
not processed by the other rules in a rule set: $a -> ().
Note that this might be dangerous as it may cause unwanted data loss.
It is sometimes desirable to re-use the text of an attribute in the body of an element. For example:
<X NUM="1">First point</X>
Becomes:
<Y>1. First Point</Y>
Unfortunately, the typing of attributes prevents using an attribute variable as a text element on the RHS. Instead, the user must create a dummy element and remove the surrounding element using a sed script in a post-processing step. For example :
; Rule to create a dummy element
;
X[ $att=NUM]{$child*} -> Y{ DELME[NUM=$att]{} ". " $child }With the above "X" element as input, this will output:
<Y><DELME NUM="1"> . First point</Y>
The sed script then eliminates the text surrounding the 1, leaving the desired output.
A BNF grammar for XmlTrans rules.
A sample XmlTrans rule file, with its corresponding XML source file and XML output file.
Last modified: Mon Oct 25 15:50:37 MET DST 1999