diffmark is an XML diff and merge
package. It consists of a shared C++ library,
libdiffmark
, plus two programs wrapping the
library into a command-line interface: dm and
dm-merge. dm takes 2 XML files
and prints their diff (also an XML document) on its standard
output. dm-merge takes the first document passed
to dm and its output and produces the second document.
diffmark has a close (not to say
convoluted) relationship
with the Perl module XML::DifferenceMarkup
(available on
CPAN). Current versions of
XML::DifferenceMarkup are built on top of
libdiffmark
, making the packages compatible.
Thanks to Anatol Belski,
libdiffmark
now also has a
PHP frontend.
libdiffmark
depends on
libxml2 (available from
http://xmlsoft.org).
The diff format is meant to be human-readable
(i.e. simple, as opposed to short) - basically the diff is a subset
of the input trees, annotated with instruction element nodes
specifying how to convert the source tree to the target by inserting
and deleting nodes. To prevent name colisions with input trees, all
added elements are in a namespace
http://www.locus.cz/diffmark
(the diff will fail on input trees which already use that
namespace).
The top-level node of the diff is always <diff> (or rather <dm:diff xmlns:dm="http://www.locus.cz/diffmark"> ... </dm:diff> - this description omits the namespace specification from now on); under it are fragments of the input trees and instruction nodes: <insert/>, <delete/> and <copy/>. <copy/> is used in places where the input subtrees are the same - in the limit, the diff of 2 identical documents is
<?xml version="1.0"?> <dm:diff xmlns:dm="http://www.locus.cz/XML/diffmark"> <dm:copy count="1"/> </dm:diff>
(<copy/> always has the count attribute and no other content).
<insert/> and <delete/> have the obvious meaning - in the limit a diff of 2 documents which have nothing in common is something like
<?xml version="1.0"?> <dm:diff xmlns:dm="http://www.locus.cz/XML/diffmark"> <dm:delete> <old/> </dm:delete> <dm:insert> <new> <tree>with the whole subtree, of course</tree> </new> </dm:insert> </dm:diff>
A combination of <insert/>, <delete/> and <copy/> can capture any difference, but it's sub-optimal for the case where (for example) the top-level elements in the two input documents differ while their subtrees are exactly the same. dm handles this case by putting the element from the second document into the diff, adding to it a special attribute dm:update (whose value is the element name from the first document) marking the element change:
<?xml version="1.0"?> <dm:diff xmlns:dm="http://www.locus.cz/XML/diffmark"> <top-of-second dm:update="top-of-first"> <dm:copy count="42"/> </top-of-second> </dm:diff>
<delete/> contains just one level of nested nodes - their subtrees are not included in the diff (but the element nodes which are included always come with all their attributes). <insert/> and <delete/> don't have any attributes and always contain some subtree.
Instruction nodes are never nested; all nodes above an instruction node (except the top-level <diff/>) come from the input trees. A node from the second input tree might be included in the output diff to provide context for instruction nodes when it's an element node whose subtree is not the same in the two input documents. When such an element has the same name, attributes (names and values) and namespace declarations in both input documents, it's always included in the diff (its different output trees guarantee that it will have some chindren there). If the corresponding elements are different, the one from the second document might still be included, with an added dm:update attribute, provided that both corresponding elements have non-empty subtrees, and these subtrees are so similar that deleting the first corresponding element and inserting the second would lead to a larger diff. And if this paragraph seems too complicated, don't despair - just ignore it and look at some examples.
Download now: diffmark-0.10.tar.gz