public class Docx4jDriver
extends Object
docx4j uses topologi's diffx project to determine the difference
between two bits of WordML. (an xslt is then used to convert
the diffx output to WordML with the changes tracked)
If the two things being compared start or end with the same
XML, diffx slices that off.
After that, you are left with EventSequences representing the
two things being compared (an event for the start and end of
each element and attributes, and for each word of text).
The problem is that performance drops off rapidly. For example,
if each event sequence is:
+ under say 500 entries, time is negligible
+ 1800 entries long, calculating the LCS length to fill the matrix
may take 17 seconds (on a 2.4GHZ Core 2 Duo, running from within
Eclipse)
+ 3000 entries, about 95 seconds (under IKVM)
+ 3500 entries, about 120 seconds
+ 5500 entries, about 550 seconds (under IKVM)
Ultimately, we should migrate to / develop a library which doesn't have
this problem, and supports:
- word level diff (diffx does, but Fuego doesn't but could)
- 3 way merge
- move (though why, can OpenXML represent a move?)
An intermediate step might be to add an implementation of the Lindholm
heuristically guided greedy matcher to the com.topologi.diffx.algorithm
package. See the Fuego Core XML Diff and Patch tool project
(which as at 19 June 2009, was offline). Could be relatively straightforward,
since it also uses an event sequence concept.
But in the meantime this class attempts to divide up the problem. The strategy
is to look at the children of the nodes passed in, hoping to find
an LCS amongst those. If we have that LCS, then
(at least in the default case) we don't need
to diff the things in the LCS, just the things between the
LCS entries. I say 'default case' because in that case
the LCS entries are each the hashcode of the diffx EventSequences.
(But if you
were operating on sdts, you might make them the sdt id.)
This approach might work on the children of w:body (paragraphs,
for example), or the children of an sdt:content.
It could also help if you run it on two w:body, where
all the w:p are inside w:sdts, provided you make use of the
sdt id's, *and* the sliced event sequences inside the sdt's aren't
too long.
We use the eclipse.compare package for the coarse grained divide+conquer.
TODO If any of the diffx sliced event sequence pairs are each > 2000
entries long, this will log a warning, and just return
left tree deleted, right tree inserted. Or try to carve them up somehow?
The classes in src/diffx do not import any of org.docx4j proper;
keep it this way so that this package can be made into a dll
using IKVM, and used in a .net application, without extra
dependencies (though we do use commons-lang, for help in
creating good hashcodes).
- Author:
- jason