I've been building a rendering engine for a code editor the past couple of days. Rendering huge chunks of highlighted syntax can get laggy. It's not worth switching to React at this stage, so I wanted to just write a quick diff algorithm that would selectively update only changed lines.
I found this article: https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/
With a link to this paper, the initial Git diff implementation: http://www.xmailserver.org/diff2.pdf
I couldn't find the PDF to start with, but read "edit graph" and immediately thought — why don't I just use a hashtable to store lines from LEFT_TEXT and references to where they are, then iterate over RIGHT_TEXT and return matches one by one, also making sure that I keep track of the last match to prevent jumbling?
The algorithm I produced is only a few lines and seems accurate. It's O(N) time complexity, whereas the paper above gives a best case of O(ND) where D is minimum edit distance.
function lineDiff (left, right) {
left = left.split('\n');
right = right.split('\n');
let lookup = {};
// Store line numbers from LEFT in a lookup table
left.forEach(function (line, i) {
lookup[line] = lookup[line] || [];
lookup[line].push(i);
});
// Last line we matched
var minLine = -1;
return right.map(function (line) {
lookup[line] = lookup[line] || [];
var lineNumber = -1;
if (lookup[line].length) {
lineNumber = lookup[line].shift();
// Make sure we're looking ahead
if (lineNumber > minLine) {
minLine = lineNumber;
} else {
lineNumber = -1
}
}
return {
value: line,
from: lineNumber
};
});
}
RunKit link:
https://runkit.com/keithwhor/line-diffWhat am I missing? I can't find other references to doing diffing like this. Everything just links back to that one paper.
Consider the following sequences (here each letter represents a line).
AAAB
BAAA
Where the first one is "left" and the second one is "right".The LCS is AAA. The minimal edit script from left to right is:
1. Insert B at beginning.
2. Delete B at end.
The proposed O(N) algorithm finds B as the first matching line, and thereby exhaust `left`. So it finds the common subsequence of just "B" and its edit script is:
1. Delete 3 As at beginning
2. Insert 3 As at end
which is longer.
Like someone else said, I think the key is that you're not getting the shortest edit script, you're just getting a edit script.
Interestingly, even git's implementation of diff (which uses a variant of the Myers Diff algorithm) doesn't actually find the shortest edit script by default. It has a hard-coded heuristic that just quits early if it finds a 'good enough' solution:
https://github.com/git/git/blob/master/xdiff/xdiffi.c#L143
Same this for unix diff (unless you specify the commandline flag to find shortest):
https://github.com/Distrotech/diffutils/blob/distrotech-diff...
Having said that, if you're still confident that you've made an improvement, you can port the function to python and try it out as a test case in my unit test file that I wrote when I developed my variant of the Myers Diff algorithm:
https://github.com/RobertElderSoftware/roberteldersoftwaredi...
On that topic, I'm pretty sure my variant of the Myers Diff algorithm is actually slightly faster/uses less memory than the original (I'd be happy to be proved wrong). At least, I've had my blog post up for a few years and no one has corrected me, but then again, I think the original paper has a few mistakes/obvious flaws and it still has hundreds of citations and no one has written much about improvements.
You can tune how much CPU you want to spend getting a better diff by adjusting the compression ratio, the compressor is well tuned for speed and is generally more efficient than a diffing algorithm, it can find repetitions in the new content, and you get entropy coding.
The ideal trade-off (diff-size vs space-time characteristics) depends on the use-case. It might be more important for a VDOM library to obtain a good-enough diff in O(N) time, as opposed to an LCS in O(NM) space and time. Linear diffing for VDOMs works well, especially with heuristics and things like explicit and implicit key matching. Sometimes the asymptotic characteristics of a program are irrelevant if your system is bounded.
$ my-diff yours mine
--- yours 2019-10-28 15:11:07.150118090 -0700
+++ mine 2019-10-28 16:46:42.093923066 -0700
@@ -1,50 +1,75 @@
- yours
- yours
[...]
- yours
+ mine
+ mine
[...]
+ mine
Spit out the diff and hunk header, then dump yours preceded by "- " and then mine preceded by "+ ".
Finding any diff is not really the problem.
Love this post by the way, wished we would see more of these on HN.
More specifically, it's not just something at the end appearing at the beginning causing skipping like others have pointed out. The skipping can happen in the middle. Consider the two sequences A B B C and B C A. Your algo will match the first Bs and then the Cs and skip the common B C together.
Even if you could somehow make it work, it would be O(N) but note that with all of the split() and map() and array indexing, it starts to be more like O(N+NX) where X is that potentially non-insignificant overhead.
Don't let me pooh-pooh you from trying but if you do give-in and use the Myers' algo, here's my C diff implementation:
http://www.ioplex.com/~miallen/libmba/dl/src/diff.c
This is actually pretty compact as diff implementations go and would probably be a reference for implementing diff in JS.
Not sure about the diffing research though.