HACKER Q&A
📣 keithwhor

I just wrote an O(N) diffing algorithm – what am I missing?


Hey folks,

I've been building a rendering engine for a code editor the past couple of days. Rendering huge chunks of highlighted syntax can get laggy. It's not worth switching to React at this stage, so I wanted to just write a quick diff algorithm that would selectively update only changed lines.

I found this article: https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/

With a link to this paper, the initial Git diff implementation: http://www.xmailserver.org/diff2.pdf

I couldn't find the PDF to start with, but read "edit graph" and immediately thought — why don't I just use a hashtable to store lines from LEFT_TEXT and references to where they are, then iterate over RIGHT_TEXT and return matches one by one, also making sure that I keep track of the last match to prevent jumbling?

The algorithm I produced is only a few lines and seems accurate. It's O(N) time complexity, whereas the paper above gives a best case of O(ND) where D is minimum edit distance.

  function lineDiff (left, right) {
    left = left.split('\n');
    right = right.split('\n');
    let lookup = {};
    // Store line numbers from LEFT in a lookup table
    left.forEach(function (line, i) {
      lookup[line] = lookup[line] || [];
      lookup[line].push(i);
    });
    // Last line we matched
    var minLine = -1;
    return right.map(function (line) {
      lookup[line] = lookup[line] || [];
      var lineNumber = -1;
      if (lookup[line].length) {
        lineNumber = lookup[line].shift();
        // Make sure we're looking ahead
        if (lineNumber > minLine) {
          minLine = lineNumber;
        } else {
          lineNumber = -1
        }
      }
      return {
        value: line,
        from: lineNumber
      };
    });
  }
RunKit link: https://runkit.com/keithwhor/line-diff

What am I missing? I can't find other references to doing diffing like this. Everything just links back to that one paper.


  👤 ridiculous_fish Accepted Answer ✓
Diffing means computing the longest common subsequence (LCS); the edit script is everything that's not part of the LCS. The proposed algorithm greedily finds a matching element. However it may be that it would be better to skip this element, and not incorporate it into your sequence.

Consider the following sequences (here each letter represents a line).

   AAAB
   BAAA
Where the first one is "left" and the second one is "right".

The LCS is AAA. The minimal edit script from left to right is:

1. Insert B at beginning.

2. Delete B at end.

The proposed O(N) algorithm finds B as the first matching line, and thereby exhaust `left`. So it finds the common subsequence of just "B" and its edit script is:

1. Delete 3 As at beginning

2. Insert 3 As at end

which is longer.


👤 robertelder
I spent a bit of time learning the Myers Diff algorithm, (I made this page: https://blog.robertelder.org/diff-algorithm/)

Like someone else said, I think the key is that you're not getting the shortest edit script, you're just getting a edit script.

Interestingly, even git's implementation of diff (which uses a variant of the Myers Diff algorithm) doesn't actually find the shortest edit script by default. It has a hard-coded heuristic that just quits early if it finds a 'good enough' solution:

https://github.com/git/git/blob/master/xdiff/xdiffi.c#L143

Same this for unix diff (unless you specify the commandline flag to find shortest):

https://github.com/Distrotech/diffutils/blob/distrotech-diff...

Having said that, if you're still confident that you've made an improvement, you can port the function to python and try it out as a test case in my unit test file that I wrote when I developed my variant of the Myers Diff algorithm:

https://github.com/RobertElderSoftware/roberteldersoftwaredi...

On that topic, I'm pretty sure my variant of the Myers Diff algorithm is actually slightly faster/uses less memory than the original (I'd be happy to be proved wrong). At least, I've had my blog post up for a few years and no one has corrected me, but then again, I think the original paper has a few mistakes/obvious flaws and it still has hundreds of citations and no one has written much about improvements.


👤 im3w1l
Since the problem has already been resolved, I'll go on a slight tangent and mention there are some very interesting probabilistic diffing algorithms invented by the bio-informatics people.

E.g https://en.wikipedia.org/wiki/BLAST_(biotechnology)


👤 terrelln
Slightly off topic, but one great way to diff is to compress the new file using the old file as a dictionary. Both zlib and zstd support this (but zlib is limited to a 32KB file).

You can tune how much CPU you want to spend getting a better diff by adjusting the compression ratio, the compressor is well tuned for speed and is generally more efficient than a diffing algorithm, it can find repetitions in the new content, and you get entropy coding.


👤 anonytrary
You're most likely not getting a minimal diff, but it's highly likely that your diff is good enough for your use-case. React and most other similar VDOM libraries also fail to achieve minimal diffs, however, that's not necessarily a bad thing. Just because a diff isn't minimal doesn't mean it's wrong.

The ideal trade-off (diff-size vs space-time characteristics) depends on the use-case. It might be more important for a VDOM library to obtain a good-enough diff in O(N) time, as opposed to an LCS in O(NM) space and time. Linear diffing for VDOMs works well, especially with heuristics and things like explicit and implicit key matching. Sometimes the asymptotic characteristics of a program are irrelevant if your system is bounded.


👤 kazinator
I just found an O(M + N) diffing algorithm:

  $ my-diff yours mine
  --- yours 2019-10-28 15:11:07.150118090 -0700
  +++ mine  2019-10-28 16:46:42.093923066 -0700
  @@ -1,50 +1,75 @@
  - yours
  - yours
  [...]
  - yours
  + mine
  + mine
  [...]
  + mine
Spit out the diff and hunk header, then dump yours preceded by "- " and then mine preceded by "+ ".

👤 ysleepy
Does this algorithm find the shortest edit script?

Finding any diff is not really the problem.


👤 jacquesm
This doesn't answer your question but you really should read up on Needleman-Wunsch for matching up strings.

Love this post by the way, wished we would see more of these on HN.


👤 bodeadly
Ah, no, the Myers algorithm is also known as the "Sortest Edit Script" and your algo does not follow the shortest path or produce a script (just -1 for mismatches).

More specifically, it's not just something at the end appearing at the beginning causing skipping like others have pointed out. The skipping can happen in the middle. Consider the two sequences A B B C and B C A. Your algo will match the first Bs and then the Cs and skip the common B C together.

Even if you could somehow make it work, it would be O(N) but note that with all of the split() and map() and array indexing, it starts to be more like O(N+NX) where X is that potentially non-insignificant overhead.

Don't let me pooh-pooh you from trying but if you do give-in and use the Myers' algo, here's my C diff implementation:

  http://www.ioplex.com/~miallen/libmba/dl/src/diff.c
This is actually pretty compact as diff implementations go and would probably be a reference for implementing diff in JS.

👤 victorNicollet
Question: why do you ensure that you are only moving forward? Does it matter, for line highlighting, that a line has moved up, or now appears twice?

👤 pratchett
Your approach is called hash join in databases.

Not sure about the diffing research though.


👤 vince14
React has absolutely nothing to do with that.

👤 hhmc
hash map worst case insertion/search is O(N) so this is potentially worst case O(N^2). IIRC a typical implementation would be more likely to be O(N log N) for some high log factor in the limit, but still worse than O(ND). And this is before we account for the fact that hashing a line probably has an implicit factor of D in it.