0.0.3 • Published 2 years ago

@jrc03c/html-diff v0.0.3

Weekly downloads
-
License
ISC
Repository
-
Last release
2 years ago

Intro

This tool helps to find differences between HTML files on a per-element basis in addition to finding differences on a per-line or per-character basis. This makes it easier to discover if two elements are basically identical except that one lacks a class name that the other has, or that one has slightly different textContent than the other, or that they have the same children but in different orders, or that one element has a particular child as an immediate descendant whereas another element has the same child as a deeply-nested descendant, etc.

Installation

For use in Node, bundlers, and the browser:

npm install @jrc03c/html-diff

For use at the command line:

npm install -g @jrc03c/html-diff

Usage

CLI

html-diff file1.html file2.html

Optionally, you can pass a "simple" flag (--simple or -s), which will cause the output to be printed in a YAML-ish format, which is sometimes a little easier to read than JS objects. For example:

html-diff -s file1.html file2.html

JS

In Node or bundlers:

const { getDifferences } = require("@jrc03c/html-diff")

Or in the browser:

<!--
  This defines all of the relevant functions, variables, and objects in the
  global scope.
-->
<script src="path/to/dist/html-diff.js"></script>

Then:

console.log(getDifferences(element1, element2))

NOTE: Some of the functions in this library expect HTMLElement inputs. If you're using this library in Node, I recommend that you use jsdom to construct virtual DOMs, and then pass elements from those DOMs into this library's functions. For example:

const { JSDOM } = require("jsdom")
const dom1 = new JSDOM("<div>Hello, world!</div>")
const dom2 = new JSDOM("<div>Goodbye, world!</div>")

console.log(
  getDifferences(dom1.window.document.body, dom2.window.document.body)
)

API

DEFAULT_OPTIONS

DEFAULT_OPTIONS is an object that holds all of the constants used in the library's calculations. It has these properties and default values:

  • attributeWeight = represents how much element attribute differences should be weighted relative to other differences; has a default value of 1
  • childDifferenceWeight = represents how much the total differences between child elements (excluding the order of the children) should be weighted relative to other differences; has a default value of 1
  • childOrderWeight = represents how much the child order differences should be weighted relative to other differences; has a default value of 1
  • classWeight = represents how much element class differences should be weighted relative to other differences; has a default value of 1
  • differencePenalty = represents the power to which all differences should be raised, which is useful for exaggerating differences; has a default value of 1
  • idWeight = represents how much element ID differences should be weighted relative to other differences; has a default value of 1
  • shouldScoreChildren = represents whether or not child scores should contribute to the overall score; has a default value of true, but can be set to false to compare the given elements as though their children don't exist
  • tagNameWeight = represents how much element tag name differences should be weighted relative to other differences; has a default value of 1
  • textContentWeight = represents how much element text content differences should be weighted relative to other differences; has a default value of 1

To adjust any of the above properties, reassign their values, and then pass the entire object (or a copy of it, or whatever) into the relevant functions below that take an options parameter. Note that the options parameter is optional everwhere it appears below.

getAttributes(el)

Returns a list of objects, each of which represents a single attribute on the element and which has properties of "name" and "value". Does not include "class" or "id" attributes because those are evaluated separately.

getDifferences(el1, el2, [options])

Returns a list of objects, each of which describes a difference between the two given elements. Each difference object has these properties:

  • el1 = the path from the document root to the first element in the relevant pair of conflicting elements
  • el2 = the path from the document root to the second element in the relevant pair of conflicting elements
  • type = the type of difference between the relevant pair of conflicting elements; can be one of:
    • ATTRIBUTE_DIFFERENCE
    • CHILD_CONTENT_DIFFERENCE
    • CHILD_ORDER_DIFFERENCE
    • CLASS_DIFFERENCE
    • ID_DIFFERENCE
    • ORDER_DIFFERENCE
    • TAG_NAME_DIFFERENCE
    • TEXT_CONTENT_DIFFERENCE
  • el1Value = the value in the first element where the difference occurred
  • el2Value = the value in the second element where the difference occurred
  • attribute = the name of the attribute where the difference occurred; this property is present only when the difference type is ATTRIBUTE_DIFFERENCE

getDiffScore(e1, e2, [options])

Returns a score and list of differences (from getDifferences) between the two given elements. The lowest possible score is 1, in which case the elements are identical.

getMostSimilarElement(el, others, options)

Given a list of elements called others, returns the element that's most similar to el (when compared using getDiffScore).

getNonChildTextContent(el)

Returns the text content of the given element that does not include any text content from child elements.

To do

  • Write unit tests for the main API functions.
  • Implement some dynamic programming features (like a dictionary that holds the differences between two elements so that they don't have to be recalculated multiple times). I'm not actually sure how big of a problem this is, but I do know that the functions recurse quite a bit, so it may make a difference in terms of performance.