@fairwords/html-to-text-position-mapper NPM

HTML to Plain Text transformer and position mapper

If you have access to the code, here's a TS file with the same content that you can potentially build and run:

This library helps you to convert HTML to plain text while preserving a mapping between positions of text fragments in the plain text and in the HTML. Simply put, it helps you find out for any given fragment of a plain text, what is its corresponding position in the original HTML string.

First, let's say you have some code that searches for text in a plain string. Let's make a fake one for the sake of this example that will look for a single regular expression.

interface TextSearchMatch { // Interface for a found match
  readonly regexPattern: string;
  readonly text: string;
  readonly positionInText: PositionInText;
}

// Helper function that converts regexp match to TextSearchMatch
const convertRegexMatch = (regexPattern: string) => (match: RegExpMatchArray): TextSearchMatch => {
  if (match.index === undefined) {
    throw new Error('getReplacementForMatch: match length was undefined');
  }
  const matchText = match[0];
  if (matchText === undefined) {
    throw new Error('getReplacementForMatch: matchText was supposed to be not undefined');
  }
  const matchLength = matchText.length;

  return {
    positionInText: {
      start: match.index,
      length: matchLength,
    },
    text: matchText,
    regexPattern,
  };
};

const createRegexTextSearcher = (regex: RegExp) => (text: string): TextSearchMatch[] =>
  Array.from(
    text.matchAll(regex),
  ).map(convertRegexMatch(regex.source));

Let's see if it works first:

const startsWithCat = /cat[a-z]*/ugi; // Any word that starts with a "cat" must be somewhat a feline, right?
const searchForCats = createRegexTextSearcher(startsWithCat);
const text =
  'The fact that my cat ate my pet caterpillar is a "catastrophe"';
//                  ^ ^            ^         ^       ^         ^
// 012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
// 0         10        20        30        40        50        60        70        80        90      100       110
// Adding a scale here, so it's easier to find fragments positions
const expectedSearchResultsInPlainText = [
  {
    positionInText: {
      start: 17,
      length: 3,
    },
    regexPattern: 'cat[a-z]*',
    text: 'cat',
  },
  {
    positionInText: {
      start: 32,
      length: 11,
    },
    regexPattern: 'cat[a-z]*',
    text: 'caterpillar',
  },
  {
    positionInText: {
      start: 50,
      length: 11,
    },
    regexPattern: 'cat[a-z]*',
    text: 'catastrophe',
  },
];

describe('createRegexTextSearcher', () => {
  it('Must search for a text in a plain string using provided regexp', () => {
    const searchResults = searchForCats(text);
    expect(searchResults).toStrictEqual(expectedSearchResultsInPlainText);
  });
});

Perfect, now let's say you want to be able to use this search function on the HTMLs. Let's say you want to be able to find the same words in something like this:

const html =
  '<body>The fact that <i>my ca</i>t ate my pet&nbsp;ca<b>t<i>er</i>p</b>illar is a <q>catastrophe</q><br/></body>';
//                           ^     ^                 ^                       ^         ^         ^
// 012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
// 0         10        20        30        40        50        60        70        80        90      100       110
// Adding a scale here, so it's easier to find fragments positions

// Looks like you want to find 3 fragments in this HTML:
const expectedSearchResultsInHtml = [
  // `cat`: starts at 26, 7 characters long (due to HTML tags it's not 3 anymore)
  {
    positionInText: {
      start: 26,
      length: 7,
    },
    regexPattern: 'cat[a-z]*',
    text: 'cat',
  },
  // `caterpillar`: starts at 50, 25 characters long
  {
    positionInText: {
      start: 50,
      length: 25,
    },
    regexPattern: 'cat[a-z]*',
    text: 'caterpillar',
  },
  // `catastrophe`: starts at 84, 11 characters long
  {
    positionInText: {
      start: 84,
      length: 11,
    },
    regexPattern: 'cat[a-z]*',
    text: 'catastrophe',
  },
] as const;

Here's how you can use this library to get what you want:

// First of all, you need to convert and map HTML to plain text
const mappingResult = await mapHtmlToPlainText(html);

expect(mappingResult.plainText).toStrictEqual('The fact that my cat ate my pet caterpillar is a "catastrophe"');

// Run the search on the plain text
const foundCats = searchForCats(mappingResult.plainText);
expect(foundCats).toStrictEqual(expectedSearchResultsInPlainText);

// Transform search results to HTML index space:
const transformPositionsToHtml = mapPositionContainerFromPlainTextToOriginalHtml(
  mappingResult.plainTextToHtml,
  'positionInText', // Name of a field with positions
);
const foundCatsInHtml = foundCats.map(transformPositionsToHtml);
expect(foundCatsInHtml).toStrictEqual(expectedSearchResultsInHtml);

Congratulations, your plain text search engine can now target HTML as if they are plain text. Note that mapPositionContainerFromPlainTextToOriginalHtml returns an array, elements of which may be undefined. They will be undefined only if you provide some PositionInText that is not fully within resulting plain text. So i.e. if your plain text is 10 characters long, but you pass the position { start: 5, length: 20 }, it's longer than plain text, and it will return undefined. Or if start will be negative or some other weirdness. So either make sure you're withing plain text index bounds, or filter out undefineds out of results.

Additionally, if you somehow want to highlight the found entries in HTML, you potentially want to know if there are any tags in the middle of the text of the matches that you found. For example, take a look at the first cat match. In the HTML it is cat. You can't just surround it with some span tag e.g. like that cat: you will break the closing  tag. So potentially you need 2 spans in this particular case, e.g. cat To make this decision, you will need to know what are these tags. Here's how you can do it:

htmlTags is one of the fields of results of a mapHtmlToPlainText function call. It contains all the opened and closing tags for parsed HTML. Duplicating original HTML here for readability.

const html =
  '<body>The fact that <i>my ca</i>t ate my pet&nbsp;ca<b>t<i>er</i>p</b>illar is a <q>catastrophe</q><br/></body>';
//                           ^     ^                 ^                       ^         ^         ^
// 012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
// 0         10        20        30        40        50        60        70        80        90      100       110
// Adding a scale here, so it's easier to find fragments positions

const expectedTags = [
  {
    tag: 'body',
    type: 'TagOpened',
    start: 0,
    end: 5,
  },
  {
    tag: 'i',
    type: 'TagOpened',
    start: 20,
    end: 22,
  },
  {
    tag: 'i',
    type: 'TagClosed',
    start: 28,
    end: 31,
  },
  {
    tag: 'b',
    type: 'TagOpened',
    start: 52,
    end: 54,
  },
  {
    tag: 'i',
    type: 'TagOpened',
    start: 56,
    end: 58,
  },
  {
    tag: 'i',
    type: 'TagClosed',
    start: 61,
    end: 64,
  },
  {
    tag: 'b',
    type: 'TagClosed',
    start: 66,
    end: 69,
  },
  {
    tag: 'q',
    type: 'TagOpened',
    start: 81,
    end: 83,
  },
  {
    tag: 'q',
    type: 'TagClosed',
    start: 95,
    end: 98,
  },
  {
    tag: 'br',
    type: 'TagOpened',
    // Note that TagOpened tags can have an extra optional `isSelfClosing` field
    // it will only be there when it is true, you will never see it with the `false` value.
    isSelfClosing: true,
    start: 99,
    end: 103,
  },
  {
    tag: 'body',
    type: 'TagClosed',
    start: 104,
    end: 110,
  },
];
expect(mappingResult.htmlTags).toStrictEqual(expectedTags);

// Let's take a look at the tags that is inside our first match, i.e. in the word "cat".
const getTagsForMatch = getTagsFromInterval(mappingResult.htmlTags);
const firstMatch = expectedSearchResultsInHtml[0];
const firstMatchTags = getTagsForMatch(firstMatch.positionInText);
expect(firstMatchTags).toStrictEqual([
  { // It must be only one, closing i
    tag: 'i',
    type: 'TagClosed',
    start: 28,
    end: 31,
  },
]);

// Let's take a look at the tags for a second match, that has much more tags inside it.
const secondMatchTags = getTagsForMatch(expectedSearchResultsInHtml[1].positionInText);
expect(secondMatchTags).toStrictEqual([
  {
    tag: 'b',
    type: 'TagOpened',
    start: 52,
    end: 54,
  },
  {
    tag: 'i',
    type: 'TagOpened',
    start: 56,
    end: 58,
  },
  {
    tag: 'i',
    type: 'TagClosed',
    start: 61,
    end: 64,
  },
  {
    tag: 'b',
    type: 'TagClosed',
    start: 66,
    end: 69,
  },
]);
// That's a lot of tags! But it looks correct.
// And what is also important, in the span of this match there's no orphan, unmatched closing tags, so you
// can surround the whole match with the highlighter element.
// Potentially, this library can get a function that checks for unmatched closing tags in the future, should be
// trivial.
// Note that `start` and `end` of an interval is inclusive though.
// If you want to use `slice` method, it expects exclusive `end`.

Important disclaimers:

This code will not create exactly the same HTML to plain text conversion as your favorite browser. In fact, conversion result for the same HTML will be different in different browsers.
Another important thing is it does not (want to) parse CSS and has no idea of things that might change the order of the texts in certain elements, e.g. if you set flex-direction: row-reverse; and many others.

HTML parser

@everything-registry/sub-chunk-312

1 year ago

1 year ago

1 year ago

1 year ago

1 year ago