1.0.1 • Published 1 year ago

cloud-vision-lines-phrases-parser v1.0.1

Weekly downloads
-
License
Apache-2.0
Repository
github
Last release
1 year ago

cloud-vision-lines-phrases-parser

Uses customizable parsers to find text located within OCR output. In this case, the OCR output is the line list object generated by the cloud-vision-lines-phrases package, which itself is a modified version of the output from Google Cloud Vision's small batch file annotation online.

Check out a UI demonstration of the module here as well as in CodeSandbox.

Installation

const { getParsersTarget } = require('cloud-vision-lines-phrases-parser')

Use

//Refer to the cloud-vision-lines-phrases package for further detail on this setup for generating the lineList object
const getAnnotationFormats = require('cloud-vision-lines-phrases')

const batchFileAnnotation = ['...'] 
const bucketFileBasename = 'filename'
const annotationFormats = getAnnotationFormats(batchFileAnnotation, bucketFileBasename);
const lineList = annotationFormats.lineList
const parsers = [

  //parser #1 
  {
    count: 2  //any number
    method: 'after' //'after' or 'below',
    target: { 
      {
        pattern: '.+',  //use any pattern that you would input into a RegExp object's constructor function 
        unit: 'phrase',  //'phrase' or 'word'
      },
    },
  },

  //parser #2
  {
    count: 2  
    method: 'below' 
    target: {
      {
        pattern: '\\S{8}',  
        unit: 'word'  
      },
    },
  },

  //parser #3, etc.
]

The parsers array in the above example performs the following steps:

1) The first parser finds the <2nd> <phrase> <after> the <beginning of the document> that matches the pattern: <.+>

2) The second parser finds the <2nd> <word> <below> the <previously parsed value> that matches the pattern: <.{8}>

//Now we can input the lineList and parsers array into the package's function
const targetTextObject = getParsersTarget(lineList, parsers)

And the object that is returned contains info about the text value, location within the line list, the bounding box coordinates, and its unit type:

{
  value: '2/1/2023',
  normalizedVertices: ['...'],
  indices: {
    pageEnd: 0, lineEnd: 3, phraseEnd: 1, wordEnd: 0
  },
  unitType: 'word', 
}

Looking at the image file (assume this snippet is the entire document), the first parser captures the yellow-highlighted value, and the second parser (starting its parse from this value) captures the blue-highlighted value which is ultimately the value that is returned:

Alt text

More details about how the units are defined (phrases and words) can be found in the cloud-vision-lines-phrases package.

When selecting the after method, think of this as parsing left to right across the page, and then down to the next line, just as you would read a page of text. More technically, the parser will iterate through one of the two unit types (either phrase or word per the target.unit) until it finds a unit's text that matches the target.pattern. The count determines how many matching units it must find, and the last matching unit will be the returned value and the point where the parse terminates. If the count has not been met by the time the parser reaches the end of the line list, an empty value will be returned. Each parser will start at the unit where the previous parser ended, and the first parser will always start at the beginning of the line list (this applies to both the after and below methods, so it doesn't matter which of the two methods you select for the first parser).

CAVEAT: If a parser is capturing a word (yellow) that is embedded in the beginning or middle of a phrase, the next parser (if the selected target.pattern is a phrase) will use the remainder of that phrase (blue) in its count :

Alt text

When the below method is selected, the parser will go line by line directly beneath the previous parser's stopping point until it finds a unit (either a phrase or word depending on the target.unit) with text matching the target.pattern. Directly beneath, in this context, means that a unit is in a line beneath, and horizontally overlaps, the previous parser's stopping point. When a matching unit is found, the count is incremented and the parse continues to go down to the next line directly beneath the previous parser's stopping point and looks for another matching unit. This continues until either the count is met, which will return the last count's value, or the end of the line list is reached, which will return an empty value. The positions of the units stay relative to each other across multiple pages, i.e. text on subsequent pages is considered to be directly beneath text on prior pages, as long as their x-coordinates overlap.