1.0.3 • Published 5 years ago

extract-pdf-by-coordinates v1.0.3

Weekly downloads
-
License
MIT
Repository
github
Last release
5 years ago

extract-pdf-by-coordinates

Extract text from a specific area by coordinates in PDF files.

$ npm install extract-pdf-by-coordinates

Utilizes pdf.js-extract. OCR is not supported.

Usage

convert(file, options)

Returns a Promise which resolves into an array where each item is a page from the PDF. Each page is an array which contains all the text elements extracted from it. The text elements are objects with x, y, and str properties.

extract(page, start, end)

  • page: array of text elements
  • start & end: object with x and y properties which are numbers that consist of point (pt) units (not px)

Returns a string which contains the texts extracted from the set of coordinates. Texts are separated by a new line.

You can use GIMP to view your PDF files using pt units.

Example

A good case for where this will be useful is if you are working with PDF files which have a consistent template, like bills for example.

Suppose we have an archive with a bunch of electricity bills and we want to know how much energy was consumed for the entire time period:

Given that all of the files have the same structure, we can simply pinpoint the area that we want to extract the text from.

Electricity bill example

In this case, since we only care about the total kW⋅h of each electricity bill, our set of coordinates will only include one text element. But normally if you wish to extract a group of text elements, the module will return everything inside your defined coordinates with each text element separated by a new line.

const { convert, extract } = require("extract-pdf-by-coordinates")

let totalConsumed = 0

convert("./bills.pdf")
  .then(pages => {
    for (const page of pages) {
      let monthConsumption = extract(
        page,
        { x: 300, y: 520 }, // Start position
        { x: 345, y: 540 } // End position
      )

      // Here we need to remove commas from the extracted value,
      monthConsumption = monthConsumption.split(",").join("")
      // and then convert the string to number.
      monthConsumption = parseFloat(monthConsumption)

      totalConsumed += monthConsumption
    }

    console.log(totalConsumed)
  })
  .catch(err => {
    console.log(err)
  })

Todo

  • Support for other units like px, cm, in, %
1.0.4

5 years ago

1.0.3

5 years ago

1.0.2

5 years ago

1.0.1

5 years ago

1.0.0

5 years ago