percollate v4.2.3
Percollate is a command-line tool that turns web pages into beautifully formatted PDF, EPUB, HTML or Markdown files.
Installation
percollate
is a Node.js command-line tool which you can install globally from npm:
npm install -g percollate
Percollate and its dependencies require Node.js 14.17.0 or later.
Community-maintained packages
There's a packaged version available on Arch User Repository, which you can install using your local AUR helper (yay
, pacaur
, or similar):
yay -S nodejs-percollate
Some Docker images are available in this tracking issue.
Usage
Run
percollate --help
for a list of available commands and options.
Percollate is invoked on one or more operands (usually URLs):
percollate <command> [options] url [url]...
The following commands are available:
percollate pdf
produces a PDF file;percollate epub
produces an EPUB file;percollate html
produces a HTML file.percollate md
produces a Markdown file.
The operands can be URLs, paths to local files, or the -
character which stands for stdin
(the standard inputs).
Available options
Unless otherwise stated, these options apply to all three commands.
-o, --output
Specify the path of the resulting bundle relative to the current folder.
percollate pdf https://example.com -o my-example.pdf
-u, --url
Using the -
operand you can read the HTML content from stdin
, as fetched by a separate command, such as curl
. In this sort of setup, percollate
does not know the URL from which the content has been fetched, and relative paths on images, anchors, et cetera won't resolve correctly.
Use the --url
option to supply the source's original URL.
curl https://example.com | percollate pdf - --url=https://example.com
-w, --wait
By default, percollate processes URLs in parallel. Use the --wait
option to process them sequentially instead, with a pause between items. The delay is specified in seconds, and can be zero.
percollate epub --wait=1 url1 url2 url3
--individual
By default, percollate bundles all web pages in a single file. Use the --individual
flag to export each source to a separate file.
percollate pdf --individual http://example.com/page1 http://example.com/page2
--template
Path to a custom HTML template. Applies to pdf
, html
, and md
.
--style
Path to a custom CSS stylesheet, relative to the current folder.
--css
Additional CSS styles you can pass from the command-line to override styles specified by the default/custom stylesheet.
--no-amp
Don't prefer the AMP version of the web page.
--debug
Print more detailed information.
-t, --title
Provide a title for the bundle.
percollate epub http://example.com/page-1 http://example.com/page-2 --title="Best Of Example"
-a, --author
Provide an author for the bundle.
percollate pdf --author="Ella Example" http://example.com
--cover
Generate a cover. The option is implicitly enabled when the --title
option is provided, or when bundling more than one web page to a single file. Disable this implicit behavior by passing the --no-cover
flag.
--toc
Generate a hyperlinked table of contents. The option is implicitly enabled when bundling more than one web page to a single file. Disable this implicit behavior by passing the --no-toc
flag.
Applies to pdf
, html
, and md
.
--toc-level=<level>
By default, the table of contents is a flat list of article titles. With the --toc-level
option the table of contents will include headings under each article title (<h2>
, <h3>
, etc.), up to the specified heading depth. A number between 1
and 6
is expected.
Using --toc-level
with a value greater than 1
implies --toc
.
--hyphenate
Hyphenation is enabled by default for pdf
, and disabled for epub
, html
, and md
. You can opt into hyphenation with the --hyphenate
flag, or disable it with the --no-hyphenate
flag.
See also the Hyphenation and justification recipe.
--inline
Embed images inline with the document. Images are fetched and converted to Base64-encoded data
URLs.
This option is particularly useful for html
to produce self-contained HTML files.
--md.<option>=<value>
Pass options to the underlying Markdown stringifier, mdast-util-to-markdown
. These are the default Markdown options:
const DEFAULT_MARKDOWN_OPTIONS = {
fences: true,
emphasis: '_',
strong: '_',
resourceLink: true,
rule: '-'
};
--unsafe
Disables some JSDOM validations that may throw an error when parsing invalid HTML pages (See #177).
Recipes
Basic bundling
To turn a single web page into a PDF:
percollate pdf --output=some.pdf https://example.com
To bundle several web pages into a single PDF, specify them as separate arguments to the command:
percollate pdf --output=some.pdf https://example.com/page1 https://example.com/page2
You can use common Unix commands and keep the list of URLs in a newline-delimited text file:
cat urls.txt | xargs percollate pdf --output=some.pdf
To transform several web pages into individual PDF files at once, use the --individual
flag:
percollate pdf --individual https://example.com/page1 https://example.com/page2
If you'd like to fetch the HTML with an external command, you can use -
as an operand, which stands for stdin
(the standard input):
curl https://example.com/page1 | percollate pdf --url=https://example.com/page1 -
Notice we're using the url
option to tell percollate the source of our (now-anonymous) HTML it gets on stdin, so that relative URLs on links and images resolve correctly.
The --css
option
The --css
option lets you pass a small snippet of CSS to percollate. Here are some common use-cases:
Custom page size / margins
The default page size is A5 (portrait). You can use the --css
option to override it using any supported CSS size
:
percollate pdf --css "@page { size: A3 landscape }" http://example.com
Similarly, you can define:
- custom margins, e.g.
@page { margin: 0 }
- the base font size:
html { font-size: 10pt }
Changing the font stacks
The default stylesheet includes CSS variables for the fonts used in the PDF:
:root {
--main-font: Palatino, 'Palatino Linotype', 'Times New Roman',
'Droid Serif', Times, 'Source Serif Pro', serif, 'Apple Color Emoji',
'Segoe UI Emoji', 'Segoe UI Symbol';
--alt-font: 'helvetica neue', ubuntu, roboto, noto, 'segoe ui', arial,
sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol';
--code-font: Menlo, Consolas, monospace;
}
CSS variable | What it does |
---|---|
--main-font | The font stack used for body text |
--alt-font | Used in headings, captions, et cetera |
--code-font | Used for code snippets |
To override them, use the --css
option:
percollate pdf --css ":root { --main-font: 'PT Serif'; --alt-font: Roboto; }" http://example.com
💡 To work correctly, you must have the fonts installed on your machine. Custom web fonts currently require you to use a custom CSS stylesheet / HTML template.
Remove the appended href
s from hyperlinks
The idea with percollate is to make PDFs that can be printed without losing where the hyperlinks point to. However, for some link-heavy pages, the appended href
s can become bothersome. You can remove them using:
percollate pdf --css "a:after { display: none }" http://example.com
Hyphenation and justification
Hyphenation is only enabled by default for PDFs, but you can opt in or out of it for any output format with a flag.
When hyphenation is enabled, paragraphs will be justified:
.article__content p {
text-align: justify;
}
If you prefer left-aligned text:
percollate pdf --css ".article__content p { text-align: left }" http://example.com
The --style
option
The --style
option lets you use your own CSS stylesheet instead of the default one. Here are some common use-cases for this option:
⚠️ TODO add examples here
The --template
option
The --template
option lets you use a custom HTML template for the PDF.
💡 The HTML template is parsed with nunjucks, which is a close JavaScript relative of Twig for PHP, Jinja2 for Python and L for Ruby.
Here are some common use-cases:
Customizing the page header / footer
Puppeteer can print some basic information about the page in the PDF. The following CSS class names are available for the header / footer, into which the appropriate content will be injected:
date
— The formatted print datetitle
— The document titleurl
— document location (Note: this will print the path of the temporary html, not the original web page URL)pageNumber
— the current page numbertotalPages
— total pages in the document
👉 See the Chromium source code for details.
You place your header / footer template in a template
element in your HTML:
<template class="header-template"> My header </template>
<template class="footer-template">
<div class="text center">
<span class="pageNumber"></span>
</div>
</template>
See the default HTML for example usage.
You can add CSS styles to the header / footer with either the --css
option or a separate CSS stylesheet (the --style
option).
💡 The header / footer template do not inherit their styles from the rest of the page (i.e. they are not part of the cascade), so you'll have to write the full CSS you want to apply to them.
An example from the default stylesheet:
.footer-template {
font-size: 10pt;
font-weight: bold;
}
Updating
To keep the tool up-to-date, you can run:
npm install -g percollate
Occasionally, an ugrade might not go according to plan; in this case, you can uninstall and re-install percollate
:
npm uninstall -g percollate && npm install -g percollate
How it works
All export formats follow a common pipeline:
- Fetch the page(s) using
node-fetch
- If an AMP version of the page exists, use that instead (disable with
--no-amp
flag) - Enhance the DOM using
jsdom
- Pass the DOM through
mozilla/readability
to strip unnecessary elements - Apply the HTML template and the stylesheet to the resulting HTML
Different formats then use different tools to produce the final file.
PDFs are rendered with puppeteer
.
EPUBs have external images fetched and bundled together with the HTML of each article. When the --inline
option is used, images are instead converted to data
URLs and embedded into the HTML.
HTMLs are saved without any further changes. When the --inline
option is used, images are converted to data
URLs and embedded into the HTML. External images are not otherwise fetched.
Markdown files are produced the same way as HTMLs, then processed with a series of utilities from the unified.js umbrella.
Limitations
Percollate inherits the limitations of two of its main components, Readability and Puppeteer (headless Chrome).
The imperative approach Readability takes will not be perfect in each case, especially on HTML pages with atypical markup; you may occasionally notice that it either leaves in superfluous content, or that it strips out parts of the content. You can confirm the problem against Firefox's Reader View. In this case, consider filing an issue on mozilla/readability
.
Using a browser to generate the PDF is a double-edged sword. On the one hand, you get excellent support for web platform features. On the other hand, print CSS as defined by W3C specifications is only partially implemented, and it seems unlikely that support will be improved any time soon. However, even with modest print support, I think Chrome is the best (free) tool for the job.
Troubleshooting
On some Linux machines you'll need to install a few more Chrome dependencies before percollate
works correctly. (Thanks to @ptica for sorting it out)
The percollate pdf
command supports the --no-sandbox
Puppeteer flag, but make sure you're aware of the implications before disabling the sandbox.
Using Firefox to render PDFs
This feature is experimental. Please log an issue if you notice anything wrong.
Starting with percollate
3.x, it's possible to use Firefox Nightly as an alternative browser with which to render PDFs. To make Firefox available to Percollate, use the following install command:
PUPPETEER_PRODUCT=firefox npm install percollate
After installation, percollate pdf
commands can be run with the --browser=firefox
option.
Limitations of Firefox PDF rendering
At the moment, rendering PDFs with Firefox has the following limitations:
- The pages can't have headers and footers, so there are no page numbers.
Contributing
Contributions of all kinds are welcome! See CONTRIBUTING.md for details.
See also
Here are some other projects to check out if you're interested in building books using the browser:
4 months ago
5 months ago
7 months ago
8 months ago
8 months ago
8 months ago
11 months ago
1 year ago
1 year ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
5 years ago
5 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago