0.0.5 β€’ Published 9 months ago

@harvard-lil/wacz-preparator v0.0.5

Weekly downloads
-
License
MIT
Repository
github
Last release
9 months ago

wacz-preparator πŸ“š

npm version JavaScript Style Guide Linting

CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.

wacz-preparator --extractor "archive-it" --username "lil" --password $PASSWORD --collection-id 12345

See also: wacz-exhibitor for embedding a self-contained web archive collection on a web page.


Summary


Foreword

⚠️πŸ₯ΌπŸ§ͺ Experimental:

This pipeline was originally developed in the context of The Harvard Library Innovation Lab's partnership with the Radcliffe Institute's Schlesinger Library on experimental access to web archives.

We have only tested it on The Schlesinger #meToo Web Archives collection and would welcome feedback from the community to help solidify it.

In particular, we would love to hear more about:

  • Any edge cases this pipeline currently doesn't account for.
  • General interest in exploring new ways of storing, copying, and giving access to web archives

Contact: info@perma.cc

πŸ‘† Back to the summary


How does it work?

Given a specific extractor and valid combination of credentials, wacz-preparator will perform the following steps in order to pull and package a remote web archives collection into a single WACZ file.

Example: Archive-It Extractor

#DescriptionNotes
01Check validity of credentials and access to the collection
02Create local collection folder if not already presentBecause the underlying files are kept around in that folder, processing can be interrupted, resumed, and run multiple times over.
03Pull Collection Information
04Pull list of available WARC files
05Pull crawl information for all WARC filesThis includes retrieving seeds (urls).
06Pull page title for all of the crawled URLsWill first try to fetch that information from the seed meta data. If not available, will try to pull that information from the Wayback Machine.
07Delete "loose" WARCs from local collection folderThis comparison allows for discarding WARC files that may have previously been pulled locally but are no longer part of the collection.
08Compare hashes of local WARC files against remote hashes (1)This allows for determining what files need to be downloaded or re-downloaded.
09Pull WARC filesOnly the files that are not already present locally will be pulled.
10Compare hashes of local WARC files against remote hashes (2)At this stage, there should be no discrepancies.
11Build pages list
12Prepare WACZ file

At the end of this process, a WACZ file named after the collection ID should be available (ie: 12345.wacz).

WACZ files can be read with any compatible playback software, such as replayweb.page.

Note: All of the operations that involve talking to the Archive-It API are run in parallel batches: the --concurrency option allows for determining how many requests can be run in parallel.

πŸ‘† Back to the summary


Getting Started

Dependencies

wacz-preparator requires Node.js 18+.

Compatibility

This program has been written for UNIX-like systems and is expected to work on Linux, Mac OS, and Windows Subsystem for Linux.

Installation

wacz-preparator is available on npmjs.org and can be installed as follows:

# As a CLI
npm install -g @harvard-lil/wacz-preparator

# As a library
npm install @harvard-lil/wacz-preparator --save

πŸ‘† Back to the summary


CLI

Here are a few examples of how wacz-preparator can be used in the command line to extract a full collection from Archive-It into a WACZ file:

# The program needs an Archive-It username, password, and collection-id to operate ...
wacz-preparator --extractor "archive-it" --username 'foo' --password 'bar' --collection-id 12345

# ... the latter can / should be passed as an environment variable
wacz-preparator --extractor "archive-it"  --username 'foo' --password $PASSWORD --collection-id 12345

# Unless specified otherwise with --output-path, wacz-preparator will work in the current directory
wacz-preparator --extractor "archive-it"  --output-path "/path/to/directory" --username 'foo' --password $PASSWORD --collection-id 12345

# The resulting WACZ file can be signed using an authsign-compatible endpoint.
# See: https://specs.webrecorder.net/wacz-auth/0.1.0/#implementations
wacz-preparator --extractor "archive-it" --signing-url "https://example.com/sign" --username foo --password $PASSWORD --collection-id 12345

# Use --help to list the available options, and see what the defaults are.
wacz-preparator --help
Usage: wacz-preparator [options]

πŸ“š CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.
More info: https://github.com/harvard-lil/wacz-preparator

Options:
  -v, --version                 Display Library and CLI version.
  -e, --extractor <string>      Web Archiving platform to extract the collection from. (choices: "archive-it", default: "archive-it")
  -u, --username <string>       API username (required for Archive-it). (default: null)
  -p, --password <string>       API password (required for Archive-it). (default: null)
  -i, --collection-id <string>  Id of the collection to process (required for Archive-it). (default: null)
  -o, --output-path <string>    Path in which wacz-preparator will work. (default: pwd)
  -c, --concurrency <number>    Sets a limit for parallel requests to the Archive-It API. (default: 50)
  --auto-clear <bool>           Automatically delete the collection-specific folder that was created? (choices: "true", "false", default: "false")
  --signing-url <string>        Authsign-compatible endpoint for signing WACZ file.
  --signing-token <string>      Authentication token to --signing-url, if needed.
  --log-level <string>          Controls CLI verbosity. (choices: "silent", "trace", "debug", "info", "warn", "error", default: "info")
  -h, --help                    Show options list.

πŸ‘† Back to the summary


JavaScript Library

wacz-preparator can also be used as JavaScript library in a Node.js project.

Example: Using the Preparator.process() method

import { ArchiveItExtractor } from "@harvard-lil/wacz-preparator"

const collection = new ArchiveItExtractor({
  username: 'username', 
  password: 'password', 
  collectionId: 12345
})

if (await collection.process()) {
  // WACZ file is ready!
  // ... 
}

The process() method runs through all the steps described in the "How does it work?" section.

It is also possible to go through each individual step manually and customize the behavior of wacz-preparator.

πŸ‘† Back to the summary


Development

Standard JS

This codebase uses the Standard JS coding style.

  • npm run lint can be used to check formatting.
  • npm run lint-autofix can be used to check formatting and automatically edit files accordingly when possible.
  • Most IDEs can be configured to automatically check and enforce this coding style.

JSDoc

JSDoc is used for both documentation and loose type checking purposes on this project.

Testing

⚠️ In its current state, this experimental codebase doesn't come with an automated test suite.

Available CLI

# Runs linter
npm run lint

# Runs linter and attempts to automatically fix issues
npm run lint-autofix

# Step-by-step NPM publishing helper
npm run publish-util

πŸ‘† Back to the summary