Wikipedia-list-extractor NPM

wikipedia-list-extractor

Wikipedia has lists of objects (e.g. monuments), often referenced by governmental data (e.g. heritage protection). This module helps to extract data from these lists.

Example: The sub-pages of https://de.wikipedia.org/wiki/Denkmalgesch%C3%BCtzteObjekte_in%C3%96sterreich(Denkmalgeschützte Objekte in Österreich) will list all heritage protected objects in Austria. This module will return individual items of this list as JSON objects. The ID within this module for this list is 'AT-BDA'. The items can be referenced either by their ID (e.g. 'id-24536') or their Wikidata-ID (e.g. 'Q1534177') or their page plus index (e.g. 'Liste der denkmalgeschützten Objekte in Wien/Innere Stadt/E–He#69').

There's a demo-application where you can view items on a map: https://openstreetmap.at/demo-wikipedia-list-extractor (Source).

In data/ there are config files for each type of list.

Usage

Stand-alone with NodeJS server (included with the dev dependencies)

git clone https://github.com/plepe/wikipedia-list-extractor
cd wikipedia-list-extractor
npm install
npm start

Point your browser to http://localhost:8080/ for the interactive App.

You can try the list 'AT-BDA' and as ID 'id-24536' or 'Q1534177'. Both IDs should return the Goethedenkmal in Vienna.

Additionally, the standalone server exposes a HTTP API which you can query: http://localhost:8080/api//

where list is the ID of a list (e.g. INT-UNESCO)
where id is one or several ids, comma separated

Example:

curl http://localhost:8080/api/INT-UNESCO-de/91,80

As module within a NodeJS application

Wikipedia List Extractor uses a few modules (node-fetch, jsdom) as indirect dependencies (so they don't get compiled when using browserify). These have to be exposed as global variables. This can be done by requiring wikipedia-list-extractor/node.

let extractor = new MediawikiListExtractor('INT-UNESCO-de', null, {
  path: 'node_modules/wikipedia-list-extractor/data',
})
extractor.get(['91', '80'], (err, result) => {
  console.log(err, JSON.stringify(result, null, '  '))
})

Stand-alone on a PHP server (e.g. with Apache2)

cd /var/www/html
git clone https://github.com/plepe/wikipedia-list-extractor
cd wikipedia-list-extractor
npm install

Point your browser to https://server/wikipedia-list-extractor

You have to select 'Run code in browser', as the PHP code does not implement the server side.

As module within a web application in a browser

As Wikipedia does not allow requests from a web browser, when they do not originate from a wikipedia page, we have to use a proxy. The URL of the proxy has to be supplied with the options, when loading MediawikiListExtractor:

// def is the file data/INT-UNESCO.json as Javascript Object
let extractor = new MediawikiListExtractor('INT-UNESCO', null, {
  path: 'node_modules/wikipedia-list-extractor/data',
  proxy: 'proxy/?'
})
extractor.get(['91', '80'], (err, result) => {
  console.log(err, result)
})

See proxy/index.php or proxy/index.js for examples.

List definition files

The list definition files are in the data/ folder and these are YAML files. The basic structure:

title:
  en: List for something
param:
  ... Definition for a source or several sources

Definition of a source:

language: de
source: https://de.wikipedia.org
pageTitleMatch: Liste der Kunstwerke
renderedFields:
  id:
    column: 2
    regexp: /<a[^>]*>([0-9]+)<\/a>/
    type: html
  wikidata:
    column: 3
    regexp: /<a href="https:\/\/www.wikidata.org\/wiki\/(Q[0-9]+)">Wikidata<\/a>/
    type: html

For sources, the following options are possible

Field	Description
language	Language of this list
source	URL of the Mediawiki / Wikipedia where this list is to be found
pageTitleMatch	The template title for pages which build this page (e.g. there might be a list of artwork for each town). This is a regular expression for Mediawiki CirrusSearch, so there might be some restrictions.
template	Mediawiki pages use the specified template (or, when this is an array, templates) for rendering content.
rawIdField	The id of the item can be read from this field (in the template in page source).
rawAnchorField	The HTML anchor of the item can be read from this field (in the template in page source).
rawWikidataField	The wikidata id of the item can be read from this field (in the template in page source).
renderedTableClass	In rendered output, the table in the page can be detected from this class.
renderedIdField	The id of the item can be read from this field (in the rendered output, see renderedFields). If the id is empty ('', null, ...), the item will be ignored.
renderedAnchorField	The HTML anchor of the item can be read from this field (in the rendered output, see renderedFields).
renderedWikidataField	The wikidata id of the item can be read from this field (in the rendered output, see renderedFields).
renderedFields	Hashed array of fields, see below.
wikidataFields	Optionally load the specified list of fields from the matching wikidata item. Example: `[{property: P31, field: "is_a"}, ...]`

Advanced Fields:

Field	Description
pages	List of pages which constitutes the whole dataset (e.g. for getAll, which returns all items).
rawAnchorTemplate	Complex HTML anchor for the item. Uses Twig syntax to compile the anchor. Available parameters: `item.field` (with each field from the template), `page` (page title), `index` (index of the item on this page).
rawIdTemplate	More complex ID and aliases for the item. Uses Twig syntax to compile the ID/Aliases (one alias per line). Make sure that the first result is always the same as the first ID in `renderedIdTemplate`. Available parameters: `item.field` (with all fields from the template), `page` (page title), `index` (index of the item on this page).
renderedAnchorTemplate	Complex HTML anchor for the item. Uses Twig syntax to compile the anchor. Available parameters: `item.field` (with each field from the template), `page` (page title), `index` (index of the item on this page).
renderedIdTemplate	More complex ID and aliases for the item. Uses Twig syntax to compile the ID/Aliases (one alias per line). Make sure that the first result is always the same as the first ID in `rawIdTemplate`. Available parameters: `item.field` (with all parsed fields from the rendered page), `page` (page title), `index` (index of the item on this page).
wikidataIdTemplate	Additional aliases for the item. Uses Twig syntax to compile the alias (one alias per line). Available parameters: `item.P1234` (with all properties specified in wikidataFields).
idToQuery	When searching for an ID, how to search on the Mediawiki site. idToQuery uses Twig syntax to generate the query, with multiple lines prefixed by a query option and `=`; available parameter: `id` (the id we are looking for). Query options: `field` (which field to query), `value` (which value to query), `wikidataProperty` and `wikidataValue` (value can't be found in the page source, needs to query wikidata first -> use wikidata item id as value), `page` (doesn't need to search, just load the specified page).

Rendered Fields Parameter:

Parameter	Description
column	Table column
type	'html' (default), 'image' (parse url, width, height from first image in this field)
domQuery	CSS style query for a DOM node in the cell.
domAttribute	Use the value of the DOM node (or the cell, if domQuery was not specified).
regexp	A regular expression, where the first match is the resulting value (to exclude patterns, use: `/foo(?:bar)(bla)/` -> "bla".
modify	A TwigJS template, which can modify value. The following parameters are available: `value` (if `column` was specified; the result after column, domQuery, domAttribute, regexp), `row` (the full table row as array), `index` (the n'th item on this page), `page` (the name of the Wikipedia page).

argparse array.prototype.flat async find-wikidata-items jsdom node-fetch parse-mediawiki-template twig yaml

@infinitebrahmanuniverse/nolb-wik @everything-registry/sub-chunk-3147

2 years ago

2 years ago

4 years ago

4 years ago