2.0.0 • Published 2 years ago

@iebh/dedupe-sweep v2.0.0

Weekly downloads
-
License
MIT
Repository
github
Last release
2 years ago

IEBH/Dedupe-Sweep

Deduplicate reference libraries using the sweep method.

This library is intended to be used with Reflib compatible references.

// Simple example with an array of references
var Dedupe = require('@iebh/dedupe-sweep');

(new Dedupe())
	.set('strategy', 'doiOnly')
	.run([
		{doi: 'https://doi.org/10.1000/182'},
		{doi: '10.1000/182'},
	])
	.then(deduped => { /* ... */ })
// More complex example reading in a reference library with RefLib, deduping it and saving as another file
var Dedupe = require('@iebh/dedupe-sweep');
var reflib = require('reflib');

// Read in the library
var refs = await reflib.promises.parseFile('my-large-reference-library.xml');

// Dedupe
var deduper = new Dedupe()
deduper.set('strategy', 'clark')
var dedupedRefs = await deduper.run(refs);

// Save the deduped library
await reflib.promises.outputFile('my-large-reference-library-deduped.xml', dedupedRefs);

Testing

The various strategies within this project are tested using the Systematic Reviews Data Sets for Testing Automation Tools by Beller et. al and are available in the test/data directory.

Tests can be run via npm test or mocha. See the test directory for more information on specifics.

Testing statistics are based on the methodology from Evaluating automated deduplication tools: protocol by Hair et. al.

API

Constructor: Dedupe(options)

Returns a Dedupe class which extends a basic EventEmitter.

Dedupe.settings

Object storing all local settings for the class.

SettingTypeDefaultDescription
stratergystring'clark'The stratergy to use on the next run()
validateStratergybooleantrueValidate the strategy before beginning, only disable this if you are sure the strategy is valid
actionstring'0'The action to take when detecting a duplicate. ENUM: ACTIONS
actionFieldstring'dedupe'The field to use with actions
thresholdnumber0.1Floating value (between 0 and 1) when marking or deleting refs automatically
markOkstring / function'OK'String value to set the action field to when actionField=='mark' and the ref is a non-dupe, if a function it is called as (ref)
markDupestring / function'DUPE'String value to set the action field to when actionField=='mark' and the ref is a dupe, if a function it is called as (ref)
dupeRefstring0How to refer to other refs when actionfield=='stats'. ENUM: DUPEREF
fieldWeightnumber0How to calculate duplication score. ENUM: FIELDWEIGHT
markOriginalbooleanfalseWhether to mark the original as a duplicate or not

Static: Dedupe.ACTIONS

Actions to take when detecting duplicates

ValueSettingDescription
0'STATS'Add the field field in Dedupe.settings.actionField with the deduplicate chance to the input
1'MARK'Set the field in Dedupe.settings.actionField to Dedupe.settings.mark{Ok,Dupe} depending on duplicate status but leave input unchanged
2'DELETE'Remove duplicates from input and return sliced output

Static: Dedupe.DUPEREF

How to refer to other references.

ValueSettingDescription
0'INDEX'Refer to other references by their offset in the input array
1'RECNUMBER'Refer to other references by their recnumber field

Static: Dedupe.FIELDWEIGHT

How to refer to other references.

ValueSettingDescription
0'MINIMUM'Calculate duplication score based on minumum field score
1'AVERAGE'Calculate duplication score based on average field score

Dedupe.comparisons

A lookup object of comparison functions used within strategies.

Each comparison is made up of:

SettingTypeDescription
keystringInternal short name of the comparison in camelCase
titlestringHuman friendly title of the comparison
descriptionstringLonger description of what the comparison does
handlerfunctionFunction, called as (a, b) for fields which is expected to return a floating value of duplicate-ness

Dedupe.mutators

A lookup object of field mutators used within strategies.

Each mutator is made up of:

SettingTypeDescription
keystringInternal short name of the mutator in camelCase
titlestringHuman friendly title of the mutator
descriptionstringLonger description of what the mutator does
handlerfunctionFunction, called as (value) which is expected to return the mutated input

Static: Dedupe.strategies

A lookup object of strategies.

Each strategy is made up of:

SettingTypeDescription
keystringInternal short name of the strategy in camelCase
titlestringHuman friendly title of the strategy
descriptionstringLonger description of the strategy
mutatorsobjectList of fields which will be mutated and how, prior to the strategy being run
stepsarrayArray of steps to take when running the strategy

Dedupe.set(option, value)

Convenience function to quickly set a single option, or merge an object of options. Returns the original Dedupe instance.

Dedupe.run(input)

Takes an array of input references applying the action specified in Dedupe.settings.action. Returns a promise.

Strategies

This module includes a selection of deduplication strategies which are basic JavaScript objects which detail steps to take to detect reference duplication.

Each strategy should include a title, description, optional mutations and a collection of steps to perform.

A simple example of the DOI only strategy:

module.exports = {
	title: 'DOI only',
	description: 'Compare references against DOI fields only',
	mutations: {
		doi: 'doiRewrite',
	},
	steps: [
		{
			fields: ['doi'],
			comparison: 'exact',
		},
	],
};

Strategy format:

PathTypeDefaultDescription
titlestringThe short human-readable title of the strategy
descriptionstringA longer, HTML compatible description of the strategy
mutatorsobjectAn object of the reference properties to mutate prior to processing, each value should be a known mutator
stepsarrayA collection of steps for the deduplication process
steps.skipOmittedbooleantrueSkip field comparison where either side is not specified
steps[].fieldsarrayAn array of strings, each value should correspond to a known reference field
steps[].comparisonstringThe comparison method to use in this step, should correspond to a known comparison method
2.0.0

2 years ago

1.5.1

2 years ago

1.5.0

3 years ago

1.1.1

3 years ago

1.1.0

3 years ago

1.0.8

3 years ago

1.0.7

3 years ago

1.0.6

3 years ago

1.0.5

3 years ago

1.0.4

3 years ago

1.0.3

3 years ago

1.0.2

3 years ago

1.0.1

3 years ago

1.0.0

3 years ago