mongodb-collection-sample v5.0.0
mongodb-collection-sample

Sample documents from a MongoDB collection.
Install
npm install --save mongodb-collection-sample
Example
npm install mongodb lodash mongodb-collection-sample
const sample = require('mongodb-collection-sample');
const { MongoClient } = require('mongodb');
const _ = require('lodash');
const client = new MongoClient();
async function main() {
await client.connect('mongodb://localhost:27017');
// Generate 1000 documents
const docs = _range(0, 1000).map(function(i) {
return {
_id: 'needle_' + i,
is_even: i % 2
};
});
// Insert them into a collection.
await db.collection('haystack').insert(docs);
const options = {};
// Size of the sample to capture [default: `5`].
options.size = 5;
// Query to restrict sample source [default: `{}`]
options.query = {};
// Get a stream of sample documents from the collection.
const stream = sample(db, 'haystack', options);
stream.on('error', function(err){
console.error('Error in sample stream', err);
return process.exit(1);
});
stream.on('data', function(doc){
console.log('Got sampled document `%j`', doc);
});
stream.on('end', function(){
console.log('Sampling complete! Goodbye!');
db.close();
process.exit(0);
});
}
main();
Options
Supported options that can be passed to sample(db, coll, options)
are
query
: the filter to be used, default is{}
size
: the number of documents to sample, default is5
fields
: the fields you want returned (projection object), default isnull
raw
: boolean to return documents as raw BSON buffers, default isfalse
sort
: the sort field and direction, default is{_id: -1}
maxTimeMS
: the maxTimeMS value after which the operation is terminated, default isundefined
promoteValues
: boolean whether certain BSON values should be cast to native Javascript values or not. Default istrue
How It Works
Native Sampler
MongoDB version 3.1.6 and above generally uses the $sample
aggregation operator:
db.collectionName.aggregate([
{$match: <query>},
{$sample: {size: <size>}},
{$project: <fields>},
{$sort: <sort>}
])
However, if more documents are requested than are available, the $sample
stage
is omitted for performance optimization. If the sample size is above 5% of the
result set count (but less than 100%), the algorithm falls back to the reservoir
sampling, to avoid a blocking sort stage on the server.
Reservoir Sampling
For MongoDB version 3.1.5 and below we use a client-size reservoir sampling algorithm.
- Query for a stream of _id values, limit 10,000.
- Read stream of
_id
s and savesampleSize
randomly chosen values. - Then query selected random documents by _id.
The two modes, illustrated:
Performance Notes
For peak performance of the client-side reservoir sampler, keep the following guidelines in mind.
- The initial query for a stream of
_id
values must be limited to some finite value. (Default 10k) - This query should be covered by an index
- Since there's a limit, you may wish to bias for recent documents via a sort. (Default: {_id: -1})
- Don't sort on {$natural: -1}: this forces a collection scan!
Queries that include a sort by $natural order do not use indexes to fulfill the query predicate
- When retrieving docs: batch using one $in to reduce network chattiness.
License
Apache 2
2 years ago
5 years ago
5 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
9 years ago
9 years ago
9 years ago
9 years ago
9 years ago
9 years ago
9 years ago
9 years ago
9 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago