1.0.0 • Published 7 years ago

fasta-filter v1.0.0

Weekly downloads
-
License
ISC
Repository
-
Last release
7 years ago

fasta-filter

A quick-and-dirty NodeJS program (or maybe a library?) that can filter DNA sequences into different buckets in different simple ways.

Alie, read this part if nothing else!

The only code here that should be read by non-technical users is analyze.js, but my goal is to make that file editable by non-programmers who have a teensy knowledge of JavaScript. Or even to those who don't.

Eventually I will turn this thing into a simple command line tool, and you won't even have to understand analyze.js. But if you do... you can MAKE YOUR OWN fasta-filtering command line tools easily, with this code as a starter.

I love you, I'll leave you alone now. For those of you who are curious, here is the rest of the story:

Nerdy parts:

This project was written with the builder pattern in order to prioritize simplicity of the use of each of its functions. Potentially complex function calls are human-readable, taking advantage of this pattern and also the object destructuring feature of ES6. For example:

const { longSequences, shortSequences } = filter(sequences).byLength(1300);
const buckets = filter(longSequences).intoBuckets().byExactMatch(bucketTests);

Thanks to https://github.com/biojs-io/biojs-io-fasta for the FASTA format parsing piece. The rest is just the magic of vanilla JavaScript's Array API. Simple as that.

How to use:

So far I haven't exported these modules for other npm packages to use. But you can use this repo by following these steps: 1. Download and install NodeJS (and npm with it) from https://nodejs.org/en/. 2. Clone/fork this repo. Open a terminal at the root of the clone, and run npm install. 3. Remove our starter input from input/ 4. Alter analyze.js to meet your needs (input and output filenames, and filter logic). 5. At the terminal, from the fasta-filter directory, run ./analyze.

Note: You can also run node analyze.js, but the ./analyze bash script will pump Node with a little more memory. So if you run into memory problems running directly with node, try my handy runner script.

The io.fasta module:

The io.fasta module takes fasta files and gives you an array of all the sequences in those files. It can also take any array or arrays of sequences and turn them back into one or more fasta files. The sequence objects here are biojs-io-fasta Sequence objects, which have the following shape:

const sequence = {
  seq: "ATCGATCG",
  name: "awesome-seq",
  id: "unique id" // usually a number, just the order in which they appeared in the fasta file
}

The above sequence object is the equivalent of the following lines of FASTA plain text sequence format:

>awesome-seq
ATCGATCG

NOTE: The sequence lines of FASTA format are limited to 80 characters. If your sequences are longer, they should be multiple up-to-80-character lines.

Examples:

// Reading files:
const io = require('./io');
const sequences = await io.fasta.parseFiles(['input/file1.fasta.txt', 'input/file2.fasta.txt']);

// Writing a single file:
io.fasta.saveFile(filteredSequences, 'output/results-filename.fasta.txt');

// Writing multiple files:
io.fasta.saveFiles({
  './output/long-sequences.fasta.txt': longSequences,
  './output/short-sequences.fasta.txt': shortSequences
});

The filter module:

The filter module takes arrays of sequences and turns them into buckets. In this context, a bucket is an object with filter parameters, and one or more arrays of sequences, as its properties.

In the examples below that do not use intoBuckets(), the result is a single bucket instead of an array of buckets. In these examples, the single bucket is immediately destructured and only some of its properties used.

filter().byLength(cutoffLength) returns a bucket object with properties cutoffLength, longSequences, and shortSequences.

filter().byExactMatch(testSequence) returns a bucket object with properties testSequence, matchingSequences, and nonMatchingSequences.

The arguments to the filter methods are provided along with the filtered sequences, so that you can tell which bucket you are looking at.

The intoBuckets() method returned by filter() can be used to produce multiple buckets from an array of different inputs, and contains all the same methods as the single-result filter() (currently byLength and byExactMatch).

Examples:

// Filtering by sequence length in number of nucleotides:
const filter = require('./filter');
const { longSequences, shortSequences } = filter(sequences).byLength(1300);

// Filtering by exact match on a substring in some of the sequences:
const { matchingSequences } = filter(longSequences).byExactMatch(testSequence);

// Filtering into 3 buckets by exact match on each of 3 test sequences:
const buckets = filter(longSequences).intoBuckets().byExactMatch(testSequencesArray);
console.log(buckets.length); // 3
const { testSequence, matchingSequences, nonMatchingSequences } = buckets[2];

Advanced Example (this one I should probably make another module method for):

// Generating multiple fasta files from multiple filtered buckets (avoiding slashes in filenames):
const removeSlashes = s => s.split('/').join('-');
io.fasta.saveFiles(buckets.reduce((sequenceArraysByFilename, bucket) => {
  const filename = `output/exact-match-bucket-${removeSlashes(bucket.testSequence.name)}.fasta.txt`;
  return {
    ...sequenceArraysByFilename,
    [filename]: bucket.matchingSequences
  };
}, {}));

These patterns can be mixed and matched, i.e. you could filter into different buckets by size with no further changes to the filter module.

1.0.0

7 years ago