docstream v0.0.3
DocStream
Installation
This utility requires Node.js and can be installed via the "npm" utility
sudo npm install -g docstreamAfter that, the "docstream" command should be available at the command-line
Introduction
CouchDB's "_all_docs" endpoint gets you all the documents in database in this form
e.g
http://mycouchdbserver?all_docs?include_docs=true
{
    "total_rows": 93985131,
    "offset": 0,
    "rows": [
        {
            "id": "0000230a35e724e12b8c18a8f700065d",
            "key": "0000230a35e724e12b8c18a8f700065d",
            "value": {
                "rev": "1-adf8311047fcdd953543118e7d501fa1"
            },
            "doc": {
                "_id": "0000230a35e724e12b8c18a8f700065d",
                "_rev": "1-adf8311047fcdd953543118e7d501fa1",
                "a": "1",
                "b": "2",
                "c": "3"
            }
        },
        {
            "id": "0000230a35e724e12b8c18a8f7000ccd",
            "key": "0000230a35e724e12b8c18a8f7000ccd",
            "value": {
                "rev": "1-5ce610ff79bc1cfe62b4a1a68e5b09cf"
            },
            "doc": {
                "_id": "0000230a35e724e12b8c18a8f7000ccd",
                "_rev": "1-5ce610ff79bc1cfe62b4a1a68e5b09cf",
                "a": "2",
                "b": "5",
                "c": "6"
            }
        }
    ]
}Notice the documents themselves are contained inside an object inside an array. In real life, the data comes out like this:
{"total_rows":93985131,"offset":0,"rows":[
{"id":"0000230a35e724e12b8c18a8f700065d","key":"0000230a35e724e12b8c18a8f700065d","value":{"rev":"1-adf8311047fcdd953543118e7d501fa1"},"doc":{"_id":"0000230a35e724e12b8c18a8f700065d","_rev":"1-adf8311047fcdd953543118e7d501fa1","a":"1","b":"2","c":"3"}},
{"id":"0000230a35e724e12b8c18a8f7000ccd","key":"0000230a35e724e12b8c18a8f7000ccd","value":{"rev":"1-5ce610ff79bc1cfe62b4a1a68e5b09cf"},"doc":{"_id":"0000230a35e724e12b8c18a8f7000ccd","_rev":"1-5ce610ff79bc1cfe62b4a1a68e5b09cf","a":"2","b":"5","c":"6"}}
]}with each object on its own line.
If you are wanting to export the data and put it in Redshift, for example, the JSON needs to be in this form:
{"_id":"0000230a35e724e12b8c18a8f700065d","_rev":"1-adf8311047fcdd953543118e7d501fa1","a":"1","b":"2","c":"3"}
{"_id":"0000230a35e724e12b8c18a8f7000ccd","_rev":"1-5ce610ff79bc1cfe62b4a1a68e5b09cf","a":"2","b":"5","c":"6"}Solution 1 - jq
The jq utility allows JSON to be parsed and reformatted on the command-line. e.g.
  curl 'http://mycouchdbserver?_all_docs?include_docs=true' | jq '.rows[].doc'(Thanks to Cloudant Support for this solution). Unfortunately it is not suitable for large data sets as jq requires all of the data to be in-memory.
Solution 2 - Use this docstream.js utility
DocStream takes _all_docs data in on stdin and outputs just the "doc" section:
  curl 'http://mycouchdbserver?_all_docs?include_docs=true' | docstreamThis is should work with any size of data set, as long as each document appears per line.
e.g.
 cat sample.txt | docstream | gzip > output.txt.gz