@bluehemoth/csvjsonify v1.2.0
csv2json
Description
A simple package meant for loading csv data, transforming it to json format and then outputting the transformed data.
Usage
Description
A simple package which transforms data from csv to json format.
Options
--sourceFile Absolute path of the file to be transformed.
--resultFile Absolute path of the file where the transformed data will be stored.
--separator Symbol, which is used in source file to separate values. The value of the should be either of , | ; \t (tab).
Defaults to comma if not provided
Examples
csvToJson --sourceFile "D:\source.csv" --resultFile "D:\result.json" --separator ","Environment variables
TRANSFORMER_CHOICE Feature flag which indicates which data transformed should be used
GOOGLE_DRIVE_STORAGE Feature flag which enables the upload of transformation result to google drive
GOOGLE_APPLICATION_CREDENTIALS_FILE Name of the Google api service account key file
SHARED_FOLDER_ID Id of the google drive account that is shared with the Google API service account, transformation result will be uploaded to this folder
DATA_DIR Absolute path of the test data directory
CREDENTIALS_DIR Absolute path of the folder which contains the credentials file
SOURCE_FILE Name of the source file
RESULT_FILE Name of the results file
LOGGING_LEVEL Application logging levelFeature flags
The package allows the customization of its operation via env file feature flags. The package supports these flags:
TRANSFORMER_CHOICE:
description: Decides which transformer will be used to transform the pipe data
values:
legacy_csv: Transformer which transforms csv to json by building a JSON string via simple foreach operation
optimized_csv: Transformer which extends the legacy transformer and builds JSON strings via .reduce() method
GOOGLE_DRIVE_STORAGE:
description: Decides if the transformed file should be stored in google drive
values:
enabled: Enables the storage service
disabled: Disables the storage serviceGoogle drive storage requirements
To use the google drive storage service the user must provide the required authentification credentials. The following steps describe the process of the authentication:
1. First follow the provided steps and create a service account and a key assigned to this account
2. After key creation a credential file should be automatically downloaded to your system - move this file to the root directory of this package
3. Assign the GOOGLE_APPLICATION_CREDENTIALS_FILE environment variable the path of credentials file (relative to root directory of the package)
4. Create a google drive folder and share it with the service account (share -> type service account email -> editor -> done)
5. Copy the id of the shared folder and save its value in bashARED_FOLDER_ID environment variable
6. Set GOOGLE_DRIVE_STORAGE environment variable to enabled
Running the docker image
From release v1.2.0 the package supports its use in docker containers. To run the package in a container follow these steps:
1. Create a .env in package directory by following the env.example file and the descriptions of the environment variables in Environment variables section
2. Run docker-compose up --build to run the package container in a detached mode
3. After successful run the transformed result will be available in the directory specified in DATA_DIR environment variable
Note: built image is also available here
You can run this image via docker run with the following command: sudo docker run -v <absolute path of source/result files directory>:/app/testData -v <absolute path of credentials directory>:/app/credentials --env-file <relative path to env> mind33z/csv2json:<version> npm run start -- --sourceFile "/app/testData/<source file name>" --resultFile "/app/testData/<result file name>"
Benchmarks
During performance measuring two metrics were tracked - execution time and memory. The screenshots below demonstrate the results of converting a sample 0.8 MB test file the full bloated 13 GB test file.
V1.1
For this version, only the execution time metric was tracked, as the results of the previous version showed that there is no need to optimise memory usage. The first screenshot shows the results of the test that was run after the _buildJSONStringFromLine function was enhanced. The second screenshot shows the results of the testing after the code in the _transform function has been converted to asynchronous. Both tests were done with 13GB bloated data file.


Enhacement of the _buildJSONStringFromLine had a positive influence on the execution time - the total time of the function decreased by roughly 10x which in the end led to total runtime decreasing by roughly 30 seconds. Converting _transform had an awful effect on package runtime - the total time of each key transform function (except _buildJSONStringFromLine) increased by 2x. This may have happened due to event loop encountering difficulties because it received too many simple task promises. Only the _buildJSONStringFromLine enchancement will be carried over into later versions.
V1.0
Execution time
Sample data (0.8 MB):

Bloat data (13 GB):

The results of the profiler show that functions _buildJSONStringFromLine, _removeEscapeSlashes, _splitLineToArr influence the execution time the most (apart from node's own functions). It should be noted that on bloated dataset _splitLineToArr method overtakes the _buildJSONStringFromLine method in terms of execution time. The following releases should prioritize improving the highlighted methods.
Memory
Sample data (0.8 MB):

Bloat data (13 GB):

The results of memory tracking show that even though the package has to process big amounts of data the memory used remains roughly the same. This can be attributed to the use of streams. No further improvements in memory usage are required.
Changelog
v1.2.0 - (2022-10-24)
Added
- Upload to google drive functionality
- Autodetect separator if no
--separatorargument is provided docker-compose.yml,Dockerfile, and.dockerignorefiles- Workflow job for building a docker image from the project and pushing it to DockerHub
- Csv to json transformer tests and a workflow that runs these tests on push
- Custom logger implementation
Updated
- Added
Feature flags,Google drive storage requirements, andRunning the docker imagesections to the README.md file - Fixed a separator symbol bug in optimized JSON building method
- Fixed JSON formatting issues
v1.1.0 - (2022-10-19)
Added
- Feature flag toggling functionality via .env
CsvToJsonOptimizedStreama transform stream class which acts as an improved iteration of the previous transform stream- Refactored the project structure
TransformerFactorya factory which handles the creation of different transformers
Updated
- Added benchmarks of the current version to benchmarks section in README.md
v1.0.0 - (2022-10-18)
Added
- Implemented
CsvToJsonStreamclass. This class:- Transforms CSV to JSON data in chunks
- Handles the case of chunk having an incomplete line
- Checks if the CSV line was parsed into an array correctly and that no unescaped separators were used in the data itself
- Measured the execution time and the memory usage of the converter when using the bloated 13 GB data file and sample 0.8 MB test data file.
- Created a pipe out of
ReadStream,CsvToJsonStream, andWriteStreamand achieved the basic functionality of the package
Updated
- README.md
v0.1.1 - (2022-10-14)
Added
- Input Handling
- Github actions workflow (on release bump if tag and package version mismatch and publish to npm)
- Test data file generation function
- README.md
Package structure
index.jscontains the main code of the packagehandleArgs.jscontains logic related to handling input argumentsgenerate.jscontains test data file generation logictransformers/CsvToJsonStream.jscontains the extended transform class used for transforming data from json format to csvtransformers/CsvToJsonOptimizedStream.jscontains the enhanced transform methods of theCsvToJsonStreamclassfactories/TransformerFactory.jscontains a factory which handles the creation of different transformersuploadToGoogleDrive.jscontains a function which uploads the transformed result file to the shared folder provided in .env filetests/directory contains the tests of the package and a custom test runnerutils/Logger.jscontains a custom logger implementation