1.2.0 • Published 2 years ago

@bluehemoth/csvjsonify v1.2.0

Weekly downloads
-
License
MIT
Repository
github
Last release
2 years ago

csv2json

Description

A simple package meant for loading csv data, transforming it to json format and then outputting the transformed data.

Usage

    Description
    
    A simple package which transforms data from csv to json format.
    
    Options
    
    --sourceFile    Absolute path of the file to be transformed.
    --resultFile    Absolute path of the file where the transformed data will be stored.
    --separator     Symbol, which is used in source file to separate values. The value of the should be either of , | ; \t (tab).
                    Defaults to comma if not provided
    
    Examples
    
    csvToJson --sourceFile "D:\source.csv" --resultFile "D:\result.json" --separator ","

Environment variables

  TRANSFORMER_CHOICE  Feature flag which indicates which data transformed should be used
  
  GOOGLE_DRIVE_STORAGE  Feature flag which enables the upload of transformation result to google drive
  
  GOOGLE_APPLICATION_CREDENTIALS_FILE Name of the Google api service account key file
  
  SHARED_FOLDER_ID  Id of the google drive account that is shared with the Google API service account, transformation result will be uploaded to this folder
   
  DATA_DIR  Absolute path of the test data directory
  
  CREDENTIALS_DIR Absolute path of the folder which contains the credentials file
  
  SOURCE_FILE Name of the source file
  
  RESULT_FILE Name of the results file
  
  LOGGING_LEVEL Application logging level

Feature flags

The package allows the customization of its operation via env file feature flags. The package supports these flags:

    TRANSFORMER_CHOICE:
        description:    Decides which transformer will be used to transform the pipe data
        values:
            legacy_csv: Transformer which transforms csv to json by building a JSON string via simple foreach operation
            optimized_csv:  Transformer which extends the legacy transformer and builds JSON strings via .reduce() method
    
    GOOGLE_DRIVE_STORAGE:
        description:    Decides if the transformed file should be stored in google drive
        values:
            enabled: Enables the storage service
            disabled: Disables the storage service

Google drive storage requirements

To use the google drive storage service the user must provide the required authentification credentials. The following steps describe the process of the authentication: 1. First follow the provided steps and create a service account and a key assigned to this account 2. After key creation a credential file should be automatically downloaded to your system - move this file to the root directory of this package 3. Assign the GOOGLE_APPLICATION_CREDENTIALS_FILE environment variable the path of credentials file (relative to root directory of the package) 4. Create a google drive folder and share it with the service account (share -> type service account email -> editor -> done) 5. Copy the id of the shared folder and save its value in bashARED_FOLDER_ID environment variable 6. Set GOOGLE_DRIVE_STORAGE environment variable to enabled

Running the docker image

From release v1.2.0 the package supports its use in docker containers. To run the package in a container follow these steps: 1. Create a .env in package directory by following the env.example file and the descriptions of the environment variables in Environment variables section 2. Run docker-compose up --build to run the package container in a detached mode 3. After successful run the transformed result will be available in the directory specified in DATA_DIR environment variable

Note: built image is also available here You can run this image via docker run with the following command: sudo docker run -v <absolute path of source/result files directory>:/app/testData -v <absolute path of credentials directory>:/app/credentials --env-file <relative path to env> mind33z/csv2json:<version> npm run start -- --sourceFile "/app/testData/<source file name>" --resultFile "/app/testData/<result file name>"

Benchmarks

During performance measuring two metrics were tracked - execution time and memory. The screenshots below demonstrate the results of converting a sample 0.8 MB test file the full bloated 13 GB test file.

V1.1

For this version, only the execution time metric was tracked, as the results of the previous version showed that there is no need to optimise memory usage. The first screenshot shows the results of the test that was run after the _buildJSONStringFromLine function was enhanced. The second screenshot shows the results of the testing after the code in the _transform function has been converted to asynchronous. Both tests were done with 13GB bloated data file.

enhanced build json execution time

async execution time

Enhacement of the _buildJSONStringFromLine had a positive influence on the execution time - the total time of the function decreased by roughly 10x which in the end led to total runtime decreasing by roughly 30 seconds. Converting _transform had an awful effect on package runtime - the total time of each key transform function (except _buildJSONStringFromLine) increased by 2x. This may have happened due to event loop encountering difficulties because it received too many simple task promises. Only the _buildJSONStringFromLine enchancement will be carried over into later versions.

V1.0

Execution time

Sample data (0.8 MB):

Sample execution time

Bloat data (13 GB):

Bloat execution time

The results of the profiler show that functions _buildJSONStringFromLine, _removeEscapeSlashes, _splitLineToArr influence the execution time the most (apart from node's own functions). It should be noted that on bloated dataset _splitLineToArr method overtakes the _buildJSONStringFromLine method in terms of execution time. The following releases should prioritize improving the highlighted methods.

Memory

Sample data (0.8 MB):

sample memory

Bloat data (13 GB):

bloat memory

The results of memory tracking show that even though the package has to process big amounts of data the memory used remains roughly the same. This can be attributed to the use of streams. No further improvements in memory usage are required.

Changelog

v1.2.0 - (2022-10-24)

Added

  • Upload to google drive functionality
  • Autodetect separator if no --separator argument is provided
  • docker-compose.yml, Dockerfile , and .dockerignore files
  • Workflow job for building a docker image from the project and pushing it to DockerHub
  • Csv to json transformer tests and a workflow that runs these tests on push
  • Custom logger implementation

Updated

  • Added Feature flags, Google drive storage requirements , and Running the docker image sections to the README.md file
  • Fixed a separator symbol bug in optimized JSON building method
  • Fixed JSON formatting issues

v1.1.0 - (2022-10-19)

Added

  • Feature flag toggling functionality via .env
  • CsvToJsonOptimizedStream a transform stream class which acts as an improved iteration of the previous transform stream
  • Refactored the project structure
  • TransformerFactory a factory which handles the creation of different transformers

Updated

  • Added benchmarks of the current version to benchmarks section in README.md

v1.0.0 - (2022-10-18)

Added

  • Implemented CsvToJsonStream class. This class:
    • Transforms CSV to JSON data in chunks
    • Handles the case of chunk having an incomplete line
    • Checks if the CSV line was parsed into an array correctly and that no unescaped separators were used in the data itself
  • Measured the execution time and the memory usage of the converter when using the bloated 13 GB data file and sample 0.8 MB test data file.
  • Created a pipe out of ReadStream, CsvToJsonStream , and WriteStream and achieved the basic functionality of the package

Updated

  • README.md

v0.1.1 - (2022-10-14)

Added

  • Input Handling
  • Github actions workflow (on release bump if tag and package version mismatch and publish to npm)
  • Test data file generation function
  • README.md

Package structure

  • index.js contains the main code of the package
  • handleArgs.js contains logic related to handling input arguments
  • generate.js contains test data file generation logic
  • transformers/CsvToJsonStream.js contains the extended transform class used for transforming data from json format to csv
  • transformers/CsvToJsonOptimizedStream.js contains the enhanced transform methods of the CsvToJsonStream class
  • factories/TransformerFactory.jscontains a factory which handles the creation of different transformers
  • uploadToGoogleDrive.js contains a function which uploads the transformed result file to the shared folder provided in .env file
  • tests/ directory contains the tests of the package and a custom test runner
  • utils/Logger.js contains a custom logger implementation
1.2.0

2 years ago

1.1.1

2 years ago

1.1.0

2 years ago

1.0.0

2 years ago

1.1.9

2 years ago

1.1.8

2 years ago

1.1.7

2 years ago

1.1.2

2 years ago

0.1.1

2 years ago

0.0.33

2 years ago

0.0.30

2 years ago

0.0.28

2 years ago

0.0.27

2 years ago

0.0.25

2 years ago