1.0.1 • Published 3 years ago

tapestry-pipeline v1.0.1

Weekly downloads
-
License
MIT
Repository
github
Last release
3 years ago

Tapestry-branding-logo

shields.io npm version badge shields.io npm license badge shields.io custom website link badge

Overview

Tapestry is an open source orchestration framework for the deployment of user entity data pipelines. It allows users to easily configure and launch an end-to-end data pipeline hosted on Amazon Web Services. Our automated solution combines best-in-class tools to create a warehouse-centric data stack, offering built-in data ingestion, transformation, and newly emerging data syncing (also known as "reverse ETL") technologies. Our inclusion of a reverse ETL component solves the "last mile" problem by providing the ability to operationalize collected user data in near real time.

Read our case study for more information about user data pipelines and to learn how we built Tapestry.

The Team

Katherine Beck Software Engineer Los Angeles, CA

Leah Garrison Software Engineer Atlanta, GA

Rick Molé Software Engineer New York, NY

Adam Peterson Software Engineer Lexington, KY


Table of Contents


Prerequisites

  • Node.js (v12+)
  • NPM
  • AWS Account
  • AWS CLI configured locally
  • Docker

You'll need to have the above accounts and tools before running any Tapestry commands. Being that Tapestry is an Node package, both Node.js and NPM must be installed on your machine. Tapestry also requires you to have an AWS account and the AWS CLI configured locally since it relies your local environment to spin up AWS resources. Finally, Tapestry uses Docker images and containers to run both the ingestion and syncing phases of your pipeline, and so you must have Docker installed and running on your machine.


Installing Tapestry

npm i -g tapestry-pipeline

Note: With the exception of init, all Tapestry commands should be run from the root directory of your Tapestry project.

Tapestry Commands

CommandDescription
tapestry initGathers project information and provision necessary project folders and template files.
tapestry deployDeploys a full data pipeline including Airbyte for ingestion and Grouparoo for syncing, both provisioned on AWS resources and connected to a Snowflake data warehouse.
tapestry kickstartDeploys the same pipeline as deploy, but also includes configuration for Zoom and Salesforce as Airbyte sources and Mailchimp as a Grouparoo destination.
tapestry start-serverLaunches Tapestry UI dashboard locally on port 7777.
tapestry rebuildRebuilds local Grouparoo image and pushes that udpdated image to user's ECR repository, updating the Grouparoo Cloudformation stack in the process.
tapestry teardownKills pipeline ingestion and syncing by tearing down most of the provisioned AWS resources.

Initialization

The first command you want to run to setup your project is init.

tapestry init

  • Tapestry prompts you to give your project a name
  • Tapestry provisions a project folder by the same name, as well as an AWS Cloudformation template for the setup and configuration of your Airbyte stack

Deploying a Tapestry Pipeline

Prerequisites:

  • Snowflake Account (deploy/kickstart)
  • Zoom Account (kickstart)
  • Salesforce Account (kickstart)
  • Mailchimp Account (kickstart)

You have a choice between two commands for the pipeline deployment process. The deploy command will configure and launch a full user data pipeline equipped with Airbyte for the ingestion tool, Snowflake as your data warehouse, Grouparoo as the syncing tool, and a number of AWS resources needed to host and connect these tools to complete the pipeline.

The kickstart command is similar in that it does everything deploy does, but it also configures two Airbyte sources (Zoom and Salesforce) and a Grouparoo destination (Mailchimp).

Regardless of which command you choose, note that a Snowflake account is required for both deploy and kickstart.

tapestry deploy

  • Tapestry prompts you for your Snowflake credentials
  • AWS resources for ingestion phase (Airbyte) are provisioned
  • Snowflake is configured as an Airbyte destination
  • AWS resources for syncing phase (Grouparoo) are provisioned
  • Snowflake is configured as a Grouparoo source

tapestry kickstart

  • Tapestry prompts you for your Snowflake credentials
  • AWS resources for ingestion phase (Airbyte) are provisioned
  • Snowflake is configured as an Airbyte destination
  • Zoom and Salesforce are configured as Airbyte sources
  • User is prompted to follow instructions for DBT setup found here(DBT LINK HERE!)
  • AWS resources for syncing phase (Grouparoo) are provisioned
  • Snowflake is configured as a Grouparoo source
  • Mailchimp is configured as a Grouparoo destination

Management & Maintenance

tapestry start-server

  • Launches Tapestry dashboard UI

Note: Both deploy and kickstart will automatically launch your Tapestry dashboard upon successful deployment. Use start-server to launch your dashboard otherwise.

tapestry rebuild

Specific to the syncing phase of the pipeline. While most updates to Airbyte can be done right in their UI, Grouparoo’s dashboard is mainly for application visibility and observance. In order to add, remove, or update any sources or destinations, changes need to be made to the configuration files in your local Grouparoo directory.

  • Awaits confirmation from user for any changes made to configuration files
  • Rebuilds Grouparoo Docker image locally
  • Pushes local Grouparoo image to ECR
  • Updated Cloudformation stack

Metrics

Your Tapestry dashboard contains documentation for how to use Tapestry, along with various pages for each section of your pipeline: Data Ingestion, Data Storage & Data Transformation, and Data Syncing). Each page displays metrics that give you better insight into the health of each component, such as CPU utilization and instance health. They also include links to the UIs of all of the tools being used at each stage of the pipeline: Airbyte, Grouparoo, Snowflake, and DBT.


Teardown

tapestry teardown

  • Most AWS resources for ingestion and syncing phases are completely torn down and deleted from your AWS account

Note: We say “most” AWS resources because we leave your S3 bucket and parameters in your Parameters Store intact so that you may retain access to this data even after your pipeline has been torn down.


Tapestry Architecture

Tapestry Architecture

The above diagram shows the complete infrastructure of a Tapestry pipeline that is provisioned with deploy/kickstart. This specific diagram also shows the preconfigured sources and destinations that are configured in our kickstart command. For a deeper understanding of this architecture and what each piece is doing, please read our case study.


Helpful Resources