Thursday, September 17th, 2020
You want to extract data from Postgres and load it into Snowflake. You prefer the flexibility of code -- perhaps there are some custom operations you want to do on the data before it is loaded into Snowflake for example.
But you recognize that writing such code is only half the problem. Setting up and maintaining infrastructure to run, monitor and orchestrate a code-based data pipeline can be a huge time sink and distraction.
This is where CloudReactor helps.
CloudReactor’s “deploy” command packages your program into Tasks backed by Docker images, and deploys them to your AWS ECS environment. You can then use the CloudReactor dashboard to monitor, manage and orchestrate these Tasks. For example, simply hit “start” in the CloudReactor dashboard, or configure a cron schedule, and your Task automatically starts running on AWS ECS Fargate. You can view all historical executions and their status too.
In a nutshell, you get the development & isolation benefits of Docker, plus easy deployment to AWS ECS, monitoring, alerting and Workflow authoring thanks to CloudReactor.
Running Tasks on ECS Fargate is often cost-effective because you only pay for resources you use, and saves your team time because there’s no server maintenance to worry about.
After following this guide, you’ll have a Task that will extract data from a Postgres table and load it into Snowflake. The Task keeps track of the most recent data loaded, and will only fetch new / updated records each time it runs. The script as-is doesn’t pick up deleted records however (i.e. assumes the source table is append-only).
The Task is deployed and run on an ECS cluster in your own AWS environment, and uses AWS Secrets Manager for runtime secrets. Once deployed, you can login to your CloudReactor account to stop, start and schedule this Task; to set alerts; and to build Workflows (connect this Task with other Tasks), etc.
While we believe using CloudReactor to manage and monitor Tasks is a great choice, you can also use the core “python_to_snowflake.py” script independent of CloudReactor. Scripts deployed with CloudReactor are just normal code. Feel free to check out the script itself here.
We’ll first run the script locally to confirm everything works, and then configure & deploy to AWS via CloudReactor.
Fork the CloudReactor postgres_to_snowflake repo. This includes the
postgres_to_snowflake.py script, all required dependencies, and tools that enable the Task to be built into a docker container and deployed to AWS:
git clone https://github.com/CloudReactor/postgres_to_snowflake.git [./new_dir]
new_dir is optional if you want to clone to a directory other than ./postgres_to_snowflake
Modify the following files:
/src/postgres_to_snowflake.py. This is the program that fetches data from postgres and loads it to Snowflake. You will want to modify the “key variables” section at the top of the page.
/deploy_config/files/.env.dev: Add your postgres and Snowflake credentials. These are used when the script is run locally.
/docker-compose.yml: Nothing to change here. Just for reference, this has a “postgres_to_snowflake” block under “services”. This allows you to run the Task locally by running
docker-compose run --rm postgres_to_snowflake:
postgres_to_snowflake: <<: *service-base command: python src/postgres_to_snowflake.py
docker-compose run --rm postgres_to_snowflake
Of course, you need to ensure connectivity from your local machine to source and destination databases. This might involve a VPN for example.
If the script succeeds 🥳 🎉, you should be able to login to your snowflake dashboard and query the relevant destination table.
Sign up for a CloudReactor account. Once created, you can login to view your CloudReactor API key.
To run Tasks in AWS via CloudReactor, you’ll need:
There are two ways to create this infrastructure and link it to CloudReactor.
git clone https://github.com/CloudReactor/cloudreactor-aws-setup-wizard.git cd cloudreactor-aws-setup-wizard ./build.sh ./wizard.sh (or .wizard.bat if using Windows)
Whether you use the AWS Setup Wizard or do things manually, you’ll end up with:
We’ll use AWS Secrets Manager to store runtime secrets. For this guide, we’ll create 3 secrets objects: one each to store your CloudReactor API key, PostgreSQL and Snowflake credentials. (If you don’t want to use AWS Secrets Manager, see https://docs.cloudreactor.io/secrets.html for an overview of some other options).
In the AWS Secrets Manager dashboard, click “Store a new secret”.
<env_name>with whatever your CloudReactor Run Environment is called -- this was created during the AWS Setup Wizard. You can also check the name of the Run Environment by logging into the CloudReactor dashboard and selecting “Run Environments”.
CloudReactor/<env_name>/common/prefix is important: Tasks deployed with CloudReactor are by default able to read secrets contained in this specific path (but not other paths). You can of course modify the permissions policy to enable access to other paths.
env_nameis the CloudReactor “Run Environment” mentioned just above. You can choose a different name to
postgresif you want.
Modify the following files:
/deploy_config/docker_deploy.env.example: copy this to
docker_deploy.env. Add your AWS key, secret access key, and AWS ECS cluster region. This allows you to use the
./docker_deploycommand in the next step to deploy Tasks directly from your local machine to AWS.
/deploy_config/vars/[env_name].yml: copy and rename
example.ymlto match the name of your CloudReactor Run Environment (e.g. prod, staging). Open the file. Fill in your CloudReactor API key next to
api_key. Next, further down that same file, you’ll see the following block:
- name: PROC_WRAPPER_API_KEY valueFrom: "arn:aws:secretsmanager:[aws_region]:[aws_account_id]:secret:CloudReactor/example/common/cloudreactor_api_key-xxx" - name: POSTGRES_SECRETS valueFrom: ... ...
Fill in the
valueFrom: values here with the actual secrets ARNs from AWS Secrets Manager (note:
PROC_WRAPPER_API_KEY = CloudReactor API secret).
Note the following block further down that file:
postgres_to_snowflake: <<: >*default_env_task_config
Nothing to modify here. This block allows the postgres_to_snowflake task to read the secrets we just defined above.
Deploy to ECS:
./cr_deploy.sh [env_name] postgres_to_snowflake
[env_name] is the name of your CloudReactor Run Environment (and the .yml file created just above with your CloudReactor API key), e.g. prod, staging.
Log into the CloudReactor dashboard at dash.cloudreactor.io.
You’ll see an entry for the postgres_to_snowflake Task:
To run it now, simply hit the start button. You can monitor its status from the dashboard, and see a list of historical executions by clicking on the Task name.
Click the gear icon to set a cron schedule. You can also setup success / failure alerts via email or PagerDuty.
Finally, if you deploy other Tasks using CloudReactor, you can click “Workflows” and link Tasks together (e.g. “immediately after this Task succeeds, start some related downstream Task”).
CloudReactor aims to simplify data engineers' lives, by making it really easy to deploy, run, monitor and manage Tasks in the cloud.
In this guide, we deployed a Task that fetched data from a Postgres source, and loaded it into Snowflake. The Task runs in your own AWS environment, but can be managed easily in CloudReactor.
Next, you may want to deploy other Tasks using CloudReactor - read our guide here. Once deployed, you can link these Tasks together into a Workflow.
We welcome your feedback! Feel free to contact me at firstname.lastname@example.org