Skip to content
This repository has been archived by the owner on Oct 11, 2021. It is now read-only.
Victor Villas edited this page Apr 29, 2020 · 65 revisions

CloudFormation Integration

Some of the stack template parameters are directly related to the Airflow configuration. Stack updates that change these parameters will trigger changes to each Airflow service Auto Scaling Group's Launch Configuration metadata and newer instances will be instantiated with the updated parameters. Still, old instances must have these changes propagated internally to already running services.

This is achieved leveraging the CloudFormation Helper Scripts suite, making it easier to perform operations as code. Every instance has a cfn-hup service that watches for metadata changes in the CloudFormation template and takes care of triggering the setup process for already running EC2 instances. This will override old configurations and thanks to the airflow-confapply-agent service, restart the airflow process to pick up the new parameters.

Systems Manager Integration

When operating a distributed system like Airflow, it's frequently useful to manage or inspect all or some moving parts. Running the cluster on EC2, this would usually mean setting up public SSH ports or provisioning bastion hosts for private subnets to be able to access each part securely, but even so this still increases the attack surface by having more shared secret keys to manage while offering very little tooling to help operators to automate maintenance tasks and refine operations procedures.

This stack uses the latest Amazon Linux AMIs equipped with an amazon-ssm-agent service so you can leverage the full capabilities of AWS Systems Manager to execute remote commands or scripts against a collection of EC2 instances at once. You can also use the Session Manager for quick inspections and routine operation tasks that may require CLI access to individual instances, working on top of existing IAM policies and is also available on the AWS Console. SSM also add auditing capabilities by logging past operations and managed sessions.

CodeDeploy Integration

Deploying Airflow on distributed persistent workers can be tricky. A frequently used approach is to use a shared network directory to keep all instances in sync with configuration and DAG files, because having a single source of truth makes deployment much easier (e.g. using git-sync). One thing to keep in mind is that updating files in the middle of the execution of a task might have unintended consequences, like the first few bash operators running scripts from the old revision and the last few bash operators running incompatible scripts from the new revision. Safely deploying new code requires the workers to stop, which means a single shared directory requires all the workers to stop simultaneously, which is troublesome to orchestrate.

Thanks to AWS CodeDeploy, distributing the Airflow configs and DAG files to all individual instances is completely automated and centralized, helping developers and operators to make frequent, small, reversible changes. Each instance is equipped with the codedeploy-agent that polls for pending deployments and takes care of the installation process and signals to the Autoscaling Group that the instance is ready for service. Just generate a new deployment package (using the CLI or other tools like CodePipeline) and CodeDeploy will take care of the rest through its agents. This process is easy to adopt, allows fast release cycles and is flexible enough to handle complex upgrading scenarios like requiring airflow services to restart or installing additional packages.

Nevertheless, it's important to make sure deployments are backwards compatible in terms of message exchanging between the scheduler and the worker instances in a similar way to any asynchronous message based system. More information on how to safely deploy DAGs and configuration changes can be found on a specific document.

... under construction...

IAM Policy Selectivity

Implement a strong identity foundation

S3

Enable traceability

Secrets Manager

Apply security at all layers

RDS

Protect data in transit and at rest

Reliability

... under construction...

Performance Efficiency

... under construction...

Cost Optimization

... under construction...

Automated EC2 Lifecycle

Every EC2 instance goes through the following stages:

  1. Pending (cfn-init)
    1. Install Airflow + Celery
    2. Load secrets from the AWS SSM Parameter Store
    3. Depending on the instance, setup the appropriate service:
      • For scheduler instances, enable airflow-scheduler
      • For webserver instances, enable airflow-webserver
      • For celery worker instances, enable airflow-workerset
    4. Enable airflow-confapply-agent, a sidecar that restarts airflow services if:
      • Something changes in the airflow environment variables.

Shared EFS Mounting

Autoscaling

... under construction...