Files
spicy-automation/docs/ECS_CLUSTER.md
Ryan Wilson 68684df471 Initial commit: Spicy CDK automation framework
Jenkins shared library and CDK constructs for AWS infrastructure.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-11-18 22:21:00 -08:00

17 KiB

ECS Cluster Deployment

Deploy production-ready ECS clusters with AWS CDK.

Features

  • EC2 Capacity Provider with managed scaling (replaces custom SchedulableContainers metric)
  • Mixed Instances Policy for Spot support (replaces Autospotting)
  • Launch Templates with IMDSv2 and gp3 EBS volumes
  • Instance Draining via lifecycle hooks for graceful task migration
  • Optional Fargate capacity providers for serverless workloads
  • Internal/External ALBs with HTTPS support
  • Container Insights for monitoring
  • Automatic instance refresh via max instance lifetime

Quick Start

Minimal Jenkinsfile - Using CloudFormation Imports

Minimal props: Only vpcStackName required. All VPC details auto-import from VPC stack exports.

@Library(["spicy-automation@main"]) _

spicyECSCluster(
    jenkinsAwsCredentialsId: "aws-credentials",
    region: "ca-central-1",
    stackName: "my-ecs-cluster",
    vpcStackName: "my-vpc",  // Auto-imports ALL VPC details (VPC ID, CIDR, subnets, AZs)
    ownerTag: "MyTeam",
    productTag: "my-product",
    componentTag: "ecs-cluster",
    environment: "dev"
)

What auto-imports from VPC stack:

  • VPC ID from ${vpcStackName}-VPCID
  • VPC CIDR from ${vpcStackName}-VPCCIDR
  • Number of AZs from ${vpcStackName}-NumberOfAZs
  • Private subnet IDs from ${vpcStackName}-PrivateSubnetA1ID, ${vpcStackName}-PrivateSubnetB1ID, etc.
  • Public subnet IDs from ${vpcStackName}-PublicSubnetAID, ${vpcStackName}-PublicSubnetBID, etc. (if createExternalLoadBalancer: true)
  • Availability zones auto-derived from region and number of AZs

Production Jenkinsfile with All Options

@Library(["spicy-automation@main"]) _

spicyECSCluster(
    // AWS Configuration
    jenkinsAwsCredentialsId: "aws-credentials",
    region: "ca-central-1",
    accountId: "123456789012",
    stackName: "prod-ecs-cluster",

    // VPC Configuration - only vpcStackName required, all VPC details auto-import
    vpcStackName: "production-vpc",
    // VPC ID, CIDR, subnets, AZs, and numberOfAzs all auto-import from VPC stack exports

    // Tags
    ownerTag: "Platform",
    productTag: "spicy",
    componentTag: "ecs-cluster",
    environment: "prod",

    // Instance Configuration
    instanceType: "m5a.xlarge",
    additionalInstanceTypes: "m5.xlarge,m5d.xlarge,m5n.xlarge",
    keyName: "my-keypair",
    ebsVolumeSize: 100,

    // Scaling
    minClusterSize: 3,
    maxClusterSize: 10,
    targetCapacityPercent: 100,

    // Spot Configuration (for cost savings)
    spotEnabled: true,
    onDemandPercentage: 50,  // 50% On-Demand, 50% Spot
    spotAllocationStrategy: "capacity-optimized",

    // Load Balancers
    createExternalLoadBalancer: true,
    createInternalLoadBalancer: true,
    certificateArn: "arn:aws:acm:ca-central-1:123456789012:certificate/xxx",

    // Fargate (optional hybrid - enables both FARGATE and FARGATE_SPOT)
    enableFargate: false,

    // Timeouts
    drainingTimeout: 900,        // 15 minutes for task draining
    maxInstanceLifetime: 604800, // 7 days for instance refresh

    // Container Insights
    containerInsights: true,

    // Approval for production
    approvers: "admin,platform-team"
)

Parameters Reference

Required Parameters

Parameter Description Example
jenkinsAwsCredentialsId Jenkins credential ID for AWS "aws-credentials"
region AWS region "ca-central-1"
stackName CloudFormation stack name "my-ecs-cluster"
vpcStackName VPC stack name - required. All VPC details (VPC ID, CIDR, subnets, AZs) auto-import from VPC stack exports "my-vpc"
ownerTag Owner tag value "MyTeam"
productTag Product tag value "my-product"

Instance Configuration

Parameter Default Description
instanceType m5a.large Primary EC2 instance type
additionalInstanceTypes - Additional types for Spot diversity
keyName - EC2 key pair for SSH access
ebsVolumeSize 100 EBS volume size in GB
containerInsights true Enable Container Insights

Scaling Configuration

Parameter Default Description
minClusterSize 2 Minimum number of instances
maxClusterSize 4 Maximum number of instances
targetCapacityPercent 100 Target utilization for managed scaling

Spot Configuration

Parameter Default Description
spotEnabled false Enable Spot instances
onDemandPercentage 100 Percentage of On-Demand (rest is Spot)
spotAllocationStrategy capacity-optimized Spot allocation strategy

Spot Allocation Strategies:

  • capacity-optimized - Best for interruption avoidance (recommended)
  • lowest-price - Best for cost, higher interruption risk
  • capacity-optimized-prioritized - Prioritizes instance types you specify

Load Balancer Configuration

Parameter Default Description
createExternalLoadBalancer false Create internet-facing ALB (public subnets auto-imported from VPC stack if enabled)
createInternalLoadBalancer false Create internal ALB
certificateArn - ACM certificate for HTTPS

Fargate Configuration

Parameter Default Description
enableFargate false Enable Fargate capacity providers (adds both FARGATE and FARGATE_SPOT)

Lifecycle Configuration

Parameter Default Description
drainingTimeout 900 Seconds to wait for task draining
maxInstanceLifetime 604800 Max instance age (7 days)

Environment-Specific Configuration

Development/Sandbox

spicyECSCluster(
    // ... base config ...
    environment: "dev",
    minClusterSize: 1,
    maxClusterSize: 2,
    spotEnabled: true,
    onDemandPercentage: 0,  // 100% Spot for max savings
)

Staging

spicyECSCluster(
    // ... base config ...
    environment: "staging",
    minClusterSize: 2,
    maxClusterSize: 4,
    spotEnabled: true,
    onDemandPercentage: 20,  // 80% Spot
)

Production

spicyECSCluster(
    // ... base config ...
    environment: "prod",
    minClusterSize: 3,
    maxClusterSize: 10,
    spotEnabled: true,
    onDemandPercentage: 50,  // 50% On-Demand baseline
    approvers: "admin,platform-team"
)

Stack Outputs

The stack exports these values for use by ECS services:

Output Export Name Description
ClusterName {stackName}-cluster-name ECS cluster name
ClusterArn {stackName}-cluster-arn ECS cluster ARN
VPC {stackName}-VPC VPC ID
ECSHostSecurityGroup {stackName}-ecs-host-security-group EC2 security group
AutoScalingGroupName {stackName}-auto-scaling-group ASG name
ExternalLoadBalancerDNS {stackName}-internet-facing-url External ALB DNS
ExternalLoadBalancerArn {stackName}-internet-facing-arn External ALB ARN
ExternalHTTPListenerArn {stackName}-internet-facing-http-listener HTTP listener ARN
ExternalHTTPSListenerArn {stackName}-internet-facing-https-listener HTTPS listener ARN
InternalLoadBalancerDNS {stackName}-internal-url Internal ALB DNS
InternalLoadBalancerArn {stackName}-internal-arn Internal ALB ARN
InternalHTTPListenerArn {stackName}-internal-http-listener HTTP listener ARN
InternalHTTPSListenerArn {stackName}-internal-https-listener HTTPS listener ARN
LogsBucketName {stackName}-logs-s3-bucket ALB access logs bucket

How It Works

Capacity Providers (Replaces Custom Scaling)

The cluster uses ECS Managed Scaling via Capacity Providers:

┌─────────────────────────────────────────────────────────┐
│                    ECS Cluster                          │
├─────────────────────────────────────────────────────────┤
│  Capacity Providers:                                    │
│  ┌─────────────────────────────────────────────────┐   │
│  │ EC2 Capacity Provider                           │   │
│  │ - Managed Scaling: ON                           │   │
│  │ - Target Capacity: 100%                         │   │
│  │ - Min Scaling Step: 1                           │   │
│  │ - Max Scaling Step: 10000                       │   │
│  └─────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────┐   │
│  │ FARGATE (optional)                              │   │
│  └─────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────┐   │
│  │ FARGATE_SPOT (optional)                         │   │
│  └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

This replaces the legacy SchedulableContainers Lambda metric with AWS-native scaling.

Mixed Instances Policy (Replaces Autospotting)

When spotEnabled: true:

┌─────────────────────────────────────────────────────────┐
│              Auto Scaling Group                         │
├─────────────────────────────────────────────────────────┤
│  Mixed Instances Policy:                                │
│  ┌─────────────────────────────────────────────────┐   │
│  │ On-Demand Base Capacity: 0                      │   │
│  │ On-Demand % Above Base: 50%                     │   │
│  │ Spot Allocation: capacity-optimized             │   │
│  └─────────────────────────────────────────────────┘   │
│                                                         │
│  Instance Type Overrides:                               │
│  - m5a.large (primary)                                  │
│  - m5.large                                             │
│  - m5d.large                                            │
│  - m5n.large                                            │
└─────────────────────────────────────────────────────────┘

Benefits over Autospotting:

  • No Lambda to maintain
  • Faster response (no polling delay)
  • Better capacity data (AWS-native)
  • Simpler architecture

Instance Draining

Two-layer draining for zero-downtime:

  1. Spot Interruption Draining (native ECS):

    ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
    

    ECS agent drains tasks on 2-minute Spot termination notice.

  2. Lifecycle Hook Draining (Lambda):

    • ASG sends termination event to SNS
    • Lambda sets instance to DRAINING
    • Waits for running tasks to migrate
    • Completes lifecycle action

Launch Template Features

  • IMDSv2 Required: Enhanced metadata security
  • gp3 EBS Volumes: Better performance, lower cost than gp2
  • Encrypted Volumes: EBS encryption enabled
  • SSM Agent: Pre-installed for Session Manager access

Migrating from Legacy Automation

Parameter Mapping

Legacy (Ansible) New (CDK)
stackName stackName
instanceType instanceType
minClusterSize minClusterSize
maxClusterSize maxClusterSize
spotEnabled spotEnabled
minOnDemandPercentage onDemandPercentage
largestContainerCPUReservation (not needed - managed scaling)
largestContainerMemoryReservation (not needed - managed scaling)
clusterScaleUpAdjustment (not needed - managed scaling)
clusterScaleDownAdjustment (not needed - managed scaling)

Removed Features

These legacy features are no longer needed:

  • SchedulableContainers Lambda: Replaced by Capacity Provider managed scaling
  • Autospotting: Replaced by Mixed Instances Policy
  • Launch Configurations: Replaced by Launch Templates
  • gp2 volumes: Upgraded to gp3
  • IMDSv1: Now requires IMDSv2

Troubleshooting

Instances Not Joining Cluster

Check the ECS agent logs:

docker logs ecs-agent
cat /var/log/ecs/ecs-agent.log

Verify cluster name in user data:

cat /etc/ecs/ecs.config

Tasks Not Draining

Check Lambda logs in CloudWatch:

/aws/lambda/{stackName}-DrainingLambda

Spot Interruptions

Monitor with CloudWatch metrics:

  • AWS/EC2SpotInterruptionRate
  • AWS/ECSCPUReservation, MemoryReservation

Consider increasing onDemandPercentage for critical workloads.

Cost Optimization Tips

  1. Use Spot in non-prod: onDemandPercentage: 0
  2. Multiple instance types: Better Spot availability
  3. Right-size instances: Match to your container sizes
  4. Enable Fargate Spot: For batch/background tasks
  5. Set max instance lifetime: Force instance refresh for patches