Files
spicy-automation/docs/ECS_CLUSTER.md
Ryan Wilson 68684df471 Initial commit: Spicy CDK automation framework
Jenkins shared library and CDK constructs for AWS infrastructure.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-11-18 22:21:00 -08:00

385 lines
17 KiB
Markdown

# ECS Cluster Deployment
Deploy production-ready ECS clusters with AWS CDK.
## Features
- **EC2 Capacity Provider** with managed scaling (replaces custom SchedulableContainers metric)
- **Mixed Instances Policy** for Spot support (replaces Autospotting)
- **Launch Templates** with IMDSv2 and gp3 EBS volumes
- **Instance Draining** via lifecycle hooks for graceful task migration
- **Optional Fargate** capacity providers for serverless workloads
- **Internal/External ALBs** with HTTPS support
- **Container Insights** for monitoring
- **Automatic instance refresh** via max instance lifetime
## Quick Start
### Minimal Jenkinsfile - Using CloudFormation Imports
**Minimal props:** Only `vpcStackName` required. All VPC details auto-import from VPC stack exports.
```groovy
@Library(["spicy-automation@main"]) _
spicyECSCluster(
jenkinsAwsCredentialsId: "aws-credentials",
region: "ca-central-1",
stackName: "my-ecs-cluster",
vpcStackName: "my-vpc", // Auto-imports ALL VPC details (VPC ID, CIDR, subnets, AZs)
ownerTag: "MyTeam",
productTag: "my-product",
componentTag: "ecs-cluster",
environment: "dev"
)
```
**What auto-imports from VPC stack:**
- VPC ID from `${vpcStackName}-VPCID`
- VPC CIDR from `${vpcStackName}-VPCCIDR`
- Number of AZs from `${vpcStackName}-NumberOfAZs`
- Private subnet IDs from `${vpcStackName}-PrivateSubnetA1ID`, `${vpcStackName}-PrivateSubnetB1ID`, etc.
- Public subnet IDs from `${vpcStackName}-PublicSubnetAID`, `${vpcStackName}-PublicSubnetBID`, etc. (if `createExternalLoadBalancer: true`)
- Availability zones auto-derived from region and number of AZs
### Production Jenkinsfile with All Options
```groovy
@Library(["spicy-automation@main"]) _
spicyECSCluster(
// AWS Configuration
jenkinsAwsCredentialsId: "aws-credentials",
region: "ca-central-1",
accountId: "123456789012",
stackName: "prod-ecs-cluster",
// VPC Configuration - only vpcStackName required, all VPC details auto-import
vpcStackName: "production-vpc",
// VPC ID, CIDR, subnets, AZs, and numberOfAzs all auto-import from VPC stack exports
// Tags
ownerTag: "Platform",
productTag: "spicy",
componentTag: "ecs-cluster",
environment: "prod",
// Instance Configuration
instanceType: "m5a.xlarge",
additionalInstanceTypes: "m5.xlarge,m5d.xlarge,m5n.xlarge",
keyName: "my-keypair",
ebsVolumeSize: 100,
// Scaling
minClusterSize: 3,
maxClusterSize: 10,
targetCapacityPercent: 100,
// Spot Configuration (for cost savings)
spotEnabled: true,
onDemandPercentage: 50, // 50% On-Demand, 50% Spot
spotAllocationStrategy: "capacity-optimized",
// Load Balancers
createExternalLoadBalancer: true,
createInternalLoadBalancer: true,
certificateArn: "arn:aws:acm:ca-central-1:123456789012:certificate/xxx",
// Fargate (optional hybrid - enables both FARGATE and FARGATE_SPOT)
enableFargate: false,
// Timeouts
drainingTimeout: 900, // 15 minutes for task draining
maxInstanceLifetime: 604800, // 7 days for instance refresh
// Container Insights
containerInsights: true,
// Approval for production
approvers: "admin,platform-team"
)
```
## Parameters Reference
### Required Parameters
| Parameter | Description | Example |
| ------------------------- | -------------------------------------------------------------------------------------------------------------- | ------------------- |
| `jenkinsAwsCredentialsId` | Jenkins credential ID for AWS | `"aws-credentials"` |
| `region` | AWS region | `"ca-central-1"` |
| `stackName` | CloudFormation stack name | `"my-ecs-cluster"` |
| `vpcStackName` | VPC stack name - **required**. All VPC details (VPC ID, CIDR, subnets, AZs) auto-import from VPC stack exports | `"my-vpc"` |
| `ownerTag` | Owner tag value | `"MyTeam"` |
| `productTag` | Product tag value | `"my-product"` |
### Instance Configuration
| Parameter | Default | Description |
| ------------------------- | ----------- | ----------------------------------- |
| `instanceType` | `m5a.large` | Primary EC2 instance type |
| `additionalInstanceTypes` | - | Additional types for Spot diversity |
| `keyName` | - | EC2 key pair for SSH access |
| `ebsVolumeSize` | `100` | EBS volume size in GB |
| `containerInsights` | `true` | Enable Container Insights |
### Scaling Configuration
| Parameter | Default | Description |
| ----------------------- | ------- | -------------------------------------- |
| `minClusterSize` | `2` | Minimum number of instances |
| `maxClusterSize` | `4` | Maximum number of instances |
| `targetCapacityPercent` | `100` | Target utilization for managed scaling |
### Spot Configuration
| Parameter | Default | Description |
| ------------------------ | -------------------- | -------------------------------------- |
| `spotEnabled` | `false` | Enable Spot instances |
| `onDemandPercentage` | `100` | Percentage of On-Demand (rest is Spot) |
| `spotAllocationStrategy` | `capacity-optimized` | Spot allocation strategy |
**Spot Allocation Strategies:**
- `capacity-optimized` - Best for interruption avoidance (recommended)
- `lowest-price` - Best for cost, higher interruption risk
- `capacity-optimized-prioritized` - Prioritizes instance types you specify
### Load Balancer Configuration
| Parameter | Default | Description |
| ---------------------------- | ------- | ----------------------------------------------------------------------------------- |
| `createExternalLoadBalancer` | `false` | Create internet-facing ALB (public subnets auto-imported from VPC stack if enabled) |
| `createInternalLoadBalancer` | `false` | Create internal ALB |
| `certificateArn` | - | ACM certificate for HTTPS |
### Fargate Configuration
| Parameter | Default | Description |
| --------------- | ------- | ---------------------------------------------------------------------- |
| `enableFargate` | `false` | Enable Fargate capacity providers (adds both FARGATE and FARGATE_SPOT) |
### Lifecycle Configuration
| Parameter | Default | Description |
| --------------------- | -------- | --------------------------------- |
| `drainingTimeout` | `900` | Seconds to wait for task draining |
| `maxInstanceLifetime` | `604800` | Max instance age (7 days) |
## Environment-Specific Configuration
### Development/Sandbox
```groovy
spicyECSCluster(
// ... base config ...
environment: "dev",
minClusterSize: 1,
maxClusterSize: 2,
spotEnabled: true,
onDemandPercentage: 0, // 100% Spot for max savings
)
```
### Staging
```groovy
spicyECSCluster(
// ... base config ...
environment: "staging",
minClusterSize: 2,
maxClusterSize: 4,
spotEnabled: true,
onDemandPercentage: 20, // 80% Spot
)
```
### Production
```groovy
spicyECSCluster(
// ... base config ...
environment: "prod",
minClusterSize: 3,
maxClusterSize: 10,
spotEnabled: true,
onDemandPercentage: 50, // 50% On-Demand baseline
approvers: "admin,platform-team"
)
```
## Stack Outputs
The stack exports these values for use by ECS services:
| Output | Export Name | Description |
| -------------------------- | -------------------------------------------- | ---------------------- |
| `ClusterName` | `{stackName}-cluster-name` | ECS cluster name |
| `ClusterArn` | `{stackName}-cluster-arn` | ECS cluster ARN |
| `VPC` | `{stackName}-VPC` | VPC ID |
| `ECSHostSecurityGroup` | `{stackName}-ecs-host-security-group` | EC2 security group |
| `AutoScalingGroupName` | `{stackName}-auto-scaling-group` | ASG name |
| `ExternalLoadBalancerDNS` | `{stackName}-internet-facing-url` | External ALB DNS |
| `ExternalLoadBalancerArn` | `{stackName}-internet-facing-arn` | External ALB ARN |
| `ExternalHTTPListenerArn` | `{stackName}-internet-facing-http-listener` | HTTP listener ARN |
| `ExternalHTTPSListenerArn` | `{stackName}-internet-facing-https-listener` | HTTPS listener ARN |
| `InternalLoadBalancerDNS` | `{stackName}-internal-url` | Internal ALB DNS |
| `InternalLoadBalancerArn` | `{stackName}-internal-arn` | Internal ALB ARN |
| `InternalHTTPListenerArn` | `{stackName}-internal-http-listener` | HTTP listener ARN |
| `InternalHTTPSListenerArn` | `{stackName}-internal-https-listener` | HTTPS listener ARN |
| `LogsBucketName` | `{stackName}-logs-s3-bucket` | ALB access logs bucket |
## How It Works
### Capacity Providers (Replaces Custom Scaling)
The cluster uses **ECS Managed Scaling** via Capacity Providers:
```
┌─────────────────────────────────────────────────────────┐
│ ECS Cluster │
├─────────────────────────────────────────────────────────┤
│ Capacity Providers: │
│ ┌─────────────────────────────────────────────────┐ │
│ │ EC2 Capacity Provider │ │
│ │ - Managed Scaling: ON │ │
│ │ - Target Capacity: 100% │ │
│ │ - Min Scaling Step: 1 │ │
│ │ - Max Scaling Step: 10000 │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ FARGATE (optional) │ │
│ └─────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ FARGATE_SPOT (optional) │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
This replaces the legacy `SchedulableContainers` Lambda metric with AWS-native scaling.
### Mixed Instances Policy (Replaces Autospotting)
When `spotEnabled: true`:
```
┌─────────────────────────────────────────────────────────┐
│ Auto Scaling Group │
├─────────────────────────────────────────────────────────┤
│ Mixed Instances Policy: │
│ ┌─────────────────────────────────────────────────┐ │
│ │ On-Demand Base Capacity: 0 │ │
│ │ On-Demand % Above Base: 50% │ │
│ │ Spot Allocation: capacity-optimized │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Instance Type Overrides: │
│ - m5a.large (primary) │
│ - m5.large │
│ - m5d.large │
│ - m5n.large │
└─────────────────────────────────────────────────────────┘
```
Benefits over Autospotting:
- No Lambda to maintain
- Faster response (no polling delay)
- Better capacity data (AWS-native)
- Simpler architecture
### Instance Draining
Two-layer draining for zero-downtime:
1. **Spot Interruption Draining** (native ECS):
```bash
ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
```
ECS agent drains tasks on 2-minute Spot termination notice.
2. **Lifecycle Hook Draining** (Lambda):
- ASG sends termination event to SNS
- Lambda sets instance to DRAINING
- Waits for running tasks to migrate
- Completes lifecycle action
### Launch Template Features
- **IMDSv2 Required**: Enhanced metadata security
- **gp3 EBS Volumes**: Better performance, lower cost than gp2
- **Encrypted Volumes**: EBS encryption enabled
- **SSM Agent**: Pre-installed for Session Manager access
## Migrating from Legacy Automation
### Parameter Mapping
| Legacy (Ansible) | New (CDK) |
| ----------------------------------- | ------------------------------ |
| `stackName` | `stackName` |
| `instanceType` | `instanceType` |
| `minClusterSize` | `minClusterSize` |
| `maxClusterSize` | `maxClusterSize` |
| `spotEnabled` | `spotEnabled` |
| `minOnDemandPercentage` | `onDemandPercentage` |
| `largestContainerCPUReservation` | (not needed - managed scaling) |
| `largestContainerMemoryReservation` | (not needed - managed scaling) |
| `clusterScaleUpAdjustment` | (not needed - managed scaling) |
| `clusterScaleDownAdjustment` | (not needed - managed scaling) |
### Removed Features
These legacy features are no longer needed:
- **SchedulableContainers Lambda**: Replaced by Capacity Provider managed scaling
- **Autospotting**: Replaced by Mixed Instances Policy
- **Launch Configurations**: Replaced by Launch Templates
- **gp2 volumes**: Upgraded to gp3
- **IMDSv1**: Now requires IMDSv2
## Troubleshooting
### Instances Not Joining Cluster
Check the ECS agent logs:
```bash
docker logs ecs-agent
cat /var/log/ecs/ecs-agent.log
```
Verify cluster name in user data:
```bash
cat /etc/ecs/ecs.config
```
### Tasks Not Draining
Check Lambda logs in CloudWatch:
```
/aws/lambda/{stackName}-DrainingLambda
```
### Spot Interruptions
Monitor with CloudWatch metrics:
- `AWS/EC2Spot` → `InterruptionRate`
- `AWS/ECS` → `CPUReservation`, `MemoryReservation`
Consider increasing `onDemandPercentage` for critical workloads.
## Cost Optimization Tips
1. **Use Spot in non-prod**: `onDemandPercentage: 0`
2. **Multiple instance types**: Better Spot availability
3. **Right-size instances**: Match to your container sizes
4. **Enable Fargate Spot**: For batch/background tasks
5. **Set max instance lifetime**: Force instance refresh for patches