Initial commit: Spicy CDK automation framework
Jenkins shared library and CDK constructs for AWS infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
384
docs/ECS_CLUSTER.md
Normal file
384
docs/ECS_CLUSTER.md
Normal file
@@ -0,0 +1,384 @@
|
||||
# ECS Cluster Deployment
|
||||
|
||||
Deploy production-ready ECS clusters with AWS CDK.
|
||||
|
||||
## Features
|
||||
|
||||
- **EC2 Capacity Provider** with managed scaling (replaces custom SchedulableContainers metric)
|
||||
- **Mixed Instances Policy** for Spot support (replaces Autospotting)
|
||||
- **Launch Templates** with IMDSv2 and gp3 EBS volumes
|
||||
- **Instance Draining** via lifecycle hooks for graceful task migration
|
||||
- **Optional Fargate** capacity providers for serverless workloads
|
||||
- **Internal/External ALBs** with HTTPS support
|
||||
- **Container Insights** for monitoring
|
||||
- **Automatic instance refresh** via max instance lifetime
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Minimal Jenkinsfile - Using CloudFormation Imports
|
||||
|
||||
**Minimal props:** Only `vpcStackName` required. All VPC details auto-import from VPC stack exports.
|
||||
|
||||
```groovy
|
||||
@Library(["spicy-automation@main"]) _
|
||||
|
||||
spicyECSCluster(
|
||||
jenkinsAwsCredentialsId: "aws-credentials",
|
||||
region: "ca-central-1",
|
||||
stackName: "my-ecs-cluster",
|
||||
vpcStackName: "my-vpc", // Auto-imports ALL VPC details (VPC ID, CIDR, subnets, AZs)
|
||||
ownerTag: "MyTeam",
|
||||
productTag: "my-product",
|
||||
componentTag: "ecs-cluster",
|
||||
environment: "dev"
|
||||
)
|
||||
```
|
||||
|
||||
**What auto-imports from VPC stack:**
|
||||
|
||||
- VPC ID from `${vpcStackName}-VPCID`
|
||||
- VPC CIDR from `${vpcStackName}-VPCCIDR`
|
||||
- Number of AZs from `${vpcStackName}-NumberOfAZs`
|
||||
- Private subnet IDs from `${vpcStackName}-PrivateSubnetA1ID`, `${vpcStackName}-PrivateSubnetB1ID`, etc.
|
||||
- Public subnet IDs from `${vpcStackName}-PublicSubnetAID`, `${vpcStackName}-PublicSubnetBID`, etc. (if `createExternalLoadBalancer: true`)
|
||||
- Availability zones auto-derived from region and number of AZs
|
||||
|
||||
### Production Jenkinsfile with All Options
|
||||
|
||||
```groovy
|
||||
@Library(["spicy-automation@main"]) _
|
||||
|
||||
spicyECSCluster(
|
||||
// AWS Configuration
|
||||
jenkinsAwsCredentialsId: "aws-credentials",
|
||||
region: "ca-central-1",
|
||||
accountId: "123456789012",
|
||||
stackName: "prod-ecs-cluster",
|
||||
|
||||
// VPC Configuration - only vpcStackName required, all VPC details auto-import
|
||||
vpcStackName: "production-vpc",
|
||||
// VPC ID, CIDR, subnets, AZs, and numberOfAzs all auto-import from VPC stack exports
|
||||
|
||||
// Tags
|
||||
ownerTag: "Platform",
|
||||
productTag: "spicy",
|
||||
componentTag: "ecs-cluster",
|
||||
environment: "prod",
|
||||
|
||||
// Instance Configuration
|
||||
instanceType: "m5a.xlarge",
|
||||
additionalInstanceTypes: "m5.xlarge,m5d.xlarge,m5n.xlarge",
|
||||
keyName: "my-keypair",
|
||||
ebsVolumeSize: 100,
|
||||
|
||||
// Scaling
|
||||
minClusterSize: 3,
|
||||
maxClusterSize: 10,
|
||||
targetCapacityPercent: 100,
|
||||
|
||||
// Spot Configuration (for cost savings)
|
||||
spotEnabled: true,
|
||||
onDemandPercentage: 50, // 50% On-Demand, 50% Spot
|
||||
spotAllocationStrategy: "capacity-optimized",
|
||||
|
||||
// Load Balancers
|
||||
createExternalLoadBalancer: true,
|
||||
createInternalLoadBalancer: true,
|
||||
certificateArn: "arn:aws:acm:ca-central-1:123456789012:certificate/xxx",
|
||||
|
||||
// Fargate (optional hybrid - enables both FARGATE and FARGATE_SPOT)
|
||||
enableFargate: false,
|
||||
|
||||
// Timeouts
|
||||
drainingTimeout: 900, // 15 minutes for task draining
|
||||
maxInstanceLifetime: 604800, // 7 days for instance refresh
|
||||
|
||||
// Container Insights
|
||||
containerInsights: true,
|
||||
|
||||
// Approval for production
|
||||
approvers: "admin,platform-team"
|
||||
)
|
||||
```
|
||||
|
||||
## Parameters Reference
|
||||
|
||||
### Required Parameters
|
||||
|
||||
| Parameter | Description | Example |
|
||||
| ------------------------- | -------------------------------------------------------------------------------------------------------------- | ------------------- |
|
||||
| `jenkinsAwsCredentialsId` | Jenkins credential ID for AWS | `"aws-credentials"` |
|
||||
| `region` | AWS region | `"ca-central-1"` |
|
||||
| `stackName` | CloudFormation stack name | `"my-ecs-cluster"` |
|
||||
| `vpcStackName` | VPC stack name - **required**. All VPC details (VPC ID, CIDR, subnets, AZs) auto-import from VPC stack exports | `"my-vpc"` |
|
||||
| `ownerTag` | Owner tag value | `"MyTeam"` |
|
||||
| `productTag` | Product tag value | `"my-product"` |
|
||||
|
||||
### Instance Configuration
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| ------------------------- | ----------- | ----------------------------------- |
|
||||
| `instanceType` | `m5a.large` | Primary EC2 instance type |
|
||||
| `additionalInstanceTypes` | - | Additional types for Spot diversity |
|
||||
| `keyName` | - | EC2 key pair for SSH access |
|
||||
| `ebsVolumeSize` | `100` | EBS volume size in GB |
|
||||
| `containerInsights` | `true` | Enable Container Insights |
|
||||
|
||||
### Scaling Configuration
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| ----------------------- | ------- | -------------------------------------- |
|
||||
| `minClusterSize` | `2` | Minimum number of instances |
|
||||
| `maxClusterSize` | `4` | Maximum number of instances |
|
||||
| `targetCapacityPercent` | `100` | Target utilization for managed scaling |
|
||||
|
||||
### Spot Configuration
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| ------------------------ | -------------------- | -------------------------------------- |
|
||||
| `spotEnabled` | `false` | Enable Spot instances |
|
||||
| `onDemandPercentage` | `100` | Percentage of On-Demand (rest is Spot) |
|
||||
| `spotAllocationStrategy` | `capacity-optimized` | Spot allocation strategy |
|
||||
|
||||
**Spot Allocation Strategies:**
|
||||
|
||||
- `capacity-optimized` - Best for interruption avoidance (recommended)
|
||||
- `lowest-price` - Best for cost, higher interruption risk
|
||||
- `capacity-optimized-prioritized` - Prioritizes instance types you specify
|
||||
|
||||
### Load Balancer Configuration
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| ---------------------------- | ------- | ----------------------------------------------------------------------------------- |
|
||||
| `createExternalLoadBalancer` | `false` | Create internet-facing ALB (public subnets auto-imported from VPC stack if enabled) |
|
||||
| `createInternalLoadBalancer` | `false` | Create internal ALB |
|
||||
| `certificateArn` | - | ACM certificate for HTTPS |
|
||||
|
||||
### Fargate Configuration
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| --------------- | ------- | ---------------------------------------------------------------------- |
|
||||
| `enableFargate` | `false` | Enable Fargate capacity providers (adds both FARGATE and FARGATE_SPOT) |
|
||||
|
||||
### Lifecycle Configuration
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| --------------------- | -------- | --------------------------------- |
|
||||
| `drainingTimeout` | `900` | Seconds to wait for task draining |
|
||||
| `maxInstanceLifetime` | `604800` | Max instance age (7 days) |
|
||||
|
||||
## Environment-Specific Configuration
|
||||
|
||||
### Development/Sandbox
|
||||
|
||||
```groovy
|
||||
spicyECSCluster(
|
||||
// ... base config ...
|
||||
environment: "dev",
|
||||
minClusterSize: 1,
|
||||
maxClusterSize: 2,
|
||||
spotEnabled: true,
|
||||
onDemandPercentage: 0, // 100% Spot for max savings
|
||||
)
|
||||
```
|
||||
|
||||
### Staging
|
||||
|
||||
```groovy
|
||||
spicyECSCluster(
|
||||
// ... base config ...
|
||||
environment: "staging",
|
||||
minClusterSize: 2,
|
||||
maxClusterSize: 4,
|
||||
spotEnabled: true,
|
||||
onDemandPercentage: 20, // 80% Spot
|
||||
)
|
||||
```
|
||||
|
||||
### Production
|
||||
|
||||
```groovy
|
||||
spicyECSCluster(
|
||||
// ... base config ...
|
||||
environment: "prod",
|
||||
minClusterSize: 3,
|
||||
maxClusterSize: 10,
|
||||
spotEnabled: true,
|
||||
onDemandPercentage: 50, // 50% On-Demand baseline
|
||||
approvers: "admin,platform-team"
|
||||
)
|
||||
```
|
||||
|
||||
## Stack Outputs
|
||||
|
||||
The stack exports these values for use by ECS services:
|
||||
|
||||
| Output | Export Name | Description |
|
||||
| -------------------------- | -------------------------------------------- | ---------------------- |
|
||||
| `ClusterName` | `{stackName}-cluster-name` | ECS cluster name |
|
||||
| `ClusterArn` | `{stackName}-cluster-arn` | ECS cluster ARN |
|
||||
| `VPC` | `{stackName}-VPC` | VPC ID |
|
||||
| `ECSHostSecurityGroup` | `{stackName}-ecs-host-security-group` | EC2 security group |
|
||||
| `AutoScalingGroupName` | `{stackName}-auto-scaling-group` | ASG name |
|
||||
| `ExternalLoadBalancerDNS` | `{stackName}-internet-facing-url` | External ALB DNS |
|
||||
| `ExternalLoadBalancerArn` | `{stackName}-internet-facing-arn` | External ALB ARN |
|
||||
| `ExternalHTTPListenerArn` | `{stackName}-internet-facing-http-listener` | HTTP listener ARN |
|
||||
| `ExternalHTTPSListenerArn` | `{stackName}-internet-facing-https-listener` | HTTPS listener ARN |
|
||||
| `InternalLoadBalancerDNS` | `{stackName}-internal-url` | Internal ALB DNS |
|
||||
| `InternalLoadBalancerArn` | `{stackName}-internal-arn` | Internal ALB ARN |
|
||||
| `InternalHTTPListenerArn` | `{stackName}-internal-http-listener` | HTTP listener ARN |
|
||||
| `InternalHTTPSListenerArn` | `{stackName}-internal-https-listener` | HTTPS listener ARN |
|
||||
| `LogsBucketName` | `{stackName}-logs-s3-bucket` | ALB access logs bucket |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Capacity Providers (Replaces Custom Scaling)
|
||||
|
||||
The cluster uses **ECS Managed Scaling** via Capacity Providers:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ ECS Cluster │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ Capacity Providers: │
|
||||
│ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ EC2 Capacity Provider │ │
|
||||
│ │ - Managed Scaling: ON │ │
|
||||
│ │ - Target Capacity: 100% │ │
|
||||
│ │ - Min Scaling Step: 1 │ │
|
||||
│ │ - Max Scaling Step: 10000 │ │
|
||||
│ └─────────────────────────────────────────────────┘ │
|
||||
│ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ FARGATE (optional) │ │
|
||||
│ └─────────────────────────────────────────────────┘ │
|
||||
│ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ FARGATE_SPOT (optional) │ │
|
||||
│ └─────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
This replaces the legacy `SchedulableContainers` Lambda metric with AWS-native scaling.
|
||||
|
||||
### Mixed Instances Policy (Replaces Autospotting)
|
||||
|
||||
When `spotEnabled: true`:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Auto Scaling Group │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ Mixed Instances Policy: │
|
||||
│ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ On-Demand Base Capacity: 0 │ │
|
||||
│ │ On-Demand % Above Base: 50% │ │
|
||||
│ │ Spot Allocation: capacity-optimized │ │
|
||||
│ └─────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Instance Type Overrides: │
|
||||
│ - m5a.large (primary) │
|
||||
│ - m5.large │
|
||||
│ - m5d.large │
|
||||
│ - m5n.large │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Benefits over Autospotting:
|
||||
|
||||
- No Lambda to maintain
|
||||
- Faster response (no polling delay)
|
||||
- Better capacity data (AWS-native)
|
||||
- Simpler architecture
|
||||
|
||||
### Instance Draining
|
||||
|
||||
Two-layer draining for zero-downtime:
|
||||
|
||||
1. **Spot Interruption Draining** (native ECS):
|
||||
|
||||
```bash
|
||||
ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
|
||||
```
|
||||
|
||||
ECS agent drains tasks on 2-minute Spot termination notice.
|
||||
|
||||
2. **Lifecycle Hook Draining** (Lambda):
|
||||
- ASG sends termination event to SNS
|
||||
- Lambda sets instance to DRAINING
|
||||
- Waits for running tasks to migrate
|
||||
- Completes lifecycle action
|
||||
|
||||
### Launch Template Features
|
||||
|
||||
- **IMDSv2 Required**: Enhanced metadata security
|
||||
- **gp3 EBS Volumes**: Better performance, lower cost than gp2
|
||||
- **Encrypted Volumes**: EBS encryption enabled
|
||||
- **SSM Agent**: Pre-installed for Session Manager access
|
||||
|
||||
## Migrating from Legacy Automation
|
||||
|
||||
### Parameter Mapping
|
||||
|
||||
| Legacy (Ansible) | New (CDK) |
|
||||
| ----------------------------------- | ------------------------------ |
|
||||
| `stackName` | `stackName` |
|
||||
| `instanceType` | `instanceType` |
|
||||
| `minClusterSize` | `minClusterSize` |
|
||||
| `maxClusterSize` | `maxClusterSize` |
|
||||
| `spotEnabled` | `spotEnabled` |
|
||||
| `minOnDemandPercentage` | `onDemandPercentage` |
|
||||
| `largestContainerCPUReservation` | (not needed - managed scaling) |
|
||||
| `largestContainerMemoryReservation` | (not needed - managed scaling) |
|
||||
| `clusterScaleUpAdjustment` | (not needed - managed scaling) |
|
||||
| `clusterScaleDownAdjustment` | (not needed - managed scaling) |
|
||||
|
||||
### Removed Features
|
||||
|
||||
These legacy features are no longer needed:
|
||||
|
||||
- **SchedulableContainers Lambda**: Replaced by Capacity Provider managed scaling
|
||||
- **Autospotting**: Replaced by Mixed Instances Policy
|
||||
- **Launch Configurations**: Replaced by Launch Templates
|
||||
- **gp2 volumes**: Upgraded to gp3
|
||||
- **IMDSv1**: Now requires IMDSv2
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Instances Not Joining Cluster
|
||||
|
||||
Check the ECS agent logs:
|
||||
|
||||
```bash
|
||||
docker logs ecs-agent
|
||||
cat /var/log/ecs/ecs-agent.log
|
||||
```
|
||||
|
||||
Verify cluster name in user data:
|
||||
|
||||
```bash
|
||||
cat /etc/ecs/ecs.config
|
||||
```
|
||||
|
||||
### Tasks Not Draining
|
||||
|
||||
Check Lambda logs in CloudWatch:
|
||||
|
||||
```
|
||||
/aws/lambda/{stackName}-DrainingLambda
|
||||
```
|
||||
|
||||
### Spot Interruptions
|
||||
|
||||
Monitor with CloudWatch metrics:
|
||||
|
||||
- `AWS/EC2Spot` → `InterruptionRate`
|
||||
- `AWS/ECS` → `CPUReservation`, `MemoryReservation`
|
||||
|
||||
Consider increasing `onDemandPercentage` for critical workloads.
|
||||
|
||||
## Cost Optimization Tips
|
||||
|
||||
1. **Use Spot in non-prod**: `onDemandPercentage: 0`
|
||||
2. **Multiple instance types**: Better Spot availability
|
||||
3. **Right-size instances**: Match to your container sizes
|
||||
4. **Enable Fargate Spot**: For batch/background tasks
|
||||
5. **Set max instance lifetime**: Force instance refresh for patches
|
||||
Reference in New Issue
Block a user