Links#
- Amazon ECS Developer Guide
- Task definition parameters
- Amazon ECS services
- Deploy Amazon ECS services by replacing tasks
- Deployment circuit breaker
- AWS Fargate capacity providers
- Task IAM role
- Task execution IAM role
- aws ecs register-task-definition
- aws ecs update-service
1. Important Points#
Amazon ECS 是 AWS 原生 container orchestration 服务。它适合想运行 container application,但不想自己管理 Kubernetes control plane 的场景。
ECS 用来做:
run long-running container service
run scheduled / one-off container task
deploy app behind ALB / NLB
run worker / consumer / batch job
integrate with IAM, CloudWatch Logs, ALB, Service Discovery
ECS 不适合:
strong Kubernetes ecosystem requirement
custom scheduler / operator-heavy platform
portable K8S manifest as primary deployment artifact
complex multi-cluster service mesh governance核心原则:
prefer Fargate first:
no EC2 capacity management
simple operational model
good default for small / medium teams
use ECS on EC2 when:
special instance type / GPU / local disk
very high steady utilization
daemon / host-level integration
advanced placement or cost optimization requirement
deployment unit:
task definition revision is immutable
service points to one task definition revision
update-service creates a new deployment2. Core Concepts#
| Concept | Meaning | Production Note |
|---|---|---|
| Cluster | ECS logical scheduling boundary | usually one per environment or platform boundary |
| Task definition | immutable container spec | every deploy registers a new revision |
| Task | running copy of a task definition | one or more containers |
| Service | keeps desired number of tasks running | use for web/API/worker |
| Capacity provider | where tasks run | FARGATE, FARGATE_SPOT, or EC2 ASG |
| Task role | IAM role used by application code | least privilege per service |
| Execution role | IAM role used by ECS agent/Fargate | pull image, write logs, fetch secrets |
| Service discovery | DNS name for ECS services | Cloud Map or internal ALB |
| Deployment | replacing old tasks with new tasks | rolling update by default |
request path example:
Route 53
-> ALB
-> target group
-> ECS service
-> ECS task
-> container port
worker path example:
SQS / EventBridge
-> ECS service or run-task
-> container
-> downstream AWS service3. Service Configuration#
Minimum production service config:
cluster:
environment boundary, for example prod-apps
service:
desiredCount >= 2 for production web service
deploymentCircuitBreaker enabled with rollback
healthCheckGracePeriodSeconds configured when using ALB
enableExecuteCommand only when audited and needed
task definition:
image uses immutable tag or image digest
cpu / memory explicitly set
logs go to CloudWatch Logs
secrets come from Secrets Manager / SSM Parameter Store
taskRoleArn and executionRoleArn are separated
network:
awsvpc mode
private subnets for backend service
ALB in public subnets if internet-facing
security group allows only required ingressFargate task definition skeleton:
{
"family": "order-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::111122223333:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::111122223333:role/order-api-task-role",
"runtimePlatform": {
"cpuArchitecture": "X86_64",
"operatingSystemFamily": "LINUX"
},
"containerDefinitions": [
{
"name": "order-api",
"image": "111122223333.dkr.ecr.ap-east-1.amazonaws.com/order-api:20260603-120000",
"essential": true,
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"environment": [
{
"name": "APP_ENV",
"value": "prod"
}
],
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:ap-east-1:111122223333:secret:prod/order-api/database-url-AbCdEf"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/order-api",
"awslogs-region": "ap-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}4. Capacity And Networking#
capacity choice#
| Option | When To Use | Notes |
|---|---|---|
| Fargate | default for most services | no node management |
| Fargate Spot | fault-tolerant worker / async job | can be interrupted |
| ECS on EC2 | special hardware or high utilization | you manage capacity |
| EC2 capacity provider | ASG-backed ECS capacity | better scaling than manual container instances |
Fargate capacity provider example:
aws ecs put-cluster-capacity-providers \
--cluster prod-apps \
--capacity-providers FARGATE FARGATE_SPOT \
--default-capacity-provider-strategy \
capacityProvider=FARGATE,weight=1,base=1 \
capacityProvider=FARGATE_SPOT,weight=1Production networking:
public web API:
ALB in public subnets
ECS tasks in private subnets
task security group allows inbound only from ALB security group
outbound restricted when possible
internal service:
internal ALB or Cloud Map
private subnets only
no public IP
egress:
use VPC endpoints for ECR, CloudWatch Logs, Secrets Manager, SSM, STS when possible
use NAT Gateway only for required internet egress5. IAM#
Task role trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}Task role example for an app that reads one secret and one S3 prefix:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadAppSecret",
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": "arn:aws:secretsmanager:ap-east-1:111122223333:secret:prod/order-api/*"
},
{
"Sid": "ReadOrderBucketPrefix",
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::company-prod-orders/order-api/*"
}
]
}Deploy principal policy for CI/CD:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DeployEcsService",
"Effect": "Allow",
"Action": [
"ecs:DescribeServices",
"ecs:DescribeTaskDefinition",
"ecs:DescribeTasks",
"ecs:ListTasks",
"ecs:RegisterTaskDefinition",
"ecs:UpdateService"
],
"Resource": "*"
},
{
"Sid": "PassOnlyEcsTaskRoles",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": [
"arn:aws:iam::111122223333:role/ecsTaskExecutionRole",
"arn:aws:iam::111122223333:role/order-api-task-role"
],
"Condition": {
"StringEquals": {
"iam:PassedToService": "ecs-tasks.amazonaws.com"
}
}
}
]
}security rule:
app permissions go to task role
image pull / log / secret injection permissions go to execution role
CI/CD role can register task definition and update service
CI/CD role should only pass approved ECS roles6. Deployment Strategy#
ECS rolling update is the default deployment controller for most services.
rolling update:
update-service points service to new task definition
ECS starts new tasks and drains old tasks based on deployment config
minimumHealthyPercent controls lower healthy task bound
maximumPercent controls temporary upper task count
deployment circuit breaker:
detects service deployment failure
can rollback to last completed deployment
works with ECS rolling update controller
blue/green:
use CodeDeploy or ECS blue/green when traffic shifting / validation hooks are needed
more moving parts than normal rolling updateCreate service with circuit breaker:
aws ecs create-service \
--cluster prod-apps \
--service-name order-api \
--task-definition order-api:1 \
--desired-count 2 \
--capacity-provider-strategy capacityProvider=FARGATE,weight=1 \
--deployment-controller type=ECS \
--deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},maximumPercent=200,minimumHealthyPercent=100" \
--network-configuration "awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb],securityGroups=[sg-task],assignPublicIp=DISABLED}" \
--load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:ap-east-1:111122223333:targetgroup/order-api/abc123,containerName=order-api,containerPort=8080" \
--health-check-grace-period-seconds 60 \
--region ap-east-17. CLI Deploy#
这个 section 的目标是给 Jenkins、Ansible、GitHub Actions、GitLab CI 这种系统直接集成,不依赖 AWS Console。
deploy flow#
1. build image
2. push image to ECR
3. render task definition with new image
4. register new task definition revision
5. update ECS service to new revision
6. wait services-stable
7. verify running task image / ALB target health
8. print diagnostics on failurefiles#
deployment/ecs/order-api/prod/
├── metadata.env
└── task-definition.jsonmetadata.env:
AWS_REGION=ap-east-1
AWS_ACCOUNT_ID=111122223333
ECR_REPOSITORY=order-api
ECS_CLUSTER=prod-apps
ECS_SERVICE=order-api
CONTAINER_NAME=order-api
TARGET_GROUP_ARN=arn:aws:elasticloadbalancing:ap-east-1:111122223333:targetgroup/order-api/abc123task-definition.json keeps image as a placeholder:
{
"family": "order-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::111122223333:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::111122223333:role/order-api-task-role",
"containerDefinitions": [
{
"name": "order-api",
"image": "__IMAGE_URI__",
"essential": true,
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/order-api",
"awslogs-region": "ap-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}deploy script#
#!/usr/bin/env bash
set -euo pipefail
export AWS_PAGER=""
SERVICE_DIR="${1:?usage: deploy-ecs.sh <service-dir> <image-tag>}"
IMAGE_TAG="${2:?usage: deploy-ecs.sh <service-dir> <image-tag>}"
source "${SERVICE_DIR}/metadata.env"
IMAGE_URI="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPOSITORY}:${IMAGE_TAG}"
TMP_TASK_DEF="$(mktemp /tmp/ecs-task-definition.XXXXXX.json)"
trap 'rm -f "${TMP_TASK_DEF}"' EXIT
print_service_diagnostics() {
aws ecs describe-services \
--cluster "${ECS_CLUSTER}" \
--services "${ECS_SERVICE}" \
--region "${AWS_REGION}" \
--query 'services[0].deployments[*].[status,taskDefinition,desiredCount,runningCount,pendingCount,failedTasks,rolloutState,rolloutStateReason]' \
--output table
aws ecs describe-services \
--cluster "${ECS_CLUSTER}" \
--services "${ECS_SERVICE}" \
--region "${AWS_REGION}" \
--query 'services[0].events[0:10].[createdAt,message]' \
--output table
}
print_target_health() {
if [[ -z "${TARGET_GROUP_ARN:-}" ]]; then
return 0
fi
aws elbv2 describe-target-health \
--target-group-arn "${TARGET_GROUP_ARN}" \
--region "${AWS_REGION}" \
--query 'TargetHealthDescriptions[*].[Target.Id,Target.Port,TargetHealth.State,TargetHealth.Reason,TargetHealth.Description]' \
--output table
}
echo "Render task definition: ${IMAGE_URI}"
jq --arg image "${IMAGE_URI}" --arg name "${CONTAINER_NAME}" '
.containerDefinitions |= map(
if .name == $name then .image = $image else . end
)
' "${SERVICE_DIR}/task-definition.json" > "${TMP_TASK_DEF}"
jq empty "${TMP_TASK_DEF}"
echo "Register task definition"
TASK_DEF_ARN="$(
aws ecs register-task-definition \
--region "${AWS_REGION}" \
--cli-input-json "file://${TMP_TASK_DEF}" \
--query 'taskDefinition.taskDefinitionArn' \
--output text
)"
echo "Update service: ${ECS_CLUSTER}/${ECS_SERVICE} -> ${TASK_DEF_ARN}"
aws ecs update-service \
--cluster "${ECS_CLUSTER}" \
--service "${ECS_SERVICE}" \
--task-definition "${TASK_DEF_ARN}" \
--force-new-deployment \
--region "${AWS_REGION}" \
--output json > /dev/null
echo "Wait services-stable"
if ! aws ecs wait services-stable \
--cluster "${ECS_CLUSTER}" \
--services "${ECS_SERVICE}" \
--region "${AWS_REGION}"; then
echo "ECS service failed to stabilize."
print_service_diagnostics
print_target_health
exit 1
fi
echo "Verify service deployment"
aws ecs describe-services \
--cluster "${ECS_CLUSTER}" \
--services "${ECS_SERVICE}" \
--region "${AWS_REGION}" \
--query 'services[0].{serviceName:serviceName,taskDefinition:taskDefinition,desiredCount:desiredCount,runningCount:runningCount,pendingCount:pendingCount,rolloutState:deployments[0].rolloutState,rolloutStateReason:deployments[0].rolloutStateReason}' \
--output table
print_target_health
echo "Deployment completed: ${IMAGE_URI}"Run locally or from Jenkins:
chmod +x deploy-ecs.sh
./deploy-ecs.sh deployment/ecs/order-api/prod 20260603-120000rollback#
Rollback means update the service to a previous task definition revision.
aws ecs list-task-definitions \
--family-prefix order-api \
--sort DESC \
--region ap-east-1 \
--max-items 10
aws ecs update-service \
--cluster prod-apps \
--service order-api \
--task-definition order-api:42 \
--force-new-deployment \
--region ap-east-1
aws ecs wait services-stable \
--cluster prod-apps \
--services order-api \
--region ap-east-1Jenkins pattern#
pipeline {
agent any
parameters {
string(name: 'IMAGE_TAG', defaultValue: '', description: 'ECR image tag')
choice(name: 'ENV', choices: ['dev', 'uat', 'prod'], description: 'deploy env')
}
stages {
stage('Deploy ECS') {
steps {
sh '''
set -euo pipefail
./deploy-ecs.sh "deployment/ecs/order-api/${ENV}" "${IMAGE_TAG}"
'''
}
}
}
}Ansible pattern#
---
- name: Deploy ECS service
hosts: localhost
gather_facts: false
tasks:
- name: Run ECS deploy script
ansible.builtin.command:
cmd: ./deploy-ecs.sh deployment/ecs/order-api/prod 20260603-120000
changed_when: true8. Hands-on#
Create log group:
aws logs create-log-group \
--log-group-name /ecs/order-api \
--region ap-east-1Create cluster:
aws ecs create-cluster \
--cluster-name prod-apps \
--region ap-east-1Register task definition:
aws ecs register-task-definition \
--cli-input-json file://task-definition.json \
--region ap-east-1Run one task for smoke test:
aws ecs run-task \
--cluster prod-apps \
--task-definition order-api:1 \
--capacity-provider-strategy capacityProvider=FARGATE,weight=1 \
--network-configuration "awsvpcConfiguration={subnets=[subnet-aaa],securityGroups=[sg-task],assignPublicIp=DISABLED}" \
--region ap-east-1Check service:
aws ecs describe-services \
--cluster prod-apps \
--services order-api \
--region ap-east-1 \
--query 'services[0].[status,desiredCount,runningCount,pendingCount,taskDefinition]'Check stopped task reason:
aws ecs list-tasks \
--cluster prod-apps \
--service-name order-api \
--desired-status STOPPED \
--region ap-east-1
aws ecs describe-tasks \
--cluster prod-apps \
--tasks "arn:aws:ecs:ap-east-1:111122223333:task/prod-apps/example" \
--region ap-east-1 \
--query 'tasks[0].{lastStatus:lastStatus,stoppedReason:stoppedReason,containers:containers[*].[name,lastStatus,exitCode,reason]}'9. Monitoring#
Must-have signals:
| Signal | Where | Alert |
|---|---|---|
| CPU / memory utilization | ECS / CloudWatch metrics | sustained high usage |
| running task count | ECS service metrics | running < desired |
| pending task count | ECS service metrics | pending > 0 for long time |
| task stopped reason | ECS task state change event | unexpected stop |
| deployment failed | ECS service deployment event | immediate alert |
| ALB target unhealthy | ALB target group metrics | unhealthy > 0 |
| HTTP 5xx | ALB / app logs | error budget burn |
| log error rate | CloudWatch Logs / log backend | app regression |
EventBridge rule for failed deployment:
{
"source": ["aws.ecs"],
"detail-type": ["ECS Deployment State Change"],
"detail": {
"eventName": ["SERVICE_DEPLOYMENT_FAILED"]
}
}Operational commands:
aws ecs describe-services \
--cluster prod-apps \
--services order-api \
--region ap-east-1 \
--query 'services[0].events[0:20].[createdAt,message]' \
--output table
aws logs tail /ecs/order-api \
--since 30m \
--follow \
--region ap-east-110. Security Best Practices#
image:
use ECR
scan images
avoid latest tag in production
prefer immutable version tag or digest
network:
tasks in private subnets
no public IP for backend services
security group ingress only from ALB or trusted service
use VPC endpoints for AWS APIs where possible
identity:
separate task role and execution role
least privilege per service
CI/CD uses OIDC assume role, not long-lived access keys
restrict iam:PassRole
secrets:
use Secrets Manager or SSM Parameter Store
do not bake secrets into image
do not store secrets in task definition environment values
operations:
enable deployment circuit breaker with rollback
log to central log backend
alert on failed deployments and stopped tasks11. Scaling And Cost#
Service autoscaling:
scale on:
CPUUtilization
MemoryUtilization
ALBRequestCountPerTarget
SQS ApproximateNumberOfMessagesVisible for workers
custom business metric when possible
avoid:
scaling only on CPU for I/O-bound app
Fargate Spot for stateful or latency-critical service
desiredCount=1 for production web APICost controls:
right size:
set task CPU/memory from real metrics
review p95 CPU/memory every release cycle
Fargate Spot:
good for workers and retryable jobs
handle SIGTERM and stopTimeout
logs:
set CloudWatch Logs retention
avoid debug logs in production
network:
compare NAT Gateway cost vs VPC endpoint cost
avoid cross-AZ target / NAT traffic surprises12. Production Checklist#
service:
desired count >= 2 for production web service
deployment circuit breaker enabled
rollback tested
health check path returns app readiness
graceful shutdown implemented
task definition:
immutable image tag or digest
CPU/memory configured
awslogs configured
secrets injected from Secrets Manager / SSM
task role least privilege
execution role minimal and standard
network:
private subnets for tasks
security group scoped
ALB target health checks correct
VPC endpoints configured for private AWS API access when needed
deploy:
CLI deploy script stored with app or ops repo
register-task-definition and update-service are automated
services-stable wait is required
failed deploy prints ECS events and target health
rollback command documented
monitoring:
ECS deployment failure EventBridge alert
running task count alert
ALB 5xx / target unhealthy alert
app logs searchable by service/env/version13. Common Mistakes#
mistake:
use latest image tag in production
result:
rollback and audit become unreliable
mistake:
put app AWS permissions on execution role
result:
role boundary becomes unclear and over-permissioned
mistake:
deploy only with update-service --force-new-deployment and same image tag
result:
hard to know which image is running
mistake:
no circuit breaker / no service-stable wait
result:
pipeline reports success while service is unhealthy
mistake:
task health check starts before app is ready
result:
deployment flaps and rolls back
mistake:
Fargate tasks need AWS APIs but private subnet has no NAT or VPC endpoints
result:
image pull, logs, secrets, or STS calls fail