AWS ECS


1. Important Points#

Amazon ECS 是 AWS 原生 container orchestration 服务。它适合想运行 container application,但不想自己管理 Kubernetes control plane 的场景。

ECS 用来做:
    run long-running container service
    run scheduled / one-off container task
    deploy app behind ALB / NLB
    run worker / consumer / batch job
    integrate with IAM, CloudWatch Logs, ALB, Service Discovery

ECS 不适合:
    strong Kubernetes ecosystem requirement
    custom scheduler / operator-heavy platform
    portable K8S manifest as primary deployment artifact
    complex multi-cluster service mesh governance

核心原则:

prefer Fargate first:
    no EC2 capacity management
    simple operational model
    good default for small / medium teams

use ECS on EC2 when:
    special instance type / GPU / local disk
    very high steady utilization
    daemon / host-level integration
    advanced placement or cost optimization requirement

deployment unit:
    task definition revision is immutable
    service points to one task definition revision
    update-service creates a new deployment

2. Core Concepts#

Concept Meaning Production Note
Cluster ECS logical scheduling boundary usually one per environment or platform boundary
Task definition immutable container spec every deploy registers a new revision
Task running copy of a task definition one or more containers
Service keeps desired number of tasks running use for web/API/worker
Capacity provider where tasks run FARGATE, FARGATE_SPOT, or EC2 ASG
Task role IAM role used by application code least privilege per service
Execution role IAM role used by ECS agent/Fargate pull image, write logs, fetch secrets
Service discovery DNS name for ECS services Cloud Map or internal ALB
Deployment replacing old tasks with new tasks rolling update by default
request path example:
    Route 53
        -> ALB
        -> target group
        -> ECS service
        -> ECS task
        -> container port

worker path example:
    SQS / EventBridge
        -> ECS service or run-task
        -> container
        -> downstream AWS service

3. Service Configuration#

Minimum production service config:

cluster:
    environment boundary, for example prod-apps

service:
    desiredCount >= 2 for production web service
    deploymentCircuitBreaker enabled with rollback
    healthCheckGracePeriodSeconds configured when using ALB
    enableExecuteCommand only when audited and needed

task definition:
    image uses immutable tag or image digest
    cpu / memory explicitly set
    logs go to CloudWatch Logs
    secrets come from Secrets Manager / SSM Parameter Store
    taskRoleArn and executionRoleArn are separated

network:
    awsvpc mode
    private subnets for backend service
    ALB in public subnets if internet-facing
    security group allows only required ingress

Fargate task definition skeleton:

{
  "family": "order-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::111122223333:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::111122223333:role/order-api-task-role",
  "runtimePlatform": {
    "cpuArchitecture": "X86_64",
    "operatingSystemFamily": "LINUX"
  },
  "containerDefinitions": [
    {
      "name": "order-api",
      "image": "111122223333.dkr.ecr.ap-east-1.amazonaws.com/order-api:20260603-120000",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "APP_ENV",
          "value": "prod"
        }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:ap-east-1:111122223333:secret:prod/order-api/database-url-AbCdEf"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/order-api",
          "awslogs-region": "ap-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

4. Capacity And Networking#

capacity choice#

Option When To Use Notes
Fargate default for most services no node management
Fargate Spot fault-tolerant worker / async job can be interrupted
ECS on EC2 special hardware or high utilization you manage capacity
EC2 capacity provider ASG-backed ECS capacity better scaling than manual container instances

Fargate capacity provider example:

aws ecs put-cluster-capacity-providers \
  --cluster prod-apps \
  --capacity-providers FARGATE FARGATE_SPOT \
  --default-capacity-provider-strategy \
      capacityProvider=FARGATE,weight=1,base=1 \
      capacityProvider=FARGATE_SPOT,weight=1

Production networking:

public web API:
    ALB in public subnets
    ECS tasks in private subnets
    task security group allows inbound only from ALB security group
    outbound restricted when possible

internal service:
    internal ALB or Cloud Map
    private subnets only
    no public IP

egress:
    use VPC endpoints for ECR, CloudWatch Logs, Secrets Manager, SSM, STS when possible
    use NAT Gateway only for required internet egress

5. IAM#

Task role trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Task role example for an app that reads one secret and one S3 prefix:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadAppSecret",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:ap-east-1:111122223333:secret:prod/order-api/*"
    },
    {
      "Sid": "ReadOrderBucketPrefix",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::company-prod-orders/order-api/*"
    }
  ]
}

Deploy principal policy for CI/CD:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DeployEcsService",
      "Effect": "Allow",
      "Action": [
        "ecs:DescribeServices",
        "ecs:DescribeTaskDefinition",
        "ecs:DescribeTasks",
        "ecs:ListTasks",
        "ecs:RegisterTaskDefinition",
        "ecs:UpdateService"
      ],
      "Resource": "*"
    },
    {
      "Sid": "PassOnlyEcsTaskRoles",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": [
        "arn:aws:iam::111122223333:role/ecsTaskExecutionRole",
        "arn:aws:iam::111122223333:role/order-api-task-role"
      ],
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "ecs-tasks.amazonaws.com"
        }
      }
    }
  ]
}
security rule:
    app permissions go to task role
    image pull / log / secret injection permissions go to execution role
    CI/CD role can register task definition and update service
    CI/CD role should only pass approved ECS roles

6. Deployment Strategy#

ECS rolling update is the default deployment controller for most services.

rolling update:
    update-service points service to new task definition
    ECS starts new tasks and drains old tasks based on deployment config
    minimumHealthyPercent controls lower healthy task bound
    maximumPercent controls temporary upper task count

deployment circuit breaker:
    detects service deployment failure
    can rollback to last completed deployment
    works with ECS rolling update controller

blue/green:
    use CodeDeploy or ECS blue/green when traffic shifting / validation hooks are needed
    more moving parts than normal rolling update

Create service with circuit breaker:

aws ecs create-service \
  --cluster prod-apps \
  --service-name order-api \
  --task-definition order-api:1 \
  --desired-count 2 \
  --capacity-provider-strategy capacityProvider=FARGATE,weight=1 \
  --deployment-controller type=ECS \
  --deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},maximumPercent=200,minimumHealthyPercent=100" \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb],securityGroups=[sg-task],assignPublicIp=DISABLED}" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:ap-east-1:111122223333:targetgroup/order-api/abc123,containerName=order-api,containerPort=8080" \
  --health-check-grace-period-seconds 60 \
  --region ap-east-1

7. CLI Deploy#

这个 section 的目标是给 Jenkins、Ansible、GitHub Actions、GitLab CI 这种系统直接集成,不依赖 AWS Console。

deploy flow#

1. build image
2. push image to ECR
3. render task definition with new image
4. register new task definition revision
5. update ECS service to new revision
6. wait services-stable
7. verify running task image / ALB target health
8. print diagnostics on failure

files#

deployment/ecs/order-api/prod/
├── metadata.env
└── task-definition.json

metadata.env:

AWS_REGION=ap-east-1
AWS_ACCOUNT_ID=111122223333
ECR_REPOSITORY=order-api
ECS_CLUSTER=prod-apps
ECS_SERVICE=order-api
CONTAINER_NAME=order-api
TARGET_GROUP_ARN=arn:aws:elasticloadbalancing:ap-east-1:111122223333:targetgroup/order-api/abc123

task-definition.json keeps image as a placeholder:

{
  "family": "order-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::111122223333:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::111122223333:role/order-api-task-role",
  "containerDefinitions": [
    {
      "name": "order-api",
      "image": "__IMAGE_URI__",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/order-api",
          "awslogs-region": "ap-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

deploy script#

#!/usr/bin/env bash
set -euo pipefail

export AWS_PAGER=""

SERVICE_DIR="${1:?usage: deploy-ecs.sh <service-dir> <image-tag>}"
IMAGE_TAG="${2:?usage: deploy-ecs.sh <service-dir> <image-tag>}"

source "${SERVICE_DIR}/metadata.env"

IMAGE_URI="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPOSITORY}:${IMAGE_TAG}"
TMP_TASK_DEF="$(mktemp /tmp/ecs-task-definition.XXXXXX.json)"
trap 'rm -f "${TMP_TASK_DEF}"' EXIT

print_service_diagnostics() {
  aws ecs describe-services \
    --cluster "${ECS_CLUSTER}" \
    --services "${ECS_SERVICE}" \
    --region "${AWS_REGION}" \
    --query 'services[0].deployments[*].[status,taskDefinition,desiredCount,runningCount,pendingCount,failedTasks,rolloutState,rolloutStateReason]' \
    --output table

  aws ecs describe-services \
    --cluster "${ECS_CLUSTER}" \
    --services "${ECS_SERVICE}" \
    --region "${AWS_REGION}" \
    --query 'services[0].events[0:10].[createdAt,message]' \
    --output table
}

print_target_health() {
  if [[ -z "${TARGET_GROUP_ARN:-}" ]]; then
    return 0
  fi

  aws elbv2 describe-target-health \
    --target-group-arn "${TARGET_GROUP_ARN}" \
    --region "${AWS_REGION}" \
    --query 'TargetHealthDescriptions[*].[Target.Id,Target.Port,TargetHealth.State,TargetHealth.Reason,TargetHealth.Description]' \
    --output table
}

echo "Render task definition: ${IMAGE_URI}"
jq --arg image "${IMAGE_URI}" --arg name "${CONTAINER_NAME}" '
  .containerDefinitions |= map(
    if .name == $name then .image = $image else . end
  )
' "${SERVICE_DIR}/task-definition.json" > "${TMP_TASK_DEF}"

jq empty "${TMP_TASK_DEF}"

echo "Register task definition"
TASK_DEF_ARN="$(
  aws ecs register-task-definition \
    --region "${AWS_REGION}" \
    --cli-input-json "file://${TMP_TASK_DEF}" \
    --query 'taskDefinition.taskDefinitionArn' \
    --output text
)"

echo "Update service: ${ECS_CLUSTER}/${ECS_SERVICE} -> ${TASK_DEF_ARN}"
aws ecs update-service \
  --cluster "${ECS_CLUSTER}" \
  --service "${ECS_SERVICE}" \
  --task-definition "${TASK_DEF_ARN}" \
  --force-new-deployment \
  --region "${AWS_REGION}" \
  --output json > /dev/null

echo "Wait services-stable"
if ! aws ecs wait services-stable \
  --cluster "${ECS_CLUSTER}" \
  --services "${ECS_SERVICE}" \
  --region "${AWS_REGION}"; then
  echo "ECS service failed to stabilize."
  print_service_diagnostics
  print_target_health
  exit 1
fi

echo "Verify service deployment"
aws ecs describe-services \
  --cluster "${ECS_CLUSTER}" \
  --services "${ECS_SERVICE}" \
  --region "${AWS_REGION}" \
  --query 'services[0].{serviceName:serviceName,taskDefinition:taskDefinition,desiredCount:desiredCount,runningCount:runningCount,pendingCount:pendingCount,rolloutState:deployments[0].rolloutState,rolloutStateReason:deployments[0].rolloutStateReason}' \
  --output table

print_target_health

echo "Deployment completed: ${IMAGE_URI}"

Run locally or from Jenkins:

chmod +x deploy-ecs.sh

./deploy-ecs.sh deployment/ecs/order-api/prod 20260603-120000

rollback#

Rollback means update the service to a previous task definition revision.

aws ecs list-task-definitions \
  --family-prefix order-api \
  --sort DESC \
  --region ap-east-1 \
  --max-items 10

aws ecs update-service \
  --cluster prod-apps \
  --service order-api \
  --task-definition order-api:42 \
  --force-new-deployment \
  --region ap-east-1

aws ecs wait services-stable \
  --cluster prod-apps \
  --services order-api \
  --region ap-east-1

Jenkins pattern#

pipeline {
  agent any

  parameters {
    string(name: 'IMAGE_TAG', defaultValue: '', description: 'ECR image tag')
    choice(name: 'ENV', choices: ['dev', 'uat', 'prod'], description: 'deploy env')
  }

  stages {
    stage('Deploy ECS') {
      steps {
        sh '''
          set -euo pipefail
          ./deploy-ecs.sh "deployment/ecs/order-api/${ENV}" "${IMAGE_TAG}"
        '''
      }
    }
  }
}

Ansible pattern#

---
- name: Deploy ECS service
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Run ECS deploy script
      ansible.builtin.command:
        cmd: ./deploy-ecs.sh deployment/ecs/order-api/prod 20260603-120000
      changed_when: true

8. Hands-on#

Create log group:

aws logs create-log-group \
  --log-group-name /ecs/order-api \
  --region ap-east-1

Create cluster:

aws ecs create-cluster \
  --cluster-name prod-apps \
  --region ap-east-1

Register task definition:

aws ecs register-task-definition \
  --cli-input-json file://task-definition.json \
  --region ap-east-1

Run one task for smoke test:

aws ecs run-task \
  --cluster prod-apps \
  --task-definition order-api:1 \
  --capacity-provider-strategy capacityProvider=FARGATE,weight=1 \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-aaa],securityGroups=[sg-task],assignPublicIp=DISABLED}" \
  --region ap-east-1

Check service:

aws ecs describe-services \
  --cluster prod-apps \
  --services order-api \
  --region ap-east-1 \
  --query 'services[0].[status,desiredCount,runningCount,pendingCount,taskDefinition]'

Check stopped task reason:

aws ecs list-tasks \
  --cluster prod-apps \
  --service-name order-api \
  --desired-status STOPPED \
  --region ap-east-1

aws ecs describe-tasks \
  --cluster prod-apps \
  --tasks "arn:aws:ecs:ap-east-1:111122223333:task/prod-apps/example" \
  --region ap-east-1 \
  --query 'tasks[0].{lastStatus:lastStatus,stoppedReason:stoppedReason,containers:containers[*].[name,lastStatus,exitCode,reason]}'

9. Monitoring#

Must-have signals:

Signal Where Alert
CPU / memory utilization ECS / CloudWatch metrics sustained high usage
running task count ECS service metrics running < desired
pending task count ECS service metrics pending > 0 for long time
task stopped reason ECS task state change event unexpected stop
deployment failed ECS service deployment event immediate alert
ALB target unhealthy ALB target group metrics unhealthy > 0
HTTP 5xx ALB / app logs error budget burn
log error rate CloudWatch Logs / log backend app regression

EventBridge rule for failed deployment:

{
  "source": ["aws.ecs"],
  "detail-type": ["ECS Deployment State Change"],
  "detail": {
    "eventName": ["SERVICE_DEPLOYMENT_FAILED"]
  }
}

Operational commands:

aws ecs describe-services \
  --cluster prod-apps \
  --services order-api \
  --region ap-east-1 \
  --query 'services[0].events[0:20].[createdAt,message]' \
  --output table

aws logs tail /ecs/order-api \
  --since 30m \
  --follow \
  --region ap-east-1

10. Security Best Practices#

image:
    use ECR
    scan images
    avoid latest tag in production
    prefer immutable version tag or digest

network:
    tasks in private subnets
    no public IP for backend services
    security group ingress only from ALB or trusted service
    use VPC endpoints for AWS APIs where possible

identity:
    separate task role and execution role
    least privilege per service
    CI/CD uses OIDC assume role, not long-lived access keys
    restrict iam:PassRole

secrets:
    use Secrets Manager or SSM Parameter Store
    do not bake secrets into image
    do not store secrets in task definition environment values

operations:
    enable deployment circuit breaker with rollback
    log to central log backend
    alert on failed deployments and stopped tasks

11. Scaling And Cost#

Service autoscaling:

scale on:
    CPUUtilization
    MemoryUtilization
    ALBRequestCountPerTarget
    SQS ApproximateNumberOfMessagesVisible for workers
    custom business metric when possible

avoid:
    scaling only on CPU for I/O-bound app
    Fargate Spot for stateful or latency-critical service
    desiredCount=1 for production web API

Cost controls:

right size:
    set task CPU/memory from real metrics
    review p95 CPU/memory every release cycle

Fargate Spot:
    good for workers and retryable jobs
    handle SIGTERM and stopTimeout

logs:
    set CloudWatch Logs retention
    avoid debug logs in production

network:
    compare NAT Gateway cost vs VPC endpoint cost
    avoid cross-AZ target / NAT traffic surprises

12. Production Checklist#

service:
    desired count >= 2 for production web service
    deployment circuit breaker enabled
    rollback tested
    health check path returns app readiness
    graceful shutdown implemented

task definition:
    immutable image tag or digest
    CPU/memory configured
    awslogs configured
    secrets injected from Secrets Manager / SSM
    task role least privilege
    execution role minimal and standard

network:
    private subnets for tasks
    security group scoped
    ALB target health checks correct
    VPC endpoints configured for private AWS API access when needed

deploy:
    CLI deploy script stored with app or ops repo
    register-task-definition and update-service are automated
    services-stable wait is required
    failed deploy prints ECS events and target health
    rollback command documented

monitoring:
    ECS deployment failure EventBridge alert
    running task count alert
    ALB 5xx / target unhealthy alert
    app logs searchable by service/env/version

13. Common Mistakes#

mistake:
    use latest image tag in production
result:
    rollback and audit become unreliable

mistake:
    put app AWS permissions on execution role
result:
    role boundary becomes unclear and over-permissioned

mistake:
    deploy only with update-service --force-new-deployment and same image tag
result:
    hard to know which image is running

mistake:
    no circuit breaker / no service-stable wait
result:
    pipeline reports success while service is unhealthy

mistake:
    task health check starts before app is ready
result:
    deployment flaps and rolls back

mistake:
    Fargate tasks need AWS APIs but private subnet has no NAT or VPC endpoints
result:
    image pull, logs, secrets, or STS calls fail