SSM Session Recovery

Links#

1. Issue#

EC2 上的 Session Manager 突然无法连接，console 报错:

The version of SSM Agent on the instance supports Session Manager,
but the instance is not configured for use with AWS Systems Manager.
Verify that the IAM instance profile attached to the instance includes the required permissions.

如果同时发生过 Docker / K3S / privileged container / --cgroupns=host 相关变更，要同时怀疑 instance profile、IMDS、SSM endpoint connectivity，以及 host OS 被容器影响。

常见原因:
    instance profile missing
    instance profile 没有 AmazonSSMManagedInstanceCore
    IMDS disabled or unreachable
    SSM Agent 无法访问 AWS Systems Manager endpoints
    host resource / network 被 Docker, K3S, iptables, cgroup 影响

核心原则:

先恢复 access:
    不要继续启动有问题的容器
    不要先排应用问题
    先让 SSM / SSH / EC2 metadata 恢复

再处理 root cause:
    禁止 Docker / containerd / problem container 自动启动
    确认 SSM Agent, IAM, endpoint connectivity
    最后再重建业务容器

2. Session Manager Baseline#

EC2 使用 Session Manager 需要满足这些基础条件:

Area	Requirement	Check
IAM	instance profile has SSM permissions	`AmazonSSMManagedInstanceCore`
Agent	SSM Agent installed and running	`systemctl status amazon-ssm-agent`
Network	outbound HTTPS to SSM endpoints	port `443`
Metadata	instance can get IAM role credentials	IMDS enabled
Region	instance appears as a managed node	Systems Manager console

AWS managed policy:

AmazonSSMManagedInstanceCore

Required endpoint access for private subnet instances:

ssm.<region>.amazonaws.com
ssmmessages.<region>.amazonaws.com
ec2messages.<region>.amazonaws.com

If there is no NAT Gateway / internet egress, create interface VPC endpoints for Systems Manager and allow 443 from the instance security group.

3. Fast Recovery From Console#

Use this path when the EC2 instance is still running but Session Manager reports IAM / configuration errors.

EC2 console:
    Instances
        -> select instance
        -> Security
        -> IAM Role

If IAM Role is empty:
    Actions
        -> Security
        -> Modify IAM role
        -> attach a role with AmazonSSMManagedInstanceCore

If IAM Role exists:
    IAM console
        -> Roles
        -> select the role
        -> verify AmazonSSMManagedInstanceCore is attached

Check instance metadata options:

EC2 console:
    select instance
        -> Actions
        -> Instance settings
        -> Modify instance metadata options

Expected:
    IMDS enabled
    IMDSv2 optional or required is both acceptable if software supports it

Then reboot the instance:

aws ec2 reboot-instances \
  --instance-ids i-xxxxxxxxxxxxxxxxx

Wait 2-5 minutes and retry Session Manager.

4. Rescue Disk Recovery#

Use this path when Session Manager and SSH are both unavailable, or the instance reboots and immediately becomes unavailable again.

Most likely pattern:

problem container has restart policy:
    --restart=always
    docker compose restart: always

after EC2 reboot:
    Docker starts
    container starts K3S / containerd / kubelet
    host memory, cgroup, iptables, or metadata access becomes unhealthy
    amazon-ssm-agent cannot register or maintain session

Recovery steps:

1. Stop the broken EC2 instance
2. Detach the root EBS volume
3. Attach that volume to a healthy rescue EC2 in the same AZ
4. Mount the old root filesystem on the rescue EC2
5. Disable Docker / containerd / the problem container from auto-starting
6. Unmount the volume
7. Attach it back to the original instance as root volume
8. Start the original instance
9. Restore SSM first, then inspect the application container

find and mount the volume#

On the rescue EC2:

lsblk
sudo mkdir -p /mnt/rescue
sudo mount /dev/nvme1n1p1 /mnt/rescue

If the root volume uses a different device name or partition:

lsblk -f
sudo file -s /dev/nvme1n1
sudo mount /dev/nvme1n1p1 /mnt/rescue

For LVM based images:

sudo vgscan
sudo vgchange -ay
lsblk

Then mount the logical volume shown by lsblk.

disable Docker startup in the old OS#

Mask Docker and containerd in the mounted filesystem:

sudo mkdir -p /mnt/rescue/etc/systemd/system
sudo ln -sf /dev/null /mnt/rescue/etc/systemd/system/docker.service
sudo ln -sf /dev/null /mnt/rescue/etc/systemd/system/containerd.service

This prevents Docker from starting before you can log in.

If the application uses compose under a known directory, inspect it before booting:

sudo find /mnt/rescue -name 'docker-compose.yml' -o -name 'compose.yml'
sudo find /mnt/rescue -name '*.service' | grep -Ei 'docker|compose|ninedata|k3s'

If there is a custom systemd unit that starts the container, mask it too:

sudo ln -sf /dev/null /mnt/rescue/etc/systemd/system/ninedata.service

Only use the exact unit name that exists on the old system.

unmount and restore#

sync
sudo umount /mnt/rescue

Detach the volume from the rescue instance, attach it back to the original EC2 as the root device, then start the original EC2.

Once Session Manager or SSH works again, check SSM first:

sudo systemctl status amazon-ssm-agent
sudo systemctl restart amazon-ssm-agent
sudo journalctl -u amazon-ssm-agent -n 100 --no-pager

Check instance metadata and IAM credentials:

TOKEN=$(curl -s -X PUT \
  "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

curl -s \
  -H "X-aws-ec2-metadata-token: ${TOKEN}" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/

Expected result:

returns IAM role name
does not hang
does not return 404

Check Docker state:

sudo systemctl status docker
docker ps -a
docker inspect ninedata --format '{{.HostConfig.RestartPolicy.Name}}'

Disable restart policy before testing:

docker update --restart=no ninedata

If Docker was masked during rescue, unmask it only after SSM is stable:

sudo systemctl unmask containerd
sudo systemctl unmask docker
sudo systemctl start docker

6. Container With K3S#

Containers that run K3S, kubelet, or containerd inside Docker are host-sensitive. If they need --cgroupns=host, they can also affect host resource and network behavior when combined with high privileges.

Risk checklist:

container uses:
    --privileged
    --cgroupns=host
    --network=host
    hostPath mounts into /sys, /var, /run, /etc
    K3S / kubelet / containerd inside container
    restart policy always / unless-stopped

host symptoms:
    Session Manager cannot connect
    SSH cannot connect
    amazon-ssm-agent unhealthy
    high memory usage or OOM
    iptables / routing changed
    IMDS request to 169.254.169.254 hangs

Safer test run:

docker run -d \
  --name ninedata \
  --restart=no \
  --cgroupns=host \
  [OTHER_OPTIONS] \
  ninedata-image:tag

Do not enable --restart=always until:

SSM Agent remains online
memory is stable
docker logs are clean
metadata endpoint works
iptables / route table are understood

For small EC2 instances, avoid running K3S-in-container workloads on the same host that you depend on for emergency access. Use a larger instance or a dedicated test host.

7. Production Checklist#

Before running the workload again:

EC2:
    instance profile attached
    AmazonSSMManagedInstanceCore attached
    IMDS enabled
    SSM Agent running
    outbound 443 to SSM endpoints works

Access:
    Session Manager tested
    SSH or EC2 Instance Connect fallback tested if used
    no problem container auto-start before access is verified

Docker:
    restart policy set to no during testing
    memory limit considered
    logs monitored
    privileged / host namespace options documented

Recovery:
    root volume snapshot exists before risky change
    rescue instance available in same AZ
    rollback command documented