Links#
- Setting up Session Manager
- Verify or add instance permissions for Session Manager
- Session Manager prerequisites
- Access instance metadata for an EC2 instance
- Detach an Amazon EBS volume
- Attach an Amazon EBS volume
1. Issue#
EC2 上的 Session Manager 突然无法连接,console 报错:
The version of SSM Agent on the instance supports Session Manager,
but the instance is not configured for use with AWS Systems Manager.
Verify that the IAM instance profile attached to the instance includes the required permissions.如果同时发生过 Docker / K3S / privileged container / --cgroupns=host 相关变更,要同时怀疑 instance profile、IMDS、SSM endpoint connectivity,以及 host OS 被容器影响。
常见原因:
instance profile missing
instance profile 没有 AmazonSSMManagedInstanceCore
IMDS disabled or unreachable
SSM Agent 无法访问 AWS Systems Manager endpoints
host resource / network 被 Docker, K3S, iptables, cgroup 影响核心原则:
先恢复 access:
不要继续启动有问题的容器
不要先排应用问题
先让 SSM / SSH / EC2 metadata 恢复
再处理 root cause:
禁止 Docker / containerd / problem container 自动启动
确认 SSM Agent, IAM, endpoint connectivity
最后再重建业务容器2. Session Manager Baseline#
EC2 使用 Session Manager 需要满足这些基础条件:
| Area | Requirement | Check |
|---|---|---|
| IAM | instance profile has SSM permissions | AmazonSSMManagedInstanceCore |
| Agent | SSM Agent installed and running | systemctl status amazon-ssm-agent |
| Network | outbound HTTPS to SSM endpoints | port 443 |
| Metadata | instance can get IAM role credentials | IMDS enabled |
| Region | instance appears as a managed node | Systems Manager console |
AWS managed policy:
AmazonSSMManagedInstanceCoreRequired endpoint access for private subnet instances:
ssm.<region>.amazonaws.com
ssmmessages.<region>.amazonaws.com
ec2messages.<region>.amazonaws.comIf there is no NAT Gateway / internet egress, create interface VPC endpoints for Systems Manager and allow 443 from the instance security group.
3. Fast Recovery From Console#
Use this path when the EC2 instance is still running but Session Manager reports IAM / configuration errors.
EC2 console:
Instances
-> select instance
-> Security
-> IAM Role
If IAM Role is empty:
Actions
-> Security
-> Modify IAM role
-> attach a role with AmazonSSMManagedInstanceCore
If IAM Role exists:
IAM console
-> Roles
-> select the role
-> verify AmazonSSMManagedInstanceCore is attachedCheck instance metadata options:
EC2 console:
select instance
-> Actions
-> Instance settings
-> Modify instance metadata options
Expected:
IMDS enabled
IMDSv2 optional or required is both acceptable if software supports itThen reboot the instance:
aws ec2 reboot-instances \
--instance-ids i-xxxxxxxxxxxxxxxxxWait 2-5 minutes and retry Session Manager.
4. Rescue Disk Recovery#
Use this path when Session Manager and SSH are both unavailable, or the instance reboots and immediately becomes unavailable again.
Most likely pattern:
problem container has restart policy:
--restart=always
docker compose restart: always
after EC2 reboot:
Docker starts
container starts K3S / containerd / kubelet
host memory, cgroup, iptables, or metadata access becomes unhealthy
amazon-ssm-agent cannot register or maintain sessionRecovery steps:
1. Stop the broken EC2 instance
2. Detach the root EBS volume
3. Attach that volume to a healthy rescue EC2 in the same AZ
4. Mount the old root filesystem on the rescue EC2
5. Disable Docker / containerd / the problem container from auto-starting
6. Unmount the volume
7. Attach it back to the original instance as root volume
8. Start the original instance
9. Restore SSM first, then inspect the application containerfind and mount the volume#
On the rescue EC2:
lsblk
sudo mkdir -p /mnt/rescue
sudo mount /dev/nvme1n1p1 /mnt/rescueIf the root volume uses a different device name or partition:
lsblk -f
sudo file -s /dev/nvme1n1
sudo mount /dev/nvme1n1p1 /mnt/rescueFor LVM based images:
sudo vgscan
sudo vgchange -ay
lsblkThen mount the logical volume shown by lsblk.
disable Docker startup in the old OS#
Mask Docker and containerd in the mounted filesystem:
sudo mkdir -p /mnt/rescue/etc/systemd/system
sudo ln -sf /dev/null /mnt/rescue/etc/systemd/system/docker.service
sudo ln -sf /dev/null /mnt/rescue/etc/systemd/system/containerd.serviceThis prevents Docker from starting before you can log in.
If the application uses compose under a known directory, inspect it before booting:
sudo find /mnt/rescue -name 'docker-compose.yml' -o -name 'compose.yml'
sudo find /mnt/rescue -name '*.service' | grep -Ei 'docker|compose|ninedata|k3s'If there is a custom systemd unit that starts the container, mask it too:
sudo ln -sf /dev/null /mnt/rescue/etc/systemd/system/ninedata.serviceOnly use the exact unit name that exists on the old system.
unmount and restore#
sync
sudo umount /mnt/rescueDetach the volume from the rescue instance, attach it back to the original EC2 as the root device, then start the original EC2.
5. After Login#
Once Session Manager or SSH works again, check SSM first:
sudo systemctl status amazon-ssm-agent
sudo systemctl restart amazon-ssm-agent
sudo journalctl -u amazon-ssm-agent -n 100 --no-pagerCheck instance metadata and IAM credentials:
TOKEN=$(curl -s -X PUT \
"http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -s \
-H "X-aws-ec2-metadata-token: ${TOKEN}" \
http://169.254.169.254/latest/meta-data/iam/security-credentials/Expected result:
returns IAM role name
does not hang
does not return 404Check Docker state:
sudo systemctl status docker
docker ps -a
docker inspect ninedata --format '{{.HostConfig.RestartPolicy.Name}}'Disable restart policy before testing:
docker update --restart=no ninedataIf Docker was masked during rescue, unmask it only after SSM is stable:
sudo systemctl unmask containerd
sudo systemctl unmask docker
sudo systemctl start docker6. Container With K3S#
Containers that run K3S, kubelet, or containerd inside Docker are host-sensitive. If they need --cgroupns=host, they can also affect host resource and network behavior when combined with high privileges.
Risk checklist:
container uses:
--privileged
--cgroupns=host
--network=host
hostPath mounts into /sys, /var, /run, /etc
K3S / kubelet / containerd inside container
restart policy always / unless-stopped
host symptoms:
Session Manager cannot connect
SSH cannot connect
amazon-ssm-agent unhealthy
high memory usage or OOM
iptables / routing changed
IMDS request to 169.254.169.254 hangsSafer test run:
docker run -d \
--name ninedata \
--restart=no \
--cgroupns=host \
[OTHER_OPTIONS] \
ninedata-image:tagDo not enable --restart=always until:
SSM Agent remains online
memory is stable
docker logs are clean
metadata endpoint works
iptables / route table are understoodFor small EC2 instances, avoid running K3S-in-container workloads on the same host that you depend on for emergency access. Use a larger instance or a dedicated test host.
7. Production Checklist#
Before running the workload again:
EC2:
instance profile attached
AmazonSSMManagedInstanceCore attached
IMDS enabled
SSM Agent running
outbound 443 to SSM endpoints works
Access:
Session Manager tested
SSH or EC2 Instance Connect fallback tested if used
no problem container auto-start before access is verified
Docker:
restart policy set to no during testing
memory limit considered
logs monitored
privileged / host namespace options documented
Recovery:
root volume snapshot exists before risky change
rescue instance available in same AZ
rollback command documented