Links#
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ServiceQuotas.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/metrics-dimensions.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices-security-preventative.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Point-in-time-recovery.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html
1. Important Points#
DynamoDB 是 serverless NoSQL,不是 relational database:
适合:
key-value / document
high traffic OLTP
predictable access pattern
low latency read/write
event / session / cart / order state / metadata
不适合:
ad-hoc query
complex join
heavy analytical query
经常变化的查询维度
需要跨大量 item 做强事务的场景
核心原则:
先设计 access pattern,再设计 table / key / index
Query 优先,Scan 尽量避免
partition key 要高基数并且流量均匀
GSI 是额外写入成本和额外容量面,不是免费的 query helper
hot key / hot partition 是 DynamoDB 最常见生产问题
PITR / deletion protection / least privilege 应该默认打开
2. Service Configuration#
table#
| Item |
Recommendation |
| Table name |
<env>-<service>-<entity>,例如 prod-order-orders |
| Primary key |
优先 composite key: pk + sk |
| Billing mode |
流量不可预测用 PAY_PER_REQUEST;稳定大流量用 provisioned + auto scaling |
| Deletion protection |
production table 默认打开 |
| PITR |
production table 默认打开,恢复窗口 1-35 days |
| TTL |
临时数据 / session / cache / event retention 可以打开 |
| SSE |
默认全量加密;敏感数据用 customer managed KMS key |
| Tags |
env, service, owner, cost-center, data-classification |
table config checklist:
deletion protection enabled
PITR enabled
tags complete
billing mode reviewed
table quota reviewed
GSI quota reviewed
CloudWatch alarms created
resource policy / IAM reviewed
capacity mode#
| Mode |
When To Use |
注意项 |
| On-demand |
新业务、流量不可预测、低运维成本优先 |
单表默认 quota 仍然存在;突发上量前要确认 Service Quotas |
| Provisioned |
流量稳定、成本敏感、可预测峰值 |
需要 auto scaling / reserved capacity / 告警 |
on-demand:
优点:
不需要提前估算 RCU / WCU
适合 spike / early stage workload
注意:
不是无限吞吐
hot partition 仍然会 throttle
大促 / migration 前要提前压测和检查 quota
provisioned:
优点:
成本更可控
稳定大流量可以更便宜
注意:
auto scaling 有 CloudWatch 评估和 UpdateTable 延迟
sudden spike 可能先 throttle 再扩容
GSI 也要单独配置容量和 auto scaling
quotas#
常见 quota:
item max size: 400 KB
LSI per table: 5
GSI per table: default 20
projected attributes across secondary indexes: 100
tables per account per region: default 2500
default table-level throughput quota:
on-demand: 40000 read request units / 40000 write request units
provisioned: 40000 RCU / 40000 WCU
default account-level provisioned quota:
80000 RCU / 80000 WCU per region
hot partition limit:
one partition max:
3000 RCU / second
1000 WCU / second
注意:
quota 是 per region
default quota 不是架构上限,可以申请提高
hot partition 不能只靠提高 table quota 解决
3. Data Modeling Best Practices#
access pattern first#
先写清楚 query:
1. get order by order_id
2. list orders by user_id and created_at
3. list unpaid orders by tenant_id and created_at
4. update order status if version matched
5. expire session after 7 days
再设计 key:
table:
pk = USER#<user_id>
sk = ORDER#<created_at>#<order_id>
gsi1:
gsi1pk = ORDER#<order_id>
gsi1sk = META
gsi2:
gsi2pk = TENANT#<tenant_id>#STATUS#UNPAID
gsi2sk = CREATED#<created_at>#ORDER#<order_id>
partition key#
good partition key:
high cardinality
traffic evenly distributed
query 可以直接命中 partition key
不会把所有写入集中到一个值
bad partition key:
status
date only
boolean flag
tenant_id only, if one tenant can be very hot
country / region only
fixed value like ORDER / USER
hot key example:
pk = TENANT#big_customer
all write/read goes to one partition key
mitigation:
write sharding:
pk = TENANT#big_customer#SHARD#00..15
time bucket:
pk = TENANT#big_customer#DAY#2026-05-29
split access pattern:
write path use sharded key
read path query multiple shards and merge
sort key#
sort key 常用于:
time range
hierarchy
entity type
state transition history
examples:
ORDER#2026-05-29T10:00:00Z#<order_id>
PROFILE
SESSION#<session_id>
EVENT#<timestamp>#<event_id>
query patterns:
begins_with(sk, "ORDER#")
sk between "ORDER#2026-05-01" and "ORDER#2026-05-31"
single table design#
single table 适合:
access patterns 稳定
entity relationship 清晰
需要 transactionally update / query 聚合数据
团队理解 DynamoDB modeling
single table 不适合:
业务查询经常变化
团队没有维护经验
debugging / analytics 更重要
只是为了追求 one table
practical rule:
可以 single table,但不要把所有系统都塞到一个 table
table boundary 可以按 bounded context / service 来划分
4. Query / Write Best Practices#
Query vs Scan#
Query:
必须指定 partition key
可以用 sort key condition
适合在线请求
Scan:
会读很多 item
filter expression 是读完后过滤,不会减少读取容量
只适合 backfill / admin job / small table
production rule:
API request path 不要做 full table scan
如果必须 scan:
limit page size
use pagination
run in background
rate limit
monitor consumed capacity and throttle
read consistency#
eventually consistent read:
default
cost = 0.5 RCU per 4 KB
suitable for most read path
strongly consistent read:
cost = 1 RCU per 4 KB
only table / LSI support
GSI / streams do not support strong consistency
transactional read:
cost = 2 RCU per 4 KB
only use when real transaction semantics are needed
write#
write capacity:
1 WCU = one write per second for item up to 1 KB
item size is rounded up
each GSI projection adds write cost
best practices:
use UpdateExpression instead of rewriting large item
keep item small
avoid frequently changing huge attributes
do not store large blob in DynamoDB; store in S3 and keep pointer
use ReturnConsumedCapacity during test to estimate cost
conditional write#
use condition expression for correctness:
create only if not exists
update only if version matched
decrement stock only if quantity > 0
avoid lost update
aws dynamodb update-item \
--table-name prod-order-orders \
--key '{"pk":{"S":"ORDER#1001"},"sk":{"S":"META"}}' \
--update-expression "SET #status = :paid, version = version + :one" \
--condition-expression "version = :expected AND #status = :pending" \
--expression-attribute-names '{"#status":"status"}' \
--expression-attribute-values '{
":paid":{"S":"PAID"},
":pending":{"S":"PENDING"},
":expected":{"N":"1"},
":one":{"N":"1"}
}' \
--return-values ALL_NEW
retries#
client should retry:
ProvisionedThroughputExceededException
ThrottlingException
InternalServerError
ServiceUnavailable
TransactionConflictException
retry policy:
exponential backoff
jitter
max attempts
idempotency token for write path
do not retry blindly:
ConditionalCheckFailedException
ValidationException
AccessDeniedException
5. Index Best Practices#
GSI#
GSI 用于新的 access pattern:
base table key 无法支持时才加
GSI key 也必须防 hot partition
GSI read is eventually consistent
GSI 有自己的 throttling / capacity / metrics
projection:
KEYS_ONLY:
cheapest
only need keys
INCLUDE:
selected attributes
good default for query result list
ALL:
convenient but expensive
write amplification and storage cost high
注意:
base table write 会同步写 GSI
GSI backfill 可能消耗大量写容量
GSI throttle can throttle base table write
LSI#
LSI:
must be created with table
same partition key as base table
different sort key
supports strong consistent read
max 5 per table
practical rule:
不确定就不要先加 LSI
后续可变 access pattern 通常用 GSI
sparse index#
sparse index:
only items with GSI key attributes appear in index
useful for status / workflow / pending items
example:
unpaid orders only have:
gsi2pk = TENANT#<tenant_id>#STATUS#UNPAID
gsi2sk = CREATED#<created_at>#ORDER#<order_id>
paid orders remove gsi2pk / gsi2sk
6. Security Best Practices#
IAM#
principle:
use IAM role, not long-term access key
least privilege
separate read role / write role / migration role / admin role
allow table ARN and index ARN explicitly
avoid dynamodb:* in application role
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAppReadWriteOrders",
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem",
"dynamodb:Query",
"dynamodb:BatchGetItem",
"dynamodb:BatchWriteItem"
],
"Resource": [
"arn:aws:dynamodb:ap-east-1:123456789012:table/prod-order-orders",
"arn:aws:dynamodb:ap-east-1:123456789012:table/prod-order-orders/index/*"
]
}
]
}
resource-based policy#
use cases:
cross-account access
restrict source VPC endpoint
central data account
注意:
explicit deny 优先级最高
policy size has limit
cross-account 要同时检查 principal policy and resource policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyAccessUnlessFromVpce",
"Effect": "Deny",
"Principal": "*",
"Action": "dynamodb:*",
"Resource": [
"arn:aws:dynamodb:ap-east-1:123456789012:table/prod-order-orders",
"arn:aws:dynamodb:ap-east-1:123456789012:table/prod-order-orders/index/*"
],
"Condition": {
"StringNotEquals": {
"aws:SourceVpce": "vpce-0123456789abcdef0"
}
}
}
]
}
VPC endpoint#
recommendation:
EC2 / ECS / Lambda in VPC -> use DynamoDB VPC endpoint
add endpoint policy
add IAM/resource policy condition with aws:SourceVpce when appropriate
gateway endpoint:
common for DynamoDB access from VPC
no NAT gateway needed for DynamoDB traffic
interface endpoint / PrivateLink:
use when architecture requires interface endpoint behavior
client may need endpoint URL configuration
encryption#
at rest:
DynamoDB encrypts all data at rest
default uses AWS owned key
sensitive / compliance workload use customer managed KMS key
in transit:
use HTTPS/TLS
application side:
对特别敏感字段可以 client-side encryption
但加密字段通常不能直接作为 query key
data protection#
production defaults:
PITR enabled
deletion protection enabled
CloudTrail enabled
AWS Config / Security Hub rule reviewed
backup restore runbook tested
PITR notes:
recovery window: 1-35 days
restore creates a new table
LatestRestorableDateTime usually lags current time by about 5 minutes
7. TTL / Streams / Global Tables#
TTL#
TTL:
attribute must be Number
value is Unix epoch time in seconds
expired item is deleted asynchronously
expired item can still appear before background deletion
filter expired items in read path if correctness requires it
good use cases:
session
idempotency record
temporary token
cache item
short-lived event
not good:
precise scheduled deletion
compliance deletion with strict second-level SLA
streams#
DynamoDB Streams:
near real-time change capture
common integration with Lambda
use for projection / async workflow / audit / cache invalidation
stream view types:
KEYS_ONLY
NEW_IMAGE
OLD_IMAGE
NEW_AND_OLD_IMAGES
注意:
Lambda consumer must handle duplicate/retry
downstream processing should be idempotent
global tables#
global tables:
multi-region active-active
replication is eventually consistent
conflict reconciliation is last writer wins
use when:
low latency read/write in multiple regions
regional resilience is required
注意:
not a replacement for relational global transaction
conflict model must be acceptable
monitor replication latency and pending replication
TTL replicated deletes may incur replicated write cost in replica regions
8. Monitoring#
important metrics#
| Area |
Metrics |
What To Watch |
| Capacity |
ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits |
consumed vs provisioned / quota |
| Throttle |
ThrottledRequests, ReadThrottleEvents, WriteThrottleEvents |
any non-zero spike on production |
| Hot partition |
ReadKeyRangeThroughputThrottleEvents, WriteKeyRangeThroughputThrottleEvents |
key range / partition bottleneck |
| Provisioned throttle |
ReadProvisionedThroughputThrottleEvents, WriteProvisionedThroughputThrottleEvents |
provisioned capacity insufficient |
| Account limit |
ReadAccountLimitThrottleEvents, WriteAccountLimitThrottleEvents |
account-level quota hit |
| On-demand max |
ReadMaxOnDemandThroughputThrottleEvents, WriteMaxOnDemandThroughputThrottleEvents |
on-demand table max hit |
| Latency |
SuccessfulRequestLatency |
p90 / p95 / p99 by operation |
| Errors |
SystemErrors, UserErrors |
AWS side vs client side errors |
| Conditional |
ConditionalCheckFailedRequests |
optimistic locking / business conflict |
| Transaction |
TransactionConflict |
high contention transaction |
| Result size |
ReturnedItemCount, ReturnedBytes |
query efficiency |
| TTL |
TimeToLiveDeletedItemCount |
TTL deletion activity |
| GSI backfill |
OnlineIndexPercentageProgress, OnlineIndexThrottleEvents, OnlineIndexConsumedWriteCapacity |
adding GSI impact |
| Global table |
ReplicationLatency, PendingReplicationCount, AgeOfOldestUnreplicatedRecord |
replica lag |
alert rules#
critical:
SystemErrors > 0 for 5m
ReadThrottleEvents / WriteThrottleEvents > 0 for 5m on critical table
ReadAccountLimitThrottleEvents / WriteAccountLimitThrottleEvents > 0
ReplicationLatency above RPO expectation
warning:
ConsumedReadCapacityUnits > 80% provisioned for 10m
ConsumedWriteCapacityUnits > 80% provisioned for 10m
SuccessfulRequestLatency p95 above service SLO
ConditionalCheckFailedRequests sudden spike
ReturnedItemCount too high for online query
dashboard#
dashboard should include:
request latency p50 / p95 / p99 by operation
read/write consumed capacity
read/write throttle events
key range throttle events
system errors / user errors
top application errors from logs
GSI capacity and throttles
table size / item count
TTL deletes
global table replication latency, if used
Prometheus / YACE#
apiVersion: v1alpha1
discovery:
jobs:
- type: AWS/DynamoDB
regions:
- ap-east-1
metrics:
- name: ConsumedReadCapacityUnits
statistics: [Sum]
period: 60
length: 300
- name: ConsumedWriteCapacityUnits
statistics: [Sum]
period: 60
length: 300
- name: ReadThrottleEvents
statistics: [Sum]
period: 60
length: 300
- name: WriteThrottleEvents
statistics: [Sum]
period: 60
length: 300
- name: SuccessfulRequestLatency
statistics: [Average, p95]
period: 60
length: 300
- name: SystemErrors
statistics: [Sum]
period: 60
length: 300
dimensionNameRequirements:
- TableName
9. Hands-on#
create table#
export AWS_PAGER=""
export AWS_REGION="ap-east-1"
export TABLE_NAME="dev-order-orders"
aws dynamodb create-table \
--region "${AWS_REGION}" \
--table-name "${TABLE_NAME}" \
--billing-mode PAY_PER_REQUEST \
--attribute-definitions \
AttributeName=pk,AttributeType=S \
AttributeName=sk,AttributeType=S \
AttributeName=gsi1pk,AttributeType=S \
AttributeName=gsi1sk,AttributeType=S \
--key-schema \
AttributeName=pk,KeyType=HASH \
AttributeName=sk,KeyType=RANGE \
--global-secondary-indexes '[
{
"IndexName": "gsi1",
"KeySchema": [
{"AttributeName": "gsi1pk", "KeyType": "HASH"},
{"AttributeName": "gsi1sk", "KeyType": "RANGE"}
],
"Projection": {"ProjectionType": "INCLUDE", "NonKeyAttributes": ["status", "amount", "created_at"]}
}
]' \
--deletion-protection-enabled \
--tags \
Key=env,Value=dev \
Key=service,Value=order \
Key=owner,Value=platform
enable PITR#
aws dynamodb update-continuous-backups \
--region "${AWS_REGION}" \
--table-name "${TABLE_NAME}" \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true,RecoveryPeriodInDays=35
aws dynamodb describe-continuous-backups \
--region "${AWS_REGION}" \
--table-name "${TABLE_NAME}" \
--query 'ContinuousBackupsDescription.PointInTimeRecoveryDescription'
enable TTL#
aws dynamodb update-time-to-live \
--region "${AWS_REGION}" \
--table-name "${TABLE_NAME}" \
--time-to-live-specification Enabled=true,AttributeName=expire_at
put items#
aws dynamodb put-item \
--region "${AWS_REGION}" \
--table-name "${TABLE_NAME}" \
--condition-expression "attribute_not_exists(pk) AND attribute_not_exists(sk)" \
--item '{
"pk": {"S": "USER#u-1001"},
"sk": {"S": "ORDER#2026-05-29T10:00:00Z#o-1001"},
"gsi1pk": {"S": "ORDER#o-1001"},
"gsi1sk": {"S": "META"},
"order_id": {"S": "o-1001"},
"user_id": {"S": "u-1001"},
"status": {"S": "PENDING"},
"amount": {"N": "99.90"},
"version": {"N": "1"},
"created_at": {"S": "2026-05-29T10:00:00Z"},
"expire_at": {"N": "1790599200"}
}'
query by user#
aws dynamodb query \
--region "${AWS_REGION}" \
--table-name "${TABLE_NAME}" \
--key-condition-expression "pk = :pk AND begins_with(sk, :prefix)" \
--expression-attribute-values '{
":pk": {"S": "USER#u-1001"},
":prefix": {"S": "ORDER#"}
}' \
--return-consumed-capacity TOTAL
query by order id through GSI#
aws dynamodb query \
--region "${AWS_REGION}" \
--table-name "${TABLE_NAME}" \
--index-name gsi1 \
--key-condition-expression "gsi1pk = :pk AND gsi1sk = :sk" \
--expression-attribute-values '{
":pk": {"S": "ORDER#o-1001"},
":sk": {"S": "META"}
}' \
--return-consumed-capacity TOTAL
cloudwatch alarms#
aws cloudwatch put-metric-alarm \
--region "${AWS_REGION}" \
--alarm-name "dynamodb-${TABLE_NAME}-write-throttle" \
--namespace AWS/DynamoDB \
--metric-name WriteThrottleEvents \
--statistic Sum \
--period 60 \
--evaluation-periods 5 \
--threshold 0 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching \
--dimensions Name=TableName,Value="${TABLE_NAME}"
aws cloudwatch put-metric-alarm \
--region "${AWS_REGION}" \
--alarm-name "dynamodb-${TABLE_NAME}-system-errors" \
--namespace AWS/DynamoDB \
--metric-name SystemErrors \
--statistic Sum \
--period 60 \
--evaluation-periods 5 \
--threshold 0 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching \
--dimensions Name=TableName,Value="${TABLE_NAME}"
aws cloudwatch put-metric-alarm \
--region "${AWS_REGION}" \
--alarm-name "dynamodb-${TABLE_NAME}-p95-latency" \
--namespace AWS/DynamoDB \
--metric-name SuccessfulRequestLatency \
--extended-statistic p95 \
--period 60 \
--evaluation-periods 5 \
--threshold 50 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching \
--dimensions Name=TableName,Value="${TABLE_NAME}" Name=Operation,Value=Query
cleanup#
# production table 不要直接执行 cleanup
aws dynamodb update-table \
--region "${AWS_REGION}" \
--table-name "${TABLE_NAME}" \
--no-deletion-protection-enabled
aws dynamodb delete-table \
--region "${AWS_REGION}" \
--table-name "${TABLE_NAME}"
10. Production Checklist#
before launch:
access patterns reviewed
partition key distribution tested
item size measured
ReturnConsumedCapacity sampled in load test
on-demand/provisioned mode decided
Service Quotas checked
PITR enabled
deletion protection enabled
IAM least privilege reviewed
KMS key policy reviewed, if using customer managed key
VPC endpoint / resource policy reviewed
CloudWatch alarms created
dashboard created
backup restore tested
migration / backfill has rate limit
when incident happens:
check throttle reason first
check table and GSI metrics separately
check key range throttle events for hot partition
check account limit throttle events for quota issue
check app retry and timeout
check recent GSI creation / backfill / migration