AWS DynamoDB

Links#

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ServiceQuotas.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/metrics-dimensions.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices-security-preventative.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Point-in-time-recovery.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html

1. Important Points#

DynamoDB 是 serverless NoSQL，不是 relational database:
    适合:
        key-value / document
        high traffic OLTP
        predictable access pattern
        low latency read/write
        event / session / cart / order state / metadata

    不适合:
        ad-hoc query
        complex join
        heavy analytical query
        经常变化的查询维度
        需要跨大量 item 做强事务的场景

核心原则:
    先设计 access pattern，再设计 table / key / index
    Query 优先，Scan 尽量避免
    partition key 要高基数并且流量均匀
    GSI 是额外写入成本和额外容量面，不是免费的 query helper
    hot key / hot partition 是 DynamoDB 最常见生产问题
    PITR / deletion protection / least privilege 应该默认打开

2. Service Configuration#

table#

Item	Recommendation
Table name	`<env>-<service>-<entity>`，例如 `prod-order-orders`
Primary key	优先 composite key: `pk` + `sk`
Billing mode	流量不可预测用 `PAY_PER_REQUEST`；稳定大流量用 provisioned + auto scaling
Deletion protection	production table 默认打开
PITR	production table 默认打开，恢复窗口 `1-35` days
TTL	临时数据 / session / cache / event retention 可以打开
SSE	默认全量加密；敏感数据用 customer managed KMS key
Tags	`env`, `service`, `owner`, `cost-center`, `data-classification`

table config checklist:
    deletion protection enabled
    PITR enabled
    tags complete
    billing mode reviewed
    table quota reviewed
    GSI quota reviewed
    CloudWatch alarms created
    resource policy / IAM reviewed

capacity mode#

Mode	When To Use	注意项
On-demand	新业务、流量不可预测、低运维成本优先	单表默认 quota 仍然存在；突发上量前要确认 Service Quotas
Provisioned	流量稳定、成本敏感、可预测峰值	需要 auto scaling / reserved capacity / 告警

on-demand:
    优点:
        不需要提前估算 RCU / WCU
        适合 spike / early stage workload

    注意:
        不是无限吞吐
        hot partition 仍然会 throttle
        大促 / migration 前要提前压测和检查 quota

provisioned:
    优点:
        成本更可控
        稳定大流量可以更便宜

    注意:
        auto scaling 有 CloudWatch 评估和 UpdateTable 延迟
        sudden spike 可能先 throttle 再扩容
        GSI 也要单独配置容量和 auto scaling

quotas#

常见 quota:
    item max size: 400 KB
    LSI per table: 5
    GSI per table: default 20
    projected attributes across secondary indexes: 100
    tables per account per region: default 2500
    default table-level throughput quota:
        on-demand: 40000 read request units / 40000 write request units
        provisioned: 40000 RCU / 40000 WCU
    default account-level provisioned quota:
        80000 RCU / 80000 WCU per region

hot partition limit:
    one partition max:
        3000 RCU / second
        1000 WCU / second

注意:
    quota 是 per region
    default quota 不是架构上限，可以申请提高
    hot partition 不能只靠提高 table quota 解决

3. Data Modeling Best Practices#

access pattern first#

先写清楚 query:
    1. get order by order_id
    2. list orders by user_id and created_at
    3. list unpaid orders by tenant_id and created_at
    4. update order status if version matched
    5. expire session after 7 days

再设计 key:
    table:
        pk = USER#<user_id>
        sk = ORDER#<created_at>#<order_id>

    gsi1:
        gsi1pk = ORDER#<order_id>
        gsi1sk = META

    gsi2:
        gsi2pk = TENANT#<tenant_id>#STATUS#UNPAID
        gsi2sk = CREATED#<created_at>#ORDER#<order_id>

partition key#

good partition key:
    high cardinality
    traffic evenly distributed
    query 可以直接命中 partition key
    不会把所有写入集中到一个值

bad partition key:
    status
    date only
    boolean flag
    tenant_id only, if one tenant can be very hot
    country / region only
    fixed value like ORDER / USER

hot key example:
    pk = TENANT#big_customer
    all write/read goes to one partition key

mitigation:
    write sharding:
        pk = TENANT#big_customer#SHARD#00..15

    time bucket:
        pk = TENANT#big_customer#DAY#2026-05-29

    split access pattern:
        write path use sharded key
        read path query multiple shards and merge

sort key#

sort key 常用于:
    time range
    hierarchy
    entity type
    state transition history

examples:
    ORDER#2026-05-29T10:00:00Z#<order_id>
    PROFILE
    SESSION#<session_id>
    EVENT#<timestamp>#<event_id>

query patterns:
    begins_with(sk, "ORDER#")
    sk between "ORDER#2026-05-01" and "ORDER#2026-05-31"

single table design#

single table 适合:
    access patterns 稳定
    entity relationship 清晰
    需要 transactionally update / query 聚合数据
    团队理解 DynamoDB modeling

single table 不适合:
    业务查询经常变化
    团队没有维护经验
    debugging / analytics 更重要
    只是为了追求 one table

practical rule:
    可以 single table，但不要把所有系统都塞到一个 table
    table boundary 可以按 bounded context / service 来划分

4. Query / Write Best Practices#

Query vs Scan#

Query:
    必须指定 partition key
    可以用 sort key condition
    适合在线请求

Scan:
    会读很多 item
    filter expression 是读完后过滤，不会减少读取容量
    只适合 backfill / admin job / small table

production rule:
    API request path 不要做 full table scan
    如果必须 scan:
        limit page size
        use pagination
        run in background
        rate limit
        monitor consumed capacity and throttle

read consistency#

eventually consistent read:
    default
    cost = 0.5 RCU per 4 KB
    suitable for most read path

strongly consistent read:
    cost = 1 RCU per 4 KB
    only table / LSI support
    GSI / streams do not support strong consistency

transactional read:
    cost = 2 RCU per 4 KB
    only use when real transaction semantics are needed

write#

write capacity:
    1 WCU = one write per second for item up to 1 KB
    item size is rounded up
    each GSI projection adds write cost

best practices:
    use UpdateExpression instead of rewriting large item
    keep item small
    avoid frequently changing huge attributes
    do not store large blob in DynamoDB; store in S3 and keep pointer
    use ReturnConsumedCapacity during test to estimate cost

conditional write#

use condition expression for correctness:
    create only if not exists
    update only if version matched
    decrement stock only if quantity > 0
    avoid lost update

aws dynamodb update-item \
  --table-name prod-order-orders \
  --key '{"pk":{"S":"ORDER#1001"},"sk":{"S":"META"}}' \
  --update-expression "SET #status = :paid, version = version + :one" \
  --condition-expression "version = :expected AND #status = :pending" \
  --expression-attribute-names '{"#status":"status"}' \
  --expression-attribute-values '{
    ":paid":{"S":"PAID"},
    ":pending":{"S":"PENDING"},
    ":expected":{"N":"1"},
    ":one":{"N":"1"}
  }' \
  --return-values ALL_NEW

retries#

client should retry:
    ProvisionedThroughputExceededException
    ThrottlingException
    InternalServerError
    ServiceUnavailable
    TransactionConflictException

retry policy:
    exponential backoff
    jitter
    max attempts
    idempotency token for write path

do not retry blindly:
    ConditionalCheckFailedException
    ValidationException
    AccessDeniedException

5. Index Best Practices#

GSI#

GSI 用于新的 access pattern:
    base table key 无法支持时才加
    GSI key 也必须防 hot partition
    GSI read is eventually consistent
    GSI 有自己的 throttling / capacity / metrics

projection:
    KEYS_ONLY:
        cheapest
        only need keys

    INCLUDE:
        selected attributes
        good default for query result list

    ALL:
        convenient but expensive
        write amplification and storage cost high

注意:
    base table write 会同步写 GSI
    GSI backfill 可能消耗大量写容量
    GSI throttle can throttle base table write

LSI#

LSI:
    must be created with table
    same partition key as base table
    different sort key
    supports strong consistent read
    max 5 per table

practical rule:
    不确定就不要先加 LSI
    后续可变 access pattern 通常用 GSI

sparse index#

sparse index:
    only items with GSI key attributes appear in index
    useful for status / workflow / pending items

example:
    unpaid orders only have:
        gsi2pk = TENANT#<tenant_id>#STATUS#UNPAID
        gsi2sk = CREATED#<created_at>#ORDER#<order_id>

    paid orders remove gsi2pk / gsi2sk

6. Security Best Practices#

IAM#

principle:
    use IAM role, not long-term access key
    least privilege
    separate read role / write role / migration role / admin role
    allow table ARN and index ARN explicitly
    avoid dynamodb:* in application role

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAppReadWriteOrders",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem",
        "dynamodb:Query",
        "dynamodb:BatchGetItem",
        "dynamodb:BatchWriteItem"
      ],
      "Resource": [
        "arn:aws:dynamodb:ap-east-1:123456789012:table/prod-order-orders",
        "arn:aws:dynamodb:ap-east-1:123456789012:table/prod-order-orders/index/*"
      ]
    }
  ]
}

resource-based policy#

use cases:
    cross-account access
    restrict source VPC endpoint
    central data account

注意:
    explicit deny 优先级最高
    policy size has limit
    cross-account 要同时检查 principal policy and resource policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyAccessUnlessFromVpce",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "dynamodb:*",
      "Resource": [
        "arn:aws:dynamodb:ap-east-1:123456789012:table/prod-order-orders",
        "arn:aws:dynamodb:ap-east-1:123456789012:table/prod-order-orders/index/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:SourceVpce": "vpce-0123456789abcdef0"
        }
      }
    }
  ]
}

VPC endpoint#

recommendation:
    EC2 / ECS / Lambda in VPC -> use DynamoDB VPC endpoint
    add endpoint policy
    add IAM/resource policy condition with aws:SourceVpce when appropriate

gateway endpoint:
    common for DynamoDB access from VPC
    no NAT gateway needed for DynamoDB traffic

interface endpoint / PrivateLink:
    use when architecture requires interface endpoint behavior
    client may need endpoint URL configuration

encryption#

at rest:
    DynamoDB encrypts all data at rest
    default uses AWS owned key
    sensitive / compliance workload use customer managed KMS key

in transit:
    use HTTPS/TLS

application side:
    对特别敏感字段可以 client-side encryption
    但加密字段通常不能直接作为 query key

data protection#

production defaults:
    PITR enabled
    deletion protection enabled
    CloudTrail enabled
    AWS Config / Security Hub rule reviewed
    backup restore runbook tested

PITR notes:
    recovery window: 1-35 days
    restore creates a new table
    LatestRestorableDateTime usually lags current time by about 5 minutes

7. TTL / Streams / Global Tables#

TTL#

TTL:
    attribute must be Number
    value is Unix epoch time in seconds
    expired item is deleted asynchronously
    expired item can still appear before background deletion
    filter expired items in read path if correctness requires it

good use cases:
    session
    idempotency record
    temporary token
    cache item
    short-lived event

not good:
    precise scheduled deletion
    compliance deletion with strict second-level SLA

streams#

DynamoDB Streams:
    near real-time change capture
    common integration with Lambda
    use for projection / async workflow / audit / cache invalidation

stream view types:
    KEYS_ONLY
    NEW_IMAGE
    OLD_IMAGE
    NEW_AND_OLD_IMAGES

注意:
    Lambda consumer must handle duplicate/retry
    downstream processing should be idempotent

global tables#

global tables:
    multi-region active-active
    replication is eventually consistent
    conflict reconciliation is last writer wins

use when:
    low latency read/write in multiple regions
    regional resilience is required

注意:
    not a replacement for relational global transaction
    conflict model must be acceptable
    monitor replication latency and pending replication
    TTL replicated deletes may incur replicated write cost in replica regions

8. Monitoring#

important metrics#

Area	Metrics	What To Watch
Capacity	`ConsumedReadCapacityUnits`, `ConsumedWriteCapacityUnits`	consumed vs provisioned / quota
Throttle	`ThrottledRequests`, `ReadThrottleEvents`, `WriteThrottleEvents`	any non-zero spike on production
Hot partition	`ReadKeyRangeThroughputThrottleEvents`, `WriteKeyRangeThroughputThrottleEvents`	key range / partition bottleneck
Provisioned throttle	`ReadProvisionedThroughputThrottleEvents`, `WriteProvisionedThroughputThrottleEvents`	provisioned capacity insufficient
Account limit	`ReadAccountLimitThrottleEvents`, `WriteAccountLimitThrottleEvents`	account-level quota hit
On-demand max	`ReadMaxOnDemandThroughputThrottleEvents`, `WriteMaxOnDemandThroughputThrottleEvents`	on-demand table max hit
Latency	`SuccessfulRequestLatency`	p90 / p95 / p99 by operation
Errors	`SystemErrors`, `UserErrors`	AWS side vs client side errors
Conditional	`ConditionalCheckFailedRequests`	optimistic locking / business conflict
Transaction	`TransactionConflict`	high contention transaction
Result size	`ReturnedItemCount`, `ReturnedBytes`	query efficiency
TTL	`TimeToLiveDeletedItemCount`	TTL deletion activity
GSI backfill	`OnlineIndexPercentageProgress`, `OnlineIndexThrottleEvents`, `OnlineIndexConsumedWriteCapacity`	adding GSI impact
Global table	`ReplicationLatency`, `PendingReplicationCount`, `AgeOfOldestUnreplicatedRecord`	replica lag

alert rules#

critical:
    SystemErrors > 0 for 5m
    ReadThrottleEvents / WriteThrottleEvents > 0 for 5m on critical table
    ReadAccountLimitThrottleEvents / WriteAccountLimitThrottleEvents > 0
    ReplicationLatency above RPO expectation

warning:
    ConsumedReadCapacityUnits > 80% provisioned for 10m
    ConsumedWriteCapacityUnits > 80% provisioned for 10m
    SuccessfulRequestLatency p95 above service SLO
    ConditionalCheckFailedRequests sudden spike
    ReturnedItemCount too high for online query

dashboard#

dashboard should include:
    request latency p50 / p95 / p99 by operation
    read/write consumed capacity
    read/write throttle events
    key range throttle events
    system errors / user errors
    top application errors from logs
    GSI capacity and throttles
    table size / item count
    TTL deletes
    global table replication latency, if used

Prometheus / YACE#

apiVersion: v1alpha1
discovery:
  jobs:
    - type: AWS/DynamoDB
      regions:
        - ap-east-1
      metrics:
        - name: ConsumedReadCapacityUnits
          statistics: [Sum]
          period: 60
          length: 300
        - name: ConsumedWriteCapacityUnits
          statistics: [Sum]
          period: 60
          length: 300
        - name: ReadThrottleEvents
          statistics: [Sum]
          period: 60
          length: 300
        - name: WriteThrottleEvents
          statistics: [Sum]
          period: 60
          length: 300
        - name: SuccessfulRequestLatency
          statistics: [Average, p95]
          period: 60
          length: 300
        - name: SystemErrors
          statistics: [Sum]
          period: 60
          length: 300
      dimensionNameRequirements:
        - TableName

9. Hands-on#

create table#

export AWS_PAGER=""
export AWS_REGION="ap-east-1"
export TABLE_NAME="dev-order-orders"

aws dynamodb create-table \
  --region "${AWS_REGION}" \
  --table-name "${TABLE_NAME}" \
  --billing-mode PAY_PER_REQUEST \
  --attribute-definitions \
    AttributeName=pk,AttributeType=S \
    AttributeName=sk,AttributeType=S \
    AttributeName=gsi1pk,AttributeType=S \
    AttributeName=gsi1sk,AttributeType=S \
  --key-schema \
    AttributeName=pk,KeyType=HASH \
    AttributeName=sk,KeyType=RANGE \
  --global-secondary-indexes '[
    {
      "IndexName": "gsi1",
      "KeySchema": [
        {"AttributeName": "gsi1pk", "KeyType": "HASH"},
        {"AttributeName": "gsi1sk", "KeyType": "RANGE"}
      ],
      "Projection": {"ProjectionType": "INCLUDE", "NonKeyAttributes": ["status", "amount", "created_at"]}
    }
  ]' \
  --deletion-protection-enabled \
  --tags \
    Key=env,Value=dev \
    Key=service,Value=order \
    Key=owner,Value=platform

enable PITR#

aws dynamodb update-continuous-backups \
  --region "${AWS_REGION}" \
  --table-name "${TABLE_NAME}" \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true,RecoveryPeriodInDays=35

aws dynamodb describe-continuous-backups \
  --region "${AWS_REGION}" \
  --table-name "${TABLE_NAME}" \
  --query 'ContinuousBackupsDescription.PointInTimeRecoveryDescription'

enable TTL#

aws dynamodb update-time-to-live \
  --region "${AWS_REGION}" \
  --table-name "${TABLE_NAME}" \
  --time-to-live-specification Enabled=true,AttributeName=expire_at

put items#

aws dynamodb put-item \
  --region "${AWS_REGION}" \
  --table-name "${TABLE_NAME}" \
  --condition-expression "attribute_not_exists(pk) AND attribute_not_exists(sk)" \
  --item '{
    "pk": {"S": "USER#u-1001"},
    "sk": {"S": "ORDER#2026-05-29T10:00:00Z#o-1001"},
    "gsi1pk": {"S": "ORDER#o-1001"},
    "gsi1sk": {"S": "META"},
    "order_id": {"S": "o-1001"},
    "user_id": {"S": "u-1001"},
    "status": {"S": "PENDING"},
    "amount": {"N": "99.90"},
    "version": {"N": "1"},
    "created_at": {"S": "2026-05-29T10:00:00Z"},
    "expire_at": {"N": "1790599200"}
  }'

query by user#

aws dynamodb query \
  --region "${AWS_REGION}" \
  --table-name "${TABLE_NAME}" \
  --key-condition-expression "pk = :pk AND begins_with(sk, :prefix)" \
  --expression-attribute-values '{
    ":pk": {"S": "USER#u-1001"},
    ":prefix": {"S": "ORDER#"}
  }' \
  --return-consumed-capacity TOTAL

query by order id through GSI#

aws dynamodb query \
  --region "${AWS_REGION}" \
  --table-name "${TABLE_NAME}" \
  --index-name gsi1 \
  --key-condition-expression "gsi1pk = :pk AND gsi1sk = :sk" \
  --expression-attribute-values '{
    ":pk": {"S": "ORDER#o-1001"},
    ":sk": {"S": "META"}
  }' \
  --return-consumed-capacity TOTAL

cloudwatch alarms#

aws cloudwatch put-metric-alarm \
  --region "${AWS_REGION}" \
  --alarm-name "dynamodb-${TABLE_NAME}-write-throttle" \
  --namespace AWS/DynamoDB \
  --metric-name WriteThrottleEvents \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 0 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --dimensions Name=TableName,Value="${TABLE_NAME}"

aws cloudwatch put-metric-alarm \
  --region "${AWS_REGION}" \
  --alarm-name "dynamodb-${TABLE_NAME}-system-errors" \
  --namespace AWS/DynamoDB \
  --metric-name SystemErrors \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 0 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --dimensions Name=TableName,Value="${TABLE_NAME}"

aws cloudwatch put-metric-alarm \
  --region "${AWS_REGION}" \
  --alarm-name "dynamodb-${TABLE_NAME}-p95-latency" \
  --namespace AWS/DynamoDB \
  --metric-name SuccessfulRequestLatency \
  --extended-statistic p95 \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 50 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --dimensions Name=TableName,Value="${TABLE_NAME}" Name=Operation,Value=Query

cleanup#

# production table 不要直接执行 cleanup
aws dynamodb update-table \
  --region "${AWS_REGION}" \
  --table-name "${TABLE_NAME}" \
  --no-deletion-protection-enabled

aws dynamodb delete-table \
  --region "${AWS_REGION}" \
  --table-name "${TABLE_NAME}"

10. Production Checklist#

before launch:
    access patterns reviewed
    partition key distribution tested
    item size measured
    ReturnConsumedCapacity sampled in load test
    on-demand/provisioned mode decided
    Service Quotas checked
    PITR enabled
    deletion protection enabled
    IAM least privilege reviewed
    KMS key policy reviewed, if using customer managed key
    VPC endpoint / resource policy reviewed
    CloudWatch alarms created
    dashboard created
    backup restore tested
    migration / backfill has rate limit

when incident happens:
    check throttle reason first
    check table and GSI metrics separately
    check key range throttle events for hot partition
    check account limit throttle events for quota issue
    check app retry and timeout
    check recent GSI creation / backfill / migration