AWS Data Firehose

Links#

https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
https://docs.aws.amazon.com/firehose/latest/dev/basic-create.html
https://docs.aws.amazon.com/firehose/latest/dev/basic-deliver.html
https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
https://docs.aws.amazon.com/firehose/latest/dev/retry.html
https://docs.aws.amazon.com/firehose/latest/dev/monitoring-with-cloudwatch-metrics.html
https://docs.aws.amazon.com/firehose/latest/dev/monitoring-with-cloudwatch-logs.html
https://docs.aws.amazon.com/firehose/latest/APIReference/Welcome.html

1. Important Points#

Amazon Data Firehose 是 managed streaming delivery service:
    receive streaming records
    buffer records
    optionally transform with Lambda
    optionally convert JSON to Parquet / ORC for S3
    deliver to destination

常见用途:
    application logs -> S3 data lake
    CloudWatch Logs / metric stream -> S3 / third-party HTTP endpoint
    Kinesis Data Streams -> S3 / Redshift / OpenSearch
    clickstream / audit event -> S3

核心原则:
    Firehose focuses on delivery, not complex stream processing
    choose buffer size/interval based on latency vs file size
    S3 backup/error prefix must be configured and monitored
    use Lambda transform only for light transformation
    use Glue schema + Parquet/ORC when Athena query cost matters
    dynamic partitioning improves S3 layout but increases config complexity

Firehose 不是:
    message queue
    long-retention event log like Kinesis Data Streams / Kafka
    complex ETL engine
    exactly-once processing system
    low-latency millisecond delivery system

2. Service Configuration#

source#

Source	When To Use	Notes
Direct PUT	app directly calls `PutRecord` / `PutRecordBatch`	simplest app -> Firehose path
Kinesis Data Streams	need replay / multiple consumers / higher stream control	Firehose reads from stream
Amazon MSK	Kafka source in AWS	useful for Kafka -> destination
CloudWatch Logs	log subscription delivery	common centralized log pipeline
CloudWatch Metric Streams	metrics export	Firehose stream must match CloudWatch metric stream setup requirements

source decision:
    Direct PUT:
        simple delivery
        no replay from source
        app retries matter

    Kinesis Data Streams before Firehose:
        replay needed
        multiple consumers needed
        source stream retention needed
        Firehose should not be only consumer path

destination#

Destination	Use Case	Notes
Amazon S3	data lake / logs / raw events	most common destination
Amazon Redshift	warehouse loading	Firehose stages data in S3 and runs COPY
OpenSearch Service / Serverless	search / log analytics	index design matters
Splunk	log analytics	needs HEC endpoint/token
HTTP endpoint	SaaS / custom collector	retry and backup matter
Snowflake	analytics destination	destination-specific setup
Apache Iceberg Tables in S3	table format delivery	schema/table design matters

destination design:
    S3 should almost always have:
        compression
        clear prefix pattern
        error prefix
        lifecycle policy
        encryption

    Redshift / OpenSearch / HTTP should have:
        S3 backup enabled
        retry duration reviewed
        CloudWatch Logs enabled for delivery errors

buffer#

Workload	Buffer Size / Interval	Tradeoff
near real-time logs	smaller interval	lower latency, smaller files
S3 data lake / Athena	larger buffer	larger files, better query efficiency
low traffic stream	interval matters more	avoid too many tiny files
high traffic stream	size matters more	throughput and file size improve

buffering:
    Firehose buffers incoming records before delivery
    destination decides valid buffer range
    smaller buffer:
        lower latency
        more objects / requests

    larger buffer:
        better S3 object size
        lower request overhead
        higher delivery latency

3. Data Layout / Architecture#

S3 prefix#

recommended prefix:
    raw/service=order-api/env=prod/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/

error prefix:
    error/service=order-api/env=prod/!{firehose:error-output-type}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/

prefix rules:
    include service/env
    include time partition
    keep error records separated
    avoid user_id / request_id as partition key
    lifecycle raw logs if retention is limited

dynamic partitioning#

dynamic partitioning:
    partitions S3 output based on fields inside each record
    useful for:
        tenant_id
        service
        event_type
        region

avoid:
    high cardinality keys
    keys that are missing often
    keys that create tiny files

processing order with dynamic partitioning:
    deaggregation
    Lambda transformation
    partition key extraction
    delivery

format conversion#

JSON -> Parquet / ORC:
    useful for Athena / Glue / Redshift Spectrum
    needs AWS Glue Data Catalog schema
    schema must match input data structure

when to use:
    high query volume
    large S3 data lake
    columnar analytics

when not to use:
    schema changes constantly
    records are not JSON
    raw archive must be kept exactly as received

4. Transform / Write Best Practices#

Lambda transform#

Lambda transformation:
    Firehose invokes Lambda with buffered records
    Lambda returns one result per input record

record result:
    Ok:
        deliver transformed record

    Dropped:
        drop record intentionally

    ProcessingFailed:
        send to processing-failed/error path

import base64
import json


def lambda_handler(event, context):
    output = []

    for record in event["records"]:
        payload = base64.b64decode(record["data"]).decode("utf-8")
        data = json.loads(payload)

        data["service"] = data.get("service", "order-api")
        data["env"] = data.get("env", "prod")

        encoded = base64.b64encode((json.dumps(data) + "\n").encode("utf-8")).decode("utf-8")

        output.append({
            "recordId": record["recordId"],
            "result": "Ok",
            "data": encoded
        })

    return {"records": output}

transform rules:
    keep Lambda fast
    avoid network call per record
    preserve newline-delimited JSON for S3 analytics
    never silently drop invalid records unless business accepts loss
    monitor ProcessingFailed records

PutRecord / PutRecordBatch#

producer rules:
    batch records when possible
    retry throttling / transient errors
    include event timestamp in payload
    include service/env/schema_version
    do not send oversized records

{
  "event_id": "evt_001",
  "event_type": "order_created",
  "service": "order-api",
  "env": "prod",
  "schema_version": 1,
  "event_time": "2026-06-02T10:00:00Z",
  "order_id": "ord_001",
  "amount_cents": 1000
}

5. Security Best Practices#

IAM role#

Firehose delivery role should allow only:
    write to target S3 bucket/prefix
    write to backup/error S3 prefix
    invoke transform Lambda if enabled
    read Glue schema if format conversion enabled
    write CloudWatch Logs if logging enabled
    destination-specific permissions

S3 bucket policy sample#

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowFirehoseWrite",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/firehose-order-prod-role"
      },
      "Action": [
        "s3:AbortMultipartUpload",
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::order-prod-firehose",
        "arn:aws:s3:::order-prod-firehose/*"
      ]
    }
  ]
}

security checklist:
    S3 SSE-S3 or SSE-KMS enabled
    Firehose role scoped to bucket/prefix
    source app can only PutRecord to required delivery stream
    CloudWatch Logs do not contain sensitive payloads
    Splunk / HTTP endpoint token stored securely
    KMS key policy allows Firehose role when using SSE-KMS

6. Backup / Restore / Reliability#

failure behavior:
    Firehose retries destination delivery based on destination config
    if delivery fails after retry, records go to S3 backup/error location when configured
    Lambda transform / format conversion failures go to processing-failed/error prefix

must have:
    backup bucket
    error prefix
    CloudWatch error logging
    replay/runbook for failed records

replay pattern:
    inspect error prefix
    identify error-output-type
    fix transform/schema/destination issue
    replay valid records through PutRecordBatch or a controlled batch job
    keep original failed records for audit until retention expires

reliability checklist:
    destination retry duration reviewed
    error prefix monitored
    backup S3 lifecycle configured
    transform Lambda DLQ/alarms considered
    producer retry with backoff
    delivery stream quota reviewed before traffic launch

7. Monitoring#

CloudWatch namespace:
    AWS/Firehose

important metrics:
    IncomingBytes
    IncomingRecords
    DeliveryToS3.Success
    DeliveryToS3.DataFreshness
    DeliveryToS3.Records
    DeliveryToS3.Bytes
    DeliveryToS3Failures
    ThrottledRecords
    DataReadFromKinesisStream.Bytes
    DataReadFromKinesisStream.Records

alerts:
    DeliveryToS3.Success < 1
    DeliveryToS3.DataFreshness too high
    Delivery failures > 0
    ThrottledRecords > 0
    IncomingBytes drops to zero unexpectedly
    Lambda transform error spike
    S3 error prefix object count increases

logs:
    enable CloudWatch error logging
    inspect:
        DestinationDelivery
        BackupDelivery
        ProcessingFailed

dashboard:
    incoming bytes/records
    delivery success
    data freshness
    throttled records
    error log count
    S3 object count / size

8. Hands-on#

create S3 bucket#

export AWS_PAGER=""
export AWS_REGION="ap-east-1"
export ACCOUNT_ID="123456789012"
export BUCKET="order-prod-firehose"
export STREAM_NAME="prod-order-events-to-s3"
export FIREHOSE_ROLE_ARN="arn:aws:iam::${ACCOUNT_ID}:role/firehose-order-prod-role"

aws s3api create-bucket \
  --bucket "$BUCKET" \
  --region "$AWS_REGION" \
  --create-bucket-configuration LocationConstraint="$AWS_REGION"

S3 destination config#

{
  "RoleARN": "arn:aws:iam::123456789012:role/firehose-order-prod-role",
  "BucketARN": "arn:aws:s3:::order-prod-firehose",
  "Prefix": "raw/service=order-api/env=prod/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/",
  "ErrorOutputPrefix": "error/service=order-api/env=prod/!{firehose:error-output-type}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
  "BufferingHints": {
    "SizeInMBs": 128,
    "IntervalInSeconds": 300
  },
  "CompressionFormat": "GZIP",
  "CloudWatchLoggingOptions": {
    "Enabled": true,
    "LogGroupName": "/aws/kinesisfirehose/prod-order-events-to-s3",
    "LogStreamName": "S3Delivery"
  }
}

create delivery stream#

aws firehose create-delivery-stream \
  --delivery-stream-name "$STREAM_NAME" \
  --delivery-stream-type DirectPut \
  --extended-s3-destination-configuration file://extended-s3-destination.json

put one record#

printf '{"event_type":"order_created","service":"order-api","env":"prod","order_id":"ord_001"}\n' > record.json

aws firehose put-record \
  --delivery-stream-name "$STREAM_NAME" \
  --record Data=fileb://record.json

describe stream#

aws firehose describe-delivery-stream \
  --delivery-stream-name "$STREAM_NAME"

check metrics#

aws cloudwatch list-metrics \
  --namespace AWS/Firehose \
  --dimensions Name=DeliveryStreamName,Value="$STREAM_NAME"

9. Production Checklist#

source:
    Direct PUT vs Kinesis Data Streams decision documented
    producer retries implemented
    event schema_version included
    event_time included

destination:
    S3 prefix and error prefix defined
    compression enabled
    backup enabled for non-S3 destinations
    lifecycle policy configured
    destination retry reviewed

transform:
    Lambda transform is lightweight
    ProcessingFailed monitored
    schema evolution plan exists
    format conversion schema matches input

security:
    Firehose role least privilege
    S3 bucket encrypted
    KMS key policy reviewed if used
    endpoint credentials stored securely

operations:
    CloudWatch metrics dashboard exists
    CloudWatch error logging enabled
    DataFreshness alarm exists
    delivery failure alarm exists
    replay runbook exists