Links#
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
https://docs.aws.amazon.com/firehose/latest/dev/basic-create.html
https://docs.aws.amazon.com/firehose/latest/dev/basic-deliver.html
https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
https://docs.aws.amazon.com/firehose/latest/dev/retry.html
https://docs.aws.amazon.com/firehose/latest/dev/monitoring-with-cloudwatch-metrics.html
https://docs.aws.amazon.com/firehose/latest/dev/monitoring-with-cloudwatch-logs.html
https://docs.aws.amazon.com/firehose/latest/APIReference/Welcome.html
1. Important Points#
Amazon Data Firehose 是 managed streaming delivery service:
receive streaming records
buffer records
optionally transform with Lambda
optionally convert JSON to Parquet / ORC for S3
deliver to destination
常见用途:
application logs -> S3 data lake
CloudWatch Logs / metric stream -> S3 / third-party HTTP endpoint
Kinesis Data Streams -> S3 / Redshift / OpenSearch
clickstream / audit event -> S3
核心原则:
Firehose focuses on delivery, not complex stream processing
choose buffer size/interval based on latency vs file size
S3 backup/error prefix must be configured and monitored
use Lambda transform only for light transformation
use Glue schema + Parquet/ORC when Athena query cost matters
dynamic partitioning improves S3 layout but increases config complexity
Firehose 不是:
message queue
long-retention event log like Kinesis Data Streams / Kafka
complex ETL engine
exactly-once processing system
low-latency millisecond delivery system
2. Service Configuration#
source#
| Source |
When To Use |
Notes |
| Direct PUT |
app directly calls PutRecord / PutRecordBatch |
simplest app -> Firehose path |
| Kinesis Data Streams |
need replay / multiple consumers / higher stream control |
Firehose reads from stream |
| Amazon MSK |
Kafka source in AWS |
useful for Kafka -> destination |
| CloudWatch Logs |
log subscription delivery |
common centralized log pipeline |
| CloudWatch Metric Streams |
metrics export |
Firehose stream must match CloudWatch metric stream setup requirements |
source decision:
Direct PUT:
simple delivery
no replay from source
app retries matter
Kinesis Data Streams before Firehose:
replay needed
multiple consumers needed
source stream retention needed
Firehose should not be only consumer path
destination#
| Destination |
Use Case |
Notes |
| Amazon S3 |
data lake / logs / raw events |
most common destination |
| Amazon Redshift |
warehouse loading |
Firehose stages data in S3 and runs COPY |
| OpenSearch Service / Serverless |
search / log analytics |
index design matters |
| Splunk |
log analytics |
needs HEC endpoint/token |
| HTTP endpoint |
SaaS / custom collector |
retry and backup matter |
| Snowflake |
analytics destination |
destination-specific setup |
| Apache Iceberg Tables in S3 |
table format delivery |
schema/table design matters |
destination design:
S3 should almost always have:
compression
clear prefix pattern
error prefix
lifecycle policy
encryption
Redshift / OpenSearch / HTTP should have:
S3 backup enabled
retry duration reviewed
CloudWatch Logs enabled for delivery errors
buffer#
| Workload |
Buffer Size / Interval |
Tradeoff |
| near real-time logs |
smaller interval |
lower latency, smaller files |
| S3 data lake / Athena |
larger buffer |
larger files, better query efficiency |
| low traffic stream |
interval matters more |
avoid too many tiny files |
| high traffic stream |
size matters more |
throughput and file size improve |
buffering:
Firehose buffers incoming records before delivery
destination decides valid buffer range
smaller buffer:
lower latency
more objects / requests
larger buffer:
better S3 object size
lower request overhead
higher delivery latency
3. Data Layout / Architecture#
S3 prefix#
recommended prefix:
raw/service=order-api/env=prod/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
error prefix:
error/service=order-api/env=prod/!{firehose:error-output-type}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/
prefix rules:
include service/env
include time partition
keep error records separated
avoid user_id / request_id as partition key
lifecycle raw logs if retention is limited
dynamic partitioning#
dynamic partitioning:
partitions S3 output based on fields inside each record
useful for:
tenant_id
service
event_type
region
avoid:
high cardinality keys
keys that are missing often
keys that create tiny files
processing order with dynamic partitioning:
deaggregation
Lambda transformation
partition key extraction
delivery
JSON -> Parquet / ORC:
useful for Athena / Glue / Redshift Spectrum
needs AWS Glue Data Catalog schema
schema must match input data structure
when to use:
high query volume
large S3 data lake
columnar analytics
when not to use:
schema changes constantly
records are not JSON
raw archive must be kept exactly as received
Lambda transformation:
Firehose invokes Lambda with buffered records
Lambda returns one result per input record
record result:
Ok:
deliver transformed record
Dropped:
drop record intentionally
ProcessingFailed:
send to processing-failed/error path
import base64
import json
def lambda_handler(event, context):
output = []
for record in event["records"]:
payload = base64.b64decode(record["data"]).decode("utf-8")
data = json.loads(payload)
data["service"] = data.get("service", "order-api")
data["env"] = data.get("env", "prod")
encoded = base64.b64encode((json.dumps(data) + "\n").encode("utf-8")).decode("utf-8")
output.append({
"recordId": record["recordId"],
"result": "Ok",
"data": encoded
})
return {"records": output}
transform rules:
keep Lambda fast
avoid network call per record
preserve newline-delimited JSON for S3 analytics
never silently drop invalid records unless business accepts loss
monitor ProcessingFailed records
PutRecord / PutRecordBatch#
producer rules:
batch records when possible
retry throttling / transient errors
include event timestamp in payload
include service/env/schema_version
do not send oversized records
{
"event_id": "evt_001",
"event_type": "order_created",
"service": "order-api",
"env": "prod",
"schema_version": 1,
"event_time": "2026-06-02T10:00:00Z",
"order_id": "ord_001",
"amount_cents": 1000
}
5. Security Best Practices#
IAM role#
Firehose delivery role should allow only:
write to target S3 bucket/prefix
write to backup/error S3 prefix
invoke transform Lambda if enabled
read Glue schema if format conversion enabled
write CloudWatch Logs if logging enabled
destination-specific permissions
S3 bucket policy sample#
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowFirehoseWrite",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/firehose-order-prod-role"
},
"Action": [
"s3:AbortMultipartUpload",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::order-prod-firehose",
"arn:aws:s3:::order-prod-firehose/*"
]
}
]
}
security checklist:
S3 SSE-S3 or SSE-KMS enabled
Firehose role scoped to bucket/prefix
source app can only PutRecord to required delivery stream
CloudWatch Logs do not contain sensitive payloads
Splunk / HTTP endpoint token stored securely
KMS key policy allows Firehose role when using SSE-KMS
6. Backup / Restore / Reliability#
failure behavior:
Firehose retries destination delivery based on destination config
if delivery fails after retry, records go to S3 backup/error location when configured
Lambda transform / format conversion failures go to processing-failed/error prefix
must have:
backup bucket
error prefix
CloudWatch error logging
replay/runbook for failed records
replay pattern:
inspect error prefix
identify error-output-type
fix transform/schema/destination issue
replay valid records through PutRecordBatch or a controlled batch job
keep original failed records for audit until retention expires
reliability checklist:
destination retry duration reviewed
error prefix monitored
backup S3 lifecycle configured
transform Lambda DLQ/alarms considered
producer retry with backoff
delivery stream quota reviewed before traffic launch
7. Monitoring#
CloudWatch namespace:
AWS/Firehose
important metrics:
IncomingBytes
IncomingRecords
DeliveryToS3.Success
DeliveryToS3.DataFreshness
DeliveryToS3.Records
DeliveryToS3.Bytes
DeliveryToS3Failures
ThrottledRecords
DataReadFromKinesisStream.Bytes
DataReadFromKinesisStream.Records
alerts:
DeliveryToS3.Success < 1
DeliveryToS3.DataFreshness too high
Delivery failures > 0
ThrottledRecords > 0
IncomingBytes drops to zero unexpectedly
Lambda transform error spike
S3 error prefix object count increases
logs:
enable CloudWatch error logging
inspect:
DestinationDelivery
BackupDelivery
ProcessingFailed
dashboard:
incoming bytes/records
delivery success
data freshness
throttled records
error log count
S3 object count / size
8. Hands-on#
create S3 bucket#
export AWS_PAGER=""
export AWS_REGION="ap-east-1"
export ACCOUNT_ID="123456789012"
export BUCKET="order-prod-firehose"
export STREAM_NAME="prod-order-events-to-s3"
export FIREHOSE_ROLE_ARN="arn:aws:iam::${ACCOUNT_ID}:role/firehose-order-prod-role"
aws s3api create-bucket \
--bucket "$BUCKET" \
--region "$AWS_REGION" \
--create-bucket-configuration LocationConstraint="$AWS_REGION"
S3 destination config#
{
"RoleARN": "arn:aws:iam::123456789012:role/firehose-order-prod-role",
"BucketARN": "arn:aws:s3:::order-prod-firehose",
"Prefix": "raw/service=order-api/env=prod/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/",
"ErrorOutputPrefix": "error/service=order-api/env=prod/!{firehose:error-output-type}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
"BufferingHints": {
"SizeInMBs": 128,
"IntervalInSeconds": 300
},
"CompressionFormat": "GZIP",
"CloudWatchLoggingOptions": {
"Enabled": true,
"LogGroupName": "/aws/kinesisfirehose/prod-order-events-to-s3",
"LogStreamName": "S3Delivery"
}
}
create delivery stream#
aws firehose create-delivery-stream \
--delivery-stream-name "$STREAM_NAME" \
--delivery-stream-type DirectPut \
--extended-s3-destination-configuration file://extended-s3-destination.json
put one record#
printf '{"event_type":"order_created","service":"order-api","env":"prod","order_id":"ord_001"}\n' > record.json
aws firehose put-record \
--delivery-stream-name "$STREAM_NAME" \
--record Data=fileb://record.json
describe stream#
aws firehose describe-delivery-stream \
--delivery-stream-name "$STREAM_NAME"
check metrics#
aws cloudwatch list-metrics \
--namespace AWS/Firehose \
--dimensions Name=DeliveryStreamName,Value="$STREAM_NAME"
9. Production Checklist#
source:
Direct PUT vs Kinesis Data Streams decision documented
producer retries implemented
event schema_version included
event_time included
destination:
S3 prefix and error prefix defined
compression enabled
backup enabled for non-S3 destinations
lifecycle policy configured
destination retry reviewed
transform:
Lambda transform is lightweight
ProcessingFailed monitored
schema evolution plan exists
format conversion schema matches input
security:
Firehose role least privilege
S3 bucket encrypted
KMS key policy reviewed if used
endpoint credentials stored securely
operations:
CloudWatch metrics dashboard exists
CloudWatch error logging enabled
DataFreshness alarm exists
delivery failure alarm exists
replay runbook exists