Today we are announcing support for inline payloads for Amazon SageMaker AI asynchronous inference. Customers can now send inference payloads directly in the request body of the InvokeEndpointAsync API, eliminating the need to upload input data to Amazon Simple Storage Service (Amazon S3) before each call.
For payloads up to 128,000 bytes, this eliminates a full network round trip, simplifies client-side code, and reduces the operational surface area of asynchronous inference workloads.
In this article, we explain the motivation behind this feature, review the before and after customer experience, and show you how to start using Online Payloads today.
Background: How asynchronous inference worked before
You can use Amazon SageMaker AI Async Inference to queue inference requests and process them asynchronously. It is best suited for workloads with large payloads, variable traffic, or a latency tolerance of seconds to minutes. It supports auto-scaling down to zero, making it cost-effective for burst or batch workloads.
Until now, the workflow required two steps for each call:
- Download the input payload into an Amazon S3 bucket.
- Invoke the endpoint, passing the S3 object URI as InputLocation.
The endpoint processes the request asynchronously and writes the output to a configured S3 output location, which the client queries or receives via Amazon Simple Notification Service (Amazon SNS) notification.
This two-step model works well for large payloads (images, audio, multi-MB documents). But for customers with small input payloads (in KB) and who need processing times longer than real-time inference allows, the mandatory S3 dependency added unnecessary complexity.
What’s new: Inline payload via Body parameter
With today’s launch, InvokeEndpointAsync accepts a new Body parameter. When present, the payload is sent inline in the API request itself, with no S3 upload required.
Key details:
| Appearance | Details |
| New setting | Body, raw bytes, limited to 128,000 bytes. |
| Maximum online size | 128,000 bytes (raw payload). |
| Mutual exclusivity | Body and InputLocation are mutually exclusive. The API rejects requests that set both. |
| Exit behavior | Unchanged. The output is written to S3 OutputLocation. |
| Endpoint Compatibility | Designed to work with existing asynchronous endpoints; no model or container changes are expected. |
| Error handling | Size and mutual exclusivity violations return synchronous ValidationError responses. |
| Availability | Available in 31 commercial AWS Regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, BK, KUL, KD, HYD, CPT, MXP, TLV). |
Before and after: customer experience
The change is clearest in the code. The following two examples make the same asynchronous call to the same endpoint. The first uses the S3 upload step that was required until now, and the second uses the Inline Body setting that replaces it.
Before: Upload to S3 first, then invoke
import boto3, json, uuid
s3 = boto3.client("s3")
sagemaker_runtime = boto3.client("sagemaker-runtime")
payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")
# 1. Upload the query payload to S3 (additional latency + cost)
input_key = f"async-input/{uuid.uuid4()}.json"
s3.put_object(Bucket="my-async-bucket", Key=input_key, Body=payload)
input_location = f"s3://my-async-bucket/{input_key}"
# 2. Invoke endpoint
response = sagemaker_runtime.invoke_endpoint_async(
EndpointName="my-async-endpoint",
InputLocation=input_location,
ContentType="application/json",
)
print(response["OutputLocation"])
This approach requires:
- A provisioned S3 client and ingress bucket.
- AWS Identity and Access Management (IAM) s3:PutObject permission on the caller.
- A naming scheme (UUID or similar) to avoid key collisions.
- A cleanup strategy for stale input objects.
After: send payload online
import boto3, json
sagemaker_runtime = boto3.client("sagemaker-runtime")
payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")
# One call, no S3 upload, no input bucket needed
response = sagemaker_runtime.invoke_endpoint_async(
EndpointName="my-async-endpoint",
Body=payload,
ContentType="application/json",
)
print (response["OutputLocation"])
No S3 client, no uuid, no input bucket, no IAM permission on input path, no stale object cleanup.
Customer benefits
Sending the payload inline removes a network hop and dependency from each request. This translates into five concrete advantages:
- Reduced latency. One network round trip and one S3 PUT removed per request. For distributed workloads, these latency savings are significant.
- Simpler architecture. Avoids ingress bucket provisioning, lifecycle policies, cross-account access patterns, and caller IAM s3:PutObject permission on the ingress path.
- Fewer error paths. The request is a single API call. Either it gets queued or it doesn’t.
- Lower cost. Removes S3 PUT charges for uploading input to each online call.
- Immediate feedback of validation. Size and mutual exclusivity errors are returned synchronously.
When to use each approach
Inline payloads are usually the simplest choice for small payloads, but InputLocation still has its place. Use the following table to decide which path fits a given workload:
| Scenario | Recommended approach |
| Payload <= 128,000 bytes (JSON prompts, structured data) | Body online. Simpler. Avoids a network round trip and S3 PUT fees. |
| Payload > 128,000 bytes (images, audio, large documents) | Entrance location. Upload to S3 first. |
| Mixed workload with varying payload sizes | Branch according to size. Use Body for little ones, InputLocation for big ones. |
| Need to keep input data in S3 for audit or replay | Entrance location. Keeps entries in your compartment. |
Getting started
See the code notebook example for a complete overview.
Before you begin, make sure you have:
- An existing Amazon SageMaker AI asynchronous inference endpoint (check with aws sagemakerscribe-endpoint –endpoint-name my-async-endpoint).
- Latest AWS SDK for Python (Boto3) installed and configured with credentials.
- IAM permissions for sagemaker: InvokeEndpointAsync.
- An S3 output bucket configured for your asynchronous endpoint (for example, my-output-bucket).
Note: Following this guide uses billable AWS resources. SageMaker AI asynchronous inference endpoints incur charges for instance times, and S3 buckets incur charges for storage and queries. Follow the cleanup steps after completing the tutorial to avoid recurring charges.
Steps
Support for online payloads is available today. To use it:
- Update your AWS SDK. Install or upgrade Boto3 to the latest version: pip install –upgrade boto3.
- Check the installation: pip shows boto3.
- Replace your calling code. In your application, replace the S3 upload + InputLocation pattern with a direct Body parameter, as shown in the previous code example.
- Test your summon by calling the InvokeEndpointAsync API with the Body parameter.
- Check answer contains an OutputLocation field.
- Query or monitor the S3 OutputLocation to confirm that your inference result was written successfully.
No changes are necessary to your endpoint configuration, model container, or S3 output configuration.
Clean-up
To avoid ongoing charges, remove the resources used in this walkthrough:
- Delete the SageMaker AI endpoint if it was created for testing purposes: aws sagemaker delete-endpoint –endpoint-name my-async-endpoint
- Delete the output S3 bucket (if it is no longer needed). Warning: Deleting an S3 bucket permanently deletes the objects it contains. Verify that you have saved any inference results that you need to keep. aws s3 rb s3://my-output-bucket –force
- Delete any IAM policies created specifically for this tutorial.
Conclusion
Inline payload support for SageMaker AI Async Inference removes a common sticking point in asynchronous inference workflows: the mandatory S3 upload for each query. For the majority of inference payloads that fit into 128,000 bytes, you can now make a single API call and let SageMaker AI handle the rest.
The feature is designed to be backwards compatible. Existing InputLocation workflows remain unchanged. Online and S3 input are processed identically once the request is accepted, and models receive identical requests regardless of the input source.
Get started today by updating your AWS SDK and using the Body parameter on the SageMaker AI InvokeEndpointAsync API. To learn more about asynchronous inference, see the Amazon SageMaker AI Async Inference documentation.
About the authors
Dan Ferguson
Dan is a Solutions Architect at AWS, based in New York, USA. As an expert in machine learning services, Dan strives to support clients on their journey to integrating ML workflows effectively, efficiently, and sustainably.
Bruce Wang
Bruce is a Software Development Engineer on the SageMaker AI Inference DataPlane team at AWS. It builds the infrastructure that powers real-time, asynchronous inference for SageMaker AI customers.
Source: Here
“`

