Amazon SageMaker AI Async Inference now supports inline query payloads

Today we are announcing support for inline payloads for Amazon SageMaker AI asynchronous inference. Customers can now send inference payloads directly in the request body of the InvokeEndpointAsync API, eliminating the need to upload input data to Amazon Simple Storage Service (Amazon S3) before each call.

For payloads up to 128,000 bytes, this eliminates a full network round trip, simplifies client-side code, and reduces the operational surface area of asynchronous inference workloads.

In this article, we explain the motivation behind this feature, review the before and after customer experience, and show you how to start using Online Payloads today.

Background: How asynchronous inference worked before

You can use Amazon SageMaker AI Async Inference to queue inference requests and process them asynchronously. It is best suited for workloads with large payloads, variable traffic, or a latency tolerance of seconds to minutes. It supports auto-scaling down to zero, making it cost-effective for burst or batch workloads.

Until now, the workflow required two steps for each call:

Download the input payload into an Amazon S3 bucket.

Invoke the endpoint, passing the S3 object URI as InputLocation.

The endpoint processes the request asynchronously and writes the output to a configured S3 output location, which the client queries or receives via Amazon Simple Notification Service (Amazon SNS) notification.

This two-step model works well for large payloads (images, audio, multi-MB documents). But for customers with small input payloads (in KB) and who need processing times longer than real-time inference allows, the mandatory S3 dependency added unnecessary complexity.

What’s new: Inline payload via Body parameter

With today’s launch, InvokeEndpointAsync accepts a new Body parameter. When present, the payload is sent inline in the API request itself, with no S3 upload required.

Key details:

Appearance	Details
New setting	Body, raw bytes, limited to 128,000 bytes.
Maximum online size	128,000 bytes (raw payload).
Mutual exclusivity	Body and InputLocation are mutually exclusive. The API rejects requests that set both.
Exit behavior	Unchanged. The output is written to S3 OutputLocation.
Endpoint Compatibility	Designed to work with existing asynchronous endpoints; no model or container changes are expected.
Error handling	Size and mutual exclusivity violations return synchronous ValidationError responses.
Availability	Available in 31 commercial AWS Regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, BK, KUL, KD, HYD, CPT, MXP, TLV).

Before and after: customer experience

The change is clearest in the code. The following two examples make the same asynchronous call to the same endpoint. The first uses the S3 upload step that was required until now, and the second uses the Inline Body setting that replaces it.

Before: Upload to S3 first, then invoke



import boto3, json, uuid

s3 = boto3.client("s3")

sagemaker_runtime = boto3.client("sagemaker-runtime")

payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")



# 1. Upload the query payload to S3 (additional latency + cost)

input_key = f"async-input/{uuid.uuid4()}.json"

s3.put_object(Bucket="my-async-bucket", Key=input_key, Body=payload)

input_location = f"s3://my-async-bucket/{input_key}"



# 2. Invoke endpoint

response = sagemaker_runtime.invoke_endpoint_async(

    EndpointName="my-async-endpoint",

    InputLocation=input_location,

    ContentType="application/json",

)

print(response["OutputLocation"])

This approach requires:

A provisioned S3 client and ingress bucket.

AWS Identity and Access Management (IAM) s3:PutObject permission on the caller.

A naming scheme (UUID or similar) to avoid key collisions.

A cleanup strategy for stale input objects.

After: send payload online



import boto3, json

sagemaker_runtime = boto3.client("sagemaker-runtime")

payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")



# One call, no S3 upload, no input bucket needed

response = sagemaker_runtime.invoke_endpoint_async(

    EndpointName="my-async-endpoint",

    Body=payload,

    ContentType="application/json",

)

print (response["OutputLocation"])

No S3 client, no uuid, no input bucket, no IAM permission on input path, no stale object cleanup.

Customer benefits

Sending the payload inline removes a network hop and dependency from each request. This translates into five concrete advantages:

Reduced latency. One network round trip and one S3 PUT removed per request. For distributed workloads, these latency savings are significant.

Simpler architecture. Avoids ingress bucket provisioning, lifecycle policies, cross-account access patterns, and caller IAM s3:PutObject permission on the ingress path.

Fewer error paths. The request is a single API call. Either it gets queued or it doesn’t.

Lower cost. Removes S3 PUT charges for uploading input to each online call.

Immediate feedback of validation. Size and mutual exclusivity errors are returned synchronously.

When to use each approach

Inline payloads are usually the simplest choice for small payloads, but InputLocation still has its place. Use the following table to decide which path fits a given workload:

Scenario	Recommended approach
Payload <= 128,000 bytes (JSON prompts, structured data)	Body online. Simpler. Avoids a network round trip and S3 PUT fees.
Payload > 128,000 bytes (images, audio, large documents)	Entrance location. Upload to S3 first.
Mixed workload with varying payload sizes	Branch according to size. Use Body for little ones, InputLocation for big ones.
Need to keep input data in S3 for audit or replay	Entrance location. Keeps entries in your compartment.

Getting started

See the code notebook example for a complete overview.

Before you begin, make sure you have:

An existing Amazon SageMaker AI asynchronous inference endpoint (check with aws sagemakerscribe-endpoint –endpoint-name my-async-endpoint).

Latest AWS SDK for Python (Boto3) installed and configured with credentials.

IAM permissions for sagemaker: InvokeEndpointAsync.

An S3 output bucket configured for your asynchronous endpoint (for example, my-output-bucket).

Note: Following this guide uses billable AWS resources. SageMaker AI asynchronous inference endpoints incur charges for instance times, and S3 buckets incur charges for storage and queries. Follow the cleanup steps after completing the tutorial to avoid recurring charges.

Steps

Support for online payloads is available today. To use it:

Update your AWS SDK. Install or upgrade Boto3 to the latest version: pip install –upgrade boto3.

Check the installation: pip shows boto3.

Replace your calling code. In your application, replace the S3 upload + InputLocation pattern with a direct Body parameter, as shown in the previous code example.

Test your summon by calling the InvokeEndpointAsync API with the Body parameter.

Check answer contains an OutputLocation field.

Query or monitor the S3 OutputLocation to confirm that your inference result was written successfully.

No changes are necessary to your endpoint configuration, model container, or S3 output configuration.

Clean-up

To avoid ongoing charges, remove the resources used in this walkthrough:

Delete the SageMaker AI endpoint if it was created for testing purposes: aws sagemaker delete-endpoint –endpoint-name my-async-endpoint

Delete the output S3 bucket (if it is no longer needed). Warning: Deleting an S3 bucket permanently deletes the objects it contains. Verify that you have saved any inference results that you need to keep. aws s3 rb s3://my-output-bucket –force

Delete any IAM policies created specifically for this tutorial.

Conclusion

Inline payload support for SageMaker AI Async Inference removes a common sticking point in asynchronous inference workflows: the mandatory S3 upload for each query. For the majority of inference payloads that fit into 128,000 bytes, you can now make a single API call and let SageMaker AI handle the rest.

The feature is designed to be backwards compatible. Existing InputLocation workflows remain unchanged. Online and S3 input are processed identically once the request is accepted, and models receive identical requests regardless of the input source.

Get started today by updating your AWS SDK and using the Body parameter on the SageMaker AI InvokeEndpointAsync API. To learn more about asynchronous inference, see the Amazon SageMaker AI Async Inference documentation.

About the authors

Dan Ferguson

Dan is a Solutions Architect at AWS, based in New York, USA. As an expert in machine learning services, Dan strives to support clients on their journey to integrating ML workflows effectively, efficiently, and sustainably.

Bruce Wang

Bruce is a Software Development Engineer on the SageMaker AI Inference DataPlane team at AWS. It builds the infrastructure that powers real-time, asynchronous inference for SageMaker AI customers.

Source: Here

“`

In game theory, generalists sometimes prevail over specialists

Microsoft Edge is finally letting you sign in with your Google Account

Fennec Engineering obtains T2 qualification for Advanced Safety Acceleration Platform

Apple Introduces Redesigned Siri AI – Campus Technology

Amazon SageMaker AI Async Inference now supports inline query payloads

Background: How asynchronous inference worked before

What’s new: Inline payload via Body parameter

Before and after: customer experience

Before: Upload to S3 first, then invoke

After: send payload online

Customer benefits

When to use each approach

Getting started

Steps

Clean-up

Conclusion

About the authors

Dan Ferguson

Bruce Wang

In game theory, generalists sometimes prevail over specialists

Microsoft Edge is finally letting you sign in with your Google Account

Fennec Engineering obtains T2 qualification for Advanced Safety Acceleration Platform

Apple Introduces Redesigned Siri AI – Campus Technology

Massive Breach Leaks Credentials of Thousands of Sensitive Networks

5 Fun Projects Using OpenAI Codex

From pixels to planning: Earth AI for nature restoration

Stop Writing Loops in Pandas: 7 Faster Alternatives to Try

Health Information Mining: Estimating Advanced Gait Metrics with Smartwatches

Small Models, Big Results: Achieving Superior Intent Extraction with Decomposition

LEAVE A REPLY Cancel reply

Useful Links

Latest News

Microsoft Edge is finally letting you sign in with your Google Account

Fennec Engineering obtains T2 qualification for Advanced Safety Acceleration Platform

Apple Introduces Redesigned Siri AI – Campus Technology

Our Newsletter