I've deployed a model using the NVIDIA Triton Inference Server on AWS SageMaker and am attempting to expose it through a REST API using AWS API Gateway. This would make it accessible to clients.
Initially, I wrote code for directly invoking AWS SageMaker using the specific MIME type application/vnd.sagemaker-triton.binary+json;json-header-size={NUMBER} (as detailed in AWS Documentation). This MIME type in the Content-Type header, where {NUMBER} represents the number of bytes to be read as JSON followed by binary data, works flawlessly.
Following the AWS blog instructions, I created an API and set it to proxy my responses to the SageMaker Runtime without modification. Additionally, I added application/vnd.sagemaker-triton.binary+json to the Binary Media Types to ensure it's proxied in binary form without alteration.
However, when I test the AWS Gateway endpoint, I encounter an error: The error message (unexpected size for input 'np_tensor', expecting 4 additional bytes) suggests that the Triton server is not receiving the correct binary data size, possibly due to the way API Gateway is processing the request.
It appears that AWS Gateway is not preserving the Content-Type=application/vnd.sagemaker-triton.binary+json;json-header-size={NUMBER} header. Omitting this header when directly accessing the SageMaker endpoint results in the same error.
Logs indicate that the header is initially present, but subsequent entries show only truncated output, which doesn't provide much help.
Here is the code snippet I used:
python client code
import boto3
import botocore.session
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
import numpy as np
import json
import requests
aws_region = 'us-east-1'
# API Gateway URL
url = ""
# SageMaker Endpoint URL (commented out since we're using API Gateway)
# url = ""
# Sample dummy input data for testing
input_data = np.array([[-0.0024108887]]).astype('float32')
# Define the request body for the Triton server
json_request = {
"inputs": [
{
"name": "np_tensor",
"shape": list(input_data.shape),
"datatype": "FP32",
"parameters": {"binary_data_size": input_data.nbytes},
},
],
"outputs": [
{"name": "transcription", "parameters": {"binary_data": True}},
],
}
# Convert the request to a JSON string and then to bytes
json_request_str = json.dumps(json_request)
request_body = json_request_str.encode() + input_data.tobytes()
header_length = len(json_request_str)
# # Not needed for AWS Gateway
# # AWS session and credentials setup
# session = boto3.Session()
# credentials = session.get_credentials()
# # AWS Request with SigV4 Authentication
# request = AWSRequest(method="POST", url=url, data=request_body)
# SigV4Auth(credentials, 'sagemaker', aws_region).add_auth(request)
# signed_headers = dict(request.headers)
# Prepare headers, including the custom Content-Type header
signed_headers = {}
signed_headers["Content-Type"] = "application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(header_length)
# Send the request and print the response
response = requests.post(
url,
headers=signed_headers,
data=request_body
)
print(response.content.decode("utf8"))
My questions are:
- How can I ensure that AWS Gateway preserves the custom Content-Type header when proxying requests to SageMaker?
- Are there any additional configurations or settings in AWS Gateway that I might be missing to handle this type of request?
- Has anyone successfully configured a similar setup with AWS Gateway and SageMaker using Triton's binary data extension?
Any insights or suggestions would be greatly appreciated.

