Getting started with Amazon Bedrock inference profiles

As anyone that works in the world of tech will know, almost as soon as you’ve written something down it’s out of date. Well, that happened to me! I published my Apportioning Amazon Bedrock invocation costs post on the 31 October 2024, detailing how you can build a secure and scalable proxy layer in front of Amazon Bedrock to easily apportion model invocation costs per user. At the end of the article, I expressed that it’d be great if AWS had a first-party solution to solve this problem. One day later, my wishes were answered! On 1 November 2024 AWS announced that Amazon Bedrock inference profiles now supported Cost Allocation Tags.

Cost Allocation Tags aren’t new - they’ve been around for a long time and are the primary way of apportioning AWS costs and usage on custom dimensions. These could be departments in your organisation (e.g. Marketing, Sales, Back Office), down to an individual level (e.g. Joe Bloggs, Jane Doe), or really anything you want! Support is good across most AWS resources and so being able to track Bedrock model invocation costs within this mechanism is a great release.

Setting up an inference profile

At this stage it’s worth us differentiating between an application inference profile and a system inference profile.

System inference profiles are AWS-defined profiles that allow the transparent routing of invocation requests to different regions in peak time periods. An example is the US Anthropic Claude 3.5 Sonnet system inference profile, that allows you to make requests originating from us-east-2, us-east-1 or us-west-2 and use model capacity in the other regions. Something worth considering is that (at present) for the US system inference profiles, “requests originating from us-east-2 can be routed to us-east-1 or us-west-2. However, requests originating in us-east-1 and us-west-2 won’t be routed to us-east-2.”

We’ll be working with application inference profiles that allow you to define custom tags on top of existing system inference profiles, or create our own profile from a single model. It doesn’t appear that you can build your own custom cross-region inference profile at present which is a shame. The tags that are defined are the ones we’ll use for allocating cost and usage.

To maintain adherance to best practice, we’ll be creating our inference profile using infrastructure-as-code. Terraform doesn’t currently have support, so we’ll be using AWS CloudFormation.

The first thing we’ll need to do is find the ARN of a foundation model to build our application inference profile on top of. I’m working in the us-east-1 region here, but replace it with whatever region you’ve got access to models in.

1
aws bedrock list-foundation-models --query 'modelSummaries[].modelArn' --region us-east-1

I’ve chosen to use Claude 3.5 Sonnet for this, and the ARN is arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0.

Now, let’s define the resources in a CloudFormation template. To make this template scalable, we’re going to use the Fn::ForEach extension to avoid repeating ourselves. To use this transform, we need to include it at the top of our template (as seen on line 1).

We need three pieces of information to be passed into the template to create our profiles. Firstly, the ARN of the foundation model or system inference profile that the application inference profiles should run on top of. This is the pModelSource parameter. Next, the pCostDimensionName parameter will define the name of the tag that’ll be applied to the profile. Finally, the pCostDimensionValues parameter is a comma-separated list of tag values to create profiles for. Take a look at the default values for suggestions of what could be included.

As mentioned already, we’ll be using the Fn::ForEach transform to make this a dynamic and scalable operation. For more details on this transform, check out my blog post on the feature when it was first released, or the AWS documentation.

Defining the application inference profile is fairly straightforward. All we need to provide is a name, description, the source (i.e. foundation model or system inference profile ARN), and the tags to apply. We construct all of this from the parameters.

1
Transform: AWS::LanguageExtensions
2
Description: >
3
  Creates a set of Bedrock Application Inference Profiles for an  organisation's departments.
4

5
Parameters:
6
  pModelSource:
7
    Type: String
8
    Description: >
9
      The Foundation Model ARN, or ARN of the System Inference Profile  to base the Application
10
      Inference Profiles off of
11
  pCostDimensionName:
12
    Type: String
13
    Description: The name of the cost dimension (e.g. Department)
14
    Default: Department
15
  pCostDimensionValues:
16
    Type: CommaDelimitedList
17
    Description: >
18
      A comma-delimited list of cost dimension values to create inference  profiles for (e.g.
19
      Marketing,Sales,BackOffice)
20
    Default: Marketing,Sales,BackOffice
21

22
Resources:
23
  Fn::ForEach::ApplicationInferenceProfiles:
24
    - CostDimensionValue
25
    - !Ref pCostDimensionValues
26
    - "ApplicationInferenceProfile${CostDimensionValue}":
27
        Type: AWS::Bedrock::ApplicationInferenceProfile
28
        Properties:
29
          InferenceProfileName: !Sub "${AWS::StackName}-${CostDimensionValue}-${pCostDimensionName}"
30
          Description: !Sub "Inference profile for the ${CostDimensionValue} ${pCostDimensionName}"
31
          ModelSource:
32
            CopyFrom: !Ref pModelSource
33
          Tags:
34
            - Key: !Ref pCostDimensionName
35
              Value: !Ref CostDimensionValue

To create the CloudFormation stack containing these resources, you can use the management console or CLI. Find below an example CLI command that’ll create a stack with the default Department tag and the default values.

1
aws cloudformation create-stack \
2
  --stack-name bedrock-inference-profiles \
3
  --template-url ./template.yml \
4
  --parameters ParameterKey=pModelSource,ParameterValue=arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0 \
5
  --region us-east-1

You should find that this completes successfully after a short amount of time. As there’s no management console support yet to view created profiles, you’ll need to use the CLI to verify it. Check the resources created by the CloudFormation stack and grab one of the application inference profile ARNs. You can then run the following command, replacing the identifier and region with your applicable values to verify successful creation.

1
aws bedrock get-inference-profile \
2
  --inference-profile-identifier arn:aws:bedrock:us-east-1:012345678901:application-inference-profile/a12rhfisf43d \
3
  --region us-east-1

Using the profile for model invocation

Invoking a model using an application inference profile rather than the model directly is a one-line change. I’m using Python here and am switching from using the Claude 3.5 Sonnet v1 foundation model to one of the created application inference profiles that uses the same model. If you’re using a different model, the contents of your native_request variable will likely look a little different.

1
model_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
2
model_id = "arn:aws:bedrock:us-east-1:012345678901:application-inference-profile/a12rhfisf43d"
3

4
prompt = """
5
This is a test prompt. What foundation model are you?
6
"""
7

8
# Format the request payload using the model's native structure.
9
native_request = {
10
    "anthropic_version": "bedrock-2023-05-31",
11
    "max_tokens": 1024,
12
    "temperature": 0.5,
13
    "messages": [
14
        {
15
            "role": "user",
16
            "content": [{"type": "text", "text": prompt}],
17
        }
18
    ],
19
}
20

21
# Convert the native request to JSON.
22
request = json.dumps(native_request)
23

24
try:
25
    # Invoke the model with the request.
26
    response = bedrock.invoke_model(modelId=model_id, body=request)
27

28
    # Decode the response body.
29
    model_response = json.loads(response["body"].read())
30

31
    # Extract and print the response text.
32
    response_text = model_response["content"][0]["text"]
33
    print(response_text)
34

35
except (ClientError, Exception) as e:
36
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
37
    exit(1)

Securing it

It’s all well and good creating application inference profiles to track cost and usage, but what good is it if people can still invoke the foundation model directly? Let’s now look at how IAM policies should be changed to ensure people are always using the inference profiles.

Typically, you might see an IAM policy where an action like bedrock:InvokeModel is allowed for a resource like arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0 (or *, but you shouldn’t). Not many changes are required to support and require the use of an application inference profile.

The first thing that needs to be added is the ARN of the inference profile/s to the Resource array. This allows the IAM principal to use the profile, but doesn’t enforce that it’s always used. That’s where the Condition comes in. We’re defining here that the bedrock:InferenceProfileArn condition key should equal (StringEquals) the ARN of our instance profile. As this condition applies to the whole statement object, it enforces use of the inference profile if a user wanted to access the Claude 3.5 Sonnet v1 foundation model. Neat!

1
{
2
  "Version": "2012-10-17",
3
  "Statement": [
4
    {
5
      "Effect": "Allow",
6
      "Action": ["bedrock:InvokeModel*"],
7
      "Resource": [
8
        "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0",
9
        "arn:aws:bedrock:us-east-1:012345678901:application-inference-profile/a12rhfisf43d"
10
      ],
11
      "Condition": {
12
        "StringEquals": {
13
          "bedrock:InferenceProfileArn": "arn:aws:bedrock:us-east-1:012345678901:application-inference-profile/a12rhfisf43d"
14
        }
15
      }
16
    }
17
  ]
18
}

Apportioning cost

With the application profiles created and always being used during model invocation, cost and usage can now be easily apportioned. If you’re using a tag that you isn’t already activated as a Cost Allocation Tag, you’ll need to do that first. If you don’t see the tag in here even after you’ve used your application inference profile, check back after around 24 hours as it can sometimes take a while for usage data to finalise.

You can backfill Cost Allocation Tag data for up to 12 months once you’ve activated a tag.

Once a tag is activated and recording data you can then use Cost Explorer to apportion costs. Cost Explorer allows you to group by, or filter by, tag values. For example, you could group by the Department tag to see what costs have been incurred by resources tagged for each department (e.g. $100 for Marketing, $150 for Sales, $200 for BackOffice). Alternatively, you could filter by the Department tag with a value of Marketing to only include that department’s costs and subsequently group by AWS service to see where they’re spending the most.

Rounding up

To conclude, the addition of support for Cost Allocation Tags to application inference profiles is a great step forwards in efforts to truly understand the costs incurred by different cost centers when invoking models through Amazon Bedrock.

What I’d love to see is further enhancements to application inference profiles - in particular the ability to define multiple models like a system inference profile does. I’d like to be able to set something up where I can say “Priority 1 is Claude 3.5 Sonnet in us-west-1, if that’s unavailable try Claude 3.5 Sonnet in us-west-2 or us-east-1, if that’s unavailable then try Claude 3.5 Haiku in any of us-east-1, us-west-1 or us-west-2.”. It’d be relatively easy to build something custom to do it, but having it natively and managed by AWS would be awesome.

Getting started with Amazon Bedrock inference profiles

Alex Kearns

Setting up an inference profile

Using the profile for model invocation

Securing it

Apportioning cost

Rounding up

Read Next

re:Invent 2024 Well-Architected re:Cap

Apportioning Amazon Bedrock invocation costs