Piton Ventures

Tokens, GPUs & Product Cost

Note: All links are provided for informational purposes only. Piton Ventures does not receive any compensation from these links; they are included solely to support further learning.

Let's get started

Token Consumption

In the prior topic, you learned that a token is the atomic unit of an LLM. Besides being a core part of the models, in many cases they are also the unit of cost for companies building with AI.

Here's a quick refresher on how tokens are consumed:

Some action, whether user- or system-generated triggers a call to the LLM. That call often includes context data.

Input data is tokenized; it is broken into smaller sub-units and run through the LLM to determine the most likely output.

The LLM returns output tokens as a response, which is served back to the user or system.

Notice that tokens are consumed for both input data and output data. If you're building with LLMs, you will pay for both input and output. Often the amount you pay is different for the two. Understanding 'why' warrants a brief primer on Graphics Processing Units (GPUs).

Intro to GPUs

We could easily spend an entire topic on GPUs, but we'll try to keep this short and sweet. LLMs are powered by GPUs. GPUs were developed to accelerate the rendering of images, videos, and animations. They proved to be perfectly suited for AI applications, though.

GPUs excel at parallel processing; that means they can do a lot of simultaneous computations. This makes them great for training models, as well as for the functions needed for post-training inference.

To understand how GPU compute influences AI pricing, you can imagine that if an LLM is a thought process or thinking framework, the GPU is the brain.

Comparing a GPU to a brain is actually incredibly helpful. Here are a few examples below that you can relate to:

Input vs Output: Reading a math problem is a lot easier than actually solving it. You can process what you're being asked to solve, but actually crunching the numbers takes a bit more brainpower.

Context Windows: The more information you have to process, the more brainpower it takes to think through a problem.

Model Size: Finding an answer in a very large passage of text takes longer than in a brief picture book.

As a simple rule of thumb, if something takes more brainpower for you, it would take more brainpower for a GPU. Brainpower equals compute; compute equals cost.

The same scenarios we described above for you apply to LLMs:

Input vs Output: Ingesting and tokenizing data is fairly easy and straightforward. Computing relationships between tokens or generating output tokens is harder.

Context Windows: Taking in more input data costs more than using smaller context windows.

Model Size: A model that has been trained on larger data sets has more variables and takes more compute to generate a response.

These principles all become evident, and are plainly visible, on the pricing pages of LLM providers.

Model Pricing

Fixed vs Variable Pricing

If you're building an AI-centered SaaS or integrating AI into an existing product or tech stack, you'll mostly come across two main service options for LLMs — Provisioned Throughput and PAYG (Pay as you go).

Provisioned Throughput

Provisioned Throughput usually refers to a service model where you pay for access to guaranteed throughput. This means that you have access to process a certain volume in input and output tokens within a specific window of time. A good comparison would be you paying for a specific internet speed for your home.

Google offers Provisioned Throughput via 'Generative AI Scale Units (GSUs)'.⁰¹ Google publishes its GSU throughput estimates.⁰²

AWS offers Provisioned Throughput and sells 'model units' for 1-month or 6-month terms.⁰³

Microsoft Azure offers Provisioned Throughput through its Azure OpenAI service.⁰⁴

According to Google, provisioned throughput services are a great solution if "You are building real-time generative AI production applications, such as chatbots and agents" or if "You want to provide a consistent and predictable experience for users of your applications".

PAYG (Pay as you go)

Also referred to as 'on-demand' pricing, PAYG allows companies to tie cost to consumption. Under this model, you pay for the quantity of input and output tokens your product consumes. Prices are commonly quoted in units of one-million tokens (Mtok).

For certain models, you may notice that pricing varies based on context window. Additionally, more advanced models often cost more.

PAYG pricing is an excellent option for scaling startups that don't have strict throughput requirements. This structure allows you to ensure cost is driven by user activity and allows for flexibility while you grow and develop.

Wrapping Up

We've covered a lot here — input vs output tokens, GPUs and compute power, Provisioned Throughput vs PAYG.

Whether you're an early startup still in the R&D phase, or if you're a scaled enterprise SaaS enriching your product with AI, there are industry solutions catered to your consumption needs.

Regardless of your status, one thing holds true; your goal should be to maximize customer value and revenue, while optimizing cost and consuming the fewest tokens necessary. The model you select becomes a key factor here. We'll talk about that next.

Let's build something great together.

Phone

+1 (907) 952-6599

Give us a call and chat directly with our friendly team. We're always happy to answer any questions.

Email

contact@pitonventures.com

Send us a detailed message. We'll get back to you as soon as possible.

Social Media

Connect with us on social media to see key updates and industry insights.