Token Budgets for AI Features: How I Give PMs a Real Answer in 10 Minutes

Every product manager I work with eventually asks the same question: "How much will this AI feature cost to run?" They want a real number — not "it depends" and not a range so wide it is useless for budgeting.

The problem is not that the cost is unknowable. It is that most engineers give PMs a per-call cost when PMs need a per-month cost at a specific usage volume. Those are different calculations, and skipping the translation is where bad estimates come from.

Here is the method I use to answer the question in under 10 minutes for any feature spec.

The four inputs you need before you can estimate

Token cost is a function of four variables. If you do not have all four, you are guessing, not estimating.

1. Model and pricing. Get the current price per million input tokens and per million output tokens from the provider's pricing page. As of April 2026: GPT-4o is roughly $2.50/1M input and $10/1M output. Claude Sonnet 4 is $3/1M input and $15/1M output. Groq Llama 3.3 70B is around $0.59/1M input and $0.79/1M output. Prices change — always pull from the pricing page on the day you are building the estimate, not from memory.

2. Prompt size in tokens. Count your system prompt tokens using the provider's tokenizer (OpenAI's tiktoken library, Anthropic's token counting API). Include the system prompt, any few-shot examples, the user input template with its variable slots, and any context injected at runtime (retrieved chunks from a vector database, user history, etc.). Do this for the realistic worst case, not the average — costs are driven by P90 prompt size, not median.

3. Average output size. Run 20–30 test calls against real or realistic inputs and measure the actual output token count. Do not use max_tokens as your output size estimate — that is the ceiling, not the average. The average is usually 30–60% of the ceiling.

4. Calls per user per day and MAU/DAU. The PM owns this number. Get a conservative and an expected estimate. If the feature is a chat interface, track how many messages an active user sends per session and how many sessions per day. If the feature is a background job (document summarization, daily digest), track how often it runs per user.

Building the cost model

With those four inputs, the calculation is straightforward.

Cost per call:

(input_tokens / 1,000,000) × input_price + (output_tokens / 1,000,000) × output_price

Example: a feature with a 2,000-token prompt, 300 tokens of output, on GPT-4o:

(2000 / 1,000,000) × $2.50 + (300 / 1,000,000) × $10 = $0.005 + $0.003 = $0.008 per call

Monthly cost at volume:

cost_per_call × calls_per_user_per_day × DAU × 30

At 10,000 DAU making 3 calls per day:

$0.008 × 3 × 10,000 × 30 = $7,200/month

Run this calculation for your conservative, expected, and aggressive DAU scenarios. The spread between scenarios is the uncertainty range. If the spread is 10× (conservative: $800/month, aggressive: $8,000/month), the model selection or prompt design needs more work before you commit to a feature spec.

The three assumptions that kill estimates

1. Underestimating the system prompt. Engineers often estimate the system prompt at the size of the template they have written. In production, the system prompt grows. Few-shot examples get added. Safety instructions get appended. RAG retrieval injects 1,000–3,000 tokens of context per call. I have seen features go from a $0.004/call estimate to $0.019/call in production because the system prompt tripled between spec and launch. Always measure a production-realistic prompt, not the minimal demo version.

2. Assuming output tokens are proportional to input tokens. They are not. A 2,000-token prompt asking for a one-sentence summary produces ~30 output tokens. A 500-token prompt asking for a detailed plan produces 800 output tokens. Measure actual output from real test cases, stratified by the types of inputs your users will submit — not just average-case inputs.

3. Not accounting for retries. If your feature has error handling that retries on timeout or rate limit errors, every retry is a full token cost. A 5% retry rate on a feature with 100K calls per day adds 5,000 calls to your cost. At $0.008/call that is $40/day — $1,200/month — that shows up in your bill and not in your estimate.

What to do when the estimate is too high

This happens. The feature the PM specified costs $18,000/month at projected volume and the budget is $3,000/month. There are three levers:

Downgrade the model. GPT-4o → GPT-4o-mini drops input cost from $2.50 to $0.15/1M tokens. For features where GPT-4 quality is not necessary (classification, extraction, formatting), mini-class models often perform at 85–95% quality at 10–15% of the cost. Build the estimate for both and give the PM a quality-cost trade-off to make.

Shrink the prompt. If retrieval is injecting 3,000 tokens of context per call, evaluate whether 1,500 tokens of the highest-scoring chunks achieves equivalent quality. Prompt compression reduces both input cost and latency.

Cache repeated inputs. If the system prompt is the same across all users (or all users in a cohort), prompt caching on Claude or OpenAI reduces the effective cost of the system prompt to near zero after the first call. A 1,500-token system prompt that is cached costs $0.30/1M tokens instead of $3/1M — a 10× reduction on the input side.

The AI Agent Cost Estimator builds this model automatically — enter your model, token counts, and volume, and it produces the monthly cost table with conservative/expected/aggressive scenarios. The output is a table you can paste directly into a product spec or a budget conversation without formatting work.

Giving PMs a real number requires 10 minutes and four inputs. The alternative — launching a feature without a cost model — is how you discover a $40,000/month bill in a budget review the quarter after launch.

Token Budgets for AI Features: How I Give PMs a Real Answer in 10 Minutes

The four inputs you need before you can estimate

Building the cost model

The three assumptions that kill estimates

What to do when the estimate is too high

Related reading

OpenAI Assistants API vs. Building Your Own Loop: What Actually Broke for Us

Why Your n8n AI Workflow Silently Breaks at 3 AM (And the 4 Observability Hooks That Catch It)

LangGraph vs. n8n vs. a Custom State Machine: Honest Trade-offs After Shipping All Three