The customer
Our customer is a national quick-service restaurant chain with a busy, fast-moving menu and a planning team that lives and dies by getting ahead of demand. Like a lot of chains its size, it leans heavily on limited-time items to keep the menu fresh and give people a reason to come back. Those items are exciting for customers and genuinely hard for the people behind the scenes, because every launch is a bet placed months before anyone knows whether it will pay off.
The planning, category, and supply teams we worked with are good at their jobs. They had years of instinct between them. What they didn't have was a way to turn that instinct into a number they could defend, fast, for an item that had never existed before.
The problem
About half of the chain's snack-category sales come from items that are only on the menu for a few weeks at a time. Every one of those items has to be ordered, staged, and staffed months before a customer ever sees it, and every one launches with zero sales history to forecast against.
So the planning team did the only thing they could. They guessed, carefully, from experience. But a guess that ran too high meant food the chain paid to throw away. A guess that ran too low meant empty trays in launch week and a crew stuck turning away the line out the door. Neither is a small miss when half the category rides on these items.
Why ordinary forecasting couldn't do it
Forecasting usually leans on one thing: what happened last time. A brand-new limited-time item has no last time. So standard forecasting either fell back on rough analogies or simply couldn't produce a number anyone would stand behind.
The chain set a deliberately hard target, under 20% average error on a 14-day forecast for an item with no direct history. That's below what's generally considered realistic for new-product forecasting, where the missing sales history usually pushes error much higher. Three things made it genuinely difficult.
- There was no direct history. The item had never been sold. Every usable signal had to be borrowed from somewhere else, similar products, comparable past promotions, the chain's seasonal patterns.
- A promotion bends demand out of shape. People don't buy a limited-time offer the way they buy a regular menu item. The surge that comes from something being new and temporary had to be measured, not waved away.
- One wrong number is a lot of food. Supply gets committed 8 to 12 weeks ahead of launch. Forecast too high and it becomes inventory the chain pays to destroy. Too low and it's lost sales and a rough opening week.
What we built
We worked alongside the chain's demand planning, category, and supply teams, and kept the engagement practical: prove the forecast on past launches, then prove it again on the next real one with the planners, not us, deciding what "good" meant.
Because the new items had no history of their own, we built their demand signal out of everything around them: comparable menu items, past promotions, seasonal patterns, and how individual stores tend to perform.
We also modelled the shape of a launch, not just its size. A limited-time item spikes because it's new, then settles, and that curve looks different by region and store type. Getting the first 14 to 28 days right meant predicting how demand would move week to week, not just its total.
Rather than betting on one model up front, the team tested several forecasting approaches side by side, all against past launches, all scored the same way, so the winner earned its place on evidence, not preference.
The scoring didn't stop once a model was chosen. It became part of the planning rhythm, so the team can watch forecast accuracy launch over launch, and catch a number drifting before it turns into a problem on a shelf, not after.
How the agent works, and where the Anthropic SDK comes in
The hard part of this problem was never running a forecasting model. Plenty of tools can fit a curve. The hard part was the judgment that wraps around the model: deciding which past items and which past promotions actually rhyme with a brand-new one, reading whether a seasonal pattern applies this time, and explaining the result well enough that a planner would commit real money to it. That judgment is what the agent does, and it runs on Claude through the Anthropic SDK.
We built the agent on the Anthropic Python SDK, calling Claude through the Messages API. Rather than a fixed pipeline that runs the same steps in the same order every time, the agent works as a loop. Claude looks at the new item, decides what it needs to know, goes and gets it, looks at what came back, and decides the next move. That flexibility matters here because no two launches are alike. A new dessert in summer and a new savoury snack in winter call for different comparisons, and the agent reasons its way to them instead of following a script.
The reaching-out part is built with tool use, the SDK's function-calling capability. We gave Claude a set of tools it can call on its own: pull the chain's comparable menu items, retrieve past promotions and how they actually performed, fetch seasonal and store-level patterns, run each of the candidate forecasting models, and score the results. Claude decides which tools to call and in what order, reads the data that comes back, and uses it to shape the next decision. So the model selection, the comparable-item picks, and the demand signal are all assembled by the agent in the moment, grounded in the chain's real numbers rather than in anything Claude made up.
For the backtesting, where we re-ran the agent across every past launch to prove the forecast before anyone trusted it live, we leaned on the SDK's batch processing. Running hundreds of historical launches one request at a time would have been slow and expensive. Batching let us score the whole back catalogue at a lower cost and turn the evaluation around quickly, which is a big part of how we got to a number the planners believed.
The last piece is the one that earned the agent its place in the room. Alongside every forecast, Claude writes a short, plain-language explanation of why the number is what it is: which items it borrowed signal from, which promotion it treated as the closest analogue, and what it expects the launch curve to do over the first few weeks. Planners don't have to take a black box on faith. They can read the reasoning, push back on it, and override it when their own instinct says the agent missed something. That transparency is what moved this from an interesting model into a tool the team actually uses.
The result
On 14-day demand for brand-new launches, the forecasts landed at 18.23% average error, under the chain's sub-20% target, on items with no sales history at all. Supply is now committed against a clear 14-to-28-day launch window, and a large share of category sales has moved from guesswork into a structured, measured process.