Managing AI Response States

A language model breaks the assumptions most frontend code is built on. A normal request is fast, gives the same answer every time, and arrives all at once. A model response is slow enough to watch, comes back a few words at a time over several seconds, and says something different every time you ask. The user can also stop it, ask for a new answer, or hit a refusal partway through. If you handle all of that with the same loading, success, and error states you would use for a REST call, the interface feels broken long before anything has actually gone wrong. This guide is about the states in between, and how to handle each one so the user knows what is happening and stays in control while the model works.

Map the Generation Lifecycle

A generation passes through more states than a fetch does, and each one needs its own treatment on screen. In my experience the states that get skipped, thinking and stopped, are exactly the ones users notice.

Name every state: A single generation is idle before it starts, submitted once the request is sent, thinking while the model works before any text appears, streaming as tokens arrive, and finally complete, stopped, or error. Regeneration re-enters the cycle from submitted. Treating this as one state machine keeps your rendering logic honest.
Separate “sent” from “streaming”: There is often a one to three second gap between sending the request and the first token. That pause is the most anxious moment for the user, so do not paper over it with a generic spinner. Give it its own Thinking Indicator.
Design the terminal states together: complete, stopped, and error all end a generation but demand different affordances. A finished message invites a follow-up, a stopped one invites a restart, a failed one invites a retry. Decide what each looks like before you write the happy path.

Fill the Wait Before the First Token

The silence between submitting a prompt and seeing the first token is where users assume the app has frozen. Fill it deliberately.

Acknowledge immediately: The moment the request leaves, echo the user’s prompt into the transcript and show a Thinking Indicator in the model’s slot. Waiting until the first token arrives to change anything on screen makes even a fast model feel unresponsive.
Announce it to assistive tech: A silent animation tells a screen reader user nothing. Put the thinking state in a polite live region so it gets announced, then let the streamed text replace it.

Stream Tokens as They Arrive

Once text starts coming back, render it incrementally. A model that has produced three paragraphs should not look identical to one that has produced nothing.

Render partial content as it arrives: Use Streaming Response to add tokens to the message as they come in, rather than buffering the whole reply and revealing it at the end. Half-finished text is worth showing on its own, not something to hide behind a spinner.
Keep the latest text in view: Long replies scroll off-screen while they generate. Anchor the transcript to the bottom as content grows, but release that anchor the instant the user scrolls up to read. Yanking the view back down while they are reading is worse than letting the newest lines scroll out of sight.
Parse incremental markdown carefully: Model output is usually markdown, and a half-finished code fence or list will render as garbage if you re-parse on every token. Render the stabilized prefix and hold back the trailing, still-forming fragment until it resolves.

Give Users Control Over the Generation

A response that takes several seconds and comes out different every time is one users will often want to abandon or run again. Make both easy.

Always offer a stop: While a generation is streaming, the primary action should become Stop Generation, backed by an AbortController that actually cancels the underlying request. A stop that only hides the output while tokens keep burning in the background wastes money and misleads the user.
Let users ask again: Because the answer changes from one run to the next, Response Regeneration is a normal thing users will reach for, not an error path. Let them re-run the last prompt, and keep the previous answer so they can compare the two rather than lose it.
Lock the input while generating: Use Prompt Input that disables submission (but never the stop button) while a response is in flight. Without this, users fire a second prompt into a busy stream and get two interleaved answers.

Handle Failures a Fetch Never Has

AI requests fail in ways a JSON endpoint does not, and several of those failures happen mid-stream, after the user has already seen half an answer.

Recover from mid-stream breaks: A dropped connection can leave a message half-written. Keep the tokens already received, mark the message as interrupted, and offer to continue or regenerate rather than discarding visible work.
Treat refusals and moderation as normal output, not errors: A content filter or a model refusal is a valid response rather than a server error. Render it as a message with a clear explanation and a path forward, instead of a red error toast.
Back off on rate limits and overload: 429 and model-overloaded responses are common under load. Apply Retry with exponential backoff and show a calm “the model is busy, retrying” state instead of a hard failure.
Distinguish context-length errors: When a conversation gets too long for the model’s context window, retrying will not help. You have to summarize or trim the history. Give that failure its own message rather than lumping it in with network errors.

Keep the Conversation Coherent

Each generation sits inside a longer conversation, and where it sits changes how its states should look.

Anchor state to the right message: In a Message Thread, streaming, stopped, and error states belong to a specific turn. Track them per message so regenerating one answer never disturbs another.
Distinguish a first answer from a follow-up: A fresh conversation can show a generous empty state and suggested prompts; a continuing one should stay focused on the transcript. The same thinking state reads differently depending on where in the exchange it appears.

Test the Parts You Cannot Reproduce

The hardest part of an AI interface to get right is the part you cannot reproduce on demand, so build in the hooks that let you trigger each state yourself.

Drive the states from fixtures: Wire your components to a fake stream you control, one that emits tokens on a timer, stops early, stalls before the first token, and fails mid-response. You should be able to see every state without calling a real model.
Test slow and interrupted paths explicitly: Local development with a fast, reliable model hides exactly the states this guide is about. Write cases that assert the thinking indicator appears, that stop aborts the request, and that a mid-stream error preserves partial text.

Remember: An AI response is not a slow fetch. It thinks before it speaks, it speaks a little at a time, and the user may cut it off or ask again at any moment. Model those states explicitly and the interface feels responsive and in the user’s control. Flatten them into loading/done and it feels frozen while the user waits, then jarring when the whole answer lands at once.

Frontend Patterns

Managing AI Response States

Managing AI Response States

Map the Generation Lifecycle

Fill the Wait Before the First Token

Stream Tokens as They Arrive

Give Users Control Over the Generation

Handle Failures a Fetch Never Has

Keep the Conversation Coherent

Test the Parts You Cannot Reproduce

Related Patterns

A Monthly Email
from Den Odell

Managing AI Response States

Managing AI Response States

Map the Generation Lifecycle

Fill the Wait Before the First Token

Stream Tokens as They Arrive

Give Users Control Over the Generation

Handle Failures a Fetch Never Has

Keep the Conversation Coherent

Test the Parts You Cannot Reproduce

Related Patterns

A Monthly Emailfrom Den Odell

A Monthly Email
from Den Odell