Resilience by Design

TL;DR

Things will break. Networks will drop. APIs will timeout. Instead of assuming success and handling failure, assume failure and design for recovery. Resilience isn’t about preventing bad things. It’s about remaining functional when they happen.

Core principles:

Design for failure first, not as an afterthought

Contain failures: one broken part shouldn’t crash everything

Provide clear error experiences: users need understanding and agency

Preserve user work: never lose what users have invested

Hope isn't a strategy.

— Traditional Engineering Wisdom

Accepting Reality

Things will break: networks will drop, APIs will timeout, users will click buttons twice, data will arrive malformed, servers will return errors, and JavaScript will fail to load. These aren’t edge cases to handle eventually. They’re the normal operating conditions of the web.

Yet most applications are designed for the happy path. Teams build features assuming everything works, then add error handling as an afterthought. The result is brittle software that breaks unpredictably when reality hits.

Resilience by design inverts this approach. Instead of assuming success and handling failure, we assume failure and design for recovery.

Instead of asking “what should happen when this works?” we ask “what happens when this fails?” The difference sounds subtle but transforms how we architect systems.

This isn’t pessimism. It’s realism. The web is a hostile environment, distributed, asynchronous, running on devices and networks we don’t control. Designing for that reality produces applications that feel solid and trustworthy, even when individual pieces fail.

What Resilience Means

Resilience is the ability to handle failure gracefully and recover when possible. It’s not about preventing all failures, which is impossible. It’s about controlling what happens when failures occur.

Resilience exists on a spectrum, with fault tolerance at one end: the system continues functioning despite component failures. A data table that still renders when one cell’s data is malformed, or a dashboard that displays available widgets even when some fail to load. The failure is absorbed, and users might not even notice.

The Resilience Spectrum

Invisible Handling

Fault Tolerance: System continues functioning despite failures

Data table renders with malformed cell gracefully skipped
Dashboard shows available widgets, hides failed ones
Users might not notice anything went wrong

Graceful Degradation: Reduced functionality rather than none

Image gallery shows thumbnails when hi-res fails
Search shows cached results when live search unavailable
Rich editor falls back to plain textarea

Visible Recovery

Error Recovery: System returns to working state after failure

Form retries submission after network timeout
Video player rebuffers after connection interruption
Sync process resumes after going offline

Failure Transparency: Making failures visible and understandable

Clear error messages explaining what went wrong
Guidance on what users can do
Honest acknowledgment with path forward

Each level is appropriate in different contexts. The art of resilience is choosing the right response for each failure scenario. Some failures should be invisible; others need user awareness; still others require user action.

What unites all levels is a shift in mindset: failure isn't an exception to handle. It's a state to design for.

The question isn’t whether things will go wrong, but what the experience will be when they do.

The Brittleness of Happy-Path Design

Most code assumes success, with developers writing the flow for when data loads, when APIs respond, when users behave predictably. Error handling, if it exists, is an afterthought: a catch block that logs something and maybe shows a generic error message.

This approach creates brittle systems where a single failure cascades into total collapse. An API timeout crashes the entire page, not just the section that needed that data. A malformed response throws an exception that breaks unrelated components. A slow network leaves users staring at spinners that never resolve.

The brittleness compounds. Components that assume their data arrives correctly pass corrupt data downstream. Missing error boundaries let exceptions bubble up and crash the application. Loading states that never timeout leave users unsure whether to wait or refresh. Unvalidated state drifts into impossible configurations.

Consider a typical e-commerce checkout. Happy-path design builds the flow: enter shipping address, select payment method, review order, confirm purchase. But what happens when the address validation API is slow? When the payment processor returns an unexpected error code? When the user’s session expires mid-checkout? When the network drops after clicking “confirm” but before receiving confirmation? Each failure scenario needs its own design, not just technical handling, but user experience design.

Happy-path design fails users emotionally. When things break unexpectedly, users feel anxious and confused. Did I do something wrong? Is my data lost? The application gives no answers because it never anticipated the question.

The alternative is designing with failure as a primary concern, not an afterthought. Not because we’re pessimistic, but because we’re honest about the environment we’re building for.

Defensive Architecture

Defensive architecture starts from a single premise: assume everything can and will fail.

Defensive Principles

Every input is suspect. Data from users can be malicious or malformed. Data from APIs can be incomplete or incorrectly shaped. Data from your own storage can be corrupted or outdated. Validate at boundaries. Don’t assume that because your API “should” return a certain shape, it always will.

Every boundary is a failure point. Component boundaries, API boundaries, storage boundaries, module boundaries: these are all places where things go wrong. Contracts get violated. Communication fails. Assumptions break. Design boundaries as defensive perimeters.

Every operation needs a timeout. Network requests can hang forever. Promises can never resolve. Animations can fail to complete. Users hate uncertainty more than failure. If something takes too long, treat that as a failure: set timeouts, show progress, and eventually give up while telling the user what happened.

Every dependency needs a fallback. What if the CDN is down? What if the API is slow? What if the third-party script fails to load? Critical functionality shouldn’t depend on things you don’t control without fallback plans: cached data, default values, or reduced functionality. Something is almost always better than nothing.

This defensive posture affects architecture throughout. Components are designed with explicit failure states. Data fetching includes timeout and error handling from the start. State management accounts for partial and invalid data. Every feature includes thought about what happens when it doesn’t work.

Containing Failure

Resilient systems don’t just handle failure; they contain it, ensuring that a failure in one part of the system doesn’t cascade to destroy unrelated parts.

Error boundaries are the main tool for containment. They wrap sections of the application and catch failures within, displaying fallback UI instead of crashing the entire tree. A broken widget shouldn’t take down the dashboard. A failing comment section shouldn’t prevent reading the article. A corrupt user avatar shouldn’t crash the entire profile page. Error boundaries create blast zones: areas where failures can occur without spreading outward.

The placement of error boundaries is an architectural decision. Too few, and failures cascade too far. Too many, and the application becomes a patchwork of disconnected fallbacks.

When placing error boundaries, think about natural failure groupings. A settings page might have boundaries around each section, so if notification preferences fail to load, account settings still work. A social feed might have boundaries around each post, so if one post’s media fails, other posts still display. The boundaries follow the seams in the user experience.

State isolation is another containment strategy. When state is shared broadly, corruption spreads. When state is isolated to where it’s used, problems stay local. A component that manages its own state can fail without affecting siblings. A feature that owns its data can break without corrupting global state. The boundaries that help separation of concerns also help failure containment.

Circuit breakers prevent repeated failures from cascading. If an API call fails three times, stop trying for a while. If a component keeps crashing, stop rendering it. Circuit breakers recognize that persistent failures differ from temporary ones. They back off rather than hammering a broken system. They give time for recovery, whether that’s the server coming back online or the user moving to a different part of the application.

The combination of these strategies creates defense in depth. No single mechanism handles everything; together they provide layered protection against cascading failure.

The User Experience of Failure

Resilience isn’t just a technical concern; it’s a user experience concern, because how failures appear to users matters as much as how they’re handled technically.

What Users Need When Things Fail

Understanding & Agency

Understanding: What happened?

Not “Something went wrong” (too vague)
Not “Error 500: Internal Server Error” (too technical)
But “We couldn’t save your changes because the network connection was lost”

Agency: What can they do about it?

Provide actionable next steps
Offer retry options when appropriate
Give users control over their situation

Preservation & Path Forward

Preservation: Is their work safe?

“Your work is saved locally and will sync when you’re back online”
Never lose user input without warning
Auto-save drafts and preserve form data

Path Forward: How do they continue?

Clear guidance on next steps
Estimated resolution times when known
Alternative actions if available

Generic error messages fail on all counts. “Something went wrong” explains nothing, suggests nothing, preserves nothing, and offers nothing. It leaves users anxious and helpless. Technical error messages are worse. Showing stack traces or error codes to non-technical users says “we have no idea what happened either.”

Good failure UX is specific and actionable. “We couldn’t save your changes because the network connection was lost. Your work is saved locally and will sync when you’re back online.” This explains what happened, reassures about data preservation, and describes the path forward. Users can stop worrying and either fix their connection or continue working.

Different failures warrant different communication styles. Transient failures that will auto-recover might not need user acknowledgment at all, just a subtle indicator that syncing is paused. User-caused errors need clear guidance on what to fix. System failures outside user control need acknowledgment and estimated resolution if possible.

Loading states are failure-adjacent. A spinner that never resolves is functionally a failure. Users need to know if they should wait or give up. Progress indicators, timeout messages, and cancel options turn ambiguous waits into manageable experiences. “Still loading…” after thirty seconds is better than eternal spinning. “This is taking longer than usual. You can wait or try again” gives agency.

Preserving user work is perhaps the most critical resilience concern. Nothing destroys trust faster than losing user input.

Form data should survive page refreshes. Drafts should auto-save. Submissions should retry on failure. When users invest effort, protecting that investment is paramount. Local storage, session storage, and IndexedDB provide tools; the architectural decision is using them proactively rather than assuming success.

Building Resilience In

Resilience can’t be added at the end; it shapes architecture from the beginning.

Data fetching in resilient systems includes error states, loading states, timeout handling, and retry logic as first-class concerns. Not “fetch data and then figure out errors” but “fetching data means handling the full spectrum of outcomes.” Libraries that manage server state increasingly build this in, providing loading, error, and stale states alongside the data itself.

This means designing APIs and data structures that accommodate uncertainty. A user profile isn’t just { name, email, avatar }. It’s { data, isLoading, error, lastFetched }. The component that renders the profile receives all of this information and can make intelligent decisions about what to display in each state.

Component design in resilient systems includes failure props and fallback rendering. A UserProfile component doesn’t just take user data; it handles the case when data is unavailable. The component’s API makes failure handling explicit rather than hoping consumers remember to wrap it in error handling. Components become self-contained units of resilience.

This extends to component composition. Parent components that render children should consider what happens when children fail. Wrapper components can provide default error boundaries. Layout components can handle gaps when content components fail. The composition model itself becomes a vehicle for resilience.

State management in resilient systems accounts for invalid and partial states. Not every combination of state values is valid; resilient systems make invalid states unrepresentable or handle them explicitly. State machines and explicit state modeling help here, making the possible states visible and ensuring transitions between them are controlled.

Testing in resilient systems exercises failure paths as thoroughly as success paths. What happens when the API returns 500? What happens when data is malformed? What happens when the operation times out? What happens when the user is offline? These aren’t afterthought tests. They’re as important as testing that things work when everything goes right. If you don’t test failure handling, you don’t know if it works.

The Mindset Shift

Building resilient systems requires a shift in how we think about development.

Two Approaches to Building Software

The Optimistic Mindset

Asks: “What should happen when this works?”

Designs the happy path first
Grudgingly handles errors later
Produces code that works in demos
Breaks unpredictably in production
Leaves users confused when things fail

The Resilient Mindset

Asks: “What happens when this fails?”

Designs for failure first
Implements success as the optimistic path
Produces code that handles reality
Degrades gracefully under stress
Maintains user trust during adversity

This doesn’t mean pessimism or over-engineering. Not every failure needs handling and not every edge case deserves attention. Resources are finite and shipping matters. The art is identifying which failures are likely, which are costly, and which are easy to handle, then investing in those.

Common failures deserve robust handling: network interruptions, slow responses, invalid data, user mistakes. These happen constantly; users will encounter them. Rare catastrophic failures might warrant simple fallbacks: if everything breaks, show a friendly error page rather than a blank screen. Exotic edge cases might be acceptable to ignore: if something requires a perfect storm of conditions, maybe let it fail.

The key is making these decisions consciously. Know what failures you’re handling and which you’re accepting. Understand how far each failure can spread. Design the user experience of failure as carefully as the user experience of success.

Progressive enhancement and resilience by design are philosophical siblings. Both accept that ideal conditions aren't guaranteed. Both design for less-than-ideal situations first.

An application built with progressive enhancement handles browser capability failures; an application built with resilience handles runtime failures. Together they create software that works in the real world.

Summary

Resilience by design transforms how we build for the web:

Assume failure is normal: Networks drop, APIs timeout, data arrives malformed. Design for it
Contain failures: Error boundaries and isolation prevent cascading collapse
Design failure experiences: Users need understanding, agency, preservation, and a path forward
Build resilience from the start: Data structures, components, and state that accommodate failure
Test failure paths: Exercise error handling as thoroughly as success paths
Make conscious trade-offs: Not every edge case needs handling, but know what you’re accepting

Resilience isn’t about preventing bad things from happening but about remaining functional, trustworthy, and useful when they do. In a world where failure is inevitable, that’s the difference between software that feels solid and software that feels fragile.

Related Content:

Foundations: The Progressive Enhancement Mindset, Accessibility as a Foundation
Guides: Managing Async UX States, Building with Progressive Enhancement
Patterns: Error Boundary, Error Handling, Loading State, Retry Logic

External Resources:

Release It! by Michael Nygard - Design patterns for production-ready software
Error Handling in React - React’s error boundary pattern
Resilient Web Design - Jeremy Keith’s book on building robust websites
Fault Tolerance Patterns - Martin Fowler on distributed systems resilience