Saved
Foundation

Resilience by Design

Build systems that gracefully handle failure, uncertainty, and the unexpected by making resilience a core architectural principle from the start.

resilience
error-handling
architecture
Read
13 mins

By Den Odell

Deep Dive
TL;DR

TL;DR

Things will break. Networks will drop. APIs will timeout. Instead of assuming success and handling failure, assume failure and design for recovery. Resilience isn’t about preventing bad things—it’s about remaining functional when they happen.

Core principles:

Design for failure first, not as an afterthought
Contain failures: one broken part shouldn’t crash everything
Provide clear error experiences: users need understanding and agency
Preserve user work: never lose what users have invested

Hope is not a strategy.

— Traditional Engineering Wisdom

Accepting Reality

Things will break. Networks will drop. APIs will timeout. Users will click buttons twice. Data will arrive malformed. Servers will return errors. JavaScript will fail to load. These aren’t edge cases to handle eventually—they’re the normal operating conditions of the web.

Yet most applications are designed for the happy path. We build features assuming everything works, then bolt on error handling as an afterthought. The result is brittle software that shatters unpredictably when reality intrudes.

Resilience by design inverts this approach. Instead of assuming success and handling failure, we assume failure and design for recovery.

Instead of asking “what should happen when this works?” we ask “what happens when this fails?” The difference sounds subtle but transforms how we architect systems.

This isn’t pessimism. It’s realism. The web is a hostile environment—distributed, asynchronous, running on devices and networks we don’t control. Designing for that reality produces applications that feel solid and trustworthy, even when individual pieces fail. Applications that maintain user trust precisely because they handle adversity gracefully.

What Resilience Means

Resilience is the ability to handle failure gracefully and recover when possible. It’s not about preventing all failures—that’s impossible. It’s about controlling what happens when failures occur.

Resilience exists on a spectrum. At one end is fault tolerance: the system continues functioning despite component failures. A data table that still renders when one cell’s data is malformed. A dashboard that displays available widgets even when some fail to load. The failure is absorbed; users might not even notice.

The Resilience Spectrum

Invisible Handling

Fault Tolerance: System continues functioning despite failures

  • Data table renders with malformed cell gracefully skipped
  • Dashboard shows available widgets, hides failed ones
  • Users might not notice anything went wrong

Graceful Degradation: Reduced functionality rather than none

  • Image gallery shows thumbnails when hi-res fails
  • Search shows cached results when live search unavailable
  • Rich editor falls back to plain textarea

Visible Recovery

Error Recovery: System returns to working state after failure

  • Form retries submission after network timeout
  • Video player rebuffers after connection interruption
  • Sync process resumes after going offline

Failure Transparency: Making failures visible and understandable

  • Clear error messages explaining what went wrong
  • Guidance on what users can do
  • Honest acknowledgment with path forward

Each level is appropriate in different contexts. The art of resilience is choosing the right response for each failure scenario. Some failures should be invisible; others need user awareness; still others require user action.

What unites all levels is a shift in mindset: failure isn't an exception to handle—it's a state to design for.

The question isn’t whether things will go wrong, but what the experience will be when they do.

The Brittleness of Happy-Path Design

Most code assumes success. We write the flow for when data loads, when APIs respond, when users behave predictably. Error handling, if it exists, is an afterthought—a catch block that logs something and maybe shows a generic error message.

This approach creates brittle systems. A single point of failure cascades into total collapse. An API timeout doesn’t just affect the data it was fetching—it crashes the entire page. A malformed response doesn’t just display incorrectly—it throws an exception that breaks unrelated components. A slow network doesn’t just delay content—it leaves users staring at spinners that never resolve.

The brittleness compounds. Components that assume their data arrives correctly propagate corrupt data downstream. Error boundaries that don’t exist let exceptions bubble up and kill the application. Loading states that never timeout leave users uncertain whether to wait or refresh. State that isn’t validated drifts into impossible configurations.

Consider a typical e-commerce checkout. Happy-path design builds the flow: enter shipping address, select payment method, review order, confirm purchase. But what happens when the address validation API is slow? When the payment processor returns an unexpected error code? When the user’s session expires mid-checkout? When the network drops after clicking “confirm” but before receiving confirmation? Each failure scenario needs its own design—not just technical handling, but user experience design.

Happy-path design fails users emotionally. When things break unexpectedly, users feel anxious and confused. Did I do something wrong? Is my data lost? The application gives no answers because it never anticipated the question.

The alternative is designing with failure as a primary concern, not an afterthought. Not because we’re pessimistic, but because we’re honest about the environment we’re building for.

Defensive Architecture

Defensive architecture starts from a single premise: assume everything can and will fail.

Defensive Principles

Every input is suspect. Data from users can be malicious or malformed. Data from APIs can be incomplete or incorrectly shaped. Data from your own storage can be corrupted or outdated. Validate at boundaries. Don’t assume that because your API “should” return a certain shape, it always will.

Every boundary is a failure point. Component boundaries. API boundaries. Storage boundaries. Module boundaries. Each boundary is where things can go wrong—where contracts can be violated, where communication can fail, where assumptions can be invalidated. Design boundaries as defensive perimeters.

Every operation needs a timeout. Network requests that hang indefinitely. Promises that never resolve. Animations that never complete. Users hate uncertainty more than failure. If something takes too long, that’s a failure state. Set timeouts. Show progress. Eventually give up and tell the user what happened.

Every dependency needs a fallback. What if the CDN is down? What if the API is slow? What if the third-party script fails to load? Critical functionality shouldn’t depend on things you don’t control without fallback plans. Cached data. Default values. Degraded experiences. Something is almost always better than nothing.

This defensive posture affects architecture throughout. Components are designed with explicit failure states. Data fetching includes timeout and error handling from the start. State management accounts for partial and invalid data. Every feature includes consideration of what happens when it doesn’t work.

Containing Failure

Resilient systems don’t just handle failure—they contain it. A failure in one part of the system shouldn’t cascade to destroy unrelated parts.

Error boundaries are the architectural tool for containment. They wrap sections of the application and catch failures that occur within, displaying fallback UI instead of crashing the entire tree. A broken widget shouldn’t take down the dashboard. A failing comment section shouldn’t prevent reading the article. A corrupt user avatar shouldn’t crash the entire profile page. Error boundaries create blast zones—areas where failures can occur without propagating outward.

The placement of error boundaries is an architectural decision. Too few, and failures cascade too far. Too many, and the application becomes a patchwork of disconnected fallbacks.

Think about natural failure groupings. A settings page might have error boundaries around each settings section—if notification preferences fail to load, account settings should still work. A social feed might have boundaries around each post—if one post’s media fails, other posts should still display. The boundaries follow the seams in the user experience.

State isolation is another containment strategy. When state is shared broadly, corruption spreads. When state is isolated to where it’s used, problems stay local. A component that manages its own state can fail without affecting siblings. A feature that owns its data can break without corrupting global state. The boundaries that help separation of concerns also help failure containment.

Circuit breakers prevent repeated failures from cascading. If an API call fails three times, stop trying for a while. If a component keeps crashing, stop rendering it. Circuit breakers recognise that persistent failures are different from transient ones and respond by backing off rather than hammering a broken system. They give time for recovery—whether that’s the server coming back online or the user moving to a different part of the application.

The combination of these strategies creates defence in depth. No single mechanism handles everything; together they provide layered protection against cascading failure.

The User Experience of Failure

Resilience isn’t just a technical concern—it’s a user experience concern. How failures appear to users matters as much as how they’re handled technically.

What Users Need When Things Fail

Understanding & Agency

Understanding: What happened?

  • Not “Something went wrong” (too vague)
  • Not “Error 500: Internal Server Error” (too technical)
  • But “We couldn’t save your changes because the network connection was lost”

Agency: What can they do about it?

  • Provide actionable next steps
  • Offer retry options when appropriate
  • Give users control over their situation

Preservation & Path Forward

Preservation: Is their work safe?

  • “Your work is saved locally and will sync when you’re back online”
  • Never lose user input without warning
  • Auto-save drafts and preserve form data

Path Forward: How do they continue?

  • Clear guidance on next steps
  • Estimated resolution times when known
  • Alternative actions if available

Generic error messages fail on all counts. “Something went wrong” explains nothing, suggests nothing, preserves nothing, offers nothing. It leaves users anxious and helpless. Technical error messages are worse—showing stack traces or error codes to non-technical users communicates “we have no idea what happened either.”

Good failure UX is specific and actionable. “We couldn’t save your changes because the network connection was lost. Your work is saved locally and will sync when you’re back online.” This explains what happened, reassures about data preservation, and describes the path forward. Users can stop worrying and either fix their connection or continue working.

Different failures warrant different communication styles. Transient failures that will auto-recover might not need user acknowledgment at all—just a subtle indicator that syncing is paused. User-caused errors need clear guidance on what to fix. System failures outside user control need acknowledgment and estimated resolution if possible.

Loading states are failure-adjacent. A spinner that never resolves is functionally a failure. Users need to know if they should wait or give up. Progress indicators, timeout messages, and cancel options turn ambiguous waits into manageable experiences. “Still loading…” after thirty seconds is better than eternal spinning. “This is taking longer than usual—you can wait or try again” gives agency.

Preserving user work is perhaps the most critical resilience concern. Nothing destroys trust faster than losing user input.

Form data should survive page refreshes. Drafts should auto-save. Submissions should retry on failure. When users invest effort, protecting that investment is paramount. Local storage, session storage, and IndexedDB provide tools; the architectural decision is using them proactively rather than assuming success.

Building Resilience In

Resilience can’t be added at the end. It shapes architecture from the beginning.

Data fetching in resilient systems includes error states, loading states, timeout handling, and retry logic as first-class concerns. Not “fetch data and then figure out errors” but “fetching data means handling the full spectrum of outcomes.” Libraries that manage server state increasingly build this in—providing loading, error, and stale states alongside the data itself.

This means designing APIs and data structures that accommodate uncertainty. A user profile isn’t just { name, email, avatar }. It’s { data, isLoading, error, lastFetched }. The component that renders the profile receives all of this information and can make intelligent decisions about what to display in each state.

Component design in resilient systems includes failure props and fallback rendering. A UserProfile component doesn’t just take user data—it handles the case when data is unavailable. The component’s API makes failure handling explicit rather than hoping consumers remember to wrap it in error handling. Components become self-contained units of resilience.

This extends to component composition. Parent components that render children should consider what happens when children fail. Wrapper components can provide default error boundaries. Layout components can handle gaps when content components fail. The composition model itself becomes a vehicle for resilience.

State management in resilient systems accounts for invalid and partial states. Not every combination of state values is valid; resilient systems make invalid states unrepresentable or handle them explicitly. State machines and explicit state modelling help here—making the possible states visible and ensuring transitions between them are controlled.

Testing in resilient systems exercises failure paths as thoroughly as success paths. What happens when the API returns 500? What happens when data is malformed? What happens when the operation times out? What happens when the user is offline? These aren’t afterthought tests—they’re as important as testing that things work when everything goes right. If you don’t test failure handling, you don’t know if it works.

The Mindset Shift

Building resilient systems requires a shift in how we think about development.

Two Approaches to Building Software

The Optimistic Mindset

Asks: “What should happen when this works?”

  • Designs the happy path first
  • Grudgingly handles errors later
  • Produces code that works in demos
  • Breaks unpredictably in production
  • Leaves users confused when things fail

The Resilient Mindset

Asks: “What happens when this fails?”

  • Designs for failure first
  • Implements success as the optimistic path
  • Produces code that handles reality
  • Degrades gracefully under stress
  • Maintains user trust during adversity

This doesn’t mean pessimism or over-engineering. Not every failure needs handling. Not every edge case deserves attention. Resources are finite, and shipping matters. The art is identifying the failures that are likely, the failures that are costly, and the failures that are easily handled—then investing in those.

Common failures deserve robust handling: network interruptions, slow responses, invalid data, user mistakes. These happen constantly; users will encounter them. Rare catastrophic failures might warrant simple fallbacks: if everything breaks, show a friendly error page rather than a blank screen. Exotic edge cases might be acceptable to ignore: if something requires a perfect storm of conditions, maybe let it fail.

The key is making these decisions consciously. Knowing what failures you’re handling and which you’re accepting. Understanding the blast radius of each failure scenario. Designing the user experience of failure as carefully as the user experience of success.

Progressive enhancement and resilience by design are philosophical siblings. Both accept that ideal conditions are not guaranteed. Both design for less-than-ideal situations first.

An application built with progressive enhancement handles browser capability failures; an application built with resilience handles runtime failures. Together they create software that works in the real world.

Summary

Resilience by design transforms how we build for the web:

  1. Assume failure is normal: Networks drop, APIs timeout, data arrives malformed—design for it
  2. Contain failures: Error boundaries and isolation prevent cascading collapse
  3. Design failure experiences: Users need understanding, agency, preservation, and a path forward
  4. Build resilience from the start: Data structures, components, and state that accommodate failure
  5. Test failure paths: Exercise error handling as thoroughly as success paths
  6. Make conscious trade-offs: Not every edge case needs handling, but know what you’re accepting

Resilience isn’t about preventing bad things from happening. It’s about remaining functional, trustworthy, and useful when they do. In a world where failure is inevitable, that’s the difference between software that feels solid and software that feels fragile.


Related Content:

External Resources:

Stay Updated

Get New Patterns
in Your Inbox

Join thousands of developers receiving regular insights on frontend architecture patterns

No spam. Unsubscribe anytime.