Automation evangelism usually ends at “and then it runs forever, automatically.” The honest reality is that automations break, and when they break, the cost can be significant. The question isn’t whether to accept that reality — it’s how the business handles it when the inevitable failure comes at an inconvenient time.
This is what “tending” actually looks like in practice.
Why automations break
Five common causes:
Upstream changes. A platform updates an API, deprecates a field, or changes authentication requirements. The automation that depended on the old behavior fails silently or visibly.
Authentication expiration. Tokens, API keys, and OAuth permissions expire on schedules that aren’t always obvious. A working automation stops working overnight when a token times out.
Rate limit changes. A platform tightens its rate limits. The automation that used to run fine starts hitting throttles. Some events succeed, some fail; the failure pattern is hard to spot.
Edge cases. A new type of input — a customer with a different shape of data, a deal with unusual terms — exposes a case the original logic didn’t handle. The automation either errors out or produces wrong output.
Dependent service outages. A third-party service the automation relies on goes down. The automation fails for the duration of the outage; some inputs may be permanently lost depending on retry logic.
None of these are signs of poor automation work. They’re the operating cost of running automation against systems that change. Mature automation infrastructure is built knowing these will happen.
What “tending” actually involves
The work breaks into four layers, each with a different rhythm.
Layer one — monitoring
Automated checks against every critical automation:
- Did the expected number of executions run today? Last hour?
- Did each execution complete or did some fail?
- Are queue depths growing? (a sign of throttling or downstream failure)
- Are response times within normal range? (slow can be a precursor to broken)
- Did any data pass through with unusual shapes? (an edge case warning)
Monitoring runs continuously. The output is the absence of alerts during normal operation, and clear, actionable alerts when something is off.
Layer two — alerting and triage
When monitoring detects an issue, alerting routes it to someone who can act on it. The chain:
- Alert fires immediately to the on-call partner team member
- Severity classification within minutes (revenue-affecting? operational? cosmetic?)
- Customer or operator notification if business impact is imminent
- Diagnosis begins — not “investigation later,” but right now
The 11pm scenario in the title — an automation breaks after hours — is what alerting and triage exist for. The operator is asleep; the partner is responding.
Layer three — fixing
The fix depends on the failure type. Common patterns:
| Failure cause | Typical fix | Typical resolution time |
|---|---|---|
| Auth token expired | Refresh token, update stored credential | 15–30 minutes |
| Field name changed upstream | Update mapping, redeploy | 30–60 minutes |
| Rate limit hit | Add throttling, queue management | 1–3 hours |
| Edge case revealed | Update logic to handle new case, add test | 2–6 hours |
| Third-party outage | Wait, then catch up missed events | Variable |
Most fixes are quick. The harder work is the post-fix — making sure no events were lost, catching up the queue if events were stalled, and updating the automation to prevent the same class of failure.
Layer four — prevention
Periodically — weekly or biweekly for active automations, monthly for stable ones — a review of:
- Are any automations running on deprecated APIs that should be updated?
- Are authentication mechanisms current?
- Are rate limits being approached at peak times?
- Are error rates trending up?
- Are new edge cases appearing in the data?
This is where breaks get prevented rather than just fixed after the fact. It’s the layer that distinguishes mature automation tending from reactive fire-fighting.
What an 11pm break actually looks like
A specific example. A business runs automated invoicing — when a project milestone gets marked complete, an invoice generates and sends. Late on a Thursday, the integration between the project tool and the invoicing system fails because the project tool deprecated an authentication method.
11:47pm Thursday. Monitoring detects that no invoices have generated in 90 minutes despite milestones being marked complete. Alert fires to the on-call partner team member.
11:52pm. Triage. The pattern is recognizable — auth failure on the project tool side. Severity: high (invoicing affects cash flow). Operator does not need to be woken; the issue can be resolved without their input.
12:18am Friday. Fix deployed. The auth method updates to the current version, the integration reconnects, and the queued events (six invoices that should have generated) flow through.
12:35am. Verification. All six invoices generated correctly. Customer-facing email sends fired correctly. CRM and accounting reconcile.
8:00am Friday morning. Operator gets a routine summary email noting the issue, the fix, and the prevention work to come.
Following Tuesday. Prevention review. The auth method change is documented across other automations using the same project tool to make sure no others will fail the same way. Affected — none in this case; one prevented.
The operator never had to think about it. The business never lost revenue. The customer never knew. That’s what tending looks like when it’s working.
Why the alternative is so much more expensive
Compare the same failure without continuous custody:
- The failure happens at 11pm Thursday. No one notices.
- Friday morning, the operator notices something feels off in cash flow but can’t quite identify what.
- Tuesday, a customer asks about an invoice they expected but haven’t received. Investigation begins.
- Wednesday, the operator or someone they call figures out invoicing has been broken for nearly a week.
- Thursday, an emergency engagement happens to fix it. Catch-up invoices go out apologetically.
- Some invoices are billed for the wrong amounts because the project tool’s data has shifted in the meantime.
- A handful of customers are confused enough to question the invoices, dragging out collection.
The cost: a week of broken cash flow, multiple hours of operator attention diverted to the problem, customer trust damage, and an emergency engagement at premium rates. The total impact of the same underlying issue is dramatically larger.
The economic case for tending
The math operators sometimes don’t run:
- A continuous-custody arrangement that includes automation tending costs typically $1,500–$5,000/month
- A single significant automation failure resolved as an emergency typically costs $5,000–$25,000 in direct cost plus indirect business impact
- Automation graphs of any complexity break two to six times per year on average
The break-even on continuous-custody for automation tending typically lands at one or two prevented incidents per year. Most businesses with non-trivial automation have well more than that.
What to ask any automation provider
Three questions that separate real tending from a sales pitch about it:
- What’s your monitoring stack? Show me a recent alert and the response. A real answer involves named tools and a real example. A vague answer is a flag.
- What’s your on-call coverage outside business hours? If automations are revenue-touching, this matters. If they’re not, less so.
- What’s your post-incident process? A real answer includes documentation, prevention work, and review. A “we just fix it” answer means the same incident will happen again.
The point isn’t to interrogate. It’s to confirm that “we tend it” is something the provider has actually built, not just something they say.
The bottom line
Automation that runs is good. Automation that runs and is tended is what makes the model work for serious businesses. The 11pm break is going to happen — what determines whether it costs you anything is whether someone is awake to handle it.
You don't have to act on any of this yourself.
Everything in this article — the strategy, the build, the integration, the ongoing tending — is the kind of work we own end-to-end for premium operators. One partner. One number. Off your plate.
Automation
- April 14, 2026
Replacing manual data entry between two systems that won't talk to each other
Almost every business has a data-entry seam between two systems that don't natively integrate. Here's how to replace the manual work cleanly without creating fragile glue.
- March 27, 2026
When NOT to automate: three places automation makes the business worse
Automation isn't always the right answer. Three categories of work where automating actively damages the business — relationships, judgment, and brand voice.
- March 10, 2026
Quote-to-cash automation: where the leaks usually are
Most operator-run businesses lose 5–15% of potential revenue in the quote-to-cash cycle. Here's where the leaks usually are and what automation actually fixes.