Why do automations break so often?

Because they integrate systems that change independently. APIs deprecate, authentication tokens expire, field names change, rate limits shift, third-party services have outages. Every dependency is a potential break point, and most automations have many dependencies.

What does 'tending' an automation actually involve?

Continuous monitoring of the automation's health, alerting when something fails, rapid diagnosis of the cause, fixing the failure, and updating the automation to prevent the same class of failure recurring. Plus periodic review of the whole automation graph to catch issues before they become failures.

How quickly should automation breaks be fixed?

Revenue-touching automations (quote, invoice, lead routing) within hours. Operational automations (reporting, internal notifications) within a business day. Non-critical automations within a week. The right SLA depends on what the automation does and what it costs the business when it's broken.

What 'we tend it' means for an automation that breaks at 11pm

Automation evangelism usually ends at “and then it runs forever, automatically.” The honest reality is that automations break, and when they break, the cost can be significant. The question isn’t whether to accept that reality — it’s how the business handles it when the inevitable failure comes at an inconvenient time.

This is what “tending” actually looks like in practice.

Why automations break

Five common causes:

Upstream changes. A platform updates an API, deprecates a field, or changes authentication requirements. The automation that depended on the old behavior fails silently or visibly.

Authentication expiration. Tokens, API keys, and OAuth permissions expire on schedules that aren’t always obvious. A working automation stops working overnight when a token times out.

Rate limit changes. A platform tightens its rate limits. The automation that used to run fine starts hitting throttles. Some events succeed, some fail; the failure pattern is hard to spot.

Edge cases. A new type of input — a customer with a different shape of data, a deal with unusual terms — exposes a case the original logic didn’t handle. The automation either errors out or produces wrong output.

Dependent service outages. A third-party service the automation relies on goes down. The automation fails for the duration of the outage; some inputs may be permanently lost depending on retry logic.

None of these are signs of poor automation work. They’re the operating cost of running automation against systems that change. Mature automation infrastructure is built knowing these will happen.

What “tending” actually involves

The work breaks into four layers, each with a different rhythm.

Layer one — monitoring

Automated checks against every critical automation:

Did the expected number of executions run today? Last hour?
Did each execution complete or did some fail?
Are queue depths growing? (a sign of throttling or downstream failure)
Are response times within normal range? (slow can be a precursor to broken)
Did any data pass through with unusual shapes? (an edge case warning)

Monitoring runs continuously. The output is the absence of alerts during normal operation, and clear, actionable alerts when something is off.

Layer two — alerting and triage

When monitoring detects an issue, alerting routes it to someone who can act on it. The chain:

Alert fires immediately to the on-call partner team member
Severity classification within minutes (revenue-affecting? operational? cosmetic?)
Customer or operator notification if business impact is imminent
Diagnosis begins — not “investigation later,” but right now

The 11pm scenario in the title — an automation breaks after hours — is what alerting and triage exist for. The operator is asleep; the partner is responding.

Layer three — fixing

The fix depends on the failure type. Common patterns:

Failure cause	Typical fix	Typical resolution time
Auth token expired	Refresh token, update stored credential	15–30 minutes
Field name changed upstream	Update mapping, redeploy	30–60 minutes
Rate limit hit	Add throttling, queue management	1–3 hours
Edge case revealed	Update logic to handle new case, add test	2–6 hours
Third-party outage	Wait, then catch up missed events	Variable

Most fixes are quick. The harder work is the post-fix — making sure no events were lost, catching up the queue if events were stalled, and updating the automation to prevent the same class of failure.

Layer four — prevention

Periodically — weekly or biweekly for active automations, monthly for stable ones — a review of:

Are any automations running on deprecated APIs that should be updated?
Are authentication mechanisms current?
Are rate limits being approached at peak times?
Are error rates trending up?
Are new edge cases appearing in the data?

This is where breaks get prevented rather than just fixed after the fact. It’s the layer that distinguishes mature automation tending from reactive fire-fighting.

What an 11pm break actually looks like

A specific example. A business runs automated invoicing — when a project milestone gets marked complete, an invoice generates and sends. Late on a Thursday, the integration between the project tool and the invoicing system fails because the project tool deprecated an authentication method.

11:47pm Thursday. Monitoring detects that no invoices have generated in 90 minutes despite milestones being marked complete. Alert fires to the on-call partner team member.

11:52pm. Triage. The pattern is recognizable — auth failure on the project tool side. Severity: high (invoicing affects cash flow). Operator does not need to be woken; the issue can be resolved without their input.

12:18am Friday. Fix deployed. The auth method updates to the current version, the integration reconnects, and the queued events (six invoices that should have generated) flow through.

12:35am. Verification. All six invoices generated correctly. Customer-facing email sends fired correctly. CRM and accounting reconcile.

8:00am Friday morning. Operator gets a routine summary email noting the issue, the fix, and the prevention work to come.

Following Tuesday. Prevention review. The auth method change is documented across other automations using the same project tool to make sure no others will fail the same way. Affected — none in this case; one prevented.

The operator never had to think about it. The business never lost revenue. The customer never knew. That’s what tending looks like when it’s working.

Why the alternative is so much more expensive

Compare the same failure without continuous custody:

The failure happens at 11pm Thursday. No one notices.
Friday morning, the operator notices something feels off in cash flow but can’t quite identify what.
Tuesday, a customer asks about an invoice they expected but haven’t received. Investigation begins.
Wednesday, the operator or someone they call figures out invoicing has been broken for nearly a week.
Thursday, an emergency engagement happens to fix it. Catch-up invoices go out apologetically.
Some invoices are billed for the wrong amounts because the project tool’s data has shifted in the meantime.
A handful of customers are confused enough to question the invoices, dragging out collection.

The cost: a week of broken cash flow, multiple hours of operator attention diverted to the problem, customer trust damage, and an emergency engagement at premium rates. The total impact of the same underlying issue is dramatically larger.

The economic case for tending

The math operators sometimes don’t run:

A continuous-custody arrangement that includes automation tending costs typically $1,500–$5,000/month
A single significant automation failure resolved as an emergency typically costs $5,000–$25,000 in direct cost plus indirect business impact
Automation graphs of any complexity break two to six times per year on average

The break-even on continuous-custody for automation tending typically lands at one or two prevented incidents per year. Most businesses with non-trivial automation have well more than that.

What to ask any automation provider

Three questions that separate real tending from a sales pitch about it:

What’s your monitoring stack? Show me a recent alert and the response. A real answer involves named tools and a real example. A vague answer is a flag.
What’s your on-call coverage outside business hours? If automations are revenue-touching, this matters. If they’re not, less so.
What’s your post-incident process? A real answer includes documentation, prevention work, and review. A “we just fix it” answer means the same incident will happen again.

The point isn’t to interrogate. It’s to confirm that “we tend it” is something the provider has actually built, not just something they say.

The bottom line

Automation that runs is good. Automation that runs and is tended is what makes the model work for serious businesses. The 11pm break is going to happen — what determines whether it costs you anything is whether someone is awake to handle it.