Troubleshooting
1. Alerts are arriving but no incident is created
Symptoms: Webhook returns 200 OK but no incident appears in the dashboard.
Causes and fixes:
- Integration key mismatch — The
keyquery parameter in your webhook URL must match the integration'swebhook_key. Regenerate the key in Alert Sources and update your monitoring tool. - Service is inactive — Check Services and confirm the target service has status
active. Inactive services drop incoming alerts. - Alert deduplication — An alert with the same
dedup_keyas an existing open alert is silently merged. Check whether the existing alert/incident is already open under that dedup key. - Routing rule filtering — If you have routing rules on the integration, the alert payload may not match. Review routing rules in Alert Sources → integration detail.
2. No notification received when paged
Symptoms: An incident was created and escalated to your step, but you received no email/SMS.
Causes and fixes:
- Notification preferences not set — Go to Notifications and confirm you have at least one active notification rule with a valid email address.
- Delay > 0 on first rule — If your immediate rule has a delay of 5+ minutes, you will not be notified instantly. Set delay to
0for your primary rule. - Email in spam — Check your spam/junk folder. Add
[email protected]to your allowlist. - Escalation policy targets a schedule, not you — The escalation may be targeting a different schedule. Open the escalation policy and verify Step 1 resolves to you.
- Not in the rotation at that time — Use On-Call Now to verify you are currently the on-call user for your schedule.
3. Wrong person is showing as on-call
Symptoms: On-Call Now shows a different technician than expected.
Causes and fixes:
- Rotation order changed — If someone modified the layer user list, the rotation recalculates from the schedule start date. Review the layer user order in the schedule detail.
- Override is active — An approved override may be substituting another technician. Check Overrides (via API:
GET /api/overrides?schedule_id=<id>) for active overrides. - Schedule timezone mismatch — If the schedule timezone differs from your local timezone, handoff times may not be what you expect. Check the schedule's configured timezone.
- Custom rotation length calculation — For custom rotations, verify the
rotation_length_hoursandstart_dateare correct. The current user index isfloor(elapsed_hours / rotation_length_hours) % user_count.
4. Escalation is not firing (incident stuck in triggered)
Symptoms: An incident has been triggered for 30+ minutes with no escalation notifications.
Causes and fixes:
- No escalation policy assigned — Open the service the incident belongs to and confirm an escalation policy is assigned. If not, assign one and the escalation engine will pick it up on its next 60-second run.
- All steps have delay > current elapsed time — If Step 1 has
delay_minutes: 30, it won't fire for 30 minutes. Check your policy step delays. - Escalation engine not running — The background timer function must be deployed and running. Check Azure Function App
theoneoncall-background→escalation-enginefunction logs. - All repeat cycles exhausted — If the policy has
repeat_count: 0and all steps completed without acknowledgment, the policy stops escalating. Increase repeat count or acknowledge and reassign manually.
5. PSA ticket was not created after acknowledgment
Symptoms: Incident was acknowledged in On-Call but no ticket appeared in PSA.
Causes and fixes:
- PSA integration not configured — Check that
PSA_API_URLandPSA_SERVICE_KEYare set intheoneoncall-apiapp settings or Key Vault. - Key mismatch — The
PSA_SERVICE_KEYon On-Call must matchONCALL_SERVICE_KEYon PSA. Verify both values. - PSA API unreachable — Check that the PSA API is accessible from On-Call's function app. Review On-Call API logs for
psa-ticket-creationerrors. - Tenant not linked to PSA company — The tenant's
org_idmay not map to a client company in PSA. Check PSA client records for the matching org ID.
6. Defend alerts are not triggering On-Call incidents
Symptoms: Defend detects a critical threat but no on-call incident is created.
Causes and fixes:
DEFEND_SERVICE_KEYnot set — Set this env var on boththeoneoncall-apiandtheonedefend-api. They must match.- Detection severity is medium or lower — Only critical and high Defend detections trigger On-Call. Review the detection severity in Defend.
- Defend API cannot reach On-Call API — Check network connectivity. Both are on Azure East US; outbound calls should succeed. Review Defend API logs for
internal/escalations/createcall failures.
7. Coverage gap alert email received
Symptoms: You received an email saying "Coverage gap detected" for a schedule.
What it means: The daily gap detection timer found a window in the next 7 days where no layer is active for a schedule.
Fixes:
- Add users to the empty layer — If a layer has no users, it provides no coverage.
- Extend restriction hours — If a layer with time-of-day restrictions ends at 6pm and another doesn't start until 8am, there's a 14-hour gap.
- Add a second layer — Create a second layer to cover the uncovered window (e.g., an overnight layer for a business-hours schedule).
- Set an end date on temporary schedules — If a schedule has expired, update or deactivate it to prevent false gap alerts.
8. Schedule rotation handoff notifications not arriving
Symptoms: Technicians are not receiving the 1-hour advance notification before their shift begins.
Causes and fixes:
- Schedule rotation timer not running — Check
theoneoncall-background→schedule-rotationfunction logs. The timer runs every 60 seconds. - Notification preferences not set — The outgoing notification uses the technician's email preference. Confirm it's configured.
- Schedule start date in the past by years — If the schedule started very long ago and the rotation window is short, the timer's 1-hour lookahead may behave unexpectedly. Set a recent start date for long-running schedules.
9. Duplicate incidents for the same alert
Symptoms: Multiple incidents appear for what should be a single alert event.
Causes and fixes:
dedup_keynot set in webhook payload — Without a dedup key, every POST creates a new alert and incident. Add a stablededup_keyto your webhook payload (e.g., device ID + check type).- Alert grouping disabled — Enable time-based grouping on the service (5-minute window) to consolidate rapid-fire alerts.
- Two integrations pointing to the same service — If two webhook integrations both receive the same alert, both will create an incident. Ensure each alert source has exactly one integration configured.
10. On-Call app shows a blank page or fails to load
Symptoms: app.theoneoncall.app loads but shows no data or a loading spinner that never resolves.
Causes and fixes:
- Not authenticated — Open Hub and click On-Call from the waffle menu to initiate SSO. Direct navigation to the SWA domain may bypass SSO.
- Hub SSO token expired — Sign out of Hub and sign back in, then access On-Call again.
- CORS error — If accessing On-Call from a custom domain, ensure the domain is in the
@theonefamily/corsallowed origins list (wildcard:*.theoneoncall.appis already included). - API unreachable — Check Azure Function App
theoneoncall-apihealth athttps://api.theoneoncall.app/api/health. If this returns an error, check function app logs.
theoneoncall-api → Monitor → Logs. Most errors are logged with enough detail to diagnose the root cause.When to Contact Support
Contact support ([email protected]) for:
- Data loss or corruption (incidents/schedules missing unexpectedly)
- Authentication failures that persist after re-signing in to Hub
- Billing discrepancies between On-Call active user count and Hub Billing
- API errors with no actionable detail in logs (status 500 with no body)
- Integration key rotation that did not take effect
Self-resolve via the above steps for all configuration, notification preference, and routing issues.