Skip to main content

Troubleshooting

1. Alerts are arriving but no incident is created

Symptoms: Webhook returns 200 OK but no incident appears in the dashboard.

Causes and fixes:

  • Integration key mismatch — The key query parameter in your webhook URL must match the integration's webhook_key. Regenerate the key in Alert Sources and update your monitoring tool.
  • Service is inactive — Check Services and confirm the target service has status active. Inactive services drop incoming alerts.
  • Alert deduplication — An alert with the same dedup_key as an existing open alert is silently merged. Check whether the existing alert/incident is already open under that dedup key.
  • Routing rule filtering — If you have routing rules on the integration, the alert payload may not match. Review routing rules in Alert Sources → integration detail.

2. No notification received when paged

Symptoms: An incident was created and escalated to your step, but you received no email/SMS.

Causes and fixes:

  • Notification preferences not set — Go to Notifications and confirm you have at least one active notification rule with a valid email address.
  • Delay > 0 on first rule — If your immediate rule has a delay of 5+ minutes, you will not be notified instantly. Set delay to 0 for your primary rule.
  • Email in spam — Check your spam/junk folder. Add [email protected] to your allowlist.
  • Escalation policy targets a schedule, not you — The escalation may be targeting a different schedule. Open the escalation policy and verify Step 1 resolves to you.
  • Not in the rotation at that time — Use On-Call Now to verify you are currently the on-call user for your schedule.

3. Wrong person is showing as on-call

Symptoms: On-Call Now shows a different technician than expected.

Causes and fixes:

  • Rotation order changed — If someone modified the layer user list, the rotation recalculates from the schedule start date. Review the layer user order in the schedule detail.
  • Override is active — An approved override may be substituting another technician. Check Overrides (via API: GET /api/overrides?schedule_id=<id>) for active overrides.
  • Schedule timezone mismatch — If the schedule timezone differs from your local timezone, handoff times may not be what you expect. Check the schedule's configured timezone.
  • Custom rotation length calculation — For custom rotations, verify the rotation_length_hours and start_date are correct. The current user index is floor(elapsed_hours / rotation_length_hours) % user_count.

4. Escalation is not firing (incident stuck in triggered)

Symptoms: An incident has been triggered for 30+ minutes with no escalation notifications.

Causes and fixes:

  • No escalation policy assigned — Open the service the incident belongs to and confirm an escalation policy is assigned. If not, assign one and the escalation engine will pick it up on its next 60-second run.
  • All steps have delay > current elapsed time — If Step 1 has delay_minutes: 30, it won't fire for 30 minutes. Check your policy step delays.
  • Escalation engine not running — The background timer function must be deployed and running. Check Azure Function App theoneoncall-backgroundescalation-engine function logs.
  • All repeat cycles exhausted — If the policy has repeat_count: 0 and all steps completed without acknowledgment, the policy stops escalating. Increase repeat count or acknowledge and reassign manually.

5. PSA ticket was not created after acknowledgment

Symptoms: Incident was acknowledged in On-Call but no ticket appeared in PSA.

Causes and fixes:

  • PSA integration not configured — Check that PSA_API_URL and PSA_SERVICE_KEY are set in theoneoncall-api app settings or Key Vault.
  • Key mismatch — The PSA_SERVICE_KEY on On-Call must match ONCALL_SERVICE_KEY on PSA. Verify both values.
  • PSA API unreachable — Check that the PSA API is accessible from On-Call's function app. Review On-Call API logs for psa-ticket-creation errors.
  • Tenant not linked to PSA company — The tenant's org_id may not map to a client company in PSA. Check PSA client records for the matching org ID.

6. Defend alerts are not triggering On-Call incidents

Symptoms: Defend detects a critical threat but no on-call incident is created.

Causes and fixes:

  • DEFEND_SERVICE_KEY not set — Set this env var on both theoneoncall-api and theonedefend-api. They must match.
  • Detection severity is medium or lower — Only critical and high Defend detections trigger On-Call. Review the detection severity in Defend.
  • Defend API cannot reach On-Call API — Check network connectivity. Both are on Azure East US; outbound calls should succeed. Review Defend API logs for internal/escalations/create call failures.

7. Coverage gap alert email received

Symptoms: You received an email saying "Coverage gap detected" for a schedule.

What it means: The daily gap detection timer found a window in the next 7 days where no layer is active for a schedule.

Fixes:

  • Add users to the empty layer — If a layer has no users, it provides no coverage.
  • Extend restriction hours — If a layer with time-of-day restrictions ends at 6pm and another doesn't start until 8am, there's a 14-hour gap.
  • Add a second layer — Create a second layer to cover the uncovered window (e.g., an overnight layer for a business-hours schedule).
  • Set an end date on temporary schedules — If a schedule has expired, update or deactivate it to prevent false gap alerts.

8. Schedule rotation handoff notifications not arriving

Symptoms: Technicians are not receiving the 1-hour advance notification before their shift begins.

Causes and fixes:

  • Schedule rotation timer not running — Check theoneoncall-backgroundschedule-rotation function logs. The timer runs every 60 seconds.
  • Notification preferences not set — The outgoing notification uses the technician's email preference. Confirm it's configured.
  • Schedule start date in the past by years — If the schedule started very long ago and the rotation window is short, the timer's 1-hour lookahead may behave unexpectedly. Set a recent start date for long-running schedules.

9. Duplicate incidents for the same alert

Symptoms: Multiple incidents appear for what should be a single alert event.

Causes and fixes:

  • dedup_key not set in webhook payload — Without a dedup key, every POST creates a new alert and incident. Add a stable dedup_key to your webhook payload (e.g., device ID + check type).
  • Alert grouping disabled — Enable time-based grouping on the service (5-minute window) to consolidate rapid-fire alerts.
  • Two integrations pointing to the same service — If two webhook integrations both receive the same alert, both will create an incident. Ensure each alert source has exactly one integration configured.

10. On-Call app shows a blank page or fails to load

Symptoms: app.theoneoncall.app loads but shows no data or a loading spinner that never resolves.

Causes and fixes:

  • Not authenticated — Open Hub and click On-Call from the waffle menu to initiate SSO. Direct navigation to the SWA domain may bypass SSO.
  • Hub SSO token expired — Sign out of Hub and sign back in, then access On-Call again.
  • CORS error — If accessing On-Call from a custom domain, ensure the domain is in the @theonefamily/cors allowed origins list (wildcard: *.theoneoncall.app is already included).
  • API unreachable — Check Azure Function App theoneoncall-api health at https://api.theoneoncall.app/api/health. If this returns an error, check function app logs.
ℹ️For issues not covered here, check the On-Call API logs in Azure Function App theoneoncall-apiMonitorLogs. Most errors are logged with enough detail to diagnose the root cause.

When to Contact Support

Contact support ([email protected]) for:

  • Data loss or corruption (incidents/schedules missing unexpectedly)
  • Authentication failures that persist after re-signing in to Hub
  • Billing discrepancies between On-Call active user count and Hub Billing
  • API errors with no actionable detail in logs (status 500 with no body)
  • Integration key rotation that did not take effect

Self-resolve via the above steps for all configuration, notification preference, and routing issues.