Skip to main content

Support Process

Overview

The AI Delivery (AID) team provides support for Publix Pro Chat and Publix Mobile App Chat, including all associated APIs, Azure Function Apps, and Databricks jobs.
Support is handled through ServiceNow and PagerDuty, with coverage from 6:00 AM to 11:00 PM ET.


1. On-Call Rotation

  • We operate with a two-person rotation, alternating every two weeks.
  • The primary on-call engineer is responsible for triaging and responding to incidents.
  • If the primary does not acknowledge an incident within the defined window, outlined in the expected response column in the table below, PagerDuty will escalate to the backup.
  • Both members ensure clear handoffs at the start and end of each rotation.

Support Calendar

View the current on-call schedule: AI Delivery Support Calendar

Tip: Subscribe to this calendar in Outlook or your preferred calendar app to see who is on call at any time.

Planned Time Off During Your Rotation

If you have a planned vacation or time off that conflicts with your on-call rotation:

  1. Arrange coverage first – It is your responsibility to ensure support coverage is in place before submitting your time off request.
  2. Find a substitute – Coordinate with another team member to cover your rotation.
  3. Update PagerDuty – Make the necessary schedule overrides in PagerDuty to reflect the coverage change.
  4. Notify the team – Communicate the coverage arrangement to ensure everyone is aware of the schedule change.
  5. Submit time off – Only after coverage is confirmed and PagerDuty is updated should you submit your official time off request.

2. Support Channels

ChannelPurpose
ServiceNowAll incidents are ultimately routed through ServiceNow. The team uses the ISApp2 - AI Delivery assignment group, where incidents from across the organization can be directed for handling. ServiceNow serves as the central system for tracking, managing, and closing all incidents and requests.
PagerDutyPagerDuty is an important tool that notifies our team of incidents or alerts in real time. From PagerDuty, you can view open ServiceNow incidents, Azure alerts, and access support calendars to see who is currently on call.
Email / TeamsAnother key way to reach our team. Email and Teams are used for direct communication with stakeholders and internal coordination. See the section below for important Teams channels and distribution lists.

Important Communication Channels

Channel / ListPurpose
PublixProChatSupport@publix.comReceives Azure alerts and DBWS job failure notifications. As well as CIRT, Azure Policy, and other enterprise scanning tool notifications for Publix Pro Chat. Members of the organization can also email this address, though it is not monitored for active support issues. It’s mainly used by the support team to send important communications during critical issues impacting Publix Pro Chat.
PublixProChatStakeholders@publix.comUsed for stakeholder communications and updates related to Publix Pro Chat.
MobileAppChatSupport@publix.comSame as PublixProChatSupport, but for the Publix Mobile App Chat system.
PROD Incidents – Pro ChatCentral Teams channel where all incidents are published and ongoing status updates are posted for visibility across the team.

3. Priority Levels

PriorityDescriptionExpected ResponseCommunicationSchedule
P1Full production outage or major customer impactAcknowledge within 10 minutes, engage immediately, resolve or mitigate ASAPBridge call required. Stakeholder email within 30 minutes, hourly updates until resolved.24/7
P2Partial degradation or non-blocking issue impacting performance or functionalityAcknowledge within 30 minutes, work toward resolution during same dayStakeholder email within 1 hour, updates every 4 hours until resolved.Support hours only
P3General request, minor defect, or questionAcknowledge within 4 hoursNo stakeholder email required. Updates as needed.Business hours only
P4Low-impact issue, enhancement, or documentation updateAcknowledge within 8 hoursNo stakeholder email required. Updates as available.Business hours only

Reference: For official Publix incident SLA guidelines, see Incident Management Documentation.


4. Support Hours and Escalation

  • Standard coverage: 6:00 AM – 11:00 PM ET
  • Business hours: Monday – Friday, 8:30 AM – 4:30 PM ET
  • After-hours incidents:
    • Only P1 incidents will trigger PagerDuty escalation to the on-call engineer (24/7).
    • P2 incidents will be queued for triage when support hours resume the next day. It is the on-call support engineer's responsibility to assess the impact of the P2 incident to determine if immediate action is required.
    • P3 and P4 incidents will be queued for triage during business hours.
  • If a P1 is not acknowledged within 10 minutes, PagerDuty escalates to the backup.
  • If both are unavailable or the issue is prolonged, the incident is escalated to Daniel Robinson for follow up and potentially stakeholder communication.
  • Escalation Chain for True P1 Incidents (Major Impact):
    • If Daniel Robinson is not reachable within 10 minutes, escalate to Mahesh Gurusamy.
    • If Mahesh Gurusamy is not reachable within 10 minutes, escalate to Sunil Singh.
    • These escalations apply only for true P1 incidents with significant production impact.

5. Communication Guidelines

  • Confirm acknowledgment in ServiceNow and PagerDuty.
  • For P1/P2 issues:
    • Create and share a Teams bridge call for coordination.
    • Send stakeholder emails following the cadence above.
  • Keep ServiceNow notes updated throughout resolution.
  • Mark incidents as "On-Hold" if they are waiting on external dependencies, customer response, or other factors preventing active work. Incidents in On-Hold status do not count toward SLA timers.
  • After resolution, document impact, root cause, and mitigation steps before closure.

6. Post-Incident Review (PIR)

  • A PIR is required for all P1 and major P2 incidents.
  • Complete within 3 business days of resolution.
  • PIRs must include:
    • Summary of impact and timeline
    • Root cause and contributing factors
    • Preventive actions or improvements

7. Continuous Improvement

  • The team meets as needed to review:
    • Recent incidents and PIR findings
    • SLA adherence and response times
    • Opportunities for automation or process refinement
  • Action items are tracked in ServiceNow or a shared backlog.

8. Process Flow