CCG Retry Mechanism for CASI API
Overview
The CCG (Convenient Checkout Gateway) implements a robust retry mechanism for all CASI (Communications API for Sycurio Integrations) API calls to ensure reliability and resilience in the face of transient network or service errors.
Importance of Retry for CASI API
Retries play a crucial role in ensuring reliability and a smooth user experience when interacting with Sycurio System.
Network issues, temporary service outages, or unexpected server errors can cause requests to fail even if there is no problem with the user's action.
Problem with Transitional Call States
When CCG calls the CASI INSPECT endpoint, the underlying Semafone system must be in a stable state to provide accurate inspection data. However, phone calls go through several transitional states where the call is not yet stable:
Transitional Call States:
- Call Ringing: The phone is ringing but not yet answered
- Call Transferring: The call is being transferred to another agent or department
During these transitional states, the Semafone system within CASI cannot reliably inspect the call, resulting in an HTTP 500 error with the message: "Failed to make Semafone inspect URL call".
Behavior Without Retry:
- CCG calls CASI INSPECT during a transitional state (e.g., call still ringing)
- CASI returns HTTP 500:
{"detail": "Failed to make Semafone inspect URL call"} - CCG treats this as a permanent failure and terminates the telephonic entry session
- The entire transaction fails, requiring manual intervention or user retry
Why This Is Problematic:
- These failures are temporary and transient - that will resolve itself within a short time.
- Users experience unnecessary errors for conditions that would self-resolve
- Support teams must manually retry sessions or ask customers to restart the process
- The success rate is artificially lowered due to timing issues.
Without a retry mechanism, these transient errors would cause CCG to terminate the session prematurely, resulting in failed transactions and a poor user experience.
Problem with 504 Gateway Timeout Error
When CCG calls the CASI API, there are scenarios where the request may time out due to temporary network issues or delays in the downstream systems. In such cases, CASI returns an HTTP 504 Gateway Timeout error.
Behavior Without Retry:
- CCG calls a CASI endpoint and encounters a network delay or temporary unavailability
- CASI returns HTTP 504 Gateway Timeout error
- CCG treats this as a permanent failure and terminates the telephonic entry session
- The transaction fails, requiring manual intervention or user retry
Why This Is Problematic:
- These failures are often temporary and can resolve themselves if retried after a short interval
- Users may experience unnecessary errors and interruptions
- Support teams may need to manually retry or assist users, increasing operational overhead
- The overall success rate is reduced due to transient network or system issues
Without a retry mechanism, temporary network or service disruptions can cause avoidable failures, negatively impacting both user experience and operational efficiency.
When Retries Occur
- Network Timeouts: If a request to a CASI API endpoint times out (e.g., HTTP 504 Gateway Timeout), CCG will automatically retry the request.
- Server Errors: For retriable server-side errors (HTTP 5xx) CCG will retry the request up to a configured number of attempts.
- No Retries: For non-retriable server-side errors (HTTP 5xx) & client-side errors (HTTP 4xx ), CCG does not retry as these indicate issues that must be resolved by the caller.
Example Retry Flow
- CCG sends a request to a CASI API endpoint.
- If the response is a network timeout or HTTP 5xx, CCG waits for a short delay and retries.
- If the retry also fails, CCG waits longer (exponential backoff) and retries again.
- If all retries fail, the error is logged and surfaced to the user or calling system.
Retry Strategy
- Triggering Retries: If CCG encounters an error that is marked as retriable in the Retriability table, it will automatically trigger a retry according to the configured strategy.
- Max Attempts: Typically 4 attempts (1 initial + 3 retries), but this can be changed as needed.
- Delay Between Retries: After a failed attempt, CCG waits a short time before trying again. This waiting period usually starts at 600 milliseconds, but can be adjusted.
- Backoff: The waiting time increases with each retry (Exponential Backoff) to avoid overwhelming the service.
- Logging: All retry attempts and failures are recorded for monitoring and troubleshooting.
Impact of Retry
1. Positive Impact
- Improved User Experience: Users are less likely to encounter failures due to temporary call state issues, resulting in smoother telephonic payment sessions.
- Higher Success Rate: Telephonic entry sessions are more likely to complete successfully, reducing the need for manual intervention or user retries.
- Operational Efficiency: Support teams spend less time addressing transient failures, allowing them to focus on genuine issues.
- System Resilience: The system gracefully handles temporary network or telephony issues, increasing overall reliability.
2. Negative Impact
- Increase in 500 Error Count: If an INSPECT call is made after a REMOVE call, a 500 error with
{ "detail": "Failed to make Semafone inspect URL call" }may occur due to a known CASI-side bug (see issue). - False Error Metrics: Because the retry logic treats this specific 500 error as retriable, the number of 500 errors may appear higher in monitoring dashboards. However, these errors do not impact user experience or system performance.
- Dashboard Consideration: To avoid misleading error metrics, dashboard queries should be adjusted to filter out or separately categorize these known, non-impactful 500 errors.
Conclusion
The retry mechanism in CCG for CASI API is designed to address the unique challenges of telephonic payment sessions, ensuring reliability and a seamless user experience. While it brings significant business value and operational benefits, it is important to be aware of its impact on error metrics and monitoring. Ongoing collaboration between development and monitoring teams will help ensure that dashboards accurately reflect system health, and that users continue to benefit from a robust and resilient payment process.