Azure Website Errors

Incident Report for DPS AdTracker

Resolved

Earlier today, some customers experienced intermittent errors or timeouts connecting to our hosted services. We want to share what happened and what we're doing about it.

What Happened

The root cause was SNAT (Source Network Address Translation) port exhaustion on the Azure App Service infrastructure hosting our platform. Each instance of our App Service is allocated a finite pool of SNAT ports, which Azure uses to manage outbound network connections. When demand for those ports exceeds what's available — even briefly — new connections are queued or dropped until ports are freed.

Two factors converged to create the condition:
Scaling behavior — An auto-scale rule reduced our instance count during a period of declining usage, as designed. However, usage began climbing again shortly after — and did so very rapidly, faster than the platform could provision additional instances to compensate. The reduced number of instances had a smaller combined SNAT port pool to absorb the increased connection demand.
Shared infrastructure pressure — Our App Service runs on an Azure stamp (a regional cluster of shared infrastructure) that also serves other Microsoft customers unrelated to DPS. A significant uptick in activity across that stamp created substantial additional pressure on available port capacity — and it was this pressure that ultimately pushed the environment over the edge into exhaustion.

What We Did

Microsoft was engaged and implemented mitigation steps on their end to help restore normal port availability. We also worked with them to identify configuration and architectural improvements on our side.

What We're Changing
Scaling policy adjustment — We are updating our auto-scale rules to be less aggressive about scaling down during active hours. Keeping additional instances running provides a larger SNAT port pool and reduces our exposure to demand spikes that follow a dip in usage. In today's case, usage rebounded very rapidly after the scale-down — faster than the platform could provision additional instances to compensate.
Connection pooling improvement — We have identified a code-level optimization that will reduce the number of SNAT ports consumed by implementing more efficient connection pooling. This improvement is being developed and will be delivered as a platform update.

What to Expect Next
We will continue to monitor closely and adjust our scaling configuration as needed. This is the first time we have encountered this condition, and we are treating it as a priority to ensure it does not recur.

Regarding the connection pooling update: when this change is ready for your BETA environment, we will need your prompt cooperation in testing and signing off so we can move quickly to production and realize the full benefit of the fix. We'll reach out with specifics as that work progresses — timing is still being assessed and could range from a quick turnaround to a few days of development work.

This type of condition has never happened in the 11 years we've been operating in Microsoft Azure, but is something we're now adjusting for.

Posted Jun 15, 2026 - 20:48 EDT

Identified

We have identified the issue across multiple customer Azure tenants -- including those owned and managed by customers not DPS -- as generating Socket Errors.

Here's where we're seeing that the Azure App Service is sometimes just not responding -- it's not even getting to code execution. The website calls the API and the API's web server doesn't even answer: "System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond."

This is outside of DPS control and we are escalating to Microsoft.

Posted Jun 15, 2026 - 14:00 EDT

Investigating

We are currently investigating an issue where multiple customer ATOL and other hosted websites are reporting a 500 Internal Server Error message. No code or database change has happened and we are gathering information to escalate to Microsoft, as these seem to be related to a socket error that's happening before any our code or customer provided code is involved. The random nature of this seems to make this harder to pinpoint -- making it likely it is something on the provided infrastructure.

This issue is degrading performance for some customers but others are fine. Even within the same customer, some users are fine.

Posted Jun 15, 2026 - 13:52 EDT

This incident affected: DPS AdTracker®:Cloud Platform Services (AdTracker:Online (ATOL) Websites, Customer Provided Systems / Other) and Microsoft Azure + Office Cloud Services (DPSCLOUD - USA-based Hosting).