Home Technology Software "The businesses that succeed a...

"The businesses that succeed are those that view their technical architecture as a core competitive advantage" – Khrystyna Terletska, Senior Software Engineer at DraftKings, on scaling systems for peak performance

Software

CIO Bulletin
29 Febuary, 2024

– Vanessa Kim

Khrystyna Terletska has built some of the industry's most robust and high-performance systems. As someone who designs fault-tolerant architectures and optimizes systems under extreme loads, Khrystyna's expertise is key during events like the Super Bowl when millions of people place their bets. During this interview, she reveals how DraftKings scales its infrastructure to handle massive traffic surges, ensuring seamless performance and accurate data in real-time.

Khrystyna, the Super Bowl just took place recently. It's safe to assume that the load on DraftKings increases significantly on days like this. Could you walk us through what happens "behind the scenes" during such peak moments? How do you process millions of bets per second?

You're absolutely right — the Super Bowl is our highest-traffic event of the year. During peak moments like kickoff, halftime, and key plays, traffic can increase by 80 to 100 times, from 20,000 concurrent users to over 2 million.

To clarify, we don’t process "millions of bets per second." At our peak, we handle 300k to 400k bet requests per minute, but we simultaneously process over 1 million selection updates per second from our sports data feeds.

Behind the scenes, our .NET microservices architecture handles this load with a two-tier ETL system built on Kafka. Our Main Aggregator ETL consumes data from 40-50 Kafka topics, transforming raw sports data into our canonical event-market-selection model.

The key to our performance is our stateful architecture — each service instance maintains the full dataset in memory using custom concurrent data structures in C#. MessagePack is used for serialization, reducing payload sizes by 40% compared to JSON and increasing serialization speed.

When a bet comes in, it flows through our ETL pipeline, transforming raw data into our internal format, hitting Kafka event streams, updating in-memory odds calculations in real-time, and confirming the bet back to the user — all within 200 milliseconds. During the Super Bowl, we scaled from 10-15 replicas per service to 200-300 replicas, distributed across seven GCP regions in an active-active configuration.
"The businesses that succeed are those that view their technical architecture as a core competitive advantage" – Khrystyna Terletska, Senior Software Engineer at DraftKings, on scaling system

Given that a real-time betting system cannot afford even a second of downtime, what strategies have you implemented to ensure fault tolerance and uninterrupted operation?

Zero downtime is critical for us — during the Super Bowl, even 30 seconds of downtime could cost millions and damage user trust. To ensure fault tolerance, we use active-active deployments across seven GCP regions with Kafka as our single source of truth. Every ETL service can rebuild its in-memory state by replaying Kafka topics, and we’ve implemented circuit breakers to isolate failures before they can spread. Continuous consistency checking and automatic recalculations ensure that any inconsistencies are quickly addressed without manual intervention. Additionally, we use rolling updates with PodDisruptionBudgets and practice chaos engineering to test our recovery mechanisms. Snapshotting is crucial for debugging, allowing us to dump and analyze the in-memory state at any point to reduce resolution time.

What has been the most difficult engineering challenge for you personally in working on such processes?

For me, the most challenging problem was building our real-time state synchronization system across multiple data centers while maintaining consistency guarantees.

The challenge was ensuring that our sports offering data — odds, game states, and player information — remained identical across all seven of our GCP regions. At the same time, each region had to function independently if others went down. This created a fundamental tension between consistency and availability.

What made this particularly difficult was that I had to learn distributed systems theory on the fly while building a production system. I'm referring to concepts like consensus algorithms, vector clocks, and conflict resolution strategies — concepts I had only read about in papers, but now had to implement for a system processing hundreds of thousands of updates per minute.

The breakthrough came when I realized that perfect consistency wasn’t necessary — we just needed bounded inconsistency. Users could tolerate slight differences in odds between regions for a few hundred milliseconds, as long as they eve

ntually converged to the same state.

I ended up designing a system where each region maintains its own authoritative copy of the data. These regions continuously exchange merkle tree hashes to detect and resolve conflicts. When conflicts arise, we use a combination of timestamps and business logic to determine the winning state.

The hardest part was debugging issues in this system. When distributed state is spread across seven regions, and conflicts need to be resolved, identifying why a specific piece of data is inconsistent becomes like solving a puzzle with missing pieces.

Card image cap

Looking more broadly at the business level, what are the most common mistakes companies make when scaling systems for peak loads, and how can these be avoided?

From my experience with various teams and companies, the biggest mistake is when leadership treats scaling purely as an engineering challenge. They tell the tech team to "make it faster" without understanding that scaling requires fundamental business decisions about priorities and trade-offs.

Scaling systems for peak loads requires both engineering expertise and business involvement. It's not just about adding more servers, but about making fundamental decisions on architecture and priorities, with product managers and business stakeholders understanding the trade-offs from the outset.

I've seen companies spend months optimizing their existing monolithic systems instead of recognizing they need to completely rebuild their architecture. It's like trying to make a bicycle as fast as a race car — eventually, you need entirely different infrastructure. For example, when we were using our legacy SQL Server setup, management initially wanted us to simply "tune the database" instead of investing in the Kafka-based streaming architecture we actually needed.

Another common issue is companies not investing in proper monitoring until after problems arise. I can't count how many times teams have scrambled during an outage because they didn’t know which part of their system was failing. We learned early on that you need comprehensive dashboards and alerts in place before issues occur, not after.

Why is the architecture of high-load systems becoming increasingly critical, not just in betting, but also in industries like fintech and e-commerce? What universal principles apply across these domains?

The core technical challenges are the same across industries, even though the business logic may differ significantly. Whether you're processing stock trades, managing e-commerce traffic, or handling live betting, the engineering challenges remain: massive concurrent loads, zero tolerance for downtime, and the need for real-time responses.

The universal technical principles we’ve discovered work across all these domains:

Event-driven architecture: Instead of services calling each other directly, everything communicates through events. For example, when a stock price changes, a product goes out of stock, or odds shift during a game, an event is triggered that other services react to. This decouples systems, preventing cascading failures.
Eventual consistency: You can’t achieve both perfect consistency and high availability at massive scale, so we accept that data might be temporarily out of sync. For instance, a user’s account balance might take 100 milliseconds to update across all systems, but that’s acceptable if it means the system stays responsive.
Elastic scalability: Systems automatically scale up to handle increased load and scale down when it decreases. Whether more users are buying stocks, shopping online, or placing bets, the infrastructure responds the same way.

What’s particularly driving this convergence is changing user expectations. Ten years ago, people might have accepted slow website performance during peak times. Now, users expect sub-second responses, regardless of the load. If your trading app is slow during market volatility, users will switch to a competitor. If your e-commerce site crashes on Black Friday, you lose customers permanently.

The technical patterns we use — Kafka for handling millions of events per second, in-memory caching for ultra-fast lookups, circuit breakers to isolate failures, and auto-scaling to manage traffic spikes—aren’t specific to betting. They address fundamental distributed systems principles.

Every industry is becoming a technology company because software performance directly impacts business outcomes. A slow trading platform loses money with every delayed transaction. A crashing e-commerce site loses sales. A lagging betting platform loses users to competitors.

The businesses that succeed are those that view their technical architecture as a core competitive advantage, rather than just a support function.

Which trends in real-time data processing do you consider the most promising over the next 3–5 years?

I'm particularly excited about several key trends that will actually have a significant impact on our high-throughput systems.

For instance, streaming-first architectures are quickly becoming the standard. We’re shifting away from traditional batch processing to systems where everything is treated as a stream by default. Apache Kafka's evolution with KRaft mode — removing Zookeeper dependencies—is simplifying stream processing infrastructure and making it more resilient.

Another promising trend is AI-powered predictive scaling, which is transforming how we manage workloads. Instead of relying on reactive auto-scaling that responds to current load, we're developing systems that can predict traffic patterns. Our machine learning models can now forecast load spikes with 90% accuracy, based on factors like game situations, player injuries, and even social media sentiment about teams. This enables us to scale proactively, rather than just reacting to the current load.

Finally, native compilation and zero-allocation programming are revolutionizing performance. With .NET's Native AOT, our microservices now start in 200 milliseconds instead of 2 seconds, making auto-scaling far more responsive. Source generators eliminate reflection overhead, and we're achieving near-C++ performance while still maintaining the productivity benefits of C#.