Toub’s 232-page tour-de-force on performance in .NET 10

Published by marco on

The book-length Performance Improvements in .NET 10 by Stephen Toub (Microsoft DevBlogs) arrived a couple of months ago.

He explains how the various compilers (AOT, JIT, etc.) have been optimized to eliminate allocations and just generally optimized for performance. A reduction in allocations is a multi-win: the performance is better because the allocator isn’t working, the memory usage has dropped, and the garbage collector also works less.

See previous coverage in:

Toub’s 234-page tour-de-force on performance in .NET 9 (2024)
Somehow, I never documented .NET 8. Huh.
Performance Improvements in .NET 7 (2022).

A presentation at .NET Build 2025

If you prefer a 30-minute video, then you’re in luck.

Performance Improvements in .NET 10 by dotnet | Stephen Toub (YouTube)

He compares .NET Framework 4.8 vs. .NET 9 vs. .NET 10. The most impressive improvements are from 4.8 to 9.0, of course, but he highlights some interesting places where .NET 10 eclipses .NET 9, where .NET 9 had already eclipsed .NET Framework 4.8.

The last example shows how regular expressions have been continually optimized so that an operation that took 24ms in .NET Framework 4.8 was improved by about 12x to 2.5ms in .NET 9 but has been further improved by about 62,500x to about 40ns in .NET 10.

Citations and Notes

And now, on to the citations from Toub’s book along with my notes.

He starts off with a bit of history and context in the wider world.

“What made Tudor’s ice last halfway around the world wasn’t one big idea. It was a plethora of small improvements, each multiplying the effect of the last. In software development, the same principle holds: big leaps forward in performance rarely come from a single sweeping change, rather from hundreds or thousands of targeted optimizations that compound into something transformative. .NET 10’s performance story isn’t about one Disney-esque magical idea; it’s about carefully shaving off nanoseconds here and tens of bytes there, streamlining operations that are executed trillions of times.”

“As with many languages, .NET historically has had an “abstraction penalty,” those extra allocations and indirections that can occur when using high-level language features like interfaces, iterators, and delegates. Each year, the JIT gets better and better at optimizing away layers of abstraction, so that developers get to write simple code and still get great performance. .NET 10 continues this tradition. The result is that idiomatic C# (using interfaces, foreach loops, lambdas, etc.) runs even closer to the raw speed of meticulously crafted and hand-tuned code.”

JIT

“If the compiler can prove an object doesn’t escape, then that object’s lifetime is bounded by the method, and it can be allocated on the stack instead of on the heap. Stack allocation is much cheaper (just pointer bumping for allocation and automatic freeing when the method exits) and reduces GC pressure because, well, the object doesn’t need to be tracked by the GC. .NET 9 had already introduced some limited escape analysis and stack allocation support; .NET 10 takes this significantly further.”

“[…] where things gets interesting is around what the JIT is able to devirtualize. In .NET 9, it struggles to devirtualize calls to the interface implementations specifically on T[], so it won’t devirtualize either the _list.GetEnumerator() call nor the _list[index] call. However, the enumerator that’s returned is just a normal type that implements IEnumerator<T>, and the JIT has no problem devirtualizing its MoveNext and Current members. Which means that we’re actually paying a lot more going through the indexer, because for N elements, we’re having to make N interface calls, whereas with the enumerator, we only need the one with GetEnumerator interface call and then no more after that.”

To be clear: this has been addressed in .NET 10, so that the indexer is also almost always devirtualized.

“dotnet/runtime#110827 from @hez2010 also helps more methods to be inlined by doing another pass looking for opportunities after later phases of devirtualization. The JIT’s optimizations are split up into multiple phases; each phase can make improvements, and those improvements can expose additional opportunities. If those opportunities would only be capitalized on by a phase that already ran, they can be missed. But for phases that are relatively cheap to perform, such as doing a pass looking for additional inlining opportunities, those phases can be repeated once enough other optimization has happened that it’s likely productive to do so again.”

“The static readonly field is immutable, arrays can’t be resized, and the JIT can guarantee that the field is initialized prior to generating the code for Read. Therefore, when generating the code for Read, it can know with certainty that the array is of length three, and we’re accessing the element at index two. Therefore, the specified array index is guaranteed to be within bounds, and there’s no need for a bounds check.”

The JIT has been doing these kinds of optimizations for a long time but the number of cases for which it can “prove” increases with each release.

“My choice of benchmark in this case was not coincidental. This pattern shows up in the FormattingHelpers.CountDigits internal method that’s used by the core primitive types in their ToString and TryFormat implementations, in order to determine how much space will be needed to store rendered digits for a number. As with the previous example, this routine is considered core enough that it was using unsafe code to avoid the bounds check. With this fix, the code was able to be changed back to using a simple span access, and even with the simpler code, it’s now also faster.”

“Many of these different optimizations interact with each other. Dynamic PGO triggers a form of cloning, as part of the guarded devirtualization (GDV) mentioned earlier: if the instrumentation data reveals that a particular virtual call is generally performed on an instance of a specific type, the JIT can clone the resulting code into one path specific to that type and another path that handles any type. That then enables the specific-type code path to devirtualize the call and possibly inline it. And if it inlines it, that then provides more opportunities for the JIT to see that an object doesn’t escape, and potentially stack allocate it. dotnet/runtime#111473, dotnet/runtime#116978, dotnet/runtime#116992, dotnet/runtime#117222, and dotnet/runtime#117295 enable that, enhancing escape analysis to determine if an object only escapes when such a generated type test fails (when the target object isn’t of the expected common type).”

This led to several several dozen performance-test improvements across the board when the PR landed. The whole section boils down to the JIT optimization working not only for regular loops, enumerable loops, but also hand-unrolled code with multiple array accesses (where bounds-checks can now be elided using clever cloning).

Inlining

“[…] generally the most benefit from inlining comes from knock-on benefits. Just as a simple example, if you have code like:”
int i = Divide(10, 5);

static int Divide(int n, int d) => n / d;
“if Divide doesn’t get inlined, then when Divide is called, it’ll need to perform the actual idiv, which is a relatively expensive operation. In contrast, if Divide is inlined, then the call site becomes:”
int i = 10 / 5;
“which can be evaluated at compile time and becomes just:”
int i = 2;

“Just inlining everything would be bad; inlining copies code, which results in more code, which can have significant negative repercussions. For example, inlining’s increased code size puts more pressure on caches. Processors have an instruction cache, a small amount of super fast memory in a CPU that stores recently used instructions, making them really fast to access again the next time they’re needed (such as the next iteration through a loop, or the next time that same function is called).”

“As part of these heuristics, the JIT has the notion of “boosts,” where observations it makes about things methods do boost the chances of that method being inlined. dotnet/runtime#114806 gives a boost to methods that appear to be returning new arrays of a small, fixed length; if those arrays can instead be allocated in the caller’s frame, the JIT might then be able to discover they don’t escape and enable them to be stack allocated. dotnet/runtime#110596 similarly looks for boxing, as the caller could possibly instead avoid the box entirely.”

Code Layout

“When the JIT compiler generates assembly from the IL emitted by the C# compiler, it organizes that code into “basic blocks,” a sequence of instructions with one entry point and one exit point, no jumps inside, no branches out except at the end. These blocks can then be moved around as a unit, and the order in which these blocks are placed in memory is referred to as “code layout” or “basic block layout.” This ordering can have a significant performance impact because modern CPUs rely heavily on an instruction cache and on branch prediction to keep things moving fast. If frequently executed (“hot”) blocks are close together and follow a common execution path, the CPU can execute them with fewer cache misses and fewer mispredicted jumps.”

“Consider a tight loop executed millions of times. A good layout keeps the loop entry, body, and backward edge (the jump back to the beginning of the body to do the next iteration) right next to each other, letting the CPU fetch them straight from the cache. In a bad layout, that loop might be interwoven with unrelated cold blocks (say, a catch block for a try in the loop), forcing the CPU to load instructions from different places and disrupting the flow. Similarly, for an if block, the likely path should generally be the next block so no jump is required, with the unlikely branch behind a short jump away, as that better aligns with the sensibilities of branch predictors.”

GC Write Barriers

“Whenever there’s a reference write that could cross a generation, the JIT emits a call to a helper that tracks the information in a “card table,” and when the GC runs, it consults this table to see if it needs to scan a portion of the higher generations. That helper is referred to as a “GC write barrier.” Since a write barrier is potentially employed on every reference write, it must be super fast, and in fact the runtime has several different variations of write barriers so that the JIT can pick one optimized for the given situation. Of course, the fastest write barrier is one that doesn’t need to exist at all, so as with bounds checks, the JIT also exerts energy to try to prove when write barriers aren’t needed, eliding them when it can. And it can even more in .NET 10.”

Miscellaneous

“As with most compilers, the JIT employs common subexpression elimination (CSE) to find identical computations and avoid doing them repeatedly. dotnet/runtime#106637 teaches the JIT how to do so in a more consistent manner by more fully integrating CSE with its Static Single Assignment (SSA) representation. This in turn allows for more optimizations to kick in, e.g. some of the strength reduction done around loop induction variables in .NET 9 wasn’t applying as much as it should have, and now it will.”

I just love how Toub manages to keep up his excitement so deep into this document. He’s really a great writer.

Native AOT

“Native AOT [Ahead Of Time [compilation]] is the ability for a .NET application to be compiled directly to assembly code at build-time. The JIT is still used for code generation, but only at build time; the JIT isn’t part of the shipping app at all, and no code generation is performed at run-time. As such, most of the optimizations to the JIT already discussed, as well as optimizations throughput the rest of this post, apply to Native AOT equally.”

VM

“With dotnet/runtime#114462, the runtime now uses a single shared “template” for many of the small executable “stubs” it needs at runtime; stubs are tiny chunks of machine code that act as jump points, call counters, or patchable trampolines. Previously, each memory allocation for stubs would regenerate the same instructions over and over. The new approach builds one copy of the stub code in a read-only page and then maps that same physical page into every place it’s needed, while giving each allocation its own writable page for the per-stub data that changes at runtime. This lets hundreds of virtual stub pages all point to one physical code page, cutting memory use, reducing startup work, and improving instruction cache locality.”

Threading

“If a thread is blocked on an operation that depends on work items in that thread’s local queue getting processed, that work item being picked off now depends on the global queue being exhausted and another thread coming along and stealing the work item from this thread’s queue. If there’s a steady stream of incoming work into the global queue, though, that will never happen; essentially, the highest priority work item has become the lowest priority work item.

“So, back to these PRs. The idea is fairly simple: when the thread is about to block, and in particular when it’s about to block waiting on a Task, it first dumps its entire local queue into the global queue. That way, this work which was highest priority for the blocked thread has a fairer chance of being processed by other threads, rather than it being the lowest priority work for everyone.”

“dotnet/runtime#107843 from @hamarb123 adds two new methods to the Volatile class: ReadBarrier and WriteBarrier. A read barrier has “load acquire” semantics, and is sometimes referred to as a “downward fence”: it prevents instructions from being reordered in such a way that memory accesses below/after the barrier move to above/before it. In contrast, a write barrier has “store release” semantics, and is sometimes referred to as an “upwards fence”: it prevents instructions from being reordered in such a way that memory accesses above/before the barrier move to below/after it.”

“These barriers are referred to as “half fences”; the read barrier prevents later things from moving earlier, but not the other way around, and the write barrier prevents earlier things from moving later, but not the other way around. (As it happens, though, while not required by specification, today the implementation of lock does use a full barrier on both enter and exit, so nothing before or after a lock will move into it.)”

Reflection

“System.Net.Http sits above System.Security.Cryptography, referencing it for critical features like X509Certificate. But System.Security.Cryptography needs to be able to make HTTP requests in order to download OCSP information, and with System.Net.Http referencing System.Security.Cryptography, System.Security.Cryptography can’t in turn explicitly reference System.Net.Http. It can, however, use reflection or [UnsafeAccessor] and [UnsafeAccessorType] to do so, and it does. It used to use reflection, now in .NET 10 it uses [UnsafeAccessor].”

Primitives and Numerics

“dotnet/runtime#111505 from @alexcovington enables TensorPrimitives.Divide<T> to be vectorized for int. The operation already supported vectorization for float and double, for which there’s SIMD hardware-accelerated support for division, but it didn’t support int, which lacks SIMD hardware-accelerated support. This PR teaches the JIT how to emulate SIMD integer division, by converting the ints to doubles, doing double division, and then converting back.”

That fix, roundabout as it sounds, ends up making that operation 4x faster. This is pretty cool because dividing integers in SIMD code just became 4x faster on .NET. You don’t use this, you say? Well, are you sure? Are you sure that there is no code in handshake-negotiation (e.g.) that needs to divide multiple integers in parallel? These are exactly the kind of improvements that, as noted in Toub’s introduction, lead to smoother operation in many other places. This is such a low-level primitive.

“ We can then reuse those methods to do the same thing that’s already done for scalar operations but do it vectorized: take a vector of Halfs, convert them all to floats, process all the floats, and convert them all back to Halfs. Of course, I already stated that the vector types don’t support Half, so how can we “take a vector of Half“? By reinterpret casting the Span<Half> to Span<short> (or Span<ushort>), which allows us to smuggle the Halfs through. And, as it turns out, even for scalar, the very first thing Half‘s float cast operator does is convert it to a short.

“The net result is that a ton of operations can now be accelerated for Half.”

These optimizations improve performance for processing Half in dozens of operations by 11x.

“with C# 14, it’s possible for a type to not only define a + operator, it can also define a += operator. If a type defines a += operator, it will be used rather than expanding a += b as shorthand for a = a + b. And that has performance ramifications.

“[…] that means that such compound operators on the tensor types can just update the target tensor in place rather than allocating a whole new (possibly very large) data structure for each computation. dotnet/runtime#117997 adds all of these compound operators for the tensor types. (Not only are these using C# 14 user-defined compound operators, they’re doing so as extension operators, using the new C# 14 extension types feature. Fun!)”

Collections

“[…] as noted earlier in the JIT section, the JIT has been gaining super powers around dynamic PGO, escape analysis, and stack allocation. This means that in many situations, the JIT is now able to see that the most common concrete type for a given call site is a specific enumerator type and generate code specific to when it is that type, devirtualizing the calls, possibly inlining them, and then, if it’s able to do so sufficiently, stack allocating the enumerator. With the progress that’s been made in .NET 10, this now happens very frequently for arrays and List<T>. While the JIT is able to do this in general regardless of an object’s type, the ubiquity of enumeration makes it all that much more important for IEnumerator<T>, so dotnet/runtime#116978 marks IEnumerator<T> as an [Intrinsic], giving the JIT the ability to better reason about it.”

“For shorter lists, dynamic PGO will see MoveNextRare invoked a reasonable number of times, and will consider it for inlining. And if all of the calls to the enumerator are inlined, the enumerator instance can avoid escaping the call frame, and can then be stack allocated. But once the list length grows to a much larger amount, that MoveNextRare method will start to look really cold, will struggle to be inlined, and will then allow the enumerator instance to escape, preventing it from being stack allocated.”

“While OSR is awesome, it unfortunately causes some complications here. Once the list gets long enough, an invocation of the tier 0 (unoptimized) method will transition to the OSR optimized method… but OSR methods don’t contain dynamic PGO instrumentation (they used to, but it was removed because it led to problems if the instrumented code never got recompiled again and thus suffered regressions due to forever-more running with the instrumentation probes in place). Without the instrumentation, and in particular without the instrumentation for the tail portion of the method (where the enumerator’s Dispose method is invoked), even though List<T>.Dispose is a nop, the JIT may not be able to do the guarded devirtualization that enables the IEnumerator<T>.Dispose to be devirtualized and inlined. Meaning, ironically, that the nop Dispose causes escape analysis to see the enumerator instance escape, such that it can’t be stack allocated. Whew.

“[…] Specifically for enumerators, this PR enables dynamic PGO to infer the missing instrumentation based on the earlier probes used with the other enumerator methods, which then enables it to successfully devirtualize and inline Dispose.”

“Labels A and B form a loop, but that loop can be entered by jumping to either A or to B. If the compiler could prove that this loop were only ever enterable from A or only ever enterable from B, then the loop would be “reducible.” Irreducible loops are much more complex than reducible loops for a compiler to deal with, as they have more complex control and data flow and in general are harder to analyze. dotnet/runtime#116949 rewrites the MoveNext method to be a more typical while loop, which is not only easier to read and maintain, it’s also reducible and more efficient, and because it’s more streamlined, it’s also inlineable and enables possible stack allocation.”

This results in a 7x performance improvement when iterating a list of integers.

There are also a ton of optimizations in Linq, for Contains (with 10x − 400x improvements), Fill (40x), Shuffle (2x − 40x), LeftJoin, and RightJoin (2x). There are also specific improvements for many of the base collection types.

IO

The next section on IO is also interesting, with one case where they didn’t actually change any code but instead introduced an analyzer that discourages using the EndOfStream property in asynchronous code, which can lead to pathological cases in which the stream is blocked until more data arrives.

Searching / Regular Expressions

This section includes a longer discussion about the improvements included in previous versions of .NET, especially as it relates to avoiding backtracking. There are normalized forms of regular expressions that incur no backtracking penalty and can thus be evaluated with the faster version of the regular-expression engine that doesn’t have to account for it.

Here’s an example that I’ve lifted up from much further down in this section.

“Given the pattern ^abc|^abd, the code generators would end up emitting this exactly as it’s written, with an alternation with two branches, the first branch checking for the beginning and then matching “abc”, the second branch also checking for the beginning and then matching “abd”. Now in .NET 10, the anchor can be factored out, such that ^abc|^abd ends up being rewritten as ^ab[cd].”

The idea here is to search for pathological formulations for which there is a non-pathological equivalent and automatically use that version under the hood. That is my interpretation of the following rather-dense section.

“Consider a pattern a*b. a*b is observably identical to (?>a*)b, which says that the a* should not be backtracked into. That’s because there’s nothing the a* can “give back” (which can only be as) that would satisfy what comes next in the pattern (which is only b). It’s thus valid for a backtracking engine to transform how it processes a*b to instead be the equivalent of how it processes (?>a*)b. And the .NET regex engine has been capable of such transformations since .NET 5. This can result in massive improvements to throughput. With backtracking, waving my hands, we effectively need to execute everything after the backtracking construct for each possible position we could backtrack to. So, for example, with \w*SOMEPATTERN, if the w* successfully initially consumes 100 characters, we then possibly need to try to match SOMEPATTERN up to 100 different times, as we may need to backtrack up to 100 times and re-evaluate SOMEPATTERN each time we give back one of the things initially matched. If we instead make that (?>\w*), we eliminate all but one of those! That makes improvements to this ability to automatically transform backtracking constructs to be non-backtracking possibly massive improvements in performance, and practically every release of .NET since .NET 5 has increased the set of patterns that are automatically transformed. .NET 10 included.”

There are several detailed examples of 5x–6x improvements in performance for relatively common-looking regular expressions. Stephen Toub loves writing about very-specific regular-expression examples. Like, each paragraph is a blog post just on its own. Needless to say, this section is, at the same time, fascinating, extremely detailed, and eminently uncitable (because it would just entail citing pages of detail that is all necessary to understand the optimization). The improvements are impressive and incredibly well-described. Go check out that section if you like regular expressions and mathematical analysis (equivalence of expressions, reduction of solution space). The additional beauty is that the regular-expression evaluators are all source-generated C#, so it’s much, much easier to evaluate what’s going on than with the assembly-level discussions in the JIT discussion, for example.

As a final example, here is the level of holistic analysis we’re talking about.

“Unfortunately, the helper that emits that IndexOf call was passed the wrong node from the pattern: it was being passed the object representing the (?:.|\n) any-set rather than the “*/” literal, which resulted in it emitting the equivalent of IndexOfAnyInRange((char)0, ‘\uFFFF’) rather than the equivalent of IndexOf(“*/”). Oops. It was still functionally correct, in that the IndexOfAnyInRange call would successfully match the first character and the loop would re-evaluate from that location, but that means that rather than efficiently skipping using SIMD over a bunch of positions that couldn’t possibly match, we were doing non-trivial work for each and every position along the way.”

As in the IO section above, some of the optimizations come in the form of analyzers that recommend an optimization that the user can apply rather than something that the runtime can do automatically.

“[…] the .NET 10 SDK includes a new analyzer related to Regex. It’s oddly common to see code that determines whether an input matches a Regex written like this: Regex.Match(…).Success. While functionally correct, that’s much more expensive than Regex.IsMatch(…). For all of the engines, Regex.Match(…) requires allocating a new Match object and supporting data structures (except when there isn’t a match found, in which case it’s able to use an empty singleton); in contrast, IsMatch doesn’t need to allocate such an instance because it doesn’t need to return such an instance (as an implementation detail, it may still use a Match object, but it can reuse one rather than creating a new one each time).”

MemoryExtensions

“These overloads all parallel existing methods, but remove the IEquatable<T> (or IComparable<T>) constraint on the generic method parameter and accept an optional IEqualityComparer<T>? (or IComparer<T>). When no comparer or a default comparer is supplied, they can fall back to using the same vectorized logic for relevant types, and otherwise can provide as optimal an implementation as they can muster, based on the nature of T and the supplied comparer.”

This part is very interesting because you see how the improvements to MemoryExtensions led to SearchValues being faster, which, in turn, led to methods like Normalize and Contains being faster (especially when working with strings that are automatically treated as Spans wherever possible).

JSON

A good method to know is RemoveAll(), which accepts a lambda to filter for the elements to remove. If, instead of looping over the items and calling RemoveAt(n), you write _arr.RemoveAll(static n => n!.GetValue<int>() % 2 == 0), you get a huge performance benefit because RemoveAll() adjusts the underlying buffer only once rather than on each call to remove each individual item.

“With JSON being used as an encoding for many modern protocols, streaming large JSON payloads has become very common. And for most use cases, it’s already possible to stream JSON well with System.Text.Json. However, in previous releases there wasn’t been a good way to stream partial string properties; string properties had to have their values written in one operation. If you’ve got small strings, that’s fine. If you’ve got really, really large strings, and those strings are lazily-produced in chunks, however, you ideally want the ability to write those chunks of the property as you have them, rather than needing to buffer up the value in its entirety. dotnet/runtime#101356 augmented Utf8JsonWriter with a WriteStringValueSegment method, which enables such partial writes. […] These modern protocols often transmit large blobs of binary data within the JSON payloads. Typically, these blobs end up being Base64 strings as properties on some JSON object. Today, outputting such blobs requires Base64-encoding the whole input and then writing the resulting bytes or chars in their entirety into the Utf8JsonWriter. To address that, dotnet/runtime#111041 adds a WriteBase64StringSegment method to Utf8JsonWriter.”

Cryptography

“A ton of effort went into cryptography in .NET 10, almost entirely focused on post‑quantum cryptography (PQC). PQC refers to a class of cryptographic algorithms designed to resist attacks from quantum computers, machines that could one day render classic cryptographic algorithms like Rivest–Shamir–Adleman (RSA) or Elliptic Curve Cryptography (ECC) insecure by efficiently solving problems such as integer factorization and discrete logarithms. With the looming threat of “harvest now, decrypt later” attacks (where a well-funded attacker idly captures encrypted internet traffic, expecting that they’ll be able to decrypt and read it later) and the multi-year process required to migrate critical infrastructure, the transition to quantum‑safe cryptographic standards has become an urgent priority. In this light, .NET 10 adds support for ML-DSA (a National Institute of Standards and Technology PQC digital signature algorithm), Composite ML-DSA (a draft Internet Engineering Task Force specification for creating signatures that combine ML-DSA with a classical crypto algorithm like RSA), SLH-DSA (another NIST PQC signature algorithm), and ML-KEM (a NIST PQC key encapsulation algorithm).”

Conclusion

Overall, this is another amazing document—a book—that is edited to an incredibly high quality. I didn’t notice any grammatical, formatting errors, or typos (maybe a missing `?` on IComparer<T> in “These overloads all parallel existing methods, but remove the IEquatable<T> (or IComparable<T>) constraint on the generic method parameter and accept an optional IEqualityComparer<T>? (or IComparer<T>).” or when he wrote “frequently-requested” (the hyphen is only correct with adjectives, not adverbs).