This page shows the source for this entry, with WebCore formatting language tags and attributes highlighted.

Title

AOT, JIT, and PGO in .NET

Description

The latest video by Nick Chapsas has a more-than-usually clickbait-y headline. The "big" problem that NativeAOT has, is that it's 4% slower during runtime than the JIT-compiled version. <media href="https://www.youtube.com/watch?v=gJcPqdbKF90" src="https://www.youtube.com/v/gJcPqdbKF90" source="YouTube" width="560px" author="Nick Chapsas" caption="NativeAOT in .NET 8 Has One Big Problem"> That doesn't seem like such a big problem to me, when the point of AOT is to improve cold-start times for applications launched on-demand. For that use-case, AOT shines. It's over 4x faster on startup than the JIT-compiled version. It's incredibly impressive that JIT-compilation takes less than 1/10 of a second, but it's still 4x slower than AOT. <img src="{att_link}aot_vs_jit.jpg" href="{att_link}aot_vs_jit.jpg" align="none" caption="Spider graph of AOT vs. JIT" scale="50%"> So, you get the app started 4x fast, but it then performs 4% more slowly than the non-AOT version. It really depends on the use-case, but it's great for the common one of starting a server to answer a function call---think Azure Functions or AWS Lambdas---and then shutting down again, possibly immediately. <a href="https://www.linkedin.com/in/damianpedwards/">Damian P Edwards</a> (Principal Architect at Microsoft) commented on the post, <bq>[There are a] few things that cause the slightly lower performance in native AOT apps right now. First (in apps using the web SDK) is <b>the new DATAS Server GC mode.</b> This new GC mode uses far less memory than traditional ServerGC by dynamically adapting memory use based on the app's demands, but in this <b>1st generation it impacts the performance slightly.</b> The goal is to remove the performance impact and enable DATAS for all Server GC apps in the future. Second is CoreCLR in .NET 8 has <b>Dynamic PGO enabled by default, which allows the JIT to recompile hot methods with more aggressive optimizations</b> based on what it observes while the app is running. Native AOT has static PGO with a default profile applied and by definition can never have Dynamic PGO. Thirdly, <b>JIT can detect hardware capabilities (e.g. CPU intrinsics) at runtime</b> and target those in the code it generates. Native AOT however defaults to a highly compatible target instruction set which won't have those optimizations but you can specify them at compile time based on the hardware you know you're going to run on. Running the tests in [the] video with DATAS disabled and native AOT configured for the target CPU could improve the results slightly.</bq> To summarize: <ol> The DATAS GC mode is in-use for AOT, but still being fine-tuned. <div>An AOT-compiled app cannot benefit from <i>dynamic</i> <a href="https://en.wikipedia.org/wiki/Profile-guided_optimization">PGO</a>. It benefits from <i>static</i> PGO, but cannot recompile itself on-the-fly because it doesn't have a JIT compiler to do so. The JIT-compiled app can dynamically recompile what it observes as performance hotspots with more highly optimized code. I wrote a bit about how Safari does something similar for JavaScript in <a href="{app}/view_article.php?id=3057">Optimizing compilation and execution for dynamic languages</a>---although for JavaScript, dynamic recompilation is sometimes necessary for backing out of an incorrect assumption about what type a variable is going to have.</div> <div>As well, a JIT-compiled app can take actual hardware capabilities into account, while an AOT-compiled app necessarily targets a static hardware profile. The generic hardware profile is going to be extremely conservative about capabilities because if it assumes a capability that doesn't exist, the app simply won't run. Choosing a hardware profile for AOT that matches the target hardware would boost performance.</div> </ol> I guess that was more of a rephrasing, rather than a summary. Anyway, another commenter asked, <bq>[...] would it be possible in the future for a JIT application with Dynamic PGO that has run for a while and has made all kinds of optimizations to then create a "profile" of sorts that could be used by the Native AOT compiler to build an application that is both fast in startup time and highly optimized for a given workload?</bq> Yes. That should be possible. It's unclear what sort of extra performance boost this would give, especially if you'd already fine-tuned the target hardware profile---which is the first thing you should do. I could imagine adding this sort of profiling as a compilation step, though. You always have to be careful, though, whenever you're running something in production that is different than what you've tested. We put a lot of faith in the JIT and dynamic PGO, don't we? I wanted to also note that, at the end of the video, Chapsas showed Microsoft's numbers, which confirm the performance drop, <i>but also show an over 50% reduction in working set!</i> Dude! How do you not mention that!? The app uses less than half of the memory and runs almost as fast? Yes, please! That's a huge win for people paying for cloud-based services. For once, I'm somewhat surprised to see how naive Nick's take is---that a 4% drop in performance is at-all significant, especially when the "slow" version is still processing 50,000 requests per second in a performance-constrained environment. He did mention a trade-off, but was very excited to tell people that AOT is <i>slower</i> during runtime. There are always trade-offs and you should be very aware of the actual non-functional requirements for your application before you decide whether to use a technology or not. For 99.9% of the applications, the 4% drop in performance vis รก vis a JIT-compiled version won't be the deciding factor. When it's accompanied by a working set that's only 1/2 the size, then it becomes an even more attractive target.