Optimize by keeping only the code you need
Published by marco on
In the video Context is Everything by Andreas Fredriksson (Vimeo), the author pinpoints that a dependency in his app—a JSON-handling library—is sucking all the performance out of it.
So, he takes a look at it.
It’s a general-purpose library, with a lot of edge cases…edge cases that his input data doesn’t have. That is, if he can guarantee a certain context, then he can use an optimized version of the JSON library’s code. This isn’t always going to be the solution—it will, in fact, rarely be the solution for a LOB app for which every line of maintenance is a burden—but, when you’re making something with performance constraints, it’s good to be able to think like this.[1]
He takes the original JSON library and profiles it. Then he starts to excise the slow bits—bits his app doesn’t need anyway. This gets him impressive performance boosts.
First, he gets it to be 2x faster with a simple linear fix (removing unneeded branches), then boosts it to over 11x faster by using a mixed-parsing mode.
Another profile shows that a function called isspace() is taking up 45% of the processing time now. He trims that down to just handle the whitespace characters his file might actually contain. He also ditches the locale check that happened every single time.
17x faster now.
OK. What else can we do? Ah, we could observe that the data doesn’t have to contain spaces at all! That is, instead of parsing the spaces as they come along, you can use a SIMD-based solution combined with a LUT (Look-Up Table) to normalize the input data before you even parse it. He uses a quick-and-dirty Perl script to build the LUT.
22x faster now.
That performance improvement alone is 5x more than the original speed of the parser.
- We just removed a bunch of poorly predicted branches, nothing else
- Low-level thinking = not paying for things you don’t need
- Low-level thinking = partition work in hardware-friendly ways
“[…]
“We didn’t change any of the behavior of the program. All we did was we separated these two passes in a way that was friendly for the hardware. We moved branches from being in the integer control flow to being inside masks in the SIMD flow.”
The next step is to reexamine what “white space” actually is: he reinterprets it to mean anything that’s not a printable character, which allows him to optimize the mask even further.
29x faster.
Over 1GB/s of throughput.
Are we done? Bitch, please.
He moves on to two more levels of optimization that still bring good-sized gains, but at the cost of more complexity. They also contain more assumptions but that’s OK if the assumptions will always be correct. You want to stop optimizing when it makes sense for your use case. If you’re writing code for a very tight loop on some low-level hardware—or in a game where the budget per frame is a maximum of 16ms—then it might be very important: you might end up saving incredible amounts of time for your users; you might be using a lot less power.
Solve the right problem
- Ask the right questions
- Consider the liabilities and overall economics of your approach
Consider the unique context and the potentially massive wins
- Generic means “not tuned for your use case”
- Don’t be afraid to look inside