Published by marco on
The long and technical article Files are hard by Dan Luu discusses several low-level and scholarly analyses of how common file-systems and user-space applications deal with read/write errors.
- How theoretically consistent is the file system?
- How well-documented are patterns that guarantee consistency?
- How well-understand are these patterns in the communities using them?
- How do common applications (e.g. source control, databases, etc.) use these patterns?
- Are these applications guaranteeing consistency?
- What about the file-system designs? Are those airtight?
- Are the file-system implementations correct?
- How do the various components deal with hardware degradation or failure?
Asynchronous programming is hard
File-system operations work with devices and are thus asynchronous by nature. The analyses discovered similar ordering issues as with multi-threaded code.
“The most common class of error was incorrectly assuming ordering between syscalls. The next most common class of error was assuming that syscalls were atomic2. These are fundamentally the same issues people run into when doing multithreaded programming. Correctly reasoning about re-ordering behavior and inserting barriers correctly is hard. But even though shared memory concurrency is considered a hard problem that requires great care, writing to files isn’t treated the same way, even though it’s actually harder in a number of ways.”
This is why most applications should use a framework or runtime support to access the file system. Even this might not be enough, though, if the implementation is still not robust enough for the application requirements. The .NET runtime has for quite a while now offered an API that uses async/await (i.e. a promise/future-based API), which at the very least indicates the asynchronous nature of these calls, with separate paths for success and error. This is better than nothing, even if the implementation occasionally fails to properly propagate errors (as we see with the POSIX APIs below).
At any rate, the article drives home the point that programming against file systems is hard.
“People almost always just run some tests to see if things work, rather than making sure they’re coding against what’s legal in a POSIX filesystem.”
Having a few tests is better than nothing, but it’s even better to hoist your code up as many levels of abstraction as possible and avoid having to know about how to interleave
fsync calls at all. Unless you’re writing a database or a source-control system, right?
A common problem: documentation
He goes on to discuss “how much misinformation is out there” and that “it’s hard for outsiders to troll through a decade and a half of mailing list postings to figure out which ones are still valid and which ones have been obsoleted”
This is a common problem that applies not just to low-level systems programming, but to any other programming problem. We have a surfeit of choice: just search online and you’ll find something that matches what you searched.
- Is the source authoritative?
- Is the source even competent?
- Is the source relevant? Or just kind of related?
- Is the source current? Or outdated?
- Are you in an echo chamber that feels authoritative but is just a bunch of low-skill developers at a local maximum when the real answer to your problem is elsewhere and is actually much more elegant?
I recently ran into this phenomenon when learning Docker. Docker has changed and improved so much that the Internet is literally littered with old and overly complicated solutions to problems that either no longer exist or that can be solved with a simple one-liner in a configuration file. If you follow the instructions you find online, it’s possible that you’ll have something that works the way you want it to, but it’s also very likely that you’ll end up with a Frankenstein’s Monster of a setup that kind of works but is fragile in unnecessary ways.
Drives are not infallible
From the article:
“So far, we’ve assumed that the disk works properly, or at least that the filesystem is able to detect when the disk has an error via SMART or some other kind of monitoring. I’d always figured that was the case until I started looking into it, but that assumption turns out to be completely wrong.”
That sounds bad, of course. It’s not something we user-space programmers ever really think about, is it? You read from a file, you write to a file, it works, right? And if it doesn’t work (super-rare, right?), then the runtime throws an exception.
If we assume that the runtime throws an exception, we’re also assuming that the runtime is notified when an error occurs during a read or write operation. This was, apparently, not the case (at least in 2005-2008; we’ll see improvements below).
“In one presentation, one of the authors remarked that the ext3 code had lots of comments like “I really hope a write error doesn’t happen here” in places where errors weren’t handled. […] NTFS is somewhere in between. The authors found that it has many consistency checks built in, and is pretty good about propagating errors to the user. However, like ext3, it ignores write failures.”
Ignoring write failures! That’s kind of incredible, but if you’ve ever relied heavily on NTFS, you know that there are bugs in it. Sometimes files are just mysteriously locked and inaccessible until the system is rebooted. Why does the problem go away on reboot? NTFS is journaled and can recover its data, but it needs to be unmounted and checked. Instead of panicking, the write error is ignored.
“At this point, we know that it’s quite hard to write files in a way that ensures their robustness even when the underlying filesystem is correct, the underlying filesystem will have bugs, and that attempting to repair corruption to the filesystem may damage it further or destroy it.”
Replicating the results
The papers referenced in the first article are quite old (a decade or more) but the conclusions are still fascinating. Luu discusses the need for replicating the study and laments that “replications usually give little to no academic credit. This is one of the many cases where the incentives align very poorly with producing real world impact.”
Happily, Luu followed up with another post, called File-system error-handling that reproduces some of the original results with the 2017 versions of the file systems. This is an interesting study in its own right, discussing in detail interesting nuggets like the fact that “apfs doesn’t checksum data because “[apfs] engineers contend that Apple devices basically don’t return bogus data”.” (from APFS in Detail: Data Integrity).
The second article concludes that “Filesystem error handling seems to have improved.” Basic write errors are now propagated to user-space wherever possible (i.e. if the drive is not dead). However, “[m]ost filesystems don’t have checksums for data and leave error detection and correction up to userspace software.” This is probably something that most user-space software developers never think about, but it’s crucially important. Does your software assume that the file system will always throw an error? Or does it “just assume[…] that filesystems and disks don’t have errors”?
Abstract it away!
The first article concludes with a citation from Butler Lampson:
“Lampson suggests that the best known general purpose solution is to package up all of your parallelism into as small a box as possible and then have a wizard write the code in the box.”
This is generally a good approach for anything complicated: programmers should use as high-level an API as possible for a given task. Problems like security, memory-allocation, file-system access, networking, asynchronous/parallel programming…these all fall into that category. Generally, the advice is, as usual, to get your requirements, make components that satisfy those requirements and include automated tests that verify that the components will continue to satisfy the requirements.
As Lampson says, don’t write code that’s beyond you—get a “wizard” to write it instead. That’s what most of us do when we use the runtime provided with our language.
The best you can usually do is to abstract away access to external systems (including the file system) so that you can improve behavior later, should it be required. The budget and reliability constraints of a project don’t always allow you to program perfectly safely. What you can do is to make sure that the system can be made safer later with a reasonable amount of effort. To be clear: don’t be unnecessarily sloppy, but don’t tank your project guaranteeing NASA-level safety where its not needed.
So what does that mean? If you’re programming on .NET, it means you should probably stay away from some constructs that you’ve previously considered safe and not worth wrapping, like
Directory. Instead of using these directly, use them from an injected service. This level of abstraction is not difficult to enforce if introduced early in a project and will allow for improved testing anyway. If the filesystem is abstracted, components will no longer need their tests to actually write out files in order to work.
As discussed above, this isn’t to say that you jeopardize your deadline to abstract away every single file-system reference. For some applications, file-system access is so intrinsic as to be un-mockable (e.g. databases, source-control, etc.). However, your application is probably not one of those. It’s likely that your application reads/writes files in a highly localizable manner that could be wrapped in a simple component.
This advice is similar to the by-now common practice of not using the global
DateTime.UtcNow. How can this be a problem? Well, if code uses an
IClock component instead, then tests can adjust “now” to be a point in the past or future and test scheduling components more easily. It’s an easy pattern to follow in new code that pays for itself the first time you need to reproduce a timing problem.
At the end of the second article, there’s an interesting discussion of how to avoid these kind of bugs—or just bugs, in general.
“There’s a very old debate over how to prevent things like this from accidentally happening.”
Better “tools or processes”? Be “better programmers”? Are tools like guardrails? Does it make sense to keep driving, bashing back and forth across the road, but happy that the guardrails are keeping us on the road at all? Would you do that in a car?
But, yes, if that’s the best option? What’s the other option? Just stop the car and don’t go anywhere anymore? Or get out and walk?
That analogy has been beaten to death—and I don’t think it’s very appropriate (as you can see from my discussion about abstraction above). Tools and processes are better than nothing. Proper programming practices and patterns are, as well. If you train yourself to use tried-and-true patterns, then you automatically avoid common errors.
- Use a language with static type-checking
- Abstract away interfaces to the system
- Use non-nullable references wherever possible
- Use immutable data wherever possible
- Segregate mutable data into dumb objects
The point isn’t to be able to say that “there are no bugs”; it’s be able to say that “these tested bugs won’t happen”. The point is to use practices that avoid whole classes of problems.
What are better tools?
“Even better than a static analysis tool would be a language that makes it harder to accidentally forget about checking for an error.”
And now we come to the justification for some of the newer languages out there. Rust is such a language, which attempts to fix many of the shortcomings of C and C++ in the domain of allocating, sharing, modifying and freeing memory.
For error-handling, the article The Error Model by Joe Duffy discusses a very interesting and promising approach taken by a Microsoft Research team with Midori, a 100%-managed version of Windows. The basic insight is to separate bugs from recoverable errors and unrecoverable errors.
A bug is something the user-space application did wrong (e.g. passing a null reference to a method that expects only non-null references). A recoverable error is a validation error encountered when processing user input. An unrecoverable error is a file-read error in a base configuration file or a stack overflow or an out-of-memory error.
For almost all software, file-system errors are something that should just be considered an unrecoverable error. There is no reason why most applications should attempt to continue when e.g. the main configuration cannot be loaded. Most applications don’t even need to be able to recover from that. The problem occurs so rarely that you should just get a file out of backup.
Lower-level applications like Git or PostgreSql have to take more care to deal with file-system errors, but your software most likely doesn’t need to handle them. As discussed above, be aware that they can happen, abstract your code from the file-system so you can test error situations and improve handling where needed, but fail fast unless your project has a requirement to be able to recover in error conditions.
Generally, no-one expects a user-space application to include robust file-recovery. It’s expected, though, that the application detects when something is wrong and reports it, failing fast rather than just limping along and corrupting data.
NULLbytes after certain catastrophic operations.↩