What I would like is a file open option for "create replacement file".
The idea is that this makes a new inode in the same mount point as the target filename, which has no actual directory entry, but on close it replaces the directory entry of the specified path with the new file.
Why?
There are many situations where you want to make a new file to replace an existing file, but want the change from old to new file to be atomic at the file system level.
Something reading the half written new file is bad.
So you make a new temporary file, and then finally use rename() to replace the original, but that is only atomic if you are on the same mount point. You cannot simply use a mktemp() in /tmp for that as the target file may not be on the same mount point as /tmp. So you make a file with a dot prefix and some suffix or something. Messy. And needs cleaning up if you crash.
Easy?
It really would not be hard, I am sure, for the underlying filing system to support this as a file open mode. The atomic replacement of the directory entry is already a thing in rename(), and the idea of a file not in a directory but using space while open is easy - make a file and unlink it before closing. So the underlying mechanisms for this exist.
The only caveat, as a really useful extra feature, would be if not a "clean" close() call, i.e. a close because code aborted or exited without calling close(), it would *not* replace the original, just lose the new file as not in a directory. This helps cover the crash case, and always cleans up inherently.
This would be so useful.
And, of course, make gcc use this for making binaries!!!
Just to be clear, I am not suggesting "buffering" the whole file. The system to have an open file not in a directory already exists. One can create a file and open it and unlink it and still write to it, on to "disk". That is the "buffering" here. Just atomically either lose the new file (if crash, as would happen if you did that) or replace the directory entry with it and lose the old file, on clean close() call.
You can nearly do it!
Thanks for all the feedback, it is close... open() explains a _GNU_SOURCE specific option O_TMPFILE. This allows you to create an unnamed file which will vanish when you close, even if you crash. It then explains you can use a slightly convoluted call to linkat() to name the file before you close it. This nearly does the job, but not quite.
- The open() call needs a directory so it knows file system. Annoyingly you cannot pass the filename you want and have it work out the directory. It has to be the directory, meaning you have to faff about getting the directory from the file name. Not a big faff, obviously.
- The linkat() call needs CAP_DAC_READ_SEARCH set. There is a convoluted way to using /proc/ otherwise. More faff. Also, given that there is a documented way around the limitation, why is it dependant on a capability in the first place?!
- The linkat() call expects the new filename not to exist.
This does allow a file to be atomically created as a complete file, with no temp files if you crash. But this last point is the show stopper as it means you have to unlink() first, leaving a small window where the file does not exist. That or you link to a temporary file name and use rename() which puts us almost back where we started, albeit with a smaller window for leaving a temp file behind.
The obvious fix would be a new flag to linkat() to allow replacing the new file. That or allow AT_EMPTY_PATH in renameat2().
it's so close to being supported, open() with O_TMPFILE + linkat()
ReplyDeletebut sadly linkat() doesn't support replacing the target
What should happen if two processes (maybe owned by different users) used this call? Presumably the first to close would replace the original, then the second to close would do - what?
ReplyDeleteSecond to close would replace it too. Simples.
ReplyDeleteBut the *original* file isn't there any more - it has been replaced! Should the second to close replace the replacement? I suppose this is the more UNIX-like approach. And what if someone removes the mount point before the file is closed? Should missing intermediate directories be re-created?
ReplyDeletethe original file stays until atomically replaces by the new file on a clean close()
ReplyDeleteAlas, the capability requirement for linkat() is that it is perfectly legitimate to pass fhs to files you can't open() (because you don't have search permission for their dir or open permission for them) to other processes, which you can then assume can read and write them but *not preserve them for later*. Real users rely on this behaviour and if you silently break it you introduce security holes in all those previously innocent programs.
ReplyDeleteIf you have CAP_DAC_READ_SEARCH, you can bypass permissions in any case and reanimate any fh you are passed by hunting them down in the fs and link()ing them, so it's safe to allow you to make links to descriptors too, but otherwise, alas, no.
It would be nice if there was a way to say "this descriptor came from *here*, check if you can get to it with your existing permissions before permitting linkat()", but given that that link might be arbitrarily deep in the fs hierarchy and in a different uid namespace, even thinking about how to implement that is making me come out in hives :(
You can still do it via a tiny helper which has CAP_DAC_READ_SEARCH set on it, which gets passed the rh, reanimates it, and notifies you that it's done that. It's a right drag but it's hard to see how to avoid it :(