So, how does this works on Linux?
For the past few days I have been talking about our findings with regards to creating ACID storage solution. And mostly I’ve been focusing on how it works with Windows, using Windows specific terms and APIs.
The problem is that I am not sure if those are still relevant if we talk about Linux. I know that fsync perf is still an issue (if only because both Win & Lin are running on the same hardware). But would the same solutions apply?
For example, the nearest that I can find to FILE_FLAG_NO_BUFFERING is O_DIRECT and FILE_FLAG_WRITE_THROUGH appears to be similar to O_SYNC. But I am not sure if they are actually behaving in the same fashion.
Any ideas? Anyone has something like Process Monitor for Linux and can look at the actual behavior of industry grade databases commit behavior?
From my exploring, it appears that PostgreSQL is using fdatasync() as the default approach, but it can use O_DIRECT and O_DSYNC as well, so that is promising. But I would like to have someone who actually know Linux intimately tell me if I am even in the right direction.
Comments
It's a bit more involved. Which file system are you using? Ext3?
Other things: do you have noatime nodiratime set? The file systems in Linux are far more varied/configurable than windows. In general though o-direct is what you are looking for.
Cheers,
Greg
Use strace to see what system calls are issued on linux.
It may be interesting for you how Sybase does it
http://www.sybase.com/content/1043413/DirectIO-082906-wp.pdf
fio is a very nice tool for read/write benchmarks, and has many 'engines', which are just methods of writing to the disk (sync, libaio, mmap, ...).
http://git.kernel.dk/?p=fio.git;a=tree
Check out the pg_test_fsync in PostgreSQL contrib modules:
http://www.postgresql.org/docs/devel/static/pgtestfsync.html
In general if you can arrange for the data to be written with single write() calls use O_DSYNC, add in O_DIRECT if you know that your writes are aligned and you won't need to read the data afterwards (e.g. in PostgreSQL replication and WAL archiving receive WAL by reading it back from the OS).
Beware that with kernel <2.6.33 or glibc <2.12 you O_SYNC actually means O_DSYNC. You need explicit fsyncs for metadata operations like creating new files.
Comment preview