Thursday, January 14, 2010

Random disk corruption during reboot

This one's been driving me crazy for more than a year.  Randomly, during a clean reboot, about half my systems will fail to come back up.  I'll do a 'shutdown -r'; the system will start again, but then I'll be told that /tmp - and it's always /tmp - needs checked manually:

/tmp:  UNEXPECTED INCONSISTENCY:  RUN fsck MANUALLY
and I get the dreaded

Give root password for maintenance (or type Control-D to continue)

So I'll go to the console, type 'fsck -y' because I really don't care what the errors are - nobody but Ted Ts'o could fix them by hand anyway - it finds two or three errors, and I'm back in business.


Not a big deal if I'm on site, but a real pain if I need to reboot remotely and the system won't come up by itself.  It's not reproducible, and darn near impossible to troubleshoot.


I've decided to try two things.  #1 is a suggestion from Mr. Ts'o in another context - add the 'sync' parameter to the fstab entry.  Ext3 syncs data to disk every 5 seconds; the sync parameter makes it write to disk immediately.  Since it's only /tmp, I don't think it'll have a horrible effect on the system.  We'll see.


The other is to always remember to run the 'sync' command before shutting down - it'll flush the buffers to the disk.  Since I'll never remember that, I'll add the command


/bin/sync


to the /etc/init.d/halt file.




No comments:

Post a Comment