Rather than treat this page as a set of instructions, I hope that someone might learn from my mistakes which I have tried to document here.
The problem description
MyCentOS 4.2 server started showing problems with corrupted mbox files. It appeared file locking was not working and seperate processes were writing to the file at the same time. More specifically, it was more like (but not exactly) a block of data was being written over and over again in odd places.My email client, Thunderbird 18.104.22.168, would see multiple corrupt emails (up to thousands) which had never actually arrived. The top data from these corrupt emails were almost a repeat of one another, however the bottom appeared different and could be any size . Like a slightly different top section was being spliced onto other mails of varying sizes.
It seemed like all real mails were getting through along with and despite this mess.
The problem was intermittant, but had come to happen at least once per day. I went on holidays and did not check my mails for 10 days. The problem did not occur during the break, but did occur when I came back and downloaded the 10 days worth.
Procmail is my LDA and at first I though it was something with my .procmailrc which called clamassassin & spamc before delivering to the mbox.
I joined the procmail mailling list and posted a question. I got some useful tips on how to improve my procmailrc, but no help on the actual problem.
Then I put in procmail rules to dump copies of all emails before processing with clamassassin & spamc and again after processing.
This showed that all proper emails were coming in and going out of procmail as expected and they were not getting mangled there.
The only other process which normally dealt with the mbox files was my pop3 server - dovecot. (Usermin & Webmin also read and write to the mbox files, but this problem was seen to occur when they were not in action)
So, I figured the problem had to be with dovecot
I googled. I did not find anyone with the same problem as me, but I kept seeing people advising to upgrade to dovecot 1.0 or better. As dovecot's mbox support was much improved over 0.99 which is what I had. The problem is that CentOS 4.2's final version of dovecot is 0.99.18.
I found some advice on using dotlocks. I configured dovecot to use dotlocks as a first preference. This seemed to fix the problem. It did not occur for about 8 days. But them it came back again.
I decided I needed to upgrade to dovecot-1.0 which meant I would have to compile it myself.
I downloaded the source from http://dovecot.org/releases/1.0/dovecot-1.0.14.tar.gz
This compiled without any real problems, but it put files into /usr/local/sbin and /usr/local/etc instead of just /usr/sbin and /etc. I went messing with "ln" to fool the /etc/rc.d/init.d/dovecot into finding them. Eventually changed the init script too. Got myself completely confused about which was which and where was where, but just about pieced it together. Anyway the damn thing wouldn't work. It gave an authentication error which when googled indicated pam support wasn't compiled in.
dovecot: Jun 10 03:03:21 Info: Dovecot v1.0.14 starting up dovecot: Jun 10 03:03:21 Error: Auth process died too early - shutting down dovecot: Jun 10 03:03:21 Error: auth(default): Unknown passdb driver 'pam' (typo, or Dovecot was built without support for it? Check with dovecot --build-options) dovecot: Jun 10 03:03:21 Error: child 9278 (auth) returned error 89
So I did compiled in pam support with
./configure --with-pam make make install
But still got the same error.
I downloaded the source rpm for Centos 5 - dovecot-1.0-1.2.rc15.el5.src.rpm
It flagged a load of dependancies which I manually yummed, the trickiest of which was
yum install gcc-c++
I managed to build the rpm, but it wanted a later version of ssl than I had. I wasn't planning on using ssl, so I hammered it in with rpm -i --no-deps
It installed and got a bit further, but still did not work - auth failure again I think.
Then I then spotted http://wiki.dovecot.org/HowTo/DovecotLDAPostfixAdminMySQL
This had instructions for compiling from an dovecot-1.0.xxx.fc7.src rpm. So I went and got the latest one I could find. dovecot-1.0.13-18.fc7.src.rpm.
finally after editing /usr/src/redhat/SPECS/dovecot.spec
%define build_postgres 0 %define build_mysql 0 %define build_sieve 0
I got it to build and install the rpm
I tweaked the /etc/dovecot.conf file a bit:
# 1.0.13: /etc/dovecot.conf log_path: /var/log/dovecot.log login_dir: /var/run/dovecot/login login_executable(default): /usr/libexec/dovecot/imap-login login_executable(imap): /usr/libexec/dovecot/imap-login login_executable(pop3): /usr/libexec/dovecot/pop3-login mail_executable(default): /usr/libexec/dovecot/imap mail_executable(imap): /usr/libexec/dovecot/imap mail_executable(pop3): /usr/libexec/dovecot/pop3 mail_plugin_dir(default): /usr/lib/dovecot/imap mail_plugin_dir(imap): /usr/lib/dovecot/imap mail_plugin_dir(pop3): /usr/lib/dovecot/pop3 pop3_uidl_format(default): %08Xu%08Xv pop3_uidl_format(imap): %08Xu%08Xv pop3_uidl_format(pop3): %v.%u auth default: debug: yes debug_passwords: yes passdb: driver: pam userdb: driver: passwd
But I still had an auth problem.
Then I replaced the /etc/pam.d/dovecot file with the one from dovecot-0.99, and it worked
# cat /etc/pam.d/dovecot #%PAM-1.0 auth required pam_nologin.so auth required pam_stack.so service=system-auth account required pam_stack.so service=system-auth session required pam_stack.so service=system-auth
While doing all this, I neglected to set the pop3_uidl_format and I downloaded a bunch of emails into thunderbird. I got a large number of corrupt emails exactly as I was tring to describe above. I interrupted this download, set the pop3_uidl_format to %v.%u, started the download again and hey presto no problems.
Conclusion & First mistake
So to conclude I think the problem was connected with pop3_uidl all along. Perhaps if I enabled proper logging at the beginning I might have avoided this whole nightmare ...grr!
When I first read this page, the problem descriptions did not seem to match my problem. Also I figured the standard logging being done by default to maillog & messages would show up any of the errors mentioned. I did not see any errors in my logs until after I had compiled and installed 1.0.14. It just flat refused to run, so I set up a dedicated dovecot.log file and set auth_debug = yes.