(Solved) Bug: MTA may not start with zmcontrol
Posted: Sun Dec 16, 2018 5:40 pm
This can occur when calling zmcontrol restart where the MTA may not start up. It's unlikely but it can occur depending how unlucky you are.
Should it happen, you would see this on running zmcontrol restart
you could also see this on running zmcontrol status immediately or days later depending on circumstances.
When in fact the MTA would would not be running.
More specifically it can happen during a reboot or restart and it can provide you with the impression that the MTA is running. It is a result of zimbra using the "kill -0 master.pid" pattern to determine if the MTA is running. If the return call from this is successful, then it will not start the MTA and assumes the MTA is running which appears to be a good optimization by design. Given this master.pid file remains in the filesystem and contains a pid of a previous postfix master process, it can be a problem should another process reclaim the same pid value used in this file after you have stopped postfix... which while super unlikely does happen from time to time given reports in these forums, bugzilla and what I observed this week.
How can a pid have the same pid as a previous running master pid? Easy, the kernel wraps and reuses pid's depending on this value.
If you want to see how close you are to wrapping, do this as root
If the process that reclaims this pid is long living (never dies such as ldap, etc), the symptoms will be no MTA running and zmcontrol status showing that the MTA is running when in fact it is not. If that process is short lived, there will be no MTA running but later zmcontrol status will at least show it isn't running. How do you know it happened to you? A normal zmcontrol restart for the MTA will have a postfix start message in the /var/log/zimbra.log. that would be missing should /opt/zimbra/bin/zmmtactl believe the mta is already running. If you ran it from the command line, you could be under the impression it started because it would have shown 'Done' and 'Running'.
For the long living version of the bug the symptoms are STATUS lines in /var/log/zimbra-stats.log that show the MTA is running but the detailed stats will show the mail system as down. zmcontrol status from the command line would continue to show the MTA as Running.
Why does it happen? Instead of using postfix's status which tests a lock on master.pid, zmmtastatus uses the "kill -0 master.pid" pattern which returns successful because some process is running - just not the MTA. The fix appears simple.
Modify /opt/zimbra/libexec/zmmtastatus:
Until this is patched, how does one trust with certainty that zmcontrol start/restart is correct and that the MTA has been started? This has implications for those that do unattended certificate renewals using the official Zimbra recommended practice of issuing a zmcontrol restart.
Note: I have reported this to zimbra with my workaround which may not be the proper fix nor has this been confirmed as a bug.
How to workaround this without a patch and keeping the kill -0 pattern. If the reclaimed process is associated with zimbra.... another zmcontrol restart will solve it. Otherwise, rm the master.pid file which will change the logic and force zimbra to always restart the MTA on a start. Similarly, a reboot would likely have the same effect. As would killing the process associated with the master.pid and restarting the MTA ... then restarting the non zimbra process that you killed.
Should it happen, you would see this on running zmcontrol restart
Code: Select all
su - zimbra
zmcontrol restart
...
Stopping mta...Done.
...
Starting mta...Done.
..
Code: Select all
su - zimbra
...
mta Running
...
More specifically it can happen during a reboot or restart and it can provide you with the impression that the MTA is running. It is a result of zimbra using the "kill -0 master.pid" pattern to determine if the MTA is running. If the return call from this is successful, then it will not start the MTA and assumes the MTA is running which appears to be a good optimization by design. Given this master.pid file remains in the filesystem and contains a pid of a previous postfix master process, it can be a problem should another process reclaim the same pid value used in this file after you have stopped postfix... which while super unlikely does happen from time to time given reports in these forums, bugzilla and what I observed this week.
How can a pid have the same pid as a previous running master pid? Easy, the kernel wraps and reuses pid's depending on this value.
Code: Select all
cat /proc/sys/kernel/pid_max
32768
Code: Select all
# echo $$
245
# cat /opt/zimbra/data/postfix/spool/pid/master.pid
350
Code: Select all
grep 'starting the Postfix' /var/log/zimbra.log
Dec 10 07:39:44 relay3 /postfix-script[24222]: starting the Postfix mail system
Why does it happen? Instead of using postfix's status which tests a lock on master.pid, zmmtastatus uses the "kill -0 master.pid" pattern which returns successful because some process is running - just not the MTA. The fix appears simple.
Modify /opt/zimbra/libexec/zmmtastatus:
Code: Select all
% grep -A 2 kill /opt/zimbra/libexec/zmmtastatus
#system("kill -0 $pid 2> /dev/null");
#JAD 12/14/2018 (/opt/zimbra/libexec/zmmtastatus)
system("/opt/zimbra/common/sbin/postfix status 2> /dev/null");
Note: I have reported this to zimbra with my workaround which may not be the proper fix nor has this been confirmed as a bug.
How to workaround this without a patch and keeping the kill -0 pattern. If the reclaimed process is associated with zimbra.... another zmcontrol restart will solve it. Otherwise, rm the master.pid file which will change the logic and force zimbra to always restart the MTA on a start. Similarly, a reboot would likely have the same effect. As would killing the process associated with the master.pid and restarting the MTA ... then restarting the non zimbra process that you killed.