(Solved) Bug: MTA may not start with zmcontrol

Discuss your pilot or production implementation with other Zimbra admins or our engineers.
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

(Solved) Bug: MTA may not start with zmcontrol

Post by JDunphy »

This can occur when calling zmcontrol restart where the MTA may not start up. It's unlikely but it can occur depending how unlucky you are.

Should it happen, you would see this on running zmcontrol restart

Code: Select all

su - zimbra
zmcontrol restart
...
Stopping mta...Done.
...
Starting mta...Done.
..
you could also see this on running zmcontrol status immediately or days later depending on circumstances.

Code: Select all

su - zimbra
...
	mta                     Running
...
When in fact the MTA would would not be running.

More specifically it can happen during a reboot or restart and it can provide you with the impression that the MTA is running. It is a result of zimbra using the "kill -0 master.pid" pattern to determine if the MTA is running. If the return call from this is successful, then it will not start the MTA and assumes the MTA is running which appears to be a good optimization by design. Given this master.pid file remains in the filesystem and contains a pid of a previous postfix master process, it can be a problem should another process reclaim the same pid value used in this file after you have stopped postfix... which while super unlikely does happen from time to time given reports in these forums, bugzilla and what I observed this week.

How can a pid have the same pid as a previous running master pid? Easy, the kernel wraps and reuses pid's depending on this value.

Code: Select all

cat /proc/sys/kernel/pid_max
32768
If you want to see how close you are to wrapping, do this as root

Code: Select all

# echo $$
245
# cat /opt/zimbra/data/postfix/spool/pid/master.pid
350
If the process that reclaims this pid is long living (never dies such as ldap, etc), the symptoms will be no MTA running and zmcontrol status showing that the MTA is running when in fact it is not. If that process is short lived, there will be no MTA running but later zmcontrol status will at least show it isn't running. How do you know it happened to you? A normal zmcontrol restart for the MTA will have a postfix start message in the /var/log/zimbra.log. that would be missing should /opt/zimbra/bin/zmmtactl believe the mta is already running. If you ran it from the command line, you could be under the impression it started because it would have shown 'Done' and 'Running'.

Code: Select all

grep 'starting the Postfix' /var/log/zimbra.log
Dec 10 07:39:44 relay3 /postfix-script[24222]: starting the Postfix mail system
For the long living version of the bug the symptoms are STATUS lines in /var/log/zimbra-stats.log that show the MTA is running but the detailed stats will show the mail system as down. zmcontrol status from the command line would continue to show the MTA as Running.

Why does it happen? Instead of using postfix's status which tests a lock on master.pid, zmmtastatus uses the "kill -0 master.pid" pattern which returns successful because some process is running - just not the MTA. The fix appears simple.

Modify /opt/zimbra/libexec/zmmtastatus:

Code: Select all

% grep -A 2 kill /opt/zimbra/libexec/zmmtastatus

      #system("kill -0 $pid 2> /dev/null");
      #JAD 12/14/2018 (/opt/zimbra/libexec/zmmtastatus)
      system("/opt/zimbra/common/sbin/postfix status 2> /dev/null");
Until this is patched, how does one trust with certainty that zmcontrol start/restart is correct and that the MTA has been started? This has implications for those that do unattended certificate renewals using the official Zimbra recommended practice of issuing a zmcontrol restart.

Note: I have reported this to zimbra with my workaround which may not be the proper fix nor has this been confirmed as a bug.

How to workaround this without a patch and keeping the kill -0 pattern. If the reclaimed process is associated with zimbra.... another zmcontrol restart will solve it. Otherwise, rm the master.pid file which will change the logic and force zimbra to always restart the MTA on a start. Similarly, a reboot would likely have the same effect. As would killing the process associated with the master.pid and restarting the MTA ... then restarting the non zimbra process that you killed.
Last edited by JDunphy on Wed Jul 14, 2021 1:15 pm, edited 2 times in total.
User avatar
axslingr
Outstanding Member
Outstanding Member
Posts: 256
Joined: Sat Sep 13, 2014 2:20 am
ZCS/ZD Version: 8.8.15.GA.3869.UBUNTU18.64 UBUNTU18

Re: Potential Bug: MTA may not start with zmcontrol

Post by axslingr »

I'm wondering if this could happen just doing zmcontrol stop / start. I run a backup script that does a stop / start, and every so often I wake up in the morning to find hundreds of emails stuck on our external spam filter because the mta service isn't running. A simple zmcontrol restart fixes it, although zmcontrol status says the service is running. I've only ever seen this happen in a script; never from the command line.

Lance
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

Re: Potential Bug: MTA may not start with zmcontrol

Post by JDunphy »

Yes it is possible with a stop/start.

zmcontrol has these 2 functions:
doShutdown() and doStartup().

zmcontrol start does the doStartup()
zmcontrol stop does doShutdown()
zmcontrol restart does doShutdown;doStartup

Look at your /var/log/zimbra.log for your zmcontrol restart... you should see. "starting the Postfix mail system"
Here is a failed case:

Code: Select all

Dec  9 06:36:59 relay3 zimbramon[12661]: 12661:info: Starting mta via zmcontrol
Dec  9 06:36:59 relay3 saslauthd[16132]: detach_tty      : master pid is: 16132
Dec  9 06:36:59 relay3 saslauthd[16132]: ipc_init        : listening on socket: /opt/zimbra/data/sasl2/state/mux
Dec  9 06:36:59 relay3 zimbramon[12661]: 12661:info: Starting stats via zmcontrol
Dec  9 06:36:59 relay3 zimbramon[12661]: 12661:info: Starting service via zmcontrol
Here is the working and normal case:

Code: Select all

Dec 10 07:39:43 relay3 zimbramon[21758]: 21758:info: Starting mta via zmcontrol
Dec 10 07:39:44 relay3 saslauthd[24120]: detach_tty      : master pid is: 24120
Dec 10 07:39:44 relay3 saslauthd[24120]: ipc_init        : listening on socket: /opt/zimbra/data/sasl2/state/mux
Dec 10 07:39:44 relay3 /postfix-script[24222]: starting the Postfix mail system
Dec 10 07:39:44 relay3 postfix/master[24224]: daemon started -- version 3.1.1, configuration /opt/zimbra/common/conf
Dec 10 07:39:44 relay3 zimbramon[21758]: 21758:info: Starting stats via zmcontrol
Should this happen, there is nothing listening on port 25 so the primary MX if it is external would have a lot of queued messages.
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

Re: Potential Bug: MTA may not start with zmcontrol

Post by JDunphy »

I should add that the probability of hitting this increases the more often you perform a stop/start cycle. I do quarterly certificate renewals but if you are doing it nightly you are increasing your odds and have a much higher probability of triggering this problem. It is simple to reproduce. Here is 1 method.

1) zmmtactl stop # stop the MTA ... that is what zmcontrol stop calls.
2) find a process that is already running ... init is 1 so I'll use that.
3) change /opt/zimbra/data/postfix/spool/pid/master.pid so it contains 1
4) zmmtactl start

Observe that you don't have an MTA running on port 25 or perform a ps and look for postfix. Note you should also see this when the MTA starts:

Code: Select all

zimbra@tmail ~]$ zmmtactl start
Rewriting configuration files...done.
Starting saslauthd...done.
/postfix-script: starting the Postfix mail system
but in the above test case forcing the bug you would see:

Code: Select all

[zimbra@tmail ~]$ zmmtactl start
Rewriting configuration files...done.
Starting saslauthd...done.
Extra credit: run zmcontrol status :-)
User avatar
DavidMerrill
Advanced member
Advanced member
Posts: 126
Joined: Thu Jul 30, 2015 2:44 pm
Location: Portland, ME
ZCS/ZD Version: 8.8.15 P19
Contact:

Re: Potential Bug: MTA may not start with zmcontrol

Post by DavidMerrill »

Thanks for sharing this, I've often run in to this scenario (usually after a server reboot) and as described the zmcontrol stop/start has always resolved it. The OCD in me was always offended with not knowing why it was happening in the first place (yet being so busy-busy it's been hard to set aside time to chase this down), I really appreciate your insight on this.
___________________________________
David Merrill - Zimbra Practice Lead
OTELCO Zimbra Hosting, Licensing and Professional Services
Zeta Alliance
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

Re: Potential Bug: MTA may not start with zmcontrol

Post by JDunphy »

Thanks David.

It appears this bug has been around forever. Bugzilla's history showed a lot of idea's including that it was a kernel bug, stuck process, postfix bug and/or postfix draining issue. I initially thought it was a timing bug or a missed signal but when I looked at master.c, it didn't explain the long lived zmcontrol status indicating that everything was running + postfix even had a 5 second watchdog they set in their signal handler should it's children take too long to die so it looked clean.

I saw that kill -0 and thought something was off but couldn't explain it. It wasn't until I looked at the kernel and how they handled SIG_0 and the associated user level lib that I realized that kill -0 doesn't even send a signal. It only determines if you would have permission to send a signal to SOME process. It's a pretty nifty call for determining signal/proc permission and it works with zombies too but definitely not a reliable way to determine if a specific program is running or not.

Poor Rick King got a small book from me as I shared my thoughts on what this could be. Every day, another piece of information or idea as I looked at my logs and test code hoping something would jump out for one of us. I think I told him at one point... just ignore everything I have written. LOL

I had pretty much given up the cause until I saw Malte's bug report in 2017 that it happened on start up and that extra piece made it all click for me.
GrnEyedDvl
Posts: 11
Joined: Wed Dec 05, 2018 7:34 am

Re: Potential Bug: MTA may not start with zmcontrol

Post by GrnEyedDvl »

Ok this is useful information, and I actually wonder if it applies to the problem I have been having here.
viewtopic.php?f=15&t=65285

Have you seen it with cron jobs and config reloads?
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

Re: Potential Bug: MTA may not start with zmcontrol

Post by JDunphy »

GrnEyedDvl wrote: Have you seen it with cron jobs and config reloads?
The bug reported here is the MTA (postfix) being shutdown and not being restarted so issuing
zmmtactl stop would demonstrate this problem and its symptoms. I am not sure this is going on with yours given what you have reported but you can do a little more to help debug your end.
1) When it fails, telnet to port 25 from another ip address to verify not a FW issue/dynamic filter rule. If connection fails, verify that postfix is running on the zimbra box.
2) Verify you don't have some sort of dynamic fw rule kicking in based on incoming connections, etc if postfix is running.
3) Determine MTA traffic patterns around outage time to see if its load based if postfix is still running when things fail for you.

By default, the zimbra cron entries do not normally restart the MTA unless you have made changes to your configuration. As for config reloads, yes they have pathways to restart the mta depending on what is changing. see /opt/zimbra/conf/zmconfigd.cf for more details so that restart MTA pathway could trigger this bug if you were unlucky.
johnroberts
Posts: 31
Joined: Sat Sep 13, 2014 2:43 am

Re: Potential Bug: MTA may not start with zmcontrol

Post by johnroberts »

JDunphy wrote:This can occur when calling zmcontrol restart where the MTA may not start up. It's unlikely but it can occur depending how unlucky you are.
We have been plagued with this for the past four years. Thanks for the details of your investigation, it all makes sense what is causing it now...

Luckily we have monitoring software which alerts us, but still a pain at 4am after a backup and it has to be manually restarted.

Has anyone from Synacor acknowledged this as a bug yet?
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

Re: Potential Bug: MTA may not start with zmcontrol

Post by JDunphy »

johnroberts wrote: Has anyone from Synacor acknowledged this as a bug yet?
ZBUG-798 hopefully they can agree on a fix. ...https://github.com/Zimbra/zm-core-utils/pull/42
Post Reply