Spamassassin with a genetical learning disability ? ;)

Discuss your pilot or production implementation with other Zimbra admins or our engineers.
Post Reply
v1rtu4l
Posts: 36
Joined: Tue Jun 28, 2016 3:04 pm

Spamassassin with a genetical learning disability ? ;)

Post by v1rtu4l »

hi guys,

we as probably everybody are annoyed by spam. Using the hints in https://wiki.zimbra.com/wiki/Anti-spam_Strategies and implementing a few RBL resulted in too many false positives, so we do not see this as good first barrier against spam.
i hoped that spamassassin would actually learn if we flag a message as spam in 5 different mail accounts always having the same body (just different attachment), but it still will get delivered. There are even messages that have exactly the same content that got flagged as spam more than 20 times and still they get delivered. exactly same mail content that was flagged as spam will get delivered again and again.

what i already tried:
-using the hints in https://wiki.zimbra.com/wiki/Anti-spam_Strategies and being surprised that the values of antispam_enable_rule_updates and antispam_enable_restarts are set to false by default since ZCS 6 even though the Anti-spam strategies article recommends to set them to "true" (which i did). It did not change a thing (and yes of course i restarted the services several times without change).

-using zmtrainsa in the hope that the spam mails that allegedly get processed actually lead to some proper rules being generated which seems to be false.

- i checked the bayes-DB of spam assassin and it even tells me that there is basically nothing learned in it (at least that's what i read out of it)

Code: Select all

zimbra@mail:~/data/spamassassin/localrules$ sa-learn --dump all
netset: cannot include 127.0.0.0/8 as it has already been included
0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0          0          0  non-token data: nham
0.000          0          0          0  non-token data: ntokens
0.000          0          0          0  non-token data: oldest atime
0.000          0          0          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count
i even tried to write my own rules according to https://wiki.zimbra.com/wiki/Improving_Anti-spam_system but let's be honest, i will not write a rule for every spam i receive. this by the way worked kind of when i set the spam score above 5 that the mail did not get delivered at all (not even into the spam folder) but below (score 3.1) did not even flag the mail as spam and delivered it as ham.

so my question is, "is this the expected behaviour and capability of spamassassins learning or should it actually filter out spam that i already flagged more than 20 times ?"
phoenix
Ambassador
Ambassador
Posts: 27272
Joined: Fri Sep 12, 2014 9:56 pm
Location: Liverpool, England

Re: Spamassassin with a genetical learning disability ? ;)

Post by phoenix »

The first thing you should do when posting a question in thse forums is give the exact version of ZCS that's in use by posting the full output of the following command:

Code: Select all

zmcontrol -v
You say you've 'tried some RBLs' but you haven't given any details of which RBLs you've use nor have you given any information on the types of spam you're seeing, you can do that by looking at the headers of a spammy email and see what's causing the problems. A full list of all the changes you've tried (RBLs and Kill/Tag percentages etc.:) would also go a long way to understanding you're problem.
Regards

Bill

Rspamd: A high performance spamassassin replacement

Per ardua ad astra
v1rtu4l
Posts: 36
Joined: Tue Jun 28, 2016 3:04 pm

Re: Spamassassin with a genetical learning disability ? ;)

Post by v1rtu4l »

Sorry Bill, i thought it was clear that one uses the latest version if not otherwise told. My fault.

So we are using the very latest version of zimbra open source edition

Code: Select all

zimbra@mail:~$ zmcontrol -v
Release 8.7.0.GA.1659.UBUNTU14.64 UBUNTU14_64 FOSS edition.
I did not expand on the RBL issue, since this is not the way we want our spam to be blocked, because since a few hosted domains are receiving customer enquiries by freemailers like gmx.net or web.de which have a very bad integrity rating (if you check those by the means of mail-tester.com), so this is not a good way to approach it for us. the RBLs we used were zen.spamhaus.org, bl.spamcop.net, rhsbl.sorbs.net and multi.uribl.com.

This is a mail message that we already flagged days ago and we keep receiving mails with identical content without it getting filtered out:

Code: Select all

From: "Nadia Miles" <Miles.6716@littlemoonhills.co.uk>
Subject: sales report
Organization: Rathbone Brothers
Message-ID: <5403adc6-f44c-3397-8468-25217425af15@seminar-experts.ch>
Date: Sat, 23 Jul 2016 08:45:43 +0700
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.0
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="------------E683C7CF4152C2FA455B0BDA"

This is a multi-part message in MIME format.
--------------E683C7CF4152C2FA455B0BDA
Content-Type: multipart/alternative;
 boundary="------------11DA6D95A8AAD537306BC46D"


--------------11DA6D95A8AAD537306BC46D
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

I am truly sorry that I was not available at the time you called me yesterday.
I attached the report with details on sales figures.

----- Yours truly, Nadia Miles
Rathbone Brothers Phone: +1 (672) 660-64-63 Fax: +1 (672) 660-64-14

--------------11DA6D95A8AAD537306BC46D
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="content-type" content="text/html;
      charset=windows-1251">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
[b]    <big>I am truly sorry that I was not available at the time you
      called me yesterday.<br>
      I attached the report with details on sales figures.</big><br>
      <br>[/b]
    <address> <small>----- </small></address>
    <address><small> </small></address>
    <address><small> Yours truly,</small></address>
    <address><small> Nadia Miles</small></address>
    <address><small> </small></address>
    <address><small> <br>
        Rathbone Brothers</small></address>
    <small>Phone: +1 (672) 660-64-63</small>
    <address><small> Fax: +1 (672) 660-64-14 </small></address>
  </body>
</html>

--------------11DA6D95A8AAD537306BC46D--
the mail always has an zip-attachment and the bold text is always the same (the name/signature sometimes changes). i got 65 mails like this in the last 48 hours. this exact mail is probably not the best example because many of those get filtered out while a few slip through (so this is not the worst spam mail).

sadly in my trial and error quest i emptied the spam folder thinking that this would probably trigger the processing of those mails by zmtrainsa.

So the way i understand it is: the user flags a mail as spam, the mail gets put into the spam folder of the user and sent to the internal spam account at the same time. next time zmtrainsa runs it will query the internal spam account and try to derive rules from the spam mails within that spam account. is that correct?

is output of the "sa-learn --dump all" command i posted in my first post as expected or should i actually see a higher count of the entries in the db?

is this the expected output of zmtrainsa ?

Code: Select all

zimbra@mail:~$ zmtrainsa
20160723171952 Starting spam/ham extraction from system accounts.
[] INFO: Total messages processed: 355
[] INFO: Total messages processed: 201
20160723172015 Finished extracting spam/ham from system accounts.
20160723172015 Starting spamassassin training.
netset: cannot include 127.0.0.0/8 as it has already been included
Learned tokens from 206 message(s) (353 message(s) examined)
netset: cannot include 127.0.0.0/8 as it has already been included
Learned tokens from 201 message(s) (201 message(s) examined)
netset: cannot include 127.0.0.0/8 as it has already been included
20160723172044 Finished spamassassin training.
phoenix
Ambassador
Ambassador
Posts: 27272
Joined: Fri Sep 12, 2014 9:56 pm
Location: Liverpool, England

Re: Spamassassin with a genetical learning disability ? ;)

Post by phoenix »

Unfortunately we can't make any assumptions about which versions of ZCS people are using, we even the the "latest version" comment that's almost a s bad as no information. In addition, it's sometime important to know whch linux distribution is in use. :)

The headers that you've posted above, are they the complete output of a show original? If they are then it appears not to have been processed by the anti-spam system. I assume all of the ZCS services are running and the spamassassin rules are getting updated? Was this an upgraded system or a new install of ZCS? If this an upgraded system, has this problem always existed or has it started after the upgrade?
Regards

Bill

Rspamd: A high performance spamassassin replacement

Per ardua ad astra
v1rtu4l
Posts: 36
Joined: Tue Jun 28, 2016 3:04 pm

Re: Spamassassin with a genetical learning disability ? ;)

Post by v1rtu4l »

thank you for your further investigation, mate.

this system is sadly ubuntu 14 lts (but you already know that looking at my former post) and it was initially set up with ZCS 8.6, Patched with Patch 7, then upgraded to ZCS 8.7 RC2 and then upgraded to ZCS 8.7 GA (all as open source edition). The system did only get test usage from 8.7 RC2 state onwards.

all services are running:

Code: Select all

zimbra@mail:~$ zmcontrol status
Host mail.space4.local
        amavis                  Running
        antispam                Running
        antivirus               Running
        ldap                    Running
        logger                  Running
        mailbox                 Running
        memcached               Running
        mta                     Running
        opendkim                Running
        proxy                   Running
        service webapp          Running
        snmp                    Running
        spell                   Running
        stats                   Running
        zimbra webapp           Running
        zimbraAdmin webapp      Running
        zimlet webapp           Running
        zmconfigd               Running
the headers posted of the mail are not complete, because i was simply too lazy to redact company sensitive information. here is the whole thing (sorry for leaving out the essential part before)

Code: Select all

Return-Path: Miles.6716@littlemoonhills.co.uk
Received: from myzimbra.myinternaldom.local (LHLO externalhost.publicdomain.com) (192.168.0.190) by
 myzimbra.myinternaldom.local with LMTP; Sat, 23 Jul 2016 03:48:17 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
	by externalhost.publicdomain.com (Postfix) with ESMTP id 2D717342720
	for <realuser@mydomain.de>; Sat, 23 Jul 2016 03:48:17 +0200 (CEST)
X-Virus-Scanned: amavisd-new at space4.local
X-Spam-Flag: NO
X-Spam-Score: 4.504
X-Spam-Level: ****
X-Spam-Status: No, score=4.504 required=6.6 tests=[BAYES_99=3.5,
	BAYES_999=0.2, HTML_MESSAGE=0.001, RDNS_NONE=0.793,
	T_SPF_TEMPERROR=0.01] autolearn=no autolearn_force=no
Received: from externalhost.publicdomain.com ([127.0.0.1])
	by localhost (myzimbra.myinternaldom.local [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id DDBTUOHEa6nW for <realuser@mydomain.de>;
	Sat, 23 Jul 2016 03:48:01 +0200 (CEST)
Received: from [1.47.203.144] (unknown [1.47.203.144])
	by externalhost.publicdomain.com (Postfix) with ESMTP id DDDE734270B
	for <some_alias@mydomain.ch>; Sat, 23 Jul 2016 03:47:56 +0200 (CEST)
To: some_alias@mydomain.ch
From: "Nadia Miles" <Miles.6716@littlemoonhills.co.uk>
Subject: sales report
Organization: Rathbone Brothers
Message-ID: <5403adc6-f44c-3397-8468-25217425af15@seminar-experts.ch>
Date: Sat, 23 Jul 2016 08:45:43 +0700
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.0
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="------------E683C7CF4152C2FA455B0BDA"

This is a multi-part message in MIME format.
--------------E683C7CF4152C2FA455B0BDA
Content-Type: multipart/alternative;
 boundary="------------11DA6D95A8AAD537306BC46D"


--------------11DA6D95A8AAD537306BC46D
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

I am truly sorry that I was not available at the time you called me yesterday.
I attached the report with details on sales figures.

----- Yours truly, Nadia Miles
Rathbone Brothers Phone: +1 (672) 660-64-63 Fax: +1 (672) 660-64-14

--------------11DA6D95A8AAD537306BC46D
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="content-type" content="text/html;
      charset=windows-1251">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
    <big>I am truly sorry that I was not available at the time you
      called me yesterday.<br>
      I attached the report with details on sales figures.</big><br>
      <br>
    <address> <small>----- </small></address>
    <address><small> </small></address>
    <address><small> Yours truly,</small></address>
    <address><small> Nadia Miles</small></address>
    <address><small> </small></address>
    <address><small> <br>
        Rathbone Brothers</small></address>
    <small>Phone: +1 (672) 660-64-63</small>
    <address><small> Fax: +1 (672) 660-64-14 </small></address>
  </body>
</html>

--------------11DA6D95A8AAD537306BC46D--

--------------E683C7CF4152C2FA455B0BDA
Content-Type: application/x-compressed;
 name="stefano.conti_086FF.zip"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="stefano.conti_086FF.zip"
Another spam mail that got flagged as SPAM numerous times and is still getting delivered is this (shouldn't the TO-field have the name of the recipient too ?):

Code: Select all

Return-Path: konto-aktualisierung@s19395345.onlinehome-server.info
Received: from zimbra.internaldomain.local (LHLO mymail.externaldom.com) (192.168.0.190) by
 zimbra.internaldomain.local with LMTP; Sat, 23 Jul 2016 18:42:03 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
	by mymail.externaldom.com (Postfix) with ESMTP id 303B43426AA
	for <realuser@mydomain.de>; Sat, 23 Jul 2016 18:42:03 +0200 (CEST)
X-Virus-Scanned: amavisd-new at space4.local
X-Spam-Flag: NO
X-Spam-Score: 0.004
X-Spam-Level:
X-Spam-Status: No, score=0.004 required=6.6 tests=[BAYES_40=-0.001,
	HTML_EXTRA_CLOSE=0.001, HTML_MESSAGE=0.001, RCVD_IN_MSPIKE_BL=0.01,
	RCVD_IN_MSPIKE_L5=0.001, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001,
	WEIRD_PORT=0.001] autolearn=ham autolearn_force=no
Received: from mymail.externaldom.com ([127.0.0.1])
	by localhost (zimbra.internaldomain.local [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id oGfL_iHGO2uh for <realuser@mydomain.de>;
	Sat, 23 Jul 2016 18:42:02 +0200 (CEST)
Received: from s19395345.onlinehome-server.info (s19395345.onlinehome-server.info [82.165.42.2])
	by mymail.externaldom.com (Postfix) with ESMTP id 98DC43425A4
	for <aliasuser@mydomain.ch>; Sat, 23 Jul 2016 18:42:02 +0200 (CEST)
Received: from s19395345.onlinehome-server.info ([127.0.0.1]) by s19395345.onlinehome-server.info with Microsoft SMTPSVC(7.5.7601.17514);
	 Sat, 23 Jul 2016 18:34:11 +0200
Content-Type: multipart/alternative; boundary="===============0396095131=="
MIME-Version: 1.0
Subject: =?utf-8?q?Sch=C3=BCtzen_Sie_Ihre_Amazon=2Ede_Konto?=
To: Recipients <konto-aktualisierung@s19395345.onlinehome-server.info>
From: "Amazon.de" <konto-aktualisierung@s19395345.onlinehome-server.info>
Date: Sat, 23 Jul 2016 18:34:11 +0200
Message-ID: <S19395345vpvDtDJd5I0000637c@s19395345.onlinehome-server.info>
X-OriginalArrivalTime: 23 Jul 2016 16:34:11.0535 (UTC) FILETIME=[0ADB25F0:01D1E500]

You will not see this in a MIME-aware mail reader.
--===============0396095131==
Content-Type: text/plain; charset="iso-8859-1"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Description: Mail message body

Guten Tag, =


 Wir informieren Sie, dass Ihre Amazon ID wurde deaktiviert. =


 Klicken Sie einfach den untenstehenden Link und loggen Sie sich mit Ihrer =
Amazon-ID : =


 Klicken Sie hier  =


 Kundenservice Amazon.de=20
--===============0396095131==
Content-Type: text/html; charset="iso-8859-1"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Description: Mail message body

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; charset=
=3Diso-8859-1"/></head>Guten Tag,
</p></p>
Wir informieren Sie, dass Ihre Amazon ID wurde deaktiviert.
</p></p>
Klicken Sie einfach den untenstehenden Link und loggen Sie sich mit Ihrer A=
mazon-ID :
</p></p>
<a href=3D"http://host-141-173.cybees.com:88/24070141" target=3D"_blank"><s=
trong>Klicken Sie hier</strong> </a>
</p></p>
Kundenservice Amazon.de </html>





how do i actually check that the spam assassin rules get updated ?
i set the two values as recommended (see my first post) and restarted the services.

Code: Select all

zimbra@mail:~$ zmlocalconfig antispam_enable_rule_updates                      
antispam_enable_rule_updates = true
zimbra@mail:~$ zmlocalconfig antispam_enable_restarts                           
antispam_enable_restarts = true


is there a way to actually check that? i only set those values today, so if this is actually the culprit then it is definitely my fault for expecting spam assassing learning to be active by default. if so, i apologize for wasting your time.

EDIT:
a very big question for me is, when is actually the point in time that a mail will be available to spam assassins training ? is it enough to only flag it as spam or do i need to flag it as spam and then purge the spam-folder ? how do i flag something as ham ?
User avatar
esafonov
Posts: 25
Joined: Tue Jul 05, 2016 3:38 am

Re: Spamassassin with a genetical learning disability ? ;)

Post by esafonov »

To mark "spam" - press "Spam" button in zimbra weblient, to mark "ham" press "Not Spam"
After you pressed this button, Zimbra sends message to the appropriate "system account" (spam.xxxxxxx or ham.xxxxxxxx, there xxxxxx is random number you accepted during zimbra install script). In fact, this "system accounts" are just a temporal storage for spam / ham messages.

Actually zimbra learnt from this account once a night, at 00:22 (become zimbra user, and see crontab -l output)

Code: Select all

0 22 * * * /opt/zimbra/bin/zmtrainsa >> /opt/zimbra/log/spamtrain.log 2>&1
FredKarno
Posts: 49
Joined: Sat Oct 10, 2015 5:40 am

Re: Spamassassin with a genetical learning disability ? ;)

Post by FredKarno »

I have to agree with the OP.

I've tried all the spam hints and settings but spam assassin is a very slow learner. I even opened a support case but it was obvious that they didn't have a clue. Don't get me wrong, I've had excellent support in the past but you know when a support team are stalling and just want to close the case.

After about a year of religiously tagging spam, the situation has improved. It's nowhere near perfect, and still nowhere near as good as the detection rate I achieved with ASSP when I was running Exchange. The key learning point for me was that you have to educate your users to empty their spam folders regularly. Support confirmed to me that SA learns from spam which has been deleted from the spam folder. When the spam is in the folder, it's regarded as potential spam and is held there so you can rescue false positives and move them to your inbox. No learning has taken place at this point. When you empty your spam folder, the emails being deleted are confirmed as spam and that is what helps SA to learn, albeit slowly.

There are spam methods out there that just waltz past SA no matter how often you report them as spam. I'm on some list with two entries so I get two spams from that bot every time. The emails arrive about 5 minutes apart and have slightly different text but they always refer to the same subject - usually holiday offers. It's infuriating, but SA just cannot stop it.

So I'm sure that SA isn't the best filter out there, even if you spend hours tweaking it. ASSP learns much faster and is far more controllable. It does sometimes generate false positives, but it's much easier to teach users to email someone who can't get through (which auto-adds the recipient to the white list) than to have them pick through their spam folder in Zimbra/SA which is a much more tedious task.
yves.vogl
Posts: 6
Joined: Thu Oct 20, 2016 11:06 pm

Re: Spamassassin with a genetical learning disability ? ;)

Post by yves.vogl »

Hi,

I think you're wrong, FredKarno. You should have to delete mails from the Junk folder. If you detect a false positive, just press "Not spam" so that the messages gets moved to the ham account.


Regarding to v1rtu4l observation… when I mark a message as spam an run /opt/zimbra/bin/zmtrainsa, I can see that the message gets recognized. But sa-learn --dump all looks the same:

Code: Select all

0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0          0          0  non-token data: nham
0.000          0          0          0  non-token data: ntokens
0.000          0          0          0  non-token data: oldest atime
0.000          0          0          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1477007752          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count


Code: Select all

Release 8.5.1_GA_3056.RHEL6_64_20141103151539 RHEL6_64 FOSS edition.
yves.vogl
Posts: 6
Joined: Thu Oct 20, 2016 11:06 pm

Re: Spamassassin with a genetical learning disability ? ;)

Post by yves.vogl »

Maybe one should not forget about the dbpath parameter ;-)

Code: Select all

$ /opt/zimbra/libexec/sa-learn --dump magic --dbpath /opt/zimbra/amavisd/.spamassassin/init
0.000          0          3          0  non-token data: bayes db version
0.000          0         61          0  non-token data: nspam
0.000          0        154          0  non-token data: nham
0.000          0      10338          0  non-token data: ntokens
0.000          0 1115162462          0  non-token data: oldest atime
0.000          0 1117592078          0  non-token data: newest atime
0.000          0 1117592082          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count
Post Reply