Rspamd: Fast, free and open-source spam filtering system

Discuss your pilot or production implementation with other Zimbra admins or our engineers.
phoenix
Ambassador
Ambassador
Posts: 27262
Joined: Fri Sep 12, 2014 9:56 pm
Location: Liverpool, England

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by phoenix »

I must say that I haven't seen any of those problems but then I'm not running a business mail server. :) I've left my zmtrainsa scripts at the initial install run frequency, once per day and I would have thought that five minute intervals would cripple a system with a lot of users that get a ton of spam. Did you use the spamassin rules in rspamd or just it's default configuration? I haven't seen any DNS problems on my server despite the high lookups it's doing, I assume you use a caching name server on your network or is this just using ZCS dnscache? I haven't seen anything about IPv6 in any of my log files and I'm just about to put my IPv6 record on DNS for the ZCS server, I'll keep an eye open for any problems on that front.
Regards

Bill

Rspamd: A high performance spamassassin replacement

Per ardua ad astra
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by JDunphy »

Something isn't right on my test server so I need to dive deeper into the training script to begin to understand why. I followed your script verbatim and everything appears to be working out of the box. My configuration is vanilla with the exception of adjusting the scores required for spam and reject. I would think our configuration is identical given I followed your instructions including not messing with the configuration. :-) I did upgrade the software to the latest on centos 6 given the comments by vstakhov about "bugfixes especially on the tokenization". That did make a difference in our experience. As an aside, I am not a fan of rejecting spam back to the sender which I think is the default action. That isn't a problem for the single zimbra/rspamd installation but if you have front-ends that didn't block it or you want to deliver it to another system to score later then you are bouncing back to yourself for re-delivery. So my belief is that in production we would run this on the front-ends and not on the zimbra mailbox server. For the testing it works because I just want to learn to trust the software and its decision process. I like the ideas in rspamd but believe html5 will own rspamd and SA with html obfuscation in the future. Our view is that rspamd feels like a super grep so the rules are very simplistic at present. In fact, I don't even know how to get rule/score updates without updating the software automatically. Hmm...That realization says I am not very far along. LOL

This might be the best explanation: Here is the entire message in the body.

Code: Select all

SGkgbWFpbGVyLWRhZW1vbiwgbXkgbmFtZSBFbGVuYSBhbmQgaSdtIGZyb20gUnVzc2lhLg0KTm93
IGknbSBsaXZpbmcgaW4gdGhlIFVTLg0KQSB3ZWVrIGFnbyBJIGZvdW5kIHlvdXIgcHJvZmlsZSBv
biBCYWRvbyBvciBzb21ldGhpbmcgbGlrZSB0aGF0IDotKQ0KSSBkb250IHJlbWVtYmVyIHRvIGJl
IGhvbmVzdCA6LSkNCllvdSBhcmUgc3VwZXIgY3V0ZSENCk1heWJlIHdlIGNhbiBjaGF0PyBZb3Ug
d2FudCBteSBwaG90b3M/DQpQbGVhc2UgZW1haWwgbWUgdG8gYWRyZ2VyZGFvaDNAcmFtYmxlci5y
dQ0KDQpYb1hvWG8sIEVsZW5h
SA scored this as a 13.318

Code: Select all

X-Spam-Flag: YES
X-Spam-Score: 13.318
X-Spam-Level: *************
X-Spam-Status: Yes, score=13.318 required=4.8 tests=[BAYES_60=1.5,
	BLACKLIST_COUNTRY=2.5, BL_ZEN_SPAMHAUS=4, FOUND_YOU=1,
	HELO_MISC_IP=0.25, J_RCVD_IN_TRUNCATE=2, J_URIBL_WHITE=-0.1,
	MIME_BASE64_TEXT=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793]
	autolearn=no autolearn_force=no
and rspamd scored it as 3:

Code: Select all

X-Spamd-Result: default: False [3.00 / 150.00]
 RECEIVED_SPAMHAUS_XBL(3.00)[116.41.143.185.zen.spamhaus.org : 127.0.0.4]
 RCVD_NO_TLS_LAST(0.00)[]
 R_SPF_NEUTRAL(0.00)[?all]
 ASN(0.00)[asn:47066, ipnet:72.29.127.0/24, country:US]
 ARC_NA(0.00)[]
 RCPT_COUNT_ONE(0.00)[1]
 TO_DN_NONE(0.00)[]
 RCVD_COUNT_ONE(0.00)[2]
 FORGED_RECIPIENTS_FORWARDING(0.00)[]
 DMARC_NA(0.00)[laradorbecker.com]
 FROM_HAS_DN(0.00)[]
 MIME_GOOD(-0.10)[text/plain]
 R_DKIM_NA(0.00)[]
 PREVIOUSLY_DELIVERED(0.00)[mailer-daemon@example.com]
 MIME_BASE64_TEXT(0.10)[]
 FROM_EQ_ENVFROM(0.00)[]
 
But I digress. There is no reason that rspamd couldn't work better once we customize some of the rules for our needs.

Our load factor appears to be stable at 0.0 or 0.1 for the past few hours ... Because it was not a problem for the first 4+ weeks, I eventually need to track the root cause but we are in no rush to production here. Bind/named is running locally on the zimbra server so no I am not running the zimbra dns stuff. I have my resolv.conf pointing to 127.0.0.1 and I haven't looked to see if rspamd even uses it. Changing the socket pool size doesn't appear to be involved here. I probably would have missed the high cpu over 2 hours but got lucky that the test environment has some threshold reporting for this server. It doesn't happen very often but that it can happen means I need to understand why.

Note: this guess to bayes training HAS NOT been confirmed for root cause of why rspamd shows Running and 99% CPU. That it happens sometimes (3-4) over the past 7-8 days might be an important clue. Normally, rspamd stays in a sleeping state which is what I expect in such a idle machine.

Another idea. We haven't cleaned the junk folder. We observed our users not doing that in production so we have left every message on the zimbra+rspamd instance. Is it safe to assume that this behavior and the zimbra training script doesn't cause a slow down over time?

Here is a typical run:

Code: Select all

20171120053351 List rspam stats after training.
Results for command: stat (0.157 seconds)
Messages scanned: 18004
Messages with action reject: 18, 0.10%
Messages with action soft reject: 0, 0.00%
Messages with action rewrite subject: 0, 0.00%
Messages with action add header: 11982, 66.55%
Messages with action greylist: 2, 0.01%
Messages with action no action: 6002, 33.34%
Messages treated as spam: 12000, 66.65%
Messages treated as ham: 6004, 33.35%
Messages learned: 995069
Connections count: 0
Control connections count: 29884
Pools allocated: 30008
Pools freed: 29990
Bytes allocated: 542k
Memory chunks allocated: 41
Shared chunks allocated: 10
Chunks freed: 0
Oversized chunks: 6111
Fuzzy hashes in storage "rspamd.com": 43432391
Fuzzy hashes stored: 43432391
Statfile: BAYES_SPAM type: sqlite3; length: 59.57M; free blocks: 0; total blocks: 702k; free: 0.00%; learned: 1464; users: 1; languages: 4
Statfile: BAYES_HAM type: sqlite3; length: 5.80M; free blocks: 0; total blocks: 69.37k; free: 0.00%; learned: 96; users: 1; languages: 1
Total learns: 1560

20171120053351 Finished rspamd training.
vstakhov
Posts: 7
Joined: Sat Sep 09, 2017 12:40 pm

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by vstakhov »

Well, it is quite common that if you do strange things you get strange results. First of all, Rspamd is not SA and you should not expect it to work as SA. Why do you need to relearn the existing messages so often? It is trivial to check mtime and do not use not updated files by a simple `find` pipeline.

Secondly, if you use the default sqlite3 backend for statistics, then I have bad news: it is terrible and it won't be fixed ever. The only reason why is it still default is that I'm trying to be conservative with the default settings. For all new setups Redis is *strongly* recommended (and it is mentioned in the FAQ). Since the upcoming Rspamd 1.7, Redis will be the default backend and sqlite will be eventually deprecated.

Rspamd HTML parser is NOT an advanced grep as you have claimed: it is heuristic HTML *parser*, meaning that it is aware of HTML semantics, tags, encodings and so on and so forth. However, it indeed lacks CSS support and I have some samples from the wild where CSS tricks were used to poison statistical methods. In future, I plan to implement some sort of CSS parsing to stop that.

Finally, in your comparison of the message scan results, the only meaningful rule is BLACKLISTED_COUNTRY which definitely involves some custom configuration. This is also possible to do with Rspamd via multimap module which can blacklist countries, specific ASNs and so on with dynamic maps support and other features. I understand that this knowledge is a bit hidden inside Rspamd documentation but this could be improved in future (and I would appreciate any help in this task).
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by JDunphy »

First thank you for your reply. We are following Bill's configuration at present because this is a Zimbra forum and haven't deviated from those instructions. We call a training script via cron as is the zimbra practice. I haven't looked at it to know if it only submits new or resubmits everything. My guess and hope is it would only be new training given users practice of not deleting junk folders. I believe in Zimbra we move it to a special spam account which is trained to get around this problem. We have not claimed the script to be the root cause but it uncovered an edge case that we needed to understand for production services. I also saw an outgoing dns lookup spike via ipv6 which I am still looking into during the same time window.
vstakhov wrote: Rspamd HTML parser is NOT an advanced grep as you have claimed: it is heuristic HTML *parser*, meaning that it is aware of HTML semantics, tags, encodings and so on and so forth. However, it indeed lacks CSS support and I have some samples from the wild where CSS tricks were used to poison statistical methods. In future, I plan to implement some sort of CSS parsing to stop that.
My concern is also with html obfuscation and CSS is only part of the problem ... but I would be more interested in your thoughts toward html 5 since any tag can have all the attributes... ie. fonts, colors, etc. Its complicated for determining inheritance... If you don't build a proper parse tree then how do you know which objects inherited which attributes? SA only handles this a little so we have been updating the code base to understand this more. Note: we are not convinced a full html 5 parser is the solution given the performance implications but we have seen some really difficult targeted business email and ip reputation isn't helping as much as we would like. I am glad to hear that you are focusing on bayes poisoning methods. As an aside, some of our patches have been accepted into the next release of SA. It a huge problem.
phoenix
Ambassador
Ambassador
Posts: 27262
Joined: Fri Sep 12, 2014 9:56 pm
Location: Liverpool, England

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by phoenix »

Hi Jim

Sorry for my late reply. The I made to changes to zmtrainsa were just to implement the same functionality using rspamc to train rspamd A/S rather than SA or the now defunct DSPAM. it's always been my understanding that the system Junk & Not Junk folders were automatically emptied during the two zmtrainsa cron jobs that run overnight, I don't believe that anything I've changed would affect that and I've also confirmed that by manually looking at those two accounts in the Admin UI.
Regards

Bill

Rspamd: A high performance spamassassin replacement

Per ardua ad astra
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by JDunphy »

phoenix wrote:it's always been my understanding that the system Junk & Not Junk folders were automatically emptied during the two zmtrainsa cron jobs that run overnight.
Thanks Bill.

And there you have it! Oops. adding this now. :-)

Code: Select all

/opt/zimbra/bin/zmtrainsa --cleanup >> /opt/zimbra/log/spamtrain.log 2>&1
Always amazed how stupid mistakes uncover operational knowledge and failure modes. I am not sure we have the root cause because our alerts require 2 hours of pegged cpu before we got that warning email. I could run your training script by hand and it finished very quickly was the normal behavior from our observation. The output I posted showed the training finished immediately even with a sqlite3 db and 7K-9K of training data. In production, one would train and cleanup step wise on some optimal training size/frequency window that works best for their hardware and user community... which for us is once per day LOL
phoenix
Ambassador
Ambassador
Posts: 27262
Joined: Fri Sep 12, 2014 9:56 pm
Location: Liverpool, England

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by phoenix »

Unfortunately I'm not able to hammer my server/rspamd as it's just a personal mail server and therefore not much volume but I'd be surprised if it borked at any great load as it seems to be used by some large sites. FWIW, I did implement Redis on my server although that probable doesn't make much difference to me. Yes, I'd think that the training schedule would be something that each ZCS user would have to determine what suits them. I'll look forward to the final analysis of rspamd in your environment, is your profile up to date and are you still using CentOS6? I tend to keep my servers on the most recent version of CentOS so I'm on the latest CentOS7 version but again, I wouldn't have thought that would make any difference. I'm glad that my modifications to zmtrainsa are working for you.

Have a good week-end, what's left of it. :)

{EDIT}I guess I should have asked if your test machine is also on CentOS6 or CentOS7.
Regards

Bill

Rspamd: A high performance spamassassin replacement

Per ardua ad astra
User avatar
JDunphy
Outstanding Member
Outstanding Member
Posts: 883
Joined: Fri Sep 12, 2014 11:18 pm
Location: Victoria, BC
ZCS/ZD Version: 9.0.0_P39 NETWORK Edition

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by JDunphy »

phoenix wrote: is your profile up to date and are you still using CentOS6? I tend to keep my servers on the most recent version of CentOS so I'm on the latest CentOS7 version but again, I wouldn't have thought that would make any difference. I'm glad that my modifications to zmtrainsa are working for you.

{EDIT}I guess I should have asked if your test machine is also on CentOS6 or CentOS7.
Yes my profile is accurate and I am still on centos 6. It's stock which is what I am using for our rspamd trial. Probably centos 6 until 11/30/2020. :-) I've had a few centos7 machines in production for the past few years (DNS, openvpn access servers, owncloud, etc) but not for zimbra yet. centos 6 has earned a level of trust here. I haven't noticed much difference between centos 6 and 7 for how we use them. Both have been equally reliable for us. That the first UNIX source code I ever modified was version 7 init.c tells you how much inertia I could have toward systemd. ;-)

How did you see rspamd being used with a multiple host zimbra architecture where the MX's are not on the same machine as the mailboxd? Do you see your zmtrainsa connecting remotely to the rspamd on the MX's or some other method such as replicated redis, etc?

It would be kind of interesting if zimbra sites could access remote/external bayes db's like we currently do with blacklists. One could compare against different bayes db's and score them individually and in aggregate for more accuracy with local training. I wonder how accurate this would be and at what performance/latency cost? Would be an interesting market for zimbra sites to profit or share from the accuracy of their users system training. Probably would want to weight different sites to handle accuracy variants. Would the training increase the poor systems over time with less false positives and false negatives? Doesn't need to be perfect but there have been times when I wished we had a little more statistical help with some scoring. Now, if we add a blockchain to this somewhere we can write our own ticket. LOL
Hmmm...
https://arxiv.org/abs/1512.09327
MisterM75
Advanced member
Advanced member
Posts: 77
Joined: Sat Aug 05, 2017 7:10 am

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by MisterM75 »

I love your friends conversation ...

@phoenix

I get a ton of phishing message and when I send it in my junk mail folder, it comes back all the time, an idea?

Especially since I added extra filters to fight this via clamav:

https://wiki.zimbra.com/wiki/Clamav_unofficial_sigs

Hey that does not matter ...

Are you really sure that clamav is integrated rspamd via Zimbra?

Another thing is Zimbra's dev.

Question about this :

https://rspamd.com/doc/modules/multimap.html

I see that we do it via a db

I'm talking about this:
# local.d/multimap.conf
reject_content {
type = "content";
filter = "text";
map = "${LOCAL_CONFDIR}/local.d/content.map";
symbol = "REJECT_CONTENT";
prefilter = true;
action = "reject";
regexp = true;
}
More or less, one could imagine that when one sends in the spam junk folder that Zimbra via his file "zmtrainsa" sends in a database to give?
# local.d/multimap.conf
reject_content {
type = "content";
filter = "text";
map = "cdb:${LOCAL_CONFDIR}/local.d/spam.cdb";
symbol = "REJECT_CONTENT";
prefilter = true;
action = "reject";
regexp = true;
}
Mz
10424bofh
Outstanding Member
Outstanding Member
Posts: 285
Joined: Sat Sep 13, 2014 1:15 am

Re: Rspamd: A replacement for Spamassassin & Postscreen

Post by 10424bofh »

just after a quick glance

it seems rspamd seems to be a copy cat of dspam on steroids with a lot of addons, or at least does many things like dspam.
oh on that note, no it is dpsam, at least they took most of the code renamed the bins and even left the training flags the same are you kidding me
not even an official fork, wtf
well in that case it will not work on zimbra for the same reason dspam never did work on zimbra (not with a lot of additional help)


allright what im abnout to write is mainly about the statistical module. the policy modules and most other stuff is unaffected and should work as expected.

see those self learning systems need 2 things. 1 alot random data and a lot of accurate data.
so you need to train em with clean hams and spicy jams
also question is if you wanna fitler serverwide, userbased of a combo of both (baseline serverwide + user)

in any case the more homogeneous your userbase is the better.
if youre an email provider with hundreds or tousands of different users well autolearn wont work as much as you would anticipate.

the key factor to make it work is a working connection between learning and the spam folder.
if zimbra informs the training service about every change (moved into moved outta spam) then you have a chance, well it doesnt and never did properly.
rspam(dspam) needs to be informed about moving out or into spam folder to proper unlearn wrongly learned stuff


second about server load, bad news but it can be hefty if its trained properly. even a small database easily has 400megs
and it will take its toll at every message.

and yes DO NOT EVER USE ANY SQL BACKEND period.
it wont work cant work. the dataload is simply to big, you need to use the hash databse

but here is the next quirk. one thing never resolved in dspam, there lot of issues with that hashdatabase
so be prepared for one or more tears.


now if you wanna get it working well there is no make afew adustments in the config and it will run.
first you should learn what its actually really does, learn about the tokenizer and the classifiers.
i recommend the osb tokenizer but you need a lot of data there.

second classifiers is the problem described above. mixing those never really worked well.
depending on your userbase it wont ever

if youre setting up for a single entity (with similar types of mail) it will work if trained properly
since zimbra aint gonna help here you need to train yourself

also make use of the spamtraps. make heavy use of them. they work.


bottom line, it is in fact dspam what we see here in a new project.
with the same flaws.
so if you setup for one entity you can make it work, and if it does it can perform absolutly well


if youre setting up as an email provider, move on, never look back i dont think its worth the time youll need and the steady config adjustments to justify that
youll need an assload of classifiers and you need to adjust em to new customers, youll need a lot more training in total and at the end youll have a broken giant hash database and a lot of false positives.


the idea of dspam is awesome, on a single server for one entity it even works. but in bigger setups it would need a whole different aproach and a couple years of development to really make it work.



ps: holy shit they even stole the term "neural networks" from our good old nuclear elephant
Post Reply