They can be difficult for sure. I wrote an image.pm module but it works by observing tracking and structure and not the image itself. By default, SA has had quite a few attempts at this problem that could be used for some targeted meta rules for your spam mix with better success. There is certainly no shortage of ideas out there for this problem. We have focused on the html structure, tracking and obfuscation more in recent years to target this.
Check out 20_html_tests.cf which has the following tests that match ratio of text to image area.
Code: Select all
# HTML_IMAGE_ONLY - not much raw HTML with images (absolute)
body HTML_IMAGE_ONLY_04 eval:html_image_only('0000','0400')
body HTML_IMAGE_ONLY_08 eval:html_image_only('0400','0800')
body HTML_IMAGE_ONLY_12 eval:html_image_only('0800','1200')
body HTML_IMAGE_ONLY_16 eval:html_image_only('1200','1600')
body HTML_IMAGE_ONLY_20 eval:html_image_only('1600','2000')
body HTML_IMAGE_ONLY_24 eval:html_image_only('2000','2400')
body HTML_IMAGE_ONLY_28 eval:html_image_only('2400','2800')
body HTML_IMAGE_ONLY_32 eval:html_image_only('2800','3200')
describe HTML_IMAGE_ONLY_04 HTML: images with 0-400 bytes of words
describe HTML_IMAGE_ONLY_08 HTML: images with 400-800 bytes of words
describe HTML_IMAGE_ONLY_12 HTML: images with 800-1200 bytes of words
describe HTML_IMAGE_ONLY_16 HTML: images with 1200-1600 bytes of words
describe HTML_IMAGE_ONLY_20 HTML: images with 1600-2000 bytes of words
describe HTML_IMAGE_ONLY_24 HTML: images with 2000-2400 bytes of words
describe HTML_IMAGE_ONLY_28 HTML: images with 2400-2800 bytes of words
describe HTML_IMAGE_ONLY_32 HTML: images with 2800-3200 bytes of words
# HTML_IMAGE_RATIO - more image area than text (ratio)
body HTML_IMAGE_RATIO_02 eval:html_image_ratio('0.000','0.002')
body HTML_IMAGE_RATIO_04 eval:html_image_ratio('0.002','0.004')
body HTML_IMAGE_RATIO_06 eval:html_image_ratio('0.004','0.006')
body HTML_IMAGE_RATIO_08 eval:html_image_ratio('0.006','0.008')
describe HTML_IMAGE_RATIO_02 HTML has a low ratio of text to image area
describe HTML_IMAGE_RATIO_04 HTML has a low ratio of text to image area
describe HTML_IMAGE_RATIO_06 HTML has a low ratio of text to image area
describe HTML_IMAGE_RATIO_08 HTML has a low ratio of text to image area
# HTML_IMAGE_RATIO - more image area than text (ratio)
body HTML_IMAGE_RATIO_02 eval:html_image_ratio('0.000','0.002')
body HTML_IMAGE_RATIO_04 eval:html_image_ratio('0.002','0.004')
body HTML_IMAGE_RATIO_06 eval:html_image_ratio('0.004','0.006')
body HTML_IMAGE_RATIO_08 eval:html_image_ratio('0.006','0.008')
describe HTML_IMAGE_RATIO_02 HTML has a low ratio of text to image area
describe HTML_IMAGE_RATIO_04 HTML has a low ratio of text to image area
describe HTML_IMAGE_RATIO_06 HTML has a low ratio of text to image area
describe HTML_IMAGE_RATIO_08 HTML has a low ratio of text to image area
...
...
Some other rules inside HTMLEval.pm that you might use...
Code: Select all
# the important bit!
$self->register_eval_rule("html_tag_balance");
$self->register_eval_rule("html_image_only");
$self->register_eval_rule("html_image_ratio");
$self->register_eval_rule("html_charset_faraway");
$self->register_eval_rule("html_tag_exists");
$self->register_eval_rule("html_test");
$self->register_eval_rule("html_eval");
$self->register_eval_rule("html_text_match");
$self->register_eval_rule("html_text_match_count");
$self->register_eval_rule("html_body_text_match_count");
$self->register_eval_rule("html_title_subject_ratio");
$self->register_eval_rule("html_text_not_match");
$self->register_eval_rule("html_range");
$self->register_eval_rule("check_iframe_src");
I see that we added a meta rule in our salocal.cf ... the arguments to html_image_only are min, max number of words... so below that would be 600 bytes of words.
Code: Select all
# serious phishing attempts
body __HTML_IMAGE_ONLY_LOW eval:html_image_only('0000','0600')
meta J_IMAGE_PHISH (J_DANGEROUS_ATTACH && __HTML_IMAGE_ONLY_LOW)
score J_IMAGE_PHISH 2.5
describe J_IMAGE_PHISH using an image (and not more) in HTML to disguise a phishing attack. Has a dangerous attachment
You might be able to create some custom meta rules via ImageInfo. Check /opt/zimbra/common/lib/perl5/Mail/SpamAssassin/Plugin/ImageInfo.pm ... A lot of this would yield false positives by themselves which is probably why they are not mainline rules anymore... but with the correct meta statements they might be useful. Probably a few other rules in that directory you might try and use.
Code: Select all
# Usage:
# image_count()
#
# body RULENAME eval:image_count(<type>,<min>,[max])
# type: 'all','gif','png', or 'jpeg'
# min: required, message contains at least this
# many images
# max: optional, if specified, message must not
# contain more than this number of images
#
# image_count() examples
#
# body ONE_IMAGE eval:image_count('all',1,1)
# body ONE_OR_MORE_IMAGES eval:image_count('all',1)
# body ONE_PNG eval:image_count('png',1,1)
# body TWO_GIFS eval:image_count('gif',2,2)
# body MANY_JPEGS eval:image_count('gif',5)
#
# pixel_coverage()
#
# body RULENAME eval:pixel_coverage(<type>,<min>,[max])
# type: 'all','gif','png', or 'jpeg'
# min: required, message contains at least this
# much pixel area
# max: optional, if specified, message must not
# contain more than this much pixel area
#
# pixel_coverage() examples
#
# body LARGE_IMAGE_AREA eval:pixel_coverage('all',150000) # catches any images that are 150k pixel/sq or higher
# body SMALL_GIF_AREA eval:pixel_coverage('gif',1,40000) # catches only gifs that 1 to 40k pixel/sql
#
# image_name_regex()
#
# body RULENAME eval:image_name_regex(<regex>)
# regex: full quoted regexp, see examples below
#
# image_name_regex() examples
#
# body CG_DOUBLEDOT_GIF eval:image_name_regex('/^\w{2,9}\.\.gif$/i') # catches double dot gifs abcd..gif
#