Jump to content

IPS spam service is harmful


Colonel_mortis

Recommended Posts

I run a large site, so I have a lot of legitimate members and receive a fairly significant number of spammers. I turned it off (but continued to collect members' spam scores) several months ago, due to the volume of requests by legitimate people who were being caught by the spam filter. I've run some statistics based on registrations over the past month, comparing whether they have been flagged as spammer (which is very closely correlated with whether they were actually a spammer (this may not be true - see my post later in the topic)) and the code that the IPS spam service gave. The results are:

+-------+------+----------+
| FASed | code | count    |
+-------+------+----------+
|     0 | 0    |       77 |
|     0 | 1    |     5657 |
|     0 | 2    |        3 |
|     0 | 3    |        5 |
|     0 | 4    |      147 |
|     0 | null |      233 |
|     1 | 0    |        3 |
|     1 | 1    |      616 |
|     1 | 2    |       13 |
|     1 | 3    |       75 |
|     1 | 4    |      155 |
|     1 | null |        2 |
+-------+------+----------+

Breaking this down:

  • 12% of registrations are spammers
  • 6% of registrations receive a spam score > 1
  • 4% of registrations receive a spam score of 4

So far, these numbers don't seem unreasonable. However,

  • If someone receives a spam score of 4 ("user is a known spammer"), they have 51% chance of actually being a spammer

A precision of 51% is totally useless.

  • If someone is actually a spammer, they have a 28% chance of receiving a spam score > 1.

That's a pretty shoddy recall too.

If I set my site to reject members with a spam score of 4, I will lose ~150 members to the spam filter each month; even if I use 2 as the threshold instead it will still only reduce spam by 28%. That's not an OK trade-off to me.

You may say that the answer to this is to set accounts to require admin validation instead when caught by the spam filter, and that's not unreasonable. However, I'm not logged into ACP every day, so I suspect this would result in the loss of a large portion of the potential registrations who will just go and ask their question on a different forum because they didn't want to wait. Furthermore, it's often not possible to tell whether the account is legitimate just based on the registration info, so that 28% hit rate is going to drop quite a lot more.

I appreciate that catching spam is a very hard problem. However, I believe these number demonstrate that the current system is not fit for purpose, at least with the level of confidence that you currently assign to it ("member is a known spammer" in the config page is a long way from the truth, and "certain spammer" from your marketing materials is an outright lie).

Edited by Colonel_mortis
Link to comment
Share on other sites

On 11/19/2020 at 4:53 PM, Colonel_mortis said:

I appreciate that catching spam is a very hard problem. However, I believe these number demonstrate that the current system is not fit for purpose, at least with the level of confidence that you currently assign to it ("member is a known spammer" in the config page is a long way from the truth, and "certain spammer" from your marketing materials is an outright lie).

Thank you for this. We operate a site that is similarly sized to the one you are associated with, and have had similar suspicions.

We had far better success at addressing the issue by focusing our efforts on the source of registrations. For us, the majority of spam registrations came from particular network segments once we looked at the ASN associated with known spam registrants. We used our CDN's firewall feature to handle those registration attempts and have quite significantly reduced the number of spammers getting through. It's mostly a rare occasion now.

Link to comment
Share on other sites

Actually, looking further into the members who were caught by the spam service but weren't flagged as spammers, there are several who have made 0 posts (and thus weren't caught in my previous audit) but who are likely to be actual spammers based on their profile information. Based on the sample that I checked, the false positive rate is still too high to be useful, but it is not as high as I had originally thought.

Link to comment
Share on other sites

  • Management
On 11/21/2020 at 5:35 PM, Colonel_mortis said:

Actually, looking further into the members who were caught by the spam service but weren't flagged as spammers, there are several who have made 0 posts (and thus weren't caught in my previous audit) but who are likely to be actual spammers based on their profile information. Based on the sample that I checked, the false positive rate is still too high to be useful, but it is not as high as I had originally thought.

You originally quoted 51% as false positives, what you do think it could be now?

Trying to capture spam is a constantly evolving process. Trends wax and wane. We tweak things to account for this but we can only do so after a trend has established itself. Given the size of your community and the number of registrations you get, I'd love to know more about your data and look at our capturing system to increase its accuracy.

Link to comment
Share on other sites

On 11/21/2020 at 11:35 AM, Colonel_mortis said:

but who are likely to be actual spammers based on their profile information.

I'd say that there is likely a 50/50 split between spam registrants that are simply trying to create a profile with links in an attempt at SEO or link count nonsense as compared to those that actually attempt to post.

There is a need to be able to moderate links in profile fields. We turn off profile viewing to guests as one way to mitigate against this, but it's less than ideal.

Link to comment
Share on other sites

10 hours ago, Matt said:

You originally quoted 51% as false positives, what you do think it could be now?

Trying to capture spam is a constantly evolving process. Trends wax and wane. We tweak things to account for this but we can only do so after a trend has established itself. Given the size of your community and the number of registrations you get, I'd love to know more about your data and look at our capturing system to increase its accuracy.

Of the 147 members who were classified as 4 but not FASed, about 30 are likely spam accounts (based on manual classification by one of my moderators), which gives a precision of 61% (39% false positive rate) when classifying based on a score of 4.

I'd be happy to share some more detailed data with you if there's anything that you think would be helpful - feel free to reach out by ticket/email/PM(/slack), whatever is easiest.

Link to comment
Share on other sites

  • Management
On 11/23/2020 at 10:33 PM, Colonel_mortis said:

Of the 147 members who were classified as 4 but not FASed, about 30 are likely spam accounts (based on manual classification by one of my moderators), which gives a precision of 61% (39% false positive rate) when classifying based on a score of 4.

I'd be happy to share some more detailed data with you if there's anything that you think would be helpful - feel free to reach out by ticket/email/PM(/slack), whatever is easiest.

Cheers Jack, I'll do that.

Link to comment
Share on other sites

  • 2 months later...

I know this is a little old but the discussion needs to continue.

I have a fairly busy board and have found that the spam prevention hasn't been stopping much at all given the spam attacks we've recently been getting. CleanTalk is fairy useless now also and I'm relying more heavily on my WAF on CloudFlare.

So, with this in mind I'm worried about disabling spam prevention but I have changed all > 1 to moderate the registration so I'll keep an eye on it and see what happens. 

Note that 2 & 3 seem to never be triggered. 

For yesterday (17/2)

48 total registration attempts.
1 x code 3
32 x code 4
15 x code 1

 

Link to comment
Share on other sites

  • 1 year later...

I'm a bit late, but I have some updated numbers based on the past 3 months of registrations. In the latest data:

  • Precision (proportion of users with a spam score >1 that are actually spammers) is 81%, which is up significantly on the 51% from 2020
  • Recall (proportion of spammers that get a spam score >1) is 24%, which is down slightly (but probably within margin of error) on the 28% from 2020

However, the number of spammers relative to actual users has increased since the data in the original post, which means the precision is not comparable. Correcting for that changes the precision to 66%, which is still a decent improvement over 2020.

The biggest change for us though is that the spam service can mod queue members rather than blocking their registration outright, which means the sub-optimal precision is not as important as it was. This, combined with a collection of banned words, banned links, and custom spam rules (via a custom plugin), has meant that over 90% of the spam posts were mod queued before being posted.

Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...