Colonel_mortis Posted November 19, 2020 Posted November 19, 2020 (edited) I run a large site, so I have a lot of legitimate members and receive a fairly significant number of spammers. I turned it off (but continued to collect members' spam scores) several months ago, due to the volume of requests by legitimate people who were being caught by the spam filter. I've run some statistics based on registrations over the past month, comparing whether they have been flagged as spammer (which is very closely correlated with whether they were actually a spammer (this may not be true - see my post later in the topic)) and the code that the IPS spam service gave. The results are: +-------+------+----------+ | FASed | code | count | +-------+------+----------+ | 0 | 0 | 77 | | 0 | 1 | 5657 | | 0 | 2 | 3 | | 0 | 3 | 5 | | 0 | 4 | 147 | | 0 | null | 233 | | 1 | 0 | 3 | | 1 | 1 | 616 | | 1 | 2 | 13 | | 1 | 3 | 75 | | 1 | 4 | 155 | | 1 | null | 2 | +-------+------+----------+ Breaking this down: 12% of registrations are spammers 6% of registrations receive a spam score > 1 4% of registrations receive a spam score of 4 So far, these numbers don't seem unreasonable. However, If someone receives a spam score of 4 ("user is a known spammer"), they have 51% chance of actually being a spammer A precision of 51% is totally useless. If someone is actually a spammer, they have a 28% chance of receiving a spam score > 1. That's a pretty shoddy recall too. If I set my site to reject members with a spam score of 4, I will lose ~150 members to the spam filter each month; even if I use 2 as the threshold instead it will still only reduce spam by 28%. That's not an OK trade-off to me. You may say that the answer to this is to set accounts to require admin validation instead when caught by the spam filter, and that's not unreasonable. However, I'm not logged into ACP every day, so I suspect this would result in the loss of a large portion of the potential registrations who will just go and ask their question on a different forum because they didn't want to wait. Furthermore, it's often not possible to tell whether the account is legitimate just based on the registration info, so that 28% hit rate is going to drop quite a lot more. I appreciate that catching spam is a very hard problem. However, I believe these number demonstrate that the current system is not fit for purpose, at least with the level of confidence that you currently assign to it ("member is a known spammer" in the config page is a long way from the truth, and "certain spammer" from your marketing materials is an outright lie). Edited November 21, 2020 by Colonel_mortis aia and OptimusBain 1 1
Jordan Miller Posted November 19, 2020 Posted November 19, 2020 Me trying to understand if I should panic or not scorpion23, Yaroslav Brovin, Sonya* and 3 others 6
jesuralem Posted November 20, 2020 Posted November 20, 2020 What is surprising is that scores if 2 ans 3 are more reliavle than 4 as they have 13/16 and 155/160 actual spammers...
CoffeeCake Posted November 20, 2020 Posted November 20, 2020 On 11/19/2020 at 4:53 PM, Colonel_mortis said: I appreciate that catching spam is a very hard problem. However, I believe these number demonstrate that the current system is not fit for purpose, at least with the level of confidence that you currently assign to it ("member is a known spammer" in the config page is a long way from the truth, and "certain spammer" from your marketing materials is an outright lie). Thank you for this. We operate a site that is similarly sized to the one you are associated with, and have had similar suspicions. We had far better success at addressing the issue by focusing our efforts on the source of registrations. For us, the majority of spam registrations came from particular network segments once we looked at the ASN associated with known spam registrants. We used our CDN's firewall feature to handle those registration attempts and have quite significantly reduced the number of spammers getting through. It's mostly a rare occasion now. aia 1
Colonel_mortis Posted November 21, 2020 Author Posted November 21, 2020 Actually, looking further into the members who were caught by the spam service but weren't flagged as spammers, there are several who have made 0 posts (and thus weren't caught in my previous audit) but who are likely to be actual spammers based on their profile information. Based on the sample that I checked, the false positive rate is still too high to be useful, but it is not as high as I had originally thought. CoffeeCake 1
Management Matt Posted November 23, 2020 Management Posted November 23, 2020 On 11/21/2020 at 5:35 PM, Colonel_mortis said: Actually, looking further into the members who were caught by the spam service but weren't flagged as spammers, there are several who have made 0 posts (and thus weren't caught in my previous audit) but who are likely to be actual spammers based on their profile information. Based on the sample that I checked, the false positive rate is still too high to be useful, but it is not as high as I had originally thought. You originally quoted 51% as false positives, what you do think it could be now? Trying to capture spam is a constantly evolving process. Trends wax and wane. We tweak things to account for this but we can only do so after a trend has established itself. Given the size of your community and the number of registrations you get, I'd love to know more about your data and look at our capturing system to increase its accuracy.
CoffeeCake Posted November 23, 2020 Posted November 23, 2020 On 11/21/2020 at 11:35 AM, Colonel_mortis said: but who are likely to be actual spammers based on their profile information. I'd say that there is likely a 50/50 split between spam registrants that are simply trying to create a profile with links in an attempt at SEO or link count nonsense as compared to those that actually attempt to post. There is a need to be able to moderate links in profile fields. We turn off profile viewing to guests as one way to mitigate against this, but it's less than ideal. aia 1
Colonel_mortis Posted November 23, 2020 Author Posted November 23, 2020 10 hours ago, Matt said: You originally quoted 51% as false positives, what you do think it could be now? Trying to capture spam is a constantly evolving process. Trends wax and wane. We tweak things to account for this but we can only do so after a trend has established itself. Given the size of your community and the number of registrations you get, I'd love to know more about your data and look at our capturing system to increase its accuracy. Of the 147 members who were classified as 4 but not FASed, about 30 are likely spam accounts (based on manual classification by one of my moderators), which gives a precision of 61% (39% false positive rate) when classifying based on a score of 4. I'd be happy to share some more detailed data with you if there's anything that you think would be helpful - feel free to reach out by ticket/email/PM(/slack), whatever is easiest. Matt and media 2
Management Matt Posted November 25, 2020 Management Posted November 25, 2020 On 11/23/2020 at 10:33 PM, Colonel_mortis said: Of the 147 members who were classified as 4 but not FASed, about 30 are likely spam accounts (based on manual classification by one of my moderators), which gives a precision of 61% (39% false positive rate) when classifying based on a score of 4. I'd be happy to share some more detailed data with you if there's anything that you think would be helpful - feel free to reach out by ticket/email/PM(/slack), whatever is easiest. Cheers Jack, I'll do that. media 1
loccom Posted November 25, 2020 Posted November 25, 2020 we temporarily get round this by setting them to moderator approval, but this is for around 40 a day, if you get hundreds then painful.
Prank Posted February 18, 2021 Posted February 18, 2021 I know this is a little old but the discussion needs to continue. I have a fairly busy board and have found that the spam prevention hasn't been stopping much at all given the spam attacks we've recently been getting. CleanTalk is fairy useless now also and I'm relying more heavily on my WAF on CloudFlare. So, with this in mind I'm worried about disabling spam prevention but I have changed all > 1 to moderate the registration so I'll keep an eye on it and see what happens. Note that 2 & 3 seem to never be triggered. For yesterday (17/2) 48 total registration attempts. 1 x code 3 32 x code 4 15 x code 1
Management Matt Posted February 18, 2021 Management Posted February 18, 2021 We're currently revamping the spam defence system to make it more accurate and to catch more spam. We'll announce the changes soon. sudo, Jordan Miller, crmarks and 1 other 4
Prank Posted February 18, 2021 Posted February 18, 2021 11 hours ago, Matt said: We're currently revamping the spam defence system to make it more accurate and to catch more spam. We'll announce the changes soon. Any rough estimate here? I'm looking at writing an integration with Akismet.
Sonya* Posted February 18, 2021 Posted February 18, 2021 20 hours ago, Prank said: CleanTalk is fairy useless now also I use CleanTalk on contact form and love it. Can you explain why it is useless for registration? Thanks!
Prank Posted February 18, 2021 Posted February 18, 2021 40 minutes ago, Sonya* said: I use CleanTalk on contact form and love it. Can you explain why it is useless for registration? Thanks! Yep, on my site the success rate of identifying and blocking a spammer is dismally low. Sonya* 1
Management Matt Posted February 19, 2021 Management Posted February 19, 2021 14 hours ago, Prank said: Any rough estimate here? I'm looking at writing an integration with Akismet. We should have something to announce in a few weeks. Jordan Miller 1
Greek76 Posted February 20, 2021 Posted February 20, 2021 I had to disable it completely. It was automatically flagging friends of mine as spam. I even complained about it in the around a month ago. aia 1
Colonel_mortis Posted April 16, 2022 Author Posted April 16, 2022 I'm a bit late, but I have some updated numbers based on the past 3 months of registrations. In the latest data: Precision (proportion of users with a spam score >1 that are actually spammers) is 81%, which is up significantly on the 51% from 2020 Recall (proportion of spammers that get a spam score >1) is 24%, which is down slightly (but probably within margin of error) on the 28% from 2020 However, the number of spammers relative to actual users has increased since the data in the original post, which means the precision is not comparable. Correcting for that changes the precision to 66%, which is still a decent improvement over 2020. The biggest change for us though is that the spam service can mod queue members rather than blocking their registration outright, which means the sub-optimal precision is not as important as it was. This, combined with a collection of banned words, banned links, and custom spam rules (via a custom plugin), has meant that over 90% of the spam posts were mod queued before being posted. Matt, AlexWebsites and David N. 3
Recommended Posts