spammer
Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions
Fleisig, Eve, Orlikowski, Matthias, Cimiano, Philipp, Klein, Dan
For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (<5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are more random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give relatively fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.
- Europe > Austria > Vienna (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Singapore (0.04)
- (18 more...)
- Information Technology > Security & Privacy > Spam Filtering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Communications > Social Media > Crowdsourcing (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)
Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing
Nihar Bhadresh Shah, Dengyong Zhou
Crowdsourcing has gained immense popularity in machine learning applications for obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but suffers from the problem of low-quality data. To address this fundamental challenge in crowdsourcing, we propose a simple payment mechanism to incentivize workers to answer only the questions that they are sure of and skip the rest. We show that surprisingly, under a mild and natural "no-free-lunch" requirement, this mechanism is the one and only incentive-compatible payment mechanism possible. We also show that among all possible incentive-compatible mechanisms (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers. Interestingly, this unique mechanism takes a "multiplicative" form. The simplicity of the mechanism is an added benefit. In preliminary experiments involving over several hundred workers, we observe a significant reduction in the error rates under our unique mechanism for the same or lower monetary expenditure.
- Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.05)
- North America > United States > California > Alameda County > Berkeley (0.04)
Data Quality in Crowdsourcing and Spamming Behavior Detection
Ba, Yang, Mancenido, Michelle V., Chiou, Erin K., Pan, Rong
As crowdsourcing emerges as an efficient and cost-effective method for obtaining labels for machine learning datasets, it is important to assess the quality of crowd-provided data, so as to improve analysis performance and reduce biases in subsequent machine learning tasks. Given the lack of ground truth in most cases of crowdsourcing, we refer to data quality as annotators' consistency and credibility. Unlike the simple scenarios where Kappa coefficient and intraclass correlation coefficient usually can apply, online crowdsourcing requires dealing with more complex situations. We introduce a systematic method for evaluating data quality and detecting spamming threats via variance decomposition, and we classify spammers into three categories based on their different behavioral patterns. A spammer index is proposed to assess entire data consistency and two metrics are developed to measure crowd worker's credibility by utilizing the Markov chain and generalized random effects models. Furthermore, we showcase the practicality of our techniques and their advantages by applying them on a face verification task with both simulation and real-world data collected from two crowdsourcing platforms.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Information Technology (0.67)
- Health & Medicine (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing
Crowdsourcing has gained immense popularity in machine learning applications for obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but suffers from the problem of low-quality data. To address this fundamental challenge in crowdsourcing, we propose a simple payment mechanism to incentivize workers to answer only the questions that they are sure of and skip the rest. We show that surprisingly, under a mild and natural "no-free-lunch" requirement, this mechanism is the one and only incentive-compatible payment mechanism possible. We also show that among all possible incentive-compatible mechanisms (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers. Interestingly, this unique mechanism takes a "multiplicative" form. The simplicity of the mechanism is an added benefit. In preliminary experiments involving over several hundred workers, we observe a significant reduction in the error rates under our unique mechanism for the same or lower monetary expenditure.
- Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.05)
- North America > United States > California > Alameda County > Berkeley (0.04)
The ChatGPT vs Bear Blog spam war
Ever since Bear Blog's infancy, spam has been an issue. Free services tend to attract those seeking to exploit them for backlinks and the alleged SEO benefits (although this is debatable given updates to the Google algorithm). I've previously discussed this in a post, detailing the manual review process which has been holding up well for the past 3 years. But alas, change is upon us. Spam used to be quite easy to spot: poorly worded, low-effort paragraphs sprinkled with backlinks to products or services.
Social Honeypot for Humans: Luring People through Self-managed Instagram Pages
Bardi, Sara, Conti, Mauro, Pajola, Luca, Tricomi, Pier Paolo
Social Honeypots are tools deployed in Online Social Networks (OSN) to attract malevolent activities performed by spammers and bots. To this end, their content is designed to be of maximum interest to malicious users. However, by choosing an appropriate content topic, this attractive mechanism could be extended to any OSN users, rather than only luring malicious actors. As a result, honeypots can be used to attract individuals interested in a wide range of topics, from sports and hobbies to more sensitive subjects like political views and conspiracies. With all these individuals gathered in one place, honeypot owners can conduct many analyses, from social to marketing studies. In this work, we introduce a novel concept of social honeypot for attracting OSN users interested in a generic target topic. We propose a framework based on fully-automated content generation strategies and engagement plans to mimic legit Instagram pages. To validate our framework, we created 21 self-managed social honeypots (i.e., pages) on Instagram, covering three topics, four content generation strategies, and three engaging plans. In nine weeks, our honeypots gathered a total of 753 followers, 5387 comments, and 15739 likes. These results demonstrate the validity of our approach, and through statistical analysis, we examine the characteristics of effective social honeypots.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Italy > Lazio (0.04)
- Europe > Italy > Emilia-Romagna (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Anticipating New Spam Domains Through Machine Learning
Researchers from France have devised a method for identifying newly-registered domains that are likely to be used in a'hit and run' fashion by high-volume email spammers – sometimes, even before the spammers have sent out one unwanted email. The technique is based on analysis of the way that that the Sender Policy Framework (SPF), a method of verifying email provenance, has been set up on newly-registered domains. Thanks to the use of passive DNS (Domain Name System) sensors, the researchers were able to obtain near real-time DNS data from Seattle-based company Farsight, yielding SPF activity for TXT records for a range of domains. Using a class weight algorithm originally designed for processing imbalanced medical data, and implemented in the scikit-learn machine learning Python library, the researchers were able to detect three quarters of the pending spam domains within moments, or even in advance of their operation. 'With a single request to the TXT record, we detect 75% of the spam domains, possibly before the start of the spam campaign.
La veille de la cybersécurité
One of the biggest problems with social media is spammers who spam adult content. Detecting and removing such content quickly is essential to keep social media clean. Researchers from Jamia milia university have described how the user experience of young people can be improved if the spam content is filtered. Various machine learning tools can be used to detect such content and classify them as spam. Of the different models they tried, they found XG Boost to be the one with the highest accuracy at 91% and adapted the algorithm for effective classification.
Modeling User Behavior With Interaction Networks for Spam Detection
Agarwal, Prabhat, Srivastava, Manisha, Singh, Vishwakarma, Rosenberg, Charles
Spam is a serious problem plaguing web-scale digital platforms which facilitate user content creation and distribution. It compromises platform's integrity, performance of services like recommendation and search, and overall business. Spammers engage in a variety of abusive and evasive behavior which are distinct from non-spammers. Users' complex behavior can be well represented by a heterogeneous graph rich with node and edge attributes. Learning to identify spammers in such a graph for a web-scale platform is challenging because of its structural complexity and size. In this paper, we propose SEINE (Spam DEtection using Interaction NEtworks), a spam detection model over a novel graph framework. Our graph simultaneously captures rich users' details and behavior and enables learning on a billion-scale graph. Our model considers neighborhood along with edge types and attributes, allowing it to capture a wide range of spammers. SEINE, trained on a real dataset of tens of millions of nodes and billions of edges, achieves a high performance of 80% recall with 1% false positive rate. SEINE achieves comparable performance to the state-of-the-art techniques on a public dataset while being pragmatic to be used in a large-scale production system.
- Europe > Spain > Galicia > Madrid (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia (0.04)
How ML Systems can help in Detecting Spam Emails? Appknock
Cybersecurity has been a demand since the early ages of the Internet and today, looking at the tremendous increase of data threats via phishing emails, spam borne malware, spear phishing, etc. a call for reliable and smart anti-spam email filters becomes inevitable. The trend of Digital Thefts and Email Spams is heading upwards! We have many detectors and anti-spam filters; nevertheless, the need for more dependable feature-rich digital products occupy high demand in the market. This is where we look upon Artificial Intelligence which, even though it is a computer technology, it possesses life-like abilities that can think like humans by acting and performing smart. Humans have made machines learn the empirical aspects and relative approaches to spam email filtering which can help to defeat the spammers who gather sensitive user information from websites, chat rooms, and viruses.