Lazovich, Tomo
Filter bubbles and affective polarization in user-personalized large language model outputs
Lazovich, Tomo
Echoing the history of search engines and social media content rankings, the advent of large language models (LLMs) has led to a push for increased personalization of model outputs to individual users. In the past, personalized recommendations and ranking systems have been linked to the development of filter bubbles (serving content that may confirm a user's existing biases) and affective polarization (strong negative sentiment towards those with differing views). In this work, we explore how prompting a leading large language model, ChatGPT-3.5, with a user's political affiliation prior to asking factual questions about public figures and organizations leads to differing results. We observe that left-leaning users tend to receive more positive statements about left-leaning political figures and media outlets, while right-leaning users see more positive statements about right-leaning entities. This pattern holds across presidential candidates, members of the U.S. Senate, and media organizations with ratings from AllSides. When qualitatively evaluating some of these outputs, there is evidence that particular facts are included or excluded based on the user's political affiliation. These results illustrate that personalizing LLMs based on user demographics carry the same risks of affective polarization and filter bubbles that have been seen in other personalized internet technologies. This ``failure mode" should be monitored closely as there are more attempts to monetize and personalize these models.
TwERC: High Performance Ensembled Candidate Generation for Ads Recommendation at Twitter
Cai, Vanessa, Prabakar, Pradeep, Rebuelta, Manuel Serrano, Rosen, Lucas, Monti, Federico, Janocha, Katarzyna, Lazovich, Tomo, Raj, Jeetu, Shrinivasan, Yedendra, Li, Hao, Markovich, Thomas
Recommendation systems are a core feature of social media companies with their uses including recommending organic and promoted contents. Many modern recommendation systems are split into multiple stages - candidate generation and heavy ranking - to balance computational cost against recommendation quality. We focus on the candidate generation phase of a large-scale ads recommendation problem in this paper, and present a machine learning first heterogeneous re-architecture of this stage which we term TwERC. We show that a system that combines a real-time light ranker with sourcing strategies capable of capturing additional information provides validated gains. We present two strategies. The first strategy uses a notion of similarity in the interaction graph, while the second strategy caches previous scores from the ranking stage. The graph based strategy achieves a 4.08% revenue gain and the rankscore based strategy achieves a 1.38% gain. These two strategies have biases that complement both the light ranker and one another. Finally, we describe a set of metrics that we believe are valuable as a means of understanding the complex product trade offs inherent in industrial candidate generation systems.
Learning to Repair Software Vulnerabilities with Generative Adversarial Networks
Harer, Jacob, Ozdemir, Onur, Lazovich, Tomo, Reale, Christopher, Russell, Rebecca, Kim, Louis, chin, peter
Motivated by the problem of automated repair of software vulnerabilities, we propose an adversarial learning approach that maps from one discrete source domain to another target domain without requiring paired labeled examples or source and target domains to be bijections. We demonstrate that the proposed adversarial learning approach is an effective technique for repairing software vulnerabilities, performing close to seq2seq approaches that require labeled pairs. The proposed Generative Adversarial Network approach is application-agnostic in that it can be applied to other problems similar to code repair, such as grammar correction or sentiment translation.
Learning to Repair Software Vulnerabilities with Generative Adversarial Networks
Harer, Jacob, Ozdemir, Onur, Lazovich, Tomo, Reale, Christopher, Russell, Rebecca, Kim, Louis, chin, peter
Motivated by the problem of automated repair of software vulnerabilities, we propose an adversarial learning approach that maps from one discrete source domain to another target domain without requiring paired labeled examples or source and target domains to be bijections. We demonstrate that the proposed adversarial learning approach is an effective technique for repairing software vulnerabilities, performing close to seq2seq approaches that require labeled pairs. The proposed Generative Adversarial Network approach is application-agnostic in that it can be applied to other problems similar to code repair, such as grammar correction or sentiment translation.
Automated Vulnerability Detection in Source Code Using Deep Representation Learning
Russell, Rebecca L., Kim, Louis, Hamilton, Lei H., Lazovich, Tomo, Harer, Jacob A., Ozdemir, Onur, Ellingwood, Paul M., McConley, Marc W.
Increasing numbers of software vulnerabilities are discovered every year whether they are reported publicly or discovered internally in proprietary code. These vulnerabilities can pose serious risk of exploit and result in system compromise, information leaks, or denial of service. We leveraged the wealth of C and C++ open-source code available to develop a large-scale function-level vulnerability detection system using machine learning. To supplement existing labeled vulnerability datasets, we compiled a vast dataset of millions of open-source functions and labeled it with carefully-selected findings from three different static analyzers that indicate potential exploits. Using these datasets, we developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code. We evaluated our tool on code from both real software packages and the NIST SATE IV benchmark dataset. Our results demonstrate that deep feature representation learning on source code is a promising approach for automated software vulnerability detection.
Learning to Repair Software Vulnerabilities with Generative Adversarial Networks
Harer, Jacob, Ozdemir, Onur, Lazovich, Tomo, Reale, Christopher P., Russell, Rebecca L., Kim, Louis Y., Chin, Peter
Motivated by the problem of automated repair of software vulnerabilities, we propose an adversarial learning approach that maps from one discrete source domain to another target domain without requiring paired labeled examples or source and target domains to be bijections. We demonstrate that the proposed adversarial learning approach is an effective technique for repairing software vulnerabilities, performing close to seq2seq approaches that require labeled pairs. The proposed Generative Adversarial Network approach is application-agnostic in that it can be applied to other problems similar to code repair, such as grammar correction or sentiment translation.
Automated software vulnerability detection with machine learning
Harer, Jacob A., Kim, Louis Y., Russell, Rebecca L., Ozdemir, Onur, Kosta, Leonard R., Rangamani, Akshay, Hamilton, Lei H., Centeno, Gabriel I., Key, Jonathan R., Ellingwood, Paul M., McConley, Marc W., Opper, Jeffrey M., Chin, Peter, Lazovich, Tomo
Thousands of security vulnerabilities are discovered in production software each year, either reported publicly to the Common Vulnerabilities and Exposures database or discovered internally in proprietary code. Vulnerabilities often manifest themselves in subtle ways that are not obvious to code reviewers or the developers themselves. With the wealth of open source code available for analysis, there is an opportunity to learn the patterns of bugs that can lead to security vulnerabilities directly from data. In this paper, we present a data-driven approach to vulnerability detection using machine learning, specifically applied to C and C++ programs. We first compile a large dataset of hundreds of thousands of open-source functions labeled with the outputs of a static analyzer. We then compare methods applied directly to source code with methods applied to artifacts extracted from the build process, finding that source-based models perform better. We also compare the application of deep neural network models with more traditional models such as random forests and find the best performance comes from combining features learned by deep models with tree-based models. Ultimately, our highest performing model achieves an area under the precision-recall curve of 0.49 and an area under the ROC curve of 0.87.