hwt
Appendix
In practice, building f and g requires the computation for wtiwtj for all i,j. B.2 Classification For the classification task with the logistic regression model, we modify the formula of logistic regression in teaching objectives to make it convenient for derivation. It also indicates that with probability at least p1, the LST teacher can achieve exponential teachability in the iteration t. In order to achieve exponential teachiability in T iterations, the sufficient condition in Eq. (22) must be satisfied in all T iterations. Then, we use a pre-trained DenseNet [65] shown in [53] to generate 1024 dim features and the confidencescoreforeachimage.
Linguistic Characteristics of AI-Generated Text: A Survey
Terčon, Luka, Dobrovoljc, Kaja
Large language models (LLMs) are solidifying their position in the modern world as effective tools for the automatic generation of text. Their use is quickly becoming commonplace in fields such as education, healthcare, and scientific research. There is a growing need to study the linguistic features present in AI-generated text, as the increasing presence of such texts has profound implications in various disciplines such as corpus linguistics, computational linguistics, and natural language processing. Many observations have already been made, however a broader synthesis of the findings made so far is required to provide a better understanding of the topic. The present survey paper aims to provide such a synthesis of extant research. We categorize the existing works along several dimensions, including the levels of linguistic description, the models included, the genres analyzed, the languages analyzed, and the approach to prompting. Additionally, the same scheme is used to present the findings made so far and expose the current trends followed by researchers. Among the most-often reported findings is the observation that AI-generated text is more likely to contain a more formal and impersonal style, signaled by the increased presence of nouns, determiners, and adpositions and the lower reliance on adjectives and adverbs. AI-generated text is also more likely to feature a lower lexical diversity, a smaller vocabulary size, and repetitive text. Current research, however, remains heavily concentrated on English data and mostly on text generated by the GPT model family, highlighting the need for broader cross-linguistic and cross-model investigation. In most cases authors also fail to address the issue of prompt sensitivity, leaving much room for future studies that employ multiple prompt wordings in the text generation phase.
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Europe > Spain > Andalusia > Jaén Province > Jaén (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Research Report > New Finding (0.68)
- Health & Medicine (1.00)
- Education (0.93)
AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing
Bessalah, Aniss, Abdelmoumen, Hatem Mohamed, Benatchba, Karima, Benmeziane, Hadjer
Analog In-memory Computing (AIMC) has emerged as a highly efficient paradigm for accelerating Deep Neural Networks (DNNs), offering significant energy and latency benefits over conventional digital hardware. However, state-of-the-art neural networks are not inherently designed for AIMC, as they fail to account for its unique non-idealities. Neural Architecture Search (NAS) is thus needed to systematically discover neural architectures optimized explicitly for AIMC constraints. However, comparing NAS methodologies and extracting insights about robust architectures for AIMC requires a dedicated NAS benchmark that explicitly accounts for AIMC-specific hardware non-idealities. To address this, we introduce AnalogNAS-Bench, the first NAS benchmark tailored specifically for AIMC. Our study reveals three key insights: (1) standard quantization techniques fail to capture AIMC-specific noises, (2) robust architectures tend to feature wider and branched blocks, (3) skip connections improve resilience to temporal drift noise. These insights highlight the limitations of current NAS benchmarks for AIMC and pave the way for future analog-aware NAS. All the implementations used in this paper can be found at https://github.com/IBM/analog-nas/tree/main/analognasbench.
- North America > Canada > Ontario > Toronto (0.14)
- Africa > Middle East > Algeria > Algiers Province > Algiers (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- (5 more...)
Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
Pedrotti, Andrea, Papucci, Michele, Ciaccio, Cristiano, Miaschi, Alessio, Puccetti, Giovanni, Dell'Orletta, Felice, Esuli, Andrea
Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.
- Europe > United Kingdom > Northern Ireland (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Laos > Vientiane Prefecture > Vientiane (0.04)
- (16 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Media > News (1.00)
- Health & Medicine > Health Care Providers & Services (1.00)
- Government > Regional Government > Europe Government > United Kingdom Government (0.92)
- Health & Medicine > Therapeutic Area (0.68)
Human Variability vs. Machine Consistency: A Linguistic Analysis of Texts Generated by Humans and Large Language Models
Zanotto, Sergio E., Aroyehun, Segun
The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. Recent research has predominantly focused on using LLMs to classify text as either human-written or machine-generated. In our study, we adopt a different approach by profiling texts spanning four domains based on 250 distinct linguistic features. We select the M4 dataset from the Subtask B of SemEval 2024 Task 8. We automatically calculate various linguistic features with the LFTK tool and additionally measure the average syntactic depth, semantic similarity, and emotional content for each document. We then apply a two-dimensional PCA reduction to all the calculated features. Our analyses reveal significant differences between human-written texts and those generated by LLMs, particularly in the variability of these features, which we find to be considerably higher in human-written texts. This discrepancy is especially evident in text genres with less rigid linguistic style constraints. Our findings indicate that humans write texts that are less cognitively demanding, with higher semantic content, and richer emotional content compared to texts generated by LLMs. These insights underscore the need for incorporating meaningful linguistic features to enhance the understanding of textual outputs of LLMs.
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (5 more...)
PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?
Petukhova, Kseniia, Kazakov, Roman, Kochmar, Ekaterina
In this paper, we present our submission to the SemEval-2024 Task 8 "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection", focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 12th from 124 in the ranking for Subtask A (monolingual track), and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (2 more...)
AI can now copy your HANDWRITING - so, can you tell which of these was written by a robot?
AI tools like ChatGPT can draft letters, tell jokes and even give legal advice – but only in the form of computerized text. Now, scientists have created an AI that can imitate human handwriting, which could herald fresh issues regarding fraud and fake documents. Amazingly, the results are almost indistinguishable from the real thing drafted by human hands. Below is one column of writing by the team's AI model and another by humans, but can you tell which is which? Scroll down to reveal the answer!
- North America > Canada (0.05)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
Gao, Chujie, Chen, Dongping, Zhang, Qihui, Huang, Yue, Wan, Yao, Sun, Lichao
With the remarkable development and widespread applications of large language models (LLMs), the use of machine-generated text (MGT) is becoming increasingly common. This trend brings potential risks, particularly to the quality and completeness of information in fields such as news and education. Current research predominantly addresses the detection of pure MGT without adequately addressing mixed scenarios including AI-revised Human-Written Text (HWT) or human-revised MGT. To confront this challenge, we introduce mixcase, a novel concept representing a hybrid text form involving both machine-generated and human-generated content. We collected mixcase instances generated from multiple daily text-editing scenarios and composed MixSet, the first dataset dedicated to studying these mixed modification scenarios. We conduct experiments to evaluate the efficacy of popular MGT detectors, assessing their effectiveness, robustness, and generalization performance. Our findings reveal that existing detectors struggle to identify mixcase as a separate class or MGT, particularly in dealing with subtle modifications and style adaptability. This research underscores the urgent need for more fine-grain detectors tailored for mixcase, offering valuable insights for future research. Code and Models are available at https://github.com/Dongping-Chen/MixSet.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (2 more...)
- Media (0.68)
- Health & Medicine (0.46)
- Education (0.46)
CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data Limitation With Contrastive Learning
Liu, Xiaoming, Zhang, Zhaohan, Wang, Yichen, Pu, Hang, Lan, Yu, Shen, Chao
Machine-Generated Text (MGT) detection, a task that discriminates MGT from Human-Written Text (HWT), plays a crucial role in preventing misuse of text generative models, which excel in mimicking human writing style recently. Latest proposed detectors usually take coarse text sequences as input and fine-tune pretrained models with standard cross-entropy loss. However, these methods fail to consider the linguistic structure of texts. Moreover, they lack the ability to handle the low-resource problem which could often happen in practice considering the enormous amount of textual data online. In this paper, we present a coherence-based contrastive learning model named CoCo to detect the possible MGT under low-resource scenario. To exploit the linguistic feature, we encode coherence information in form of graph into text representation. To tackle the challenges of low data resource, we employ a contrastive learning framework and propose an improved contrastive loss for preventing performance degradation brought by simple samples. The experiment results on two public datasets and two self-constructed datasets prove our approach outperforms the state-of-art methods significantly. Also, we surprisingly find that MGTs originated from up-to-date language models could be easier to detect than these from previous models, in our experiments. And we propose some preliminary explanations for this counter-intuitive phenomena. All the codes and datasets are open-sourced.
- Asia > Middle East > UAE > Dubai Emirate > Dubai (0.04)
- Asia > India (0.04)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- (17 more...)
- Leisure & Entertainment > Sports > Soccer (1.00)
- Government (1.00)
- Media (0.93)
- Information Technology (0.67)