AITopics | duplication

Collaborating Authors

duplication

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models

Neural Information Processing SystemsJun-14-2026, 04:37:00 GMT

Data duplication within large-scale corpora often impedes large language models' (LLMs) performance and privacy. In privacy-concerned federated learning scenarios, conventional deduplication methods typically rely on trusted third parties to perform uniform deletion, risking loss of informative samples while introducing privacy vulnerabilities. To address these gaps, we propose Federated ReWeighting (FedRW), the first privacy-preserving framework, to the best of our knowledge, that performs soft deduplication via sample reweighting instead of deletion in federated LLM training, without assuming a trusted third party. At its core, FedRW proposes a secure, frequency-aware reweighting protocol through secure multi-party computation, coupled with a parallel orchestration strategy to ensure efficiency and scalability. During training, FedRW utilizes an adaptive reweighting mechanism with global sample frequencies to adjust individual loss contributions, effectively improving generalization and robustness. Empirical results demonstrate that FedRW outperforms the state-of-the-art method by achieving up to $28.78\times$ speedup in preprocessing and approximately $11.42$\% improvement in perplexity, while offering enhanced security guarantees. FedRW thus establishes a new paradigm for managing duplication in federated LLM training.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Genre: Research Report (0.60)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Why Diffusion Models Memorize and How to Mitigate Copying

Neural Information Processing SystemsFeb-15-2026, 23:38:16 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland > Prince George's County > College Park (0.04)
Europe > Netherlands > Drenthe > Assen (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Why Diffusion Models Memorize and How to Mitigate Copying

Neural Information Processing SystemsFeb-15-2026, 23:38:11 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland > Prince George's County > College Park (0.04)
Europe > Netherlands > Drenthe > Assen (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

a42a596fc71e17828440030074d15e74-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-9-2026, 15:57:16 GMT

detection, rotation, superpoint, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.51)

Add feedback

Why Diffusion Models Memorize and How to Mitigate Copying

Neural Information Processing SystemsOct-9-2025, 01:52:47 GMT

Images generated by diffusion models like Stable Diffusion are increasingly widespread.

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland > Prince George's County > College Park (0.04)
Europe > Netherlands > Drenthe > Assen (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Law (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Why Diffusion Models Memorize and How to Mitigate Copying

Neural Information Processing SystemsOct-9-2025, 01:52:44 GMT

Images generated by diffusion models like Stable Diffusion are increasingly widespread.

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland > Prince George's County > College Park (0.04)
Europe > Netherlands > Drenthe > Assen (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Law (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

On the de-duplication of the Lakh MIDI dataset

Choi, Eunjin, Kim, Hyerin, Ryu, Jiwoo, Nam, Juhan, Jeong, Dasaem

arXiv.org Artificial IntelligenceSep-23-2025

A large-scale dataset is essential for training a well-generalized deep-learning model. Most such datasets are collected via scraping from various internet sources, inevitably introducing duplicated data. In the symbolic music domain, these duplicates often come from multiple user arrangements and metadata changes after simple editing. However, despite critical issues such as unreliable training evaluation from data leakage during random splitting, dataset duplication has not been extensively addressed in the MIR community. This study investigates the dataset duplication issues regarding Lakh MIDI Dataset (LMD), one of the largest publicly available sources in the symbolic music domain. To find and evaluate the best retrieval method for duplicated data, we employed the Clean MIDI subset of the LMD as a benchmark test set, in which different versions of the same songs are grouped together. We first evaluated rule-based approaches and previous symbolic music retrieval models for de-duplication and also investigated with a contrastive learning-based BERT model with various augmentations to find duplicate files. As a result, we propose three different versions of the filtered list of LMD, which filters out at least 38,134 samples in the most conservative settings among 178,561 files.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.16662

Country: Asia > South Korea (0.14)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Computing-In-Memory Dataflow for Minimal Buffer Traffic

Song, Choongseok, Jeong, Doo Seok

arXiv.org Artificial IntelligenceAug-21-2025

--Computing-In-Memory (CIM) offers a potential solution to the memory wall issue and can achieve high energy efficiency by minimizing data movement, making it a promising architecture for edge AI devices. Lightweight models like MobileNet and EfficientNet, which utilize depthwise convolution for feature extraction, have been developed for these devices. However, CIM macros often face challenges in accelerating depth-wise convolution, including underutilization of CIM memory and heavy buffer traffic. The latter, in particular, has been overlooked despite its significant impact on latency and energy consumption. T o address this, we introduce a novel CIM dataflow that significantly reduces buffer traffic by maximizing data reuse and improving memory utilization during depthwise convolution. The proposed dataflow is grounded in solid theoretical principles, fully demonstrated in this paper . When applied to MobileNet and EfficientNet models, our dataflow reduces buffer traffic by 77.4-87.0%, Convolutional neural networks (CNNs) have achieved remarkable success in computer vision, excelling in spatial feature extraction [1].

artificial intelligence, machine learning, traffic, (20 more...)

arXiv.org Artificial Intelligence

2508.14375

Genre: Research Report (0.70)

Industry: Energy (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

[ Submission 1194: " DISK" ] We thank all reviewers for their insightful comments, and address their concerns

Neural Information Processing SystemsAug-15-2025, 13:51:19 GMT

R1: DISK is based on previous work (U-Net, SuperPoint) and only offers moderate innovation. We will clarify this in the paper. We tuned inference parameters (NMS window & RANSAC settings) by search, as described in L194-197. R1, R3, R5: What is the contribution of individual components of the pipeline? Experimentally, we observe that 19.9% of features from grid selection This has three potential downsides.

insightful comment, reviewer, submission 1194, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.51)

Add feedback

Video Forgery Detection for Surveillance Cameras: A Review

Tayfor, Noor B., Rashid, Tarik A., Qader, Shko M., Hassan, Bryar A., Abdalla, Mohammed H., Majidpour, Jafar, Ahmed, Aram M., Ali, Hussein M., Aladdin, Aso M., Abdullah, Abdulhady A., Shamsaldin, Ahmed S., Sidqi, Haval M., Salih, Abdulrahman, Yaseen, Zaher M., Ameen, Azad A., Nayak, Janmenjoy, Hamza, Mahmood Yashar

arXiv.org Artificial IntelligenceJul-29-2025

The widespread availability of video recording through smartphones and digital devices has made video-based evidence more accessible than ever. Surveillance footage plays a crucial role in security, law enforcement, and judicial processes. However, with the rise of advanced video editing tools, tampering with digital recordings has become increasingly easy, raising concerns about their authenticity. Ensuring the integrity of surveillance videos is essential, as manipulated footage can lead to misinformation and undermine judicial decisions. This paper provides a comprehensive review of existing forensic techniques used to detect video forgery, focusing on their effectiveness in verifying the authenticity of surveillance recordings. Various methods, including compression-based analysis, frame duplication detection, and machine learning-based approaches, are explored. The findings highlight the growing necessity for more robust forensic techniques to counteract evolving forgery methods. Strengthening video forensic capabilities will ensure that surveillance recordings remain credible and admissible as legal evidence.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.03832

Country: Asia > Middle East > Iraq > Kurdistan Region (0.14)

Genre: