AITopics

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.

computational linguistic, dataset, glotcc, (16 more...)

2410.23825

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
Asia > Indonesia > Bali (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(24 more...)

Genre: Research Report (1.00)

Industry:

Law (0.93)
Information Technology (0.93)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Singh, Rajmeet, Seneviratne, lakmal, Hussain, Irfan

A Comprehensive Review of Current Robot- Based Pollinators in Greenhouse Farming

The decline of bee and wind-based pollination systems in greenhouses due to controlled environments and limited access has boost the importance of finding alternative pollination methods. Robotic based pollination systems have emerged as a promising solution, ensuring adequate crop yield even in challenging pollination scenarios. This paper presents a comprehensive review of the current robotic-based pollinators employed in greenhouses. The review categorizes pollinator technologies into major categories such as air-jet, water-jet, linear actuator, ultrasonic wave, and air-liquid spray, each suitable for specific crop pollination requirements. However, these technologies are often tailored to particular crops, limiting their versatility. The advancement of science and technology has led to the integration of automated pollination technology, encompassing information technology, automatic perception, detection, control, and operation. This integration not only reduces labor costs but also fosters the ongoing progress of modern agriculture by refining technology, enhancing automation, and promoting intelligence in agricultural practices. Finally, the challenges encountered in design of pollinator are addressed, and a forward-looking perspective is taken towards future developments, aiming to contribute to the sustainable advancement of this technology.

farming, pollination, pollinator, (15 more...)

2410.23747

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
South America > Brazil (0.14)
Asia > Japan (0.04)
(8 more...)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.48)

Industry: Food & Agriculture > Agriculture (1.00)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)

Holt, Samuel, Liu, Tennison, van der Schaar, Mihaela

Automatically Learning Hybrid Digital Twins of Dynamical Systems

Digital Twins (DTs) are computational models that simulate the states and temporal dynamics of real-world systems, playing a crucial role in prediction, understanding, and decision-making across diverse domains. However, existing approaches to DTs often struggle to generalize to unseen conditions in data-scarce settings, a crucial requirement for such models. To address these limitations, our work begins by establishing the essential desiderata for effective DTs. Hybrid Digital Twins ($\textbf{HDTwins}$) represent a promising approach to address these requirements, modeling systems using a composition of both mechanistic and neural components. This hybrid architecture simultaneously leverages (partial) domain knowledge and neural network expressiveness to enhance generalization, with its modular design facilitating improved evolvability. While existing hybrid models rely on expert-specified architectures with only parameters optimized on data, $\textit{automatically}$ specifying and optimizing HDTwins remains intractable due to the complex search space and the need for flexible integration of domain priors. To overcome this complexity, we propose an evolutionary algorithm ($\textbf{HDTwinGen}$) that employs Large Language Models (LLMs) to autonomously propose, evaluate, and optimize HDTwins. Specifically, LLMs iteratively generate novel model specifications, while offline tools are employed to optimize emitted parameters. Correspondingly, proposed models are evaluated and evolved based on targeted feedback, enabling the discovery of increasingly effective hybrid models. Our empirical results reveal that HDTwinGen produces generalizable, sample-efficient, and evolvable models, significantly advancing DTs' efficacy in real-world applications.

dataset, tensor, torch, (14 more...)

2410.23691

Country:

North America > United States (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)
Research Report > Promising Solution (0.85)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Epidemiology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Debiasing Alternative Data for Credit Underwriting Using Causal Inference

Lam, Chris

Alternative data provides valuable insights for lenders to evaluate a borrower's creditworthiness, which could help expand credit access to underserved groups and lower costs for borrowers. But some forms of alternative data have historically been excluded from credit underwriting because it could act as an illegal proxy for a protected class like race or gender, causing redlining. We propose a method for applying causal inference to a supervised machine learning model to debias alternative data so that it might be used for credit underwriting. We demonstrate how our algorithm can be used against a public credit dataset to improve model accuracy across different racial groups, while providing theoretically robust nondiscrimination guarantees.

alternative data, borrower, discrimination, (16 more...)

2410.22382

Country:

North America > United States > New York > Kings County > New York City (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Missouri > Jackson County > Kansas City (0.04)
(9 more...)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Banking & Finance > Credit (1.00)
Banking & Finance > Insurance (0.82)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Machine LearningOct-31-2024

Bridging Geometric States via Geometric Diffusion Bridge

Luo, Shengjie, Xu, Yixian, He, Di, Zheng, Shuxin, Liu, Tie-Yan, Wang, Liwei

The accurate prediction of geometric state evolution in complex systems is critical for advancing scientific domains such as quantum chemistry and material modeling. Traditional experimental and computational methods face challenges in terms of environmental constraints and computational demands, while current deep learning approaches still fall short in terms of precision and generality. In this work, we introduce the Geometric Diffusion Bridge (GDB), a novel generative modeling framework that accurately bridges initial and target geometric states. GDB leverages a probabilistic approach to evolve geometric state distributions, employing an equivariant diffusion bridge derived by a modified version of Doob's $h$-transform for connecting geometric states. This tailored diffusion process is anchored by initial and target geometric states as fixed endpoints and governed by equivariant transition kernels. Moreover, trajectory data can be seamlessly leveraged in our GDB framework by using a chain of equivariant diffusion bridges, providing a more detailed and accurate characterization of evolution dynamics. Theoretically, we conduct a thorough examination to confirm our framework's ability to preserve joint distributions of geometric states and capability to completely model the underlying dynamics inducing trajectory distributions with negligible error. Experimental evaluations across various real-world scenarios show that GDB surpasses existing state-of-the-art approaches, opening up a new pathway for accurately bridging geometric states and tackling crucial scientific challenges with improved accuracy and applicability.

equivariant diffusion bridge, geometric state, international conference, (9 more...)

arXiv.org Machine Learning

2410.2422

Country:

North America > United States (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(4 more...)

Genre:

Research Report > Promising Solution (0.34)
Overview > Innovation (0.34)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

AIHubOct-30-2024, 11:19:11 GMT

Congratulations to the #ECAI2024 outstanding paper award winners

The 27th European Conference on Artificial Intelligence (ECAI-2024) took place from 19-24 October in Santiago de Compostela, Spain. The venue also played host to the 13th Conference on Prestigious Applications of Intelligent Systems (PAIS-2024). During the week, both conferences announced their outstanding paper award winners. The winning articles were chosen based on the reviews written during the paper selection process, nominations submitted by individual members of the programme committee, additional input solicited from outside experts, and the judgement of the programme committee chairs. Abstract: Proper losses such as cross-entropy incentivize classifiers to produce class probabilities that are well-calibrated on the training data.

classifier, outstanding paper award winner, temperature scaling, (13 more...)

AIHub

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.25)
Europe > Spain > Galicia > A Coruña Province > Santiago de Compostela (0.25)

Genre: Personal > Honors > Award (0.61)

Industry: Water & Waste Management > Solid Waste Management (0.33)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.72)

LLMs as Research Tools: A Large Scale Survey of Researchers' Usage and Perceptions

Liao, Zhehui, Antoniak, Maria, Cheong, Inyoung, Cheng, Evie Yu-Yen, Lee, Ai-Heng, Lo, Kyle, Chang, Joseph Chee, Zhang, Amy X.

The rise of large language models (LLMs) has led many researchers to consider their usage for scientific work. Some have found benefits using LLMs to augment or automate aspects of their research pipeline, while others have urged caution due to risks and ethical concerns. Yet little work has sought to quantify and characterize how researchers use LLMs and why. We present the first large-scale survey of 816 verified research article authors to understand how the research community leverages and perceives LLMs as research tools. We examine participants' self-reported LLM usage, finding that 81% of researchers have already incorporated LLMs into different aspects of their research workflow. We also find that traditionally disadvantaged groups in academia (non-White, junior, and non-native English speaking researchers) report higher LLM usage and perceived benefits, suggesting potential for improved research equity. However, women, non-binary, and senior researchers have greater ethical concerns, potentially hindering adoption.

large language model, machine learning, natural language, (22 more...)

2411.05025

Country:

North America > United States > New York > New York County > New York City (0.05)
Europe > Denmark > Capital Region > Copenhagen (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Health & Medicine (1.00)
Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)

Beyond Ontology in Dialogue State Tracking for Goal-Oriented Chatbot

Lee, Sejin, Kim, Dongha, Song, Min

Goal-oriented chatbots are essential for automating user tasks, such as booking flights or making restaurant reservations. A key component of these systems is Dialogue State Tracking (DST), which interprets user intent and maintains the dialogue state. However, existing DST methods often rely on fixed ontologies and manually compiled slot values, limiting their adaptability to open-domain dialogues. We propose a novel approach that leverages instruction tuning and advanced prompt strategies to enhance DST performance, without relying on any predefined ontologies. Our method enables Large Language Model (LLM) to infer dialogue states through carefully designed prompts and includes an anti-hallucination mechanism to ensure accurate tracking in diverse conversation contexts. Additionally, we employ a Variational Graph Auto-Encoder (VGAE) to model and predict subsequent user intent. Our approach achieved state-of-the-art with a JGA of 42.57% outperforming existing ontology-less DST models, and performed well in open-domain real-world conversations. This work presents a significant advancement in creating more adaptive and accurate goal-oriented chatbots.

computational linguistic, dialogue state, ontology, (11 more...)

2410.22767

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Indonesia > Bali (0.05)
Asia > Singapore (0.04)
(13 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
(2 more...)

Barros, Daniel, Fraga-Lamas, Paula, Fernandez-Carames, Tiago M., Lopes, Sergio Ivan

A Cost-Effective Thermal Imaging Safety Sensor for Industry 5.0 and Collaborative Robotics

The Industry 5.0 paradigm focuses on industrial operator well-being and sustainable manufacturing practices, where humans play a central role, not only during the repetitive and collaborative tasks of the manufacturing process, but also in the management of the factory floor assets. Human factors, such as ergonomics, safety, and well-being, push the human-centric smart factory to efficiently adopt novel technologies while minimizing environmental and social impact. As operations at the factory floor increasingly rely on collaborative robots (CoBots) and flexible manufacturing systems, there is a growing demand for redundant safety mechanisms (i.e., automatic human detection in the proximity of machinery that is under operation). Fostering enhanced process safety for human proximity detection allows for the protection against possible incidents or accidents with the deployed industrial devices and machinery. This paper introduces the design and implementation of a cost-effective thermal imaging Safety Sensor that can be used in the scope of Industry 5.0 to trigger distinct safe mode states in manufacturing processes that rely on collaborative robotics. The proposed Safety Sensor uses a hybrid detection approach and has been evaluated under controlled environmental conditions. The obtained results show a 97% accuracy at low computational cost when using the developed hybrid method to detect the presence of humans in thermal images.

cost-effective thermal imaging safety sensor, industry 5, sensor, (10 more...)

doi: 10.1007/978-3-031-35982-8_1

2410.23377

Country:

Europe > Portugal > Viana do Castelo > Viana do Castelo (0.05)
Europe > Spain > Galicia > A Coruña Province > A Coruña (0.04)
Europe > Portugal > Braga > Braga (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine (0.89)
Information Technology (0.68)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Survey of Cultural Awareness in Language Models: Text and Beyond

Pawar, Siddhesh, Park, Junyeong, Jin, Jiho, Arora, Arnav, Myung, Junho, Yadav, Srishti, Haznitrama, Faiz Ghifari, Song, Inhwa, Oh, Alice, Augenstein, Isabelle

dataset creation methodology, neural information processing system 2022, neural information processing system 35, (16 more...)

Large-scale deployment of large language models (LLMs) in various applications, such as chatbots and virtual assistants, requires LLMs to be culturally sensitive to the user to ensure inclusivity. Culture has been widely studied in psychology and anthropology, and there has been a recent surge in research on making LLMs more culturally inclusive in LLMs that goes beyond multilinguality and builds on findings from psychology and anthropology. In this paper, we survey efforts towards incorporating cultural awareness into text-based and multimodal LLMs. We start by defining cultural awareness in LLMs, taking the definitions of culture from anthropology and psychology as a point of departure. We then examine methodologies adopted for creating cross-cultural datasets, strategies for cultural inclusion in downstream tasks, and methodologies that have been used for benchmarking cultural awareness in LLMs. Further, we discuss the ethical implications of cultural alignment, the role of Human-Computer Interaction in driving cultural inclusion in LLMs, and the role of cultural alignment in driving social science research. We finally provide pointers to future research based on our findings about gaps in the literature.

2411.0086

Country:

North America > United States > Washington > King County > Seattle (0.27)
Asia > South Korea (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(67 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine > Therapeutic Area (1.00)
Education > Educational Setting > K-12 Education (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)