open data
PeLLE: Encoder-based language models for Brazilian Portuguese based on open data
de Mello, Guilherme Lamartine, Finger, Marcelo, Serras, and Felipe, Carpi, Miguel de Mello, Jose, Marcos Menon, Domingues, Pedro Henrique, Cavalim, Paulo
In this paper we present PeLLE, a family of large language models based on the RoBERTa architecture, for Brazilian Portuguese, trained on curated, open data from the Carolina corpus. Aiming at reproducible results, we describe details of the pretraining of the models. We also evaluate PeLLE models against a set of existing multilingual and PT-BR refined pretrained Transformer-based LLM encoders, contrasting performance of large versus smaller-but-curated pretrained models in several downstream tasks. We conclude that several tasks perform better with larger models, but some tasks benefit from smaller-but-curated data in its pretraining.
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- South America > Brazil > São Paulo (0.04)
- (4 more...)
Towards building a monitoring platform for a challenge-oriented smart specialisation with RIS3-MCAT
Fuster, Enric, Fernández, Tatiana, Carretero, Hermes, Duran-Silva, Nicolau, Guixé, Roger, Pujol, Josep, Rondelli, Bernardo, Rull, Guillem, Cortijo, Marta, Romagosa, Montserrat
In the new research and innovation (R&I) paradigm, aimed at a transformation towards more sustainable, inclusive and fair pathways to address societal and environmental challenges, and at generating new patterns of specialisation and new trajectories for socioeconomic development, it is essential to provide monitoring systems and tools to map and understand the contribution of R&I policies and projects. To address this transformation, we present the RIS3-MCAT platform, the result of a line of work aimed at exploring the potential of open data, semantic analysis, and data visualisation, for monitoring challenge-oriented smart specialisation in Catalonia. RIS3-MCAT is an interactive platform that facilitates access to R&I project data in formats that allow for sophisticated analyses of a large volume of texts, enabling the detailed study of thematic specialisations and challenges beyond classical classification systems. Its conceptualisation, development framework and use are presented in this paper. Keywords: open data, research and innovation policy, smart specialisation strategies, text mining, data visualisation, scientometrics 1. INTRODUCTION The challenges posed by globalisation, technology, climate change, and the COVID-19 pandemic require significant changes in our way of living. Although large transition costs are associated with a successful attainment of all those challenges, the potential opportunities brought about are enormous (Bigas et al., 2021).
- North America > Montserrat (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Government (1.00)
- Health & Medicine (0.74)
Are Large Language Models a Threat to Digital Public Goods? Evidence from Activity on Stack Overflow
del Rio-Chanona, Maria, Laurentsyeva, Nadzeya, Wachs, Johannes
Large language models like ChatGPT efficiently provide users with information about various topics, presenting a potential substitute for searching the web and asking people for help online. But since users interact privately with the model, these models may drastically reduce the amount of publicly available human-generated data and knowledge resources. This substitution can present a significant problem in securing training data for future models. In this work, we investigate how the release of ChatGPT changed human-generated open data on the web by analyzing the activity on Stack Overflow, the leading online Q\&A platform for computer programming. We find that relative to its Russian and Chinese counterparts, where access to ChatGPT is limited, and to similar forums for mathematics, where ChatGPT is less capable, activity on Stack Overflow significantly decreased. A difference-in-differences model estimates a 16\% decrease in weekly posts on Stack Overflow. This effect increases in magnitude over time, and is larger for posts related to the most widely used programming languages. Posts made after ChatGPT get similar voting scores than before, suggesting that ChatGPT is not merely displacing duplicate or low-quality content. These results suggest that more users are adopting large language models to answer questions and they are better substitutes for Stack Overflow for languages for which they have more training data. Using models like ChatGPT may be more efficient for solving certain programming problems, but its widespread adoption and the resulting shift away from public exchange on the web will limit the open data people and models can learn from in the future.
- Asia > Russia (0.28)
- Europe > Russia (0.14)
- North America > United States > New York (0.04)
- (7 more...)
Open Data on GitHub: Unlocking the Potential of AI
Roman, Anthony Cintron, Xu, Kevin, Smith, Arfon, Vega, Jehu Torres, Robinson, Caleb, Ferres, Juan M Lavista
GitHub is the world's largest platform for collaborative software development, with over 100 million users. GitHub is also used extensively for open data collaboration, hosting more than 800 million open data files, totaling 142 terabytes of data. This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research. We analyze the existing landscape of open data on GitHub and the patterns of how users share datasets. Our findings show that GitHub is one of the largest hosts of open data in the world and has experienced an accelerated growth of open data assets over the past four years. By examining the open data landscape on GitHub, we aim to empower users and organizations to leverage existing open datasets and improve their discoverability -- ultimately contributing to the ongoing AI revolution to help address complex societal issues. We release the three datasets that we have collected to support this analysis as open datasets at https://github.com/github/open-data-on-github.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Greece > Attica > Athens (0.04)
Towards the Automatic Generation of Conversational Interfaces to Facilitate the Exploration of Tabular Data
Gomez, Marcos, Cabot, Jordi, Clarisó, Robert
Tabular data is the most common format to publish and exchange structured data online. A clear example is the growing number of open data portals published by all types of public administrations. However, exploitation of these data sources is currently limited to technical people able to programmatically manipulate and digest such data. As an alternative, we propose the use of chatbots to offer a conversational interface to facilitate the exploration of tabular data sources. With our approach, any regular citizen can benefit and leverage them. Moreover, our chatbots are not manually created: instead, they are automatically generated from the data source itself thanks to the instantiation of a configurable collection of conversation patterns.
- North America > United States > Texas > Kleberg County (0.04)
- North America > United States > Texas > Jack County (0.04)
- North America > United States > Texas > Chambers County (0.04)
- (3 more...)
- Research Report (0.64)
- Workflow (0.46)
Trends and Challenges Towards an Effective Data-Driven Decision Making in UK SMEs: Case Studies and Lessons Learnt from the Analysis of 85 SMEs
Tawil, Abdel-Rahman, Mohamed, Muhidin, Schmoor, Xavier, Vlachos, Konstantinos, Haidar, Diana
The adoption of data science brings vast benefits to Small and Medium-sized Enterprises (SMEs) including business productivity, economic growth, innovation and jobs creation. Data Science can support SMEs to optimise production processes, anticipate customers' needs, predict machinery failures and deliver efficient smart services. Businesses can also harness the power of Artificial Intelligence (AI) and Big Data and the smart use of digital technologies to enhance productivity and performance, paving the way for innovation. However, integrating data science decisions into an SME requires both skills and IT investments. In most cases, such expenses are beyond the means of SMEs due to limited resources and restricted access to financing. This paper presents trends and challenges towards an effective data-driven decision making for organisations based on a case study of 85 SMEs, mostly from the West Midlands region of England. The work is supported as part of a 3 years ERDF (European Regional Development Funded project) in the areas of big data management, analytics and business intelligence. We present two case studies that demonstrates the potential of Digitisation, AI and Machine Learning and use these as examples to unveil challenges and showcase the wealth of current available opportunities for SMEs.
- North America > United States > Hawaii (0.04)
- Europe > United Kingdom > Scotland (0.04)
- Europe > United Kingdom > Northern Ireland (0.04)
- (2 more...)
- Workflow (1.00)
- Research Report > New Finding (0.93)
- Overview (0.93)
- Law (1.00)
- Health & Medicine (1.00)
- Banking & Finance > Economy (1.00)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.87)
The Water Health Open Knowledge Graph
Carletti, Gianluca, Giulianelli, Elio, Lippolis, Anna Sofia, Lodi, Giorgia, Nuzzolese, Andrea Giovanni, Picone, Marco, Settanta, Giulio
Recently, an increasing interest in the management of water and health resources has been recorded. This interest is fed by the global sustainability challenges posed to the humanity that have water scarcity and quality at their core. Thus, the availability of effective, meaningful and open data is crucial to address those issues in the broader context of the Sustainable Development Goals of clean water and sanitation as targeted by the United Nations. In this paper, we present the Water Health Open Knowledge Graph (WHOW-KG) along with its design methodology and analysis on impact. WHOW-KG is a semantic knowledge graph that models data on water consumption, pollution, infectious disease rates and drug distribution. The WHOW-KG is developed in the context of the EU-funded WHOW (Water Health Open Knowledge) project and aims at supporting a wide range of applications: from knowledge discovery to decision-making, making it a valuable resource for researchers, policymakers, and practitioners in the water and health domains. The WHOW-KG consists of a network of five ontologies and related linked open data, modelled according to those ontologies.
- Europe > Italy > Lazio > Rome (0.04)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- South America > Colombia > Bogotá D.C. > Bogotá (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- Health & Medicine (1.00)
- Government (1.00)
- Water & Waste Management > Water Management > Water Supplies & Services (0.30)
Whose Text Is It Anyway? Exploring BigCode, Intellectual Property, and Ethics
Choksi, Madiha Zahrah, Goedicke, David
Intelligent or generative writing tools rely on large language models that recognize, summarize, translate, and predict content. This position paper probes the copyright interests of open data sets used to train large language models (LLMs). Our paper asks, how do LLMs trained on open data sets circumvent the copyright interests of the used data? We start by defining software copyright and tracing its history. We rely on GitHub Copilot as a modern case study challenging software copyright. Our conclusion outlines obstacles that generative writing assistants create for copyright, and offers a practical road map for copyright analysis for developers, software law experts, and general users to consider in the context of intelligent LLM-powered writing tools.
Mapping STI ecosystems via Open Data: overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark
Bovenzi, Nicandro, Duran-Silva, Nicolau, Massucci, Francesco Alessandro, Multari, Francesco, Parra-Rojas, Cèsar, Pujol-Llatse, Josep
Science, Technology and Innovation (STI) decision-makers often need to have a clear vision of what is researched and by whom to design effective policies. Such a vision is provided by effective and comprehensive mappings of the research activities carried out within their institutional boundaries. A major challenge to be faced in this context is the difficulty in accessing the relevant data and in combining information coming from different sources: indeed, traditionally, STI data has been confined within closed data sources and, when available, it is categorised with different taxonomies. Here, we present a proof-of-concept study of the use of Open Resources to map the research landscape on the Sustainable Development Goal (SDG) 13 - Climate Action, for an entire country, Denmark, and we map it on the 25 ERC panels.
Understanding the Ethical Use of Open Data While Protecting PII
People have been wondering for years – when and even sometimes IF artificial intelligence will live up to its incredible potential. The technology is finally beginning to change industries and lives. Now implemented across everything from smartphone cameras and self-driving vehicles to manufacturing facilities, AI has racked up numerous high-profile success stories: People now rely on AI to silently optimize photos, perfect their parallel parking, and discover product defects. AI can either be cool or creepy, but it's currently on the right side of that line. At the same time, however, the public is becoming increasingly aware of AI ethics, as researchers and journalists question the sources of data powering AI innovations, and spotlight ways AI data is being misused by tech giants.
- Information Technology > Security & Privacy (1.00)
- Law (0.92)
- Government > Regional Government > North America Government > United States Government (0.48)