AITopics | Little Rock

Collaborating Authors

Little Rock

A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research

Chung, Simon, Vorland, Colby J., Maney, Donna L., Brown, Andrew W.

arXiv.org Machine LearningDec-15-2025

Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.

algorithm, category, marginal distribution, (14 more...)

arXiv.org Machine Learning

2512.08371

Country:

North America > United States > Arkansas > Pulaski County > Little Rock (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Indiana > Monroe County > Bloomington (0.04)
(4 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Embedding Reliability Verification Constraints into Generation Expansion Planning

Liu, Peng, Cheng, Lian, Omell, Benjamin P., Burgard, Anthony P.

arXiv.org Machine LearningApr-6-2025

Generation planning approaches face challenges in managing the incompatible mathematical structures between stochastic production simulations for reliability assessment and optimization models for generation planning, which hinders the integration of reliability constraints. This study proposes an approach to embedding reliability verification constraints into generation expansion planning by leveraging a weighted oblique decision tree (WODT) technique. For each planning year, a generation mix dataset, labeled with reliability assessment simulations, is generated. An WODT model is trained using this dataset. Reliability-feasible regions are extracted via depth-first search technique and formulated as disjunctive constraints. These constraints are then transformed into mixed-integer linear form using a convex hull modeling technique and embedded into a unit commitment-integrated generation expansion planning model. The proposed approach is validated through a long-term generation planning case study for the Electric Reliability Council of Texas (ERCOT) region, demonstrating its effectiveness in achieving reliable and optimal planning solutions.

artificial intelligence, constraint, planning & scheduling, (16 more...)

arXiv.org Machine Learning

2504.07131

Country:

North America > United States > Texas (0.25)
Asia > China > Heilongjiang Province > Harbin (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry:

Government (1.00)
Energy > Renewable > Solar (1.00)
Energy > Power Industry > Utilities (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.54)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

Add feedback

Building Machine Learning Challenges for Anomaly Detection in Science

Campolongo, Elizabeth G., Chou, Yuan-Tang, Govorkova, Ekaterina, Bhimji, Wahid, Chao, Wei-Lun, Harris, Chris, Hsu, Shih-Chieh, Lapp, Hilmar, Neubauer, Mark S., Namayanja, Josephine, Subramanian, Aneesh, Harris, Philip, Anand, Advaith, Carlyn, David E., Ghosh, Subhankar, Lawrence, Christopher, Moreno, Eric, Raikman, Ryan, Wu, Jiaman, Zhang, Ziheng, Adhi, Bayu, Gharehtoragh, Mohammad Ahmadi, Monsalve, Saúl Alonso, Babicz, Marta, Baig, Furqan, Banerji, Namrata, Bardon, William, Barna, Tyler, Berger-Wolf, Tanya, Dieng, Adji Bousso, Brachman, Micah, Buat, Quentin, Hui, David C. Y., Cao, Phuong, Cerino, Franco, Chang, Yi-Chun, Chaulagain, Shivaji, Chen, An-Kai, Chen, Deming, Chen, Eric, Chou, Chia-Jui, Ciou, Zih-Chen, Cochran-Branson, Miles, Choi, Artur Cordeiro Oudot, Coughlin, Michael, Cremonesi, Matteo, Dadarlat, Maria, Darch, Peter, Desai, Malina, Diaz, Daniel, Dillmann, Steven, Duarte, Javier, Duporge, Isla, Ekka, Urbas, Heravi, Saba Entezari, Fang, Hao, Flynn, Rian, Fox, Geoffrey, Freed, Emily, Gao, Hang, Gao, Jing, Gonski, Julia, Graham, Matthew, Hashemi, Abolfazl, Hauck, Scott, Hazelden, James, Peterson, Joshua Henry, Hoang, Duc, Hu, Wei, Huennefeld, Mirco, Hyde, David, Janeja, Vandana, Jaroenchai, Nattapon, Jia, Haoyi, Kang, Yunfan, Kholiavchenko, Maksim, Khoda, Elham E., Kim, Sangin, Kumar, Aditya, Lai, Bo-Cheng, Le, Trung, Lee, Chi-Wei, Lee, JangHyeon, Lee, Shaocheng, van der Lee, Suzan, Lewis, Charles, Li, Haitong, Li, Haoyang, Liao, Henry, Liu, Mia, Liu, Xiaolin, Liu, Xiulong, Loncar, Vladimir, Lyu, Fangzheng, Makarov, Ilya, Mao, Abhishikth Mallampalli Chen-Yu, Michels, Alexander, Migala, Alexander, Mokhtar, Farouk, Morlighem, Mathieu, Namgung, Min, Novak, Andrzej, Novick, Andrew, Orsborn, Amy, Padmanabhan, Anand, Pan, Jia-Cheng, Pandya, Sneh, Pei, Zhiyuan, Peixoto, Ana, Percivall, George, Leung, Alex Po, Purushotham, Sanjay, Que, Zhiqiang, Quinnan, Melissa, Ranjan, Arghya, Rankin, Dylan, Reissel, Christina, Riedel, Benedikt, Rubenstein, Dan, Sasli, Argyro, Shlizerman, Eli, Singh, Arushi, Singh, Kim, Sokol, Eric R., Sorensen, Arturo, Su, Yu, Taheri, Mitra, Thakkar, Vaibhav, Thomas, Ann Mariam, Toberer, Eric, Tsai, Chenghan, Vandewalle, Rebecca, Verma, Arjun, Venterea, Ricco C., Wang, He, Wang, Jianwu, Wang, Sam, Wang, Shaowen, Watts, Gordon, Weitz, Jason, Wildridge, Andrew, Williams, Rebecca, Wolf, Scott, Xu, Yue, Yan, Jianqi, Yu, Jai, Zhang, Yulei, Zhao, Haoran, Zhao, Ying, Zhong, Yibo

arXiv.org Artificial IntelligenceMar-3-2025

Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be confounding since it requires codifying a complete knowledge of the known scientific behaviors and then projecting these known behaviors on the data to look for deviations. When utilizing machine learning, this presents a particular challenge since we require that the model not only understands scientific data perfectly but also recognizes when the data is inconsistent and out of the scope of its trained behavior. In this paper, we present three datasets aimed at developing machine learning-based anomaly detection for disparate scientific domains covering astrophysics, genomics, and polar science. We present the different datasets along with a scheme to make machine learning challenges around the three datasets findable, accessible, interoperable, and reusable (FAIR). Furthermore, we present an approach that generalizes to future machine learning challenges, enabling the possibility of large, more compute-intensive challenges that can ultimately lead to scientific discovery.

dataset, detection, university, (15 more...)

arXiv.org Artificial Intelligence

2503.02112

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > Colorado > Boulder County > Boulder (0.28)
North America > United States > Wisconsin > Dane County > Madison (0.14)
(45 more...)

Genre: Research Report (0.40)

Industry:

Energy (0.68)
Health & Medicine (0.48)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging

Wang, Shansong, Safari, Mojtaba, Li, Qiang, Chang, Chih-Wei, Qiu, Richard LJ, Roper, Justin, Yu, David S., Yang, Xiaofeng

arXiv.org Artificial IntelligenceFeb-22-2025

Vision foundation models (VFMs) are pre-trained on extensive image datasets to learn general representations for diverse types of data. These models can subsequently be fine-tuned for specific downstream tasks, significantly boosting performance across a broad range of applications. However, existing vision foundation models that claim to be applicable to various clinical tasks are mostly pre-trained on 3D computed tomography (CT), which benefits from the availability of extensive 3D CT databases. Significant differences between CT and magnetic resonance imaging (MRI) in imaging principles, signal characteristics, and data distribution may hinder their practical performance and versatility in MRI-specific applications. Here, we propose Triad, a vision foundation model for 3D MRI. Triad adopts a widely used autoencoder architecture to learn robust representations from 131,170 3D MRI volumes and uses organ-independent imaging descriptions to constrain the semantic distribution of the visual modality. The above pre-training dataset is called Triad-131K, which is currently the largest 3D MRI pre-training dataset. We evaluate Triad across three tasks, namely, organ/tumor segmentation, organ/cancer classification, and medical image registration, in two data modalities (within-domain and out-of-domain) settings using 25 downstream datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad improves segmentation performance by 2.51% compared to nnUNet-Scratch across 17 datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% compared to SwinUNETR-Scratch in registration tasks across two datasets. Our study demonstrates that pre-training can improve performance when the data modalities and organs of upstream and downstream tasks are consistent.

dataset, segmentation, triad, (14 more...)

arXiv.org Artificial Intelligence

2502.14064

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Belgium > Flanders (0.04)
North America > United States > Pennsylvania (0.04)
North America > United States > Arkansas > Pulaski County > Little Rock (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Assessing and Prioritizing Ransomware Risk Based on Historical Victim Data

Massengale, Spencer, Huff, Philip

arXiv.org Artificial IntelligenceFeb-6-2025

We present an approach to identifying which ransomware adversaries are most likely to target specific entities, thereby assisting these entities in formulating better protection strategies. Ransomware poses a formidable cybersecurity threat characterized by profit-driven motives, a complex underlying economy supporting criminal syndicates, and the overt nature of its attacks. This type of malware has consistently ranked among the most prevalent, with a rapid escalation in activity observed. Recent estimates indicate that approximately two-thirds of organizations experienced ransomware attacks in 2023 \cite{Sophos2023Ransomware}. A central tactic in ransomware campaigns is publicizing attacks to coerce victims into paying ransoms. Our study utilizes public disclosures from ransomware victims to predict the likelihood of an entity being targeted by a specific ransomware variant. We employ a Large Language Model (LLM) architecture that uses a unique chain-of-thought, multi-shot prompt methodology to define adversary SKRAM (Skills, Knowledge, Resources, Authorities, and Motivation) profiles from ransomware bulletins, threat reports, and news items. This analysis is enriched with publicly available victim data and is further enhanced by a heuristic for generating synthetic data that reflects victim profiles. Our work culminates in the development of a machine learning model that assists organizations in prioritizing ransomware threats and formulating defenses based on the tactics, techniques, and procedures (TTP) of the most likely attackers.

data mining, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2502.04421

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Florida > Orange County > Orlando (0.04)
North America > United States > Arkansas > Pulaski County > Little Rock (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.82)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.69)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Improving Legal Entity Recognition Using a Hybrid Transformer Model and Semantic Filtering Approach

Rajamanickam, Duraimurugan

arXiv.org Artificial IntelligenceOct-11-2024

Legal Entity Recognition (LER) involves identifying key entities such as parties, dates, monetary amounts, and legal provisions from legal documents. Automating this process is crucial for improving efficiency in legal workflows, including contract review, compliance monitoring, and litigation support. Traditional Named Entity Recognition (NER) methods, such as rule-based systems and classical machine learning models like Conditional Random Fields (CRFs), require extensive feature engineering and struggle to adapt to new legal terminologies. Transformer-based models, particularly BERT [1], have shown great promise in various NLP tasks, including LER. **Legal-BERT**, a finetuned variant of BERT for legal texts, has demonstrated superior performance

information retrieval, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.08521

Country:

North America > United States > Arkansas > Pulaski County > Little Rock (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)

Genre: Research Report (0.50)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.30)

Add feedback

Environment Scan of Generative AI Infrastructure for Clinical and Translational Science

Idnay, Betina, Xu, Zihan, Adams, William G., Adibuzzaman, Mohammad, Anderson, Nicholas R., Bahroos, Neil, Bell, Douglas S., Bumgardner, Cody, Campion, Thomas, Castro, Mario, Cimino, James J., Cohen, I. Glenn, Dorr, David, Elkin, Peter L, Fan, Jungwei W., Ferris, Todd, Foran, David J., Hanauer, David, Hogarth, Mike, Huang, Kun, Kalpathy-Cramer, Jayashree, Kandpal, Manoj, Karnik, Niranjan S., Katoch, Avnish, Lai, Albert M., Lambert, Christophe G., Li, Lang, Lindsell, Christopher, Liu, Jinze, Lu, Zhiyong, Luo, Yuan, McGarvey, Peter, Mendonca, Eneida A., Mirhaji, Parsa, Murphy, Shawn, Osborne, John D., Paschalidis, Ioannis C., Harris, Paul A., Prior, Fred, Shaheen, Nicholas J., Shara, Nawar, Sim, Ida, Tachinardi, Umberto, Waitman, Lemuel R., Wright, Rosalind J., Zai, Adrian H., Zheng, Kai, Lee, Sandra Soo-Jin, Malin, Bradley A., Natarajan, Karthik, Price, W. Nicholson II, Zhang, Rui, Zhang, Yiye, Xu, Hua, Bian, Jiang, Weng, Chunhua, Peng, Yifan

arXiv.org Artificial IntelligenceSep-27-2024

This study reports a comprehensive environmental scan of the generative AI (GenAI) infrastructure in the national network for clinical and translational science across 36 institutions supported by the Clinical and Translational Science Award (CTSA) Program led by the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH) at the United States. With the rapid advancement of GenAI technologies, including large language models (LLMs), healthcare institutions face unprecedented opportunities and challenges. This research explores the current status of GenAI integration, focusing on stakeholder roles, governance structures, and ethical considerations by administering a survey among leaders of health institutions (i.e., representing academic medical centers and health systems) to assess the institutional readiness and approach towards GenAI adoption. Key findings indicate a diverse range of institutional strategies, with most organizations in the experimental phase of GenAI deployment. The study highlights significant variations in governance models, with a strong preference for centralized decision-making but notable gaps in workforce training and ethical oversight. Moreover, the results underscore the need for a more coordinated approach to GenAI governance, emphasizing collaboration among senior leaders, clinicians, information technology staff, and researchers. Our analysis also reveals concerns regarding GenAI bias, data security, and stakeholder trust, which must be addressed to ensure the ethical and effective implementation of GenAI technologies. This study offers valuable insights into the challenges and opportunities of GenAI integration in healthcare, providing a roadmap for institutions aiming to leverage GenAI for improved quality of care and operational efficiency.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.12793

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > California > San Francisco County > San Francisco (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
(39 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Providers & Services (1.00)
Education > Educational Setting > Higher Education (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

Abstractive Text Summarization: State of the Art, Challenges, and Improvements

Shakil, Hassan, Farooq, Ahmad, Kalita, Jugal

arXiv.org Artificial IntelligenceSep-3-2024

Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.

abstractive text summarization, summarization, text summarization, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.neucom.2024.128255

2409.02413

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > India > NCT > New Delhi (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(13 more...)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Overview (1.00)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Add feedback

MAMA-MIA: A Large-Scale Multi-Center Breast Cancer DCE-MRI Benchmark Dataset with Expert Segmentations

Garrucho, Lidia, Reidel, Claire-Anne, Kushibar, Kaisar, Joshi, Smriti, Osuala, Richard, Tsirikoglou, Apostolia, Bobowicz, Maciej, del Riego, Javier, Catanese, Alessandro, Gwoździewicz, Katarzyna, Cosaka, Maria-Laura, Abo-Elhoda, Pasant M., Tantawy, Sara W., Sakrana, Shorouq S., Shawky-Abdelfatah, Norhan O., Abdo-Salem, Amr Muhammad, Kozana, Androniki, Divjak, Eugen, Ivanac, Gordana, Nikiforaki, Katerina, Klontzas, Michail E., García-Dosdá, Rosa, Gulsun-Akpinar, Meltem, Lafcı, Oğuz, Mann, Ritse, Martín-Isla, Carlos, Prior, Fred, Marias, Kostas, Starmans, Martijn P. A., Strand, Fredrik, Díaz, Oliver, Igual, Laura, Lekadir, Karim

arXiv.org Artificial IntelligenceJun-19-2024

Current research in breast cancer Magnetic Resonance Imaging (MRI), especially with Artificial Intelligence (AI), faces challenges due to the lack of expert segmentations. To address this, we introduce the MAMA-MIA dataset, comprising 1506 multi-center dynamic contrast-enhanced MRI cases with expert segmentations of primary tumors and non-mass enhancement areas. These cases were sourced from four publicly available collections in The Cancer Imaging Archive (TCIA). Initially, we trained a deep learning model to automatically segment the cases, generating preliminary segmentations that significantly reduced expert segmentation time. Sixteen experts, averaging 9 years of experience in breast cancer, then corrected these segmentations, resulting in the final expert segmentations. Additionally, two radiologists conducted a visual inspection of the automatic segmentations to support future quality control studies. Alongside the expert segmentations, we provide 49 harmonized demographic and clinical variables and the pretrained weights of the well-known nnUNet architecture trained using the DCE-MRI full-images and expert segmentations. This dataset aims to accelerate the development and benchmarking of deep learning models and foster innovation in breast cancer diagnostics and treatment planning.

automatic segmentation, dataset, segmentation, (16 more...)

arXiv.org Artificial Intelligence

2406.13844

Country:

Europe > Austria > Vienna (0.14)
Europe > Greece (0.05)
Europe > Netherlands > South Holland > Rotterdam (0.04)
(13 more...)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Combining Embeddings and Domain Knowledge for Job Posting Duplicate Detection

Engelbach, Matthias, Klau, Dennis, Kintz, Maximilien, Ulrich, Alexander

arXiv.org Artificial IntelligenceJun-10-2024

Job descriptions are posted on many online channels, including company websites, job boards or social media platforms. These descriptions are usually published with varying text for the same job, due to the requirements of each platform or to target different audiences. However, for the purpose of automated recruitment and assistance of people working with these texts, it is helpful to aggregate job postings across platforms and thus detect duplicate descriptions that refer to the same job. In this work, we propose an approach for detecting duplicates in job descriptions. We show that combining overlap-based character similarity with text embedding and keyword matching methods lead to convincing results. In particular, we show that although no approach individually achieves satisfying performance, a combination of string comparison, deep textual embeddings, and the use of curated weighted lookup lists for specific skills leads to a significant boost in overall performance. A tool based on our approach is being used in production and feedback from real-life use confirms our evaluation.

detection, duplicate, job description, (16 more...)

arXiv.org Artificial Intelligence

2406.06257

Country:

Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Arkansas > Pulaski County > Little Rock (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback