Information Retrieval
BIM-GPT: a Prompt-Based Virtual Assistant Framework for BIM Information Retrieval
Zheng, Junwen, Fischer, Martin
Efficient information retrieval (IR) from building information models (BIMs) poses significant challenges due to the necessity for deep BIM knowledge or extensive engineering efforts for automation. We introduce BIM-GPT, a prompt-based virtual assistant (VA) framework integrating BIM and generative pre-trained transformer (GPT) technologies to support NL-based IR. A prompt manager and dynamic template generate prompts for GPT models, enabling interpretation of NL queries, summarization of retrieved information, and answering BIM-related questions. In tests on a BIM IR dataset, our approach achieved 83.5% and 99.5% accuracy rates for classifying NL queries with no data and 2% data incorporated in prompts, respectively. Additionally, we validated the functionality of BIM-GPT through a VA prototype for a hospital building. This research contributes to the development of effective and versatile VAs for BIM IR in the construction industry, significantly enhancing BIM accessibility and reducing engineering efforts and training data requirements for processing NL queries.
Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models
Liu, Jiawei, Kang, Yangyang, Tang, Di, Song, Kaisong, Sun, Changlong, Wang, Xiaofeng, Lu, Wei, Liu, Xiaozhong
Neural text ranking models have witnessed significant advancement and are increasingly being deployed in practice. Unfortunately, they also inherit adversarial vulnerabilities of general neural models, which have been detected but remain underexplored by prior studies. Moreover, the inherit adversarial vulnerabilities might be leveraged by blackhat SEO to defeat better-protected search engines. In this study, we propose an imitation adversarial attack on black-box neural passage ranking models. We first show that the target passage ranking model can be transparentized and imitated by enumerating critical queries/candidates and then train a ranking imitation model. Leveraging the ranking imitation model, we can elaborately manipulate the ranking results and transfer the manipulation attack to the target ranking model. For this purpose, we propose an innovative gradient-based attack method, empowered by the pairwise objective function, to generate adversarial triggers, which causes premeditated disorderliness with very few tokens. To equip the trigger camouflages, we add the next sentence prediction loss and the language model fluency constraint to the objective function. Experimental results on passage ranking demonstrate the effectiveness of the ranking imitation attack model and adversarial triggers against various SOTA neural ranking models. Furthermore, various mitigation analyses and human evaluation show the effectiveness of camouflages when facing potential mitigation approaches. To motivate other scholars to further investigate this novel and important problem, we make the experiment data and code publicly available.
PMC-Patients: A Large-scale Dataset of Patient Summaries and Relations for Benchmarking Retrieval-based Clinical Decision Support Systems
Zhao, Zhengyun, Jin, Qiao, Chen, Fangyuan, Peng, Tuorui, Yu, Sheng
Objective: Retrieval-based Clinical Decision Support (ReCDS) can aid clinical workflow by providing relevant literature and similar patients for a given patient. However, the development of ReCDS systems has been severely obstructed by the lack of diverse patient collections and publicly available large-scale patient-level annotation datasets. In this paper, we aim to define and benchmark two ReCDS tasks: Patient-to-Article Retrieval (ReCDS-PAR) and Patient-to-Patient Retrieval (ReCDS-PPR) using a novel dataset called PMC-Patients. Methods: We extract patient summaries from PubMed Central articles using simple heuristics and utilize the PubMed citation graph to define patient-article relevance and patient-patient similarity. We also implement and evaluate several ReCDS systems on the PMC-Patients benchmarks, including sparse retrievers, dense retrievers, and nearest neighbor retrievers. We conduct several case studies to show the clinical utility of PMC-Patients. Results: PMC-Patients contains 167k patient summaries with 3.1M patient-article relevance annotations and 293k patient-patient similarity annotations, which is the largest-scale resource for ReCDS and also one of the largest patient collections. Human evaluation and analysis show that PMC-Patients is a diverse dataset with high-quality annotations. The evaluation of various ReCDS systems shows that the PMC-Patients benchmark is challenging and calls for further research. Conclusion: We present PMC-Patients, a large-scale, diverse, and publicly available patient summary dataset with the largest-scale patient-level relation annotations. Based on PMC-Patients, we formally define two benchmark tasks for ReCDS systems and evaluate various existing retrieval methods. PMC-Patients can largely facilitate methodology research on ReCDS systems and shows real-world clinical utility.
METAM: Goal-Oriented Data Discovery
Galhotra, Sainyam, Gong, Yue, Fernandez, Raul Castro
Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing solutions do not leverage the synergy between discovery and augmentation, thus under exploiting data. In this paper, we introduce METAM, a novel goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, METAM leverages properties of the: i) data, ii) utility function, and iii) solution set size. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks. All in all, we demonstrate the promise of goal-oriented data discovery to modern data science applications.
Google Devising Radical Search Changes to Beat Back AI Rivals - The New York Times
The new features, under the project name Magi, are being created by designers, engineers and executives working in so-called sprint rooms to tweak and test the latest versions. The new search engine would offer users a far more personalized experience than the company's current service, attempting to anticipate users' needs. Lara Levin, a Google spokeswoman, said in a statement that "not every brainstorm deck or product idea leads to a launch, but as we've said before, we're excited about bringing new A.I.-powered features to search, and will share more details soon." Billions of people use Google's search engine every day for everything from finding restaurants and directions to understanding a medical diagnosis, and that simple white page with the company logo and an empty bar in the middle is one of the most widely used web pages in the world. Changes to it would have a significant impact on the lives of ordinary people, and until recently it was hard to imagine anything challenging it.
Google reportedly working on a new AI-powered search engine - Gizmochina
As the Artificial Intelligence race heats up, Google is feeling the pressure, and the company is now reportedly working on a new AI-powered search engine. According to the latest reports, the Mountain View-based technology giant is in the process of creating a new AI-powered search engine as well as updating technology in the existing search platform. Internal documents indicate that the company has a project named Magi which is aimed at updating the existing search engine with the new technology and about 160 employees are currently working on it. The new features are being created by designers, engineers, and executives working in sprint rooms to tweak and test the new versions. After the changes, the search engine would offer a more personalized experience than the company's current service.
Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org
Binette, Olivier, York, Sokhna A, Hickerson, Emma, Baek, Youngsoo, Madhavan, Sarvo, Jones, Christina
This paper introduces a novel evaluation methodology for entity resolution algorithms. It is motivated by PatentsView.org, a U.S. Patents and Trademarks Office patent data exploration tool that disambiguates patent inventors using an entity resolution algorithm. We provide a data collection methodology and tailored performance estimators that account for sampling biases. Our approach is simple, practical and principled -- key characteristics that allow us to paint the first representative picture of PatentsView's disambiguation performance. This approach is used to inform PatentsView's users of the reliability of the data and to allow the comparison of competing disambiguation algorithms.
Google is reportedly developing a new AI-powered search engine
Facing renewed competition from Microsoft and OpenAI, Google is reportedly "racing" to build an "all-new" AI-powered search engine. According to The New York Times, the company is in the early stages of creating a search service that will attempt to anticipate what you want from it in hopes of offering "a far more personalized experience." The project has "no clear timetable." However, knowing that Google is also developing a suite of new AI features for its existing search engine under the codename "Magi." Among the features Google is developing is a chatbot that can answer software engineering questions and generate code snippets.
Learning To Rank Resources with GNN
Ergashev, Ulugbek, Dragut, Eduard C., Meng, Weiyi
As the content on the Internet continues to grow, many new dynamically changing and heterogeneous sources of data constantly emerge. A conventional search engine cannot crawl and index at the same pace as the expansion of the Internet. Moreover, a large portion of the data on the Internet is not accessible to traditional search engines. Distributed Information Retrieval (DIR) is a viable solution to this as it integrates multiple shards (resources) and provides a unified access to them. Resource selection is a key component of DIR systems. There is a rich body of literature on resource selection approaches for DIR. A key limitation of the existing approaches is that they primarily use term-based statistical features and do not generally model resource-query and resource-resource relationships. In this paper, we propose a graph neural network (GNN) based approach to learning-to-rank that is capable of modeling resource-query and resource-resource relationships. Specifically, we utilize a pre-trained language model (PTLM) to obtain semantic information from queries and resources. Then, we explicitly build a heterogeneous graph to preserve structural information of query-resource relationships and employ GNN to extract structural information. In addition, the heterogeneous graph is enriched with resource-resource type of edges to further enhance the ranking accuracy. Extensive experiments on benchmark datasets show that our proposed approach is highly effective in resource selection. Our method outperforms the state-of-the-art by 6.4% to 42% on various performance metrics.
VaxxHesitancy: A Dataset for Studying Hesitancy towards COVID-19 Vaccination on Twitter
Mu, Yida, Jin, Mali, Grimshaw, Charlie, Scarton, Carolina, Bontcheva, Kalina, Song, Xingyi
Vaccine hesitancy has been a common concern, probably since vaccines were created and, with the popularisation of social media, people started to express their concerns about vaccines online alongside those posting pro- and anti-vaccine content. Predictably, since the first mentions of a COVID-19 vaccine, social media users posted about their fears and concerns or about their support and belief into the effectiveness of these rapidly developing vaccines. Identifying and understanding the reasons behind public hesitancy towards COVID-19 vaccines is important for policy markers that need to develop actions to better inform the population with the aim of increasing vaccine take-up. In the case of COVID-19, where the fast development of the vaccines was mirrored closely by growth in anti-vaxx disinformation, automatic means of detecting citizen attitudes towards vaccination became necessary. This is an important computational social sciences task that requires data analysis in order to gain in-depth understanding of the phenomena at hand. Annotated data is also necessary for training data-driven models for more nuanced analysis of attitudes towards vaccination. To this end, we created a new collection of over 3,101 tweets annotated with users' attitudes towards COVID-19 vaccination (stance). Besides, we also develop a domain-specific language model (VaxxBERT) that achieves the best predictive performance (73.0 accuracy and 69.3 F1-score) as compared to a robust set of baselines. To the best of our knowledge, these are the first dataset and model that model vaccine hesitancy as a category distinct from pro- and anti-vaccine stance.