Information Extraction
Zero-shot Slot Filling with DPR and RAG
Glass, Michael, Rossiello, Gaetano, Gliozzo, Alfio
The ability to automatically extract Knowledge Graphs (KG) from a given collection of documents is a long-standing problem in Artificial Intelligence. One way to assess this capability is through the task of slot filling. Given an entity query in form of [Entity, Slot, ?], a system is asked to `fill' the slot by generating or extracting the missing value from a relevant passage or passages. This capability is crucial to create systems for automatic knowledge base population, which is becoming in ever-increasing demand, especially in enterprise applications. Recently, there has been a promising direction in evaluating language models in the same way we would evaluate knowledge bases, and the task of slot filling is the most suitable to this intent. The recent advancements in the field try to solve this task in an end-to-end fashion using retrieval-based language models. Models like Retrieval Augmented Generation (RAG) show surprisingly good performance without involving complex information extraction pipelines. However, the results achieved by these models on the two slot filling tasks in the KILT benchmark are still not at the level required by real-world information extraction systems. In this paper, we describe several strategies we adopted to improve the retriever and the generator of RAG in order to make it a better slot filler. Our KGI0 system (available at https://github.com/IBM/retrieve-write-slot-filling) reached the top-1 position on the KILT leaderboard on both T-REx and zsRE dataset with a large margin.
Analyzing COVID Medical Papers with Azure and Text Analytics for Health
The idea to apply NLP methods to scientific literature seems quite natural. First of all, scientific texts are already well-structured, they contain things like keywords, abstract, as well as well-defined terms. Thus, at the very beginning of COVID pandemic, a research challenge has been launched on Kaggle to analyze scientific papers on the subject. The dataset behind this competition is called CORD (publication), and it contains constantly updated corpus of everything that is published on topics related to COVID. Currently, it contains more than 400000 scientific papers, about half of them - with full text.
Ireland's data privacy agency opens investigation into Facebook data leak
Ireland's Data Protection Commission (DPC) is investigating the recent leak of a Facebook user dataset that dates back to 2019. At the start of April, it came out that someone on a hacking forum had made the dataset public, exposing the personal information of about 533 million Facebook users in 106 countries. Depending on the account, there are details about phone numbers, birth dates, email addresses, locations and more. The source of the leak is an oversight Facebook fixed in August 2019. "The DPC, having considered the information provided by Facebook Ireland regarding this matter to date, is of the opinion that one or more provisions of the GDPR and/or the Data Protection Act 2018 may have been, and/or are being, infringed in relation to Facebook Users' personal data," the agency said in a statement spotted by TechCrunch.
The MuSe 2021 Multimodal Sentiment Analysis Challenge: Sentiment, Emotion, Physiological-Emotion, and Stress
Stappen, Lukas, Baird, Alice, Christ, Lukas, Schumann, Lea, Sertolli, Benjamin, Messner, Eva-Maria, Cambria, Erik, Zhao, Guoying, Schuller, Bjรถrn W.
Multimodal Sentiment Analysis (MuSe) 2021 is a challenge focusing on the tasks of sentiment and emotion, as well as physiological-emotion and emotion-based stress recognition through more comprehensively integrating the audio-visual, language, and biological signal modalities. The purpose of MuSe 2021 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), the sentiment analysis community (symbol-based), and the health informatics community. We present four distinct sub-challenges: MuSe-Wilder and MuSe-Stress which focus on continuous emotion (valence and arousal) prediction; MuSe-Sent, in which participants recognise five classes each for valence and arousal; and MuSe-Physio, in which the novel aspect of `physiological-emotion' is to be predicted. For this years' challenge, we utilise the MuSe-CaR dataset focusing on user-generated reviews and introduce the Ulm-TSST dataset, which displays people in stressful depositions. This paper also provides detail on the state-of-the-art feature sets extracted from these datasets for utilisation by our baseline model, a Long Short-Term Memory-Recurrent Neural Network. For each sub-challenge, a competitive baseline for participants is set; namely, on test, we report a Concordance Correlation Coefficient (CCC) of .4616 CCC for MuSe-Wilder; .4717 CCC for MuSe-Stress, and .4606 CCC for MuSe-Physio. For MuSe-Sent an F1 score of 32.82 % is obtained.
Facebook will not notify the half a billion users caught up in its huge data leak, it says
Facebook will not notify the more than half a billion people caught up in a huge leak of personal information, it has said. Over the weekend, it emerged that a vast trove of data on more than 530 million users โ containing information including their phone numbers and dates of birth โ was being made freely available online. Facebook said that the data was gathered before 2019. It later said that " "malicious actors" had obtained the data prior to September 2019 by "scraping" profiles using a vulnerability in the platform's tool for synching contacts, and that the loophole that allowed them to do so had now been closed. But it said that it did not inform users when the leak happened, and does not have plans to do so now.
Facebook Data Breach: How To Check If You're Part Of The Leak, Preventive Measures To Take
Cybersecurity experts revealed a few days ago that over half a billion Facebook users' personal information have been leaked. It's a gold mine of data, which includes users' full names, birthdays, locations and phone numbers. Although Facebook claims that the actual hack happened a couple of years ago, it won't hurt if users made sure their account is not part of the breach and if they are, they should take a few preventive measures to ensure future incidents as messy as this one won't affect them. Australian Security Researcher and HaveIBeenPawned Founder Tony Hunt recently added the 533 million phone numbers exposed in the Facebook data leak to his website. Those worried if their mobile numbers were part of the leak can visit the site and check if their number is there.
What you need to know about the Facebook data leak
The news: The personal data of 533 million Facebook users in more than 106 countries was found to be freely available online last weekend. The data trove, uncovered by security researcher Alon Gal, includes phone numbers, email addresses, home towns, full names, and birth dates. Initially, Facebook claimed that the data leak was previously reported on in 2019 and that it had patched the vulnerability that caused it that August. But in fact, it appears that Facebook did not properly disclose the breach at the time. It only finally acknowledged it on Tuesday April 6 in a blog post by product management director Mike Clark.
Table Detection, Information Extraction and Structuring using Deep Learning
The amount of data being collected is drastically increasing day-by-day with lots of applications, tools, and online platforms booming in the present technological era. To handle and access this humongous data productively, it's necessary to develop valuable information extraction tools. One of the sub-areas that's demanding attention in the Information Extraction field is the fetching and accessing of data from tabular forms. To explain this in a subtle way, imagine you have lots of paperwork and documents where you would be using tables, and using the same, you would like to manipulate data. Conventionally, you can copy them manually (onto a paper) or load them into excel sheets. However, with table extraction, no sooner have you sent tables as pictures to the computer than it extracts all the information and stacks them into a neat document. This saves an ample of time and is less erroneous. As discussed in the previous section, tables are used frequently to represent data in a clean format. We can see them so often across several areas, from organizing our work by structuring data across tables to storing huge assets of companies.
Huge Facebook leak that contains information about 500 million people came from abuse of contacts tool, company says
Facebook says that a vast trove of personal information, uploaded freely to the internet, was harvested as part of a feature gone wrong. The data was not stolen in a hack but instead through malicious users of its "contact importer", it said. Though that feature was intended to allow people to upload their contacts from their phone to Facebook, and find people they might know, malicious actors were able to use it to scrape the personal information of people who were already on the platform. That happened before September 2019, Facebook said in a blog post, and the bug that made it possible has now been fixed. But over the weekend it became clear that the data had become availably publicly online, vastly increasing the risk that anyone involved in it might face. That includes 535 million accounts, which belong to people including chief executive Mark Zuckerberg.
What Really Caused Facebook's 500M-User Data Leak?
Since Saturday, a massive trove of Facebook data has circulated publicly, splashing information from roughly 533 million Facebook users across the internet. The data includes things like profile names, Facebook ID numbers, email addresses, and phone numbers. It's all the kind of information that may already have been leaked or scraped from some other source, but it's yet another resource that links all that data together--and ties it to each victim--presenting tidy profiles to scammers, phishers, and spammers on a silver platter. Facebook's initial response was simply that the data was previously reported on in 2019 and that the company patched the underlying vulnerability in August of that year. But a closer look at where, exactly, this data comes from produces a much murkier picture.