Goto

Collaborating Authors

 Information Retrieval


A Search Engine for Scientific Publications: a Cybersecurity Case Study

arXiv.org Artificial Intelligence

Cybersecurity is a very challenging topic of research nowadays, as digitalization increases the interaction of people, software and services on the Internet by means of technology devices and networks connected to it. The field is broad and has a lot of unexplored ground under numerous disciplines such as management, psychology, and data science. Its large disciplinary spectrum and many significant research topics generate a considerable amount of information, making it hard for us to find what we are looking for when researching a particular subject. This work proposes a new search engine for scientific publications which combines both information retrieval and reading comprehension algorithms to extract answers from a collection of domain-specific documents. The proposed solution although being applied to the context of cybersecurity exhibited great generalization capabilities and can be easily adapted to perform under other distinct knowledge domains.


Leveraging Language to Learn Program Abstractions and Search Heuristics

#artificialintelligence

Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains โ€“ string editing, image composition, and abstract reasoning about scenes โ€“ even when no natural language hints are available at test time.


Brave's privacy-focused search engine is available in beta

Engadget

You can now try Brave's search engine for yourself. Brave has launched a beta Search feature both as an option in all its browsers as well as through the web for everyone else. As you'd expect, it's billed as a privacy- and transparency-oriented platform that doesn't track your activity or use "secret" algorithms to curate results. You'll eventually have the option of an ad-free version if you're willing to pay, and Brave will make Search available for other engines. The site index is independent, although Brave noted that image searches and some other features will lean on Microsoft's Bing.


Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw

arXiv.org Artificial Intelligence

We build on the In this paper we present our submission which tries to address unsupervised representations of speech proposed by the organizers all four tasks. We extend the baseline solution in several as a baseline, derived from CPC and clustered with the k-directions: we refine the intermediate representations, extracted means algorithm. We demonstrate that simple methods of refining with CPC, to directly improve the ABX scores. We show that those representations can narrow the gap, or even improve such representations can be used to perform simple fuzzy lookups upon the solutions which use a high computational budget. The in a large dataset, and even extract some common patterns results lead to the conclusion that the CPC-derived representations that serve as pseudo-words. Our approach to the semantic word are still too noisy for training language models, but stable similarity task is also based on pseudo-words.


Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with a Faulty Oracle

arXiv.org Machine Learning

Motivated by applications in crowdsourced entity resolution in database, signed edge prediction in social networks and correlation clustering, Mazumdar and Saha [NIPS 2017] proposed an elegant theoretical model for studying clustering with a faulty oracle. In this model, given a set of $n$ items which belong to $k$ unknown groups (or clusters), our goal is to recover the clusters by asking pairwise queries to an oracle. This oracle can answer the query that ``do items $u$ and $v$ belong to the same cluster?''. However, the answer to each pairwise query errs with probability $\varepsilon$, for some $\varepsilon\in(0,\frac12)$. Mazumdar and Saha provided two algorithms under this model: one algorithm is query-optimal while time-inefficient (i.e., running in quasi-polynomial time), the other is time efficient (i.e., in polynomial time) while query-suboptimal. Larsen, Mitzenmacher and Tsourakakis [WWW 2020] then gave a new time-efficient algorithm for the special case of $2$ clusters, which is query-optimal if the bias $\delta:=1-2\varepsilon$ of the model is large. It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters and other regimes of $\delta$. In this paper, we make progress on the above question and provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(\log^2 n)$) for all constant $k$ and any $\delta$ in the regime when information-theoretic recovery is possible. Our algorithm is built on a connection to the stochastic block model.


How to Extract Relevant Keywords with KeyBERT

#artificialintelligence

There are many powerful techniques that perform keywords extraction (e.g. However, they are mainly based on the statistical properties of the text and don't necessarily take into account the semantic aspects of the full document. KeyBERT is a minimal and easy-to-use keyword extraction technique that aims at solving this issue. It leverages the BERT language model and relies on the transformers library. So go check his repo (and clone it) if you're interested in using it.


Query Embedding on Hyper-relational Knowledge Graphs

arXiv.org Artificial Intelligence

Multi-hop logical reasoning is an established problem in the field of representation learning on knowledge graphs (KGs). It subsumes both one-hop link prediction as well as other more complex types of logical queries. Existing algorithms operate only on classical, triple-based graphs, whereas modern KGs often employ a hyper-relational modeling paradigm. In this paradigm, typed edges may have several key-value pairs known as qualifiers that provide fine-grained context for facts. In queries, this context modifies the meaning of relations, and usually reduces the answer set. Hyper-relational queries are often observed in real-world KG applications, and existing approaches for approximate query answering cannot make use of qualifier pairs. In this work, we bridge this gap and extend the multi-hop reasoning problem to hyper-relational KGs allowing to tackle this new type of complex queries. Building upon recent advancements in Graph Neural Networks and query embedding techniques, we study how to embed and answer hyper-relational conjunctive queries. Besides that, we propose a method to answer such queries and demonstrate in our experiments that qualifiers improve query answering on a diverse set of query patterns.


This High Schooler Created a Drug Discovery Search Engine

#artificialintelligence

Between his mom's place in Manhattan, his dad in Queens, and his high school in the Bronx, Noah Getz is on the subway a lot. It gives him time to read and to think. Our first coronavirus summer was waning, and he'd been wrestling with a weighty science problem: using machine learning to hunt down tiny molecules that may help treat Alzheimer's. Thus far, his AI had been spitting out results that were "almost comically bad." The problem was that the algorithms Getz was using did their best when they had massive amounts of data to sift through and discover patterns in. Getz' data set was far smaller; he was working with one lab at Mount Sinai, not a multinational pharmaceutical company with a galaxy-sized drug library.


Bilateral Personalized Dialogue Generation with Dynamic Persona-Aware Fusion

arXiv.org Artificial Intelligence

Generating personalized responses is one of the major challenges in natural human-robot interaction. Current researches in this field mainly focus on generating responses consistent with the robot's pre-assigned persona, while ignoring the user's persona. Such responses may be inappropriate or even offensive, which may lead to the bad user experience. Therefore, we propose a bilateral personalized dialogue generation (BPDG) method with dynamic persona-aware fusion via multi-task transfer learning to generate responses consistent with both personas. The proposed method aims to accomplish three learning tasks: 1) an encoder is trained with dialogue utterances added with corresponded personalized attributes and relative position (language model task), 2) a dynamic persona-aware fusion module predicts the persona presence to adaptively fuse the contextual and bilateral personas encodings (persona prediction task) and 3) a decoder generates natural, fluent and personalized responses (dialogue generation task). To make the generated responses more personalized and bilateral persona-consistent, the Conditional Mutual Information Maximum (CMIM) criterion is adopted to select the final response from the generated candidates. The experimental results show that the proposed method outperforms several state-of-the-art methods in terms of both automatic and manual evaluations.


Goal-Aware Neural SAT Solver

arXiv.org Artificial Intelligence

Modern neural networks obtain information about the problem and calculate the output solely from the input values. We argue that it is not always optimal, and the network's performance can be significantly improved by augmenting it with a query mechanism that allows the network to make several solution trials at run time and get feedback on the loss value on each trial. To demonstrate the capabilities of the query mechanism, we formulate an unsupervised (not dependant on labels) loss function for Boolean Satisfiability Problem (SAT) and theoretically show that it allows the network to extract rich information about the problem. We then propose a neural SAT solver with a query mechanism called QuerySAT and show that it outperforms the neural baseline on a wide range of SAT tasks and the classical baselines on SHA-1 preimage attack and 3-SAT task.