Information Retrieval
Barzilai and Borwein conjugate gradient method equipped with a non-monotone line search technique and its application on non-negative matrix factorization
Hafshejani, Sajad Fathi, Gaur, Daya, Hossain, Shahadat, Benkoczi, Robert
In this paper, we propose a new non-monotone conjugate gradient method for solving unconstrained nonlinear optimization problems. We first modify the non-monotone line search method by introducing a new trigonometric function to calculate the non-monotone parameter, which plays an essential role in the algorithm's efficiency. Then, we apply a convex combination of the Barzilai-Borwein method for calculating the value of step size in each iteration. Under some suitable assumptions, we prove that the new algorithm has the global convergence property. The efficiency and effectiveness of the proposed method are determined in practice by applying the algorithm to some standard test problems and non-negative matrix factorization problems.
Sequential Modelling with Applications to Music Recommendation, Fact-Checking, and Speed Reading
Sequential modelling entails making sense of sequential data, which naturally occurs in a wide array of domains. One example is systems that interact with users, log user actions and behaviour, and make recommendations of items of potential interest to users on the basis of their previous interactions. In such cases, the sequential order of user interactions is often indicative of what the user is interested in next. Similarly, for systems that automatically infer the semantics of text, capturing the sequential order of words in a sentence is essential, as even a slight re-ordering could significantly alter its original meaning. This thesis makes methodological contributions and new investigations of sequential modelling for the specific application areas of systems that recommend music tracks to listeners and systems that process text semantics in order to automatically fact-check claims, or "speed read" text for efficient further classification.
Snapchat's Scan feature can identify dogs, plants, clothes, and more
Snapchat's camera has to date mostly been associated with sending disappearing messages and goofy AR effects, like a virtual dancing hot dog. But what if it did things for you, like suggest ways to make your videos look and sound better? Or show you a similar shirt based on the one you're looking at? Starting Thursday, a feature called Scan is being upgraded and placed front and center in the app's camera, letting it identify a range of things in the real world, like clothes or dog breeds. Scan's prominent placement in Snapchat means that the company is slowly becoming not just a messaging app, but a visual search engine.
Practical Entity Resolution on AWS to Reconcile Data in the Real World
This post was co-written with Mamoon Chowdry, Solutions Architect, previously at AWS. Businesses and organizations from many industries often struggle to ensure that their data is accurate. Data often has to match people or things exactly in the real world, such as a customer name, an address, or a company. Matching our data is important to validate it, de-duplicate it, or link records in different systems together. Know Your Customer (KYC) regulations also mean that we must be confident in who or what our data is referring to. We must match millions of records from different data sources.
Top 30 NLP Use Cases: Comprehensive Guide for 2021
Natural language processing (NLP) is a subfield of AI and linguistics which enables computers to understand, interpret and manipulate human language. Although NLP faces different challenges due to the difficulty of human language, this did not become an obstacle in the face of its growth. The global NLP market was estimated at $5B in 2018 and is expected to reach $43B by 2025, and this exponential growth can mostly be attributed to the vast use cases of NLP in every industry today. You may already be familiar with many NLP applications such as autocorrection, translation, or chatbots. However, NLP is the cornerstone of numerous applications we use every day without even noticing.
Biomedical Question Answering: A Survey of Approaches and Challenges
Jin, Qiao, Yuan, Zheng, Xiong, Guangzhi, Yu, Qianlan, Ying, Huaiyuan, Tan, Chuanqi, Chen, Mosha, Huang, Songfang, Liu, Xiaozhong, Yu, Sheng
Professionals as well as the general public need effective assistance to access, understand and consume complex biomedical concepts. For example, doctors always want to be aware of up-to-date clinical evidence for the diagnosis and treatment of diseases under the scheme of Evidence-based Medicine [165], and the general public is becoming increasingly interested in learning about their own health conditions on the Internet [54]. Traditionally, Information Retrieval (IR) systems, such as PubMed, have been used to meet such information needs. However, classical IR is still not efficient enough [71, 77, 99, 164]. For instance, Russell-Rose and Chamberlain [164] reported that it requires 4 expert hours to answer complex medical queries using search engines. Compared with the retrieval systems that typically return a list of relevant documents for the users to read, Question Answering (QA) systems that provide direct answers to users' questions are more straightforward and intuitive. In general, QA itself is a challenging benchmark Natural Language Processing (NLP) task for evaluating the abilities of intelligent systems to understand a question, retrieve and utilize relevant materials and generate its answer. With the rapid development of computing hardware, modern QA models, especially those based on deep learning [30, 31, 42, 146, 171], achieve comparable or even better performance than human on many benchmark datasets [67, 83, 154, 155, 215] and have been successfully adopted in general domain search engines and conversational assistants [150, 236]. The Text REtrieval Conference (TREC) QA Track has triggered the modern QA research [197], when QA models were mostly based on IR.
An N-gram based approach to auto-extracting topics from research articles
Zhu, Linkai, Huang, Maoyi, Chen, Maomao, Wang, Wennan
A lot of manual work goes into identifying a topic for an article. With a large volume of articles, the manual process can be exhausting. Our approach aims to address this issue by automatically extracting topics from the text of large Numbers of articles. This approach takes into account the efficiency of the process. Based on existing N-gram analysis, our research examines how often certain words appear in documents in order to support automatic topic extraction. In order to improve efficiency, we apply custom filtering standards to our research. Additionally, delete as many noncritical or irrelevant phrases as possible. In this way, we can ensure we are selecting unique keyphrases for each article, which capture its core idea. For our research, we chose to center on the autonomous vehicle domain, since the research is relevant to our daily lives. We have to convert the PDF versions of most of the research papers into editable types of files such as TXT. This is because most of the research papers are only in PDF format. To test our proposed idea of automating, numerous articles on robotics have been selected. Next, we evaluate our approach by comparing the result with that obtained manually.
Representation Learning for Efficient and Effective Similarity Search and Recommendation
How data is represented and operationalized is critical for building computational solutions that are both effective and efficient. A common approach is to represent data objects as binary vectors, denoted \textit{hash codes}, which require little storage and enable efficient similarity search through direct indexing into a hash table or through similarity computations in an appropriate space. Due to the limited expressibility of hash codes, compared to real-valued representations, a core open challenge is how to generate hash codes that well capture semantic content or latent properties using a small number of bits, while ensuring that the hash codes are distributed in a way that does not reduce their search efficiency. State of the art methods use representation learning for generating such hash codes, focusing on neural autoencoder architectures where semantics are encoded into the hash codes by learning to reconstruct the original inputs of the hash codes. This thesis addresses the above challenge and makes a number of contributions to representation learning that (i) improve effectiveness of hash codes through more expressive representations and a more effective similarity measure than the current state of the art, namely the Hamming distance, and (ii) improve efficiency of hash codes by learning representations that are especially suited to the choice of search method. The contributions are empirically validated on several tasks related to similarity search and recommendation.
Scrape Search Engine Results in Real-time with Zenserp
If you have a project or service that requires scraping search results for data, you might be interested in this API that can streamline the process. Zenserp is able to get real-time data from search results on the major search platforms. Their simple API has scalable options that make it a great solution for any sized project. You can try Zenserp for free, to see how powerful this API is. Get detailed scrape results from APIs for specific situations.
An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining
The Linked Open Data practice has led to a significant growth of structured data on the Web in the last decade. Such structured data describe real-world entities in a machine-readable way, and have created an unprecedented opportunity for research in the field of Natural Language Processing. However, there is a lack of studies on how such data can be used, for what kind of tasks, and to what extent they can be useful for these tasks. This work focuses on the e-commerce domain to explore methods of utilising such structured data to create language resources that may be used for product classification and linking. We process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating of language resources: training word embedding models, continued pre-training of BERT-like language models, and training Machine Translation models that are used as a proxy to generate product-related keywords. Our evaluation on an extensive set of benchmarks shows word embeddings to be the most reliable and consistent method to improve the accuracy on both tasks (with up to 6.9 percentage points in macro-average F1 on some datasets). The other two methods however, are not as useful. Our analysis shows that this could be due to a number of reasons, including the biased domain representation in the structured data and lack of vocabulary coverage. We share our datasets and discuss how our lessons learned could be taken forward to inform future research in this direction.