Goto

Collaborating Authors

 Information Retrieval


Unsupervised Sentiment Analysis for Code-mixed Data

arXiv.org Artificial Intelligence

Code-mixing is the practice of alternating between two or more languages. Mostly observed in multilingual societies, its occurrence is increasing and therefore its importance. A major part of sentiment analysis research has been monolingual, and most of them perform poorly on code-mixed text. In this work, we introduce methods that use different kinds of multilingual and cross-lingual embeddings to efficiently transfer knowledge from monolingual text to code-mixed text for sentiment analysis of code-mixed text. Our methods can handle code-mixed text through a zero-shot learning. Our methods beat state-of-the-art on English-Spanish code-mixed sentiment analysis by absolute 3\% F1-score. We are able to achieve 0.58 F1-score (without parallel corpus) and 0.62 F1-score (with parallel corpus) on the same benchmark in a zero-shot way as compared to 0.68 F1-score in supervised settings. Our code is publicly available.


Google Search Console unparsable structured data report data issue - Search Engine Land

#artificialintelligence

Google has informed us that you may see a spike in errors in the unparsable structured data report within Google Search Console. This is a bug in the reporting system and you do not need to worry. The issue happened between January 13, 2020 and January 16, 2020. Google wrote on the data anomalies page "Some users may see a spike in unparsable structured data errors. This was due to an internal misconfiguration that will be fixed soon, and can be ignored."


On Making A Multilingual Search Engine

#artificialintelligence

You can read more about USE in this paper. Let's first read the data. Because the quora dataset is huge and takes a lot of time, we will take only 1% of the data. This will take around 3 minutes for encoding and indexing. It will have 4000 questions.


University of Warwick Job Search: Research Fellow or Senior Research Fellow (102493-0120)

#artificialintelligence

Research Fellow or Senior Research Fellow (Deep Learning for Health Trajectory Perdiction) The full-time fixed term post is available until 31st March 2023 (approximately 3 years). You will work with the Principal Investigator (Dr Leandro Pecchia), the project partners and the Warwick GATEKEEPER team for the successful execution of the project. Further information on the project can be read here https://www.gatekeeper-project.eu/ You will have a PhD in Biomedical Engineering or in a relevant discipline (e.g., Computer Science, Information Engineering, Applied Math or similar disciplines). The level of appointment (Research or Senior Research Fellow) will be determined by the successful candidate--s skills and experience, including a proven ability and achievement in research and the ability to generate external funding to support research projects.


Privacy concerns over Russia's 'most popular search engine' Yandex as its uses facial recognition

Daily Mail - Science & tech

A Russian search engine is being accused of providing an unregulated facial recognition system to members of the public -- violating personal privacy. Experts have slammed the feature as'poor' and'creepy' while dubbing it a'definite privacy concern'. Yandex, much like Google, Bing and other search engines, allows users to input an image and see similar results. But only Yandex, which claims to conduct more than 50 per cent of Russian searches on Android, produces images of the exact same person. MailOnline tested the image search facilities of Yandex, Bing, Google and specialist site TinEye by submitting a photo that was not available online.


Verizon launches 'privacy-focused' search engine leaving some skeptical because of the firm's past

Daily Mail - Science & tech

There is a new internet watchdog in town and it is powered by Verizon. The tech giant released a'privacy-focused' search engine, called OneSearch, which encrypts searches, leaves results unfiltered and claims to not store or transfer user information. The platform is also Advanced Privacy Mode enabled, meaning all search result links expire within an hour. However, some users are suspicions about the platform, as Verizon has come under fire in the past for its tracking customers as on the internet without permission. Verizon launched a'privacy-focused' search engine, called OneSearch.


Modeling Product Search Relevance in e-Commerce

arXiv.org Machine Learning

With the rapid growth of e-Commerce, online product search has emerged as a popular and effective paradigm for customers to find desired products and engage in online shopping. However, there is still a big gap between the products that customers really desire to purchase and relevance of products that are suggested in response to a query from the customer. In this paper, we propose a robust way of predicting relevance scores given a search query and a product, using techniques involving machine learning, natural language processing and information retrieval. We compare conventional information retrieval models such as BM25 and Indri with deep learning models such as word2vec, sentence2vec and paragraph2vec. We share some of our insights and findings from our experiments.


Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning

arXiv.org Machine Learning

Word embeddings, i.e., low-dimensional vector representations such as GloVe and SGNS, encode word "meaning" in the sense that distances between words' vectors correspond to their semantic proximity. This enables transfer learning of semantics for a variety of natural language processing tasks. Word embeddings are typically trained on large public corpora such as Wikipedia or Twitter. We demonstrate that an attacker who can modify the corpus on which the embedding is trained can control the "meaning" of new and existing words by changing their locations in the embedding space. We develop an explicit expression over corpus features that serves as a proxy for distance between words and establish a causative relationship between its values and embedding distances. We then show how to use this relationship for two adversarial objectives: (1) make a word a top-ranked neighbor of another word, and (2) move a word from one semantic cluster to another. An attack on the embedding can affect diverse downstream tasks, demonstrating for the first time the power of data poisoning in transfer learning scenarios. We use this attack to manipulate query expansion in information retrieval systems such as resume search, make certain names more or less visible to named entity recognition models, and cause new words to be translated to a particular target word regardless of the language. Finally, we show how the attacker can generate linguistically likely corpus modifications, thus fooling defenses that attempt to filter implausible sentences from the corpus using a language model.


Search engine for Japanese sex hotels announces security breach ZDNet

#artificialintelligence

HappyHotel, a Japanese search engine for finding and booking rooms in "love hotels," disclosed a security breach at the end of last year. Love hotels are hotels built and operated primarily for allowing guests privacy for sexual activities. Love hotels, also known as sex hotels, are used by both married couples and cheating spouses, alike, and are found all over the world, but they are particularly popular in East Asia, and especially Japan. HappyHotel.jp is a website that operates similarly to Booking.com, but lets registered users search and book rooms in love hotels across Japan. In a message posted on its website, Almex, the company behind the service, said it detected unauthorized access to its servers on December 22, last year.


Inductive Document Network Embedding with Topic-Word Attention

arXiv.org Machine Learning

Document network embedding aims at learning representations for a structured text corpus i.e. when documents are linked to each other. Recent algorithms extend network embedding approaches by incorporating the text content associated with the nodes in their formulations. In most cases, it is hard to interpret the learned representations. Moreover, little importance is given to the generalization to new documents that are not observed within the network. In this paper, we propose an interpretable and inductive document network embedding method. We introduce a novel mechanism, the Topic-Word Attention (TW A), that generates document representations based on the interplay between word and topic representations. We train these word and topic vectors through our general model, Inductive Document Network Embedding (IDNE), by leveraging the connections in the document network. Quantitative evaluations show that our approach achieves state-of-the-art performance on various networks and we qualitatively show that our model produces meaningful and interpretable representations of the words, topics and documents.