Information Retrieval
Efficient and Accurate Top-$K$ Recovery from Choice Data
The intersection of learning to rank and choice modeling is an active area of research with applications in e-commerce, information retrieval and the social sciences. In some applications such as recommendation systems, the statistician is primarily interested in recovering the set of the top ranked items from a large pool of items as efficiently as possible using passively collected discrete choice data, i.e., the user picks one item from a set of multiple items. Motivated by this practical consideration, we propose the choice-based Borda count algorithm as a fast and accurate ranking algorithm for top $K$-recovery i.e., correctly identifying all of the top $K$ items. We show that the choice-based Borda count algorithm has optimal sample complexity for top-$K$ recovery under a broad class of random utility models. We prove that in the limit, the choice-based Borda count algorithm produces the same top-$K$ estimate as the commonly used Maximum Likelihood Estimate method but the former's speed and simplicity brings considerable advantages in practice. Experiments on both synthetic and real datasets show that the counting algorithm is competitive with commonly used ranking algorithms in terms of accuracy while being several orders of magnitude faster.
Accelerating Business Growth with Natural Language Processing
Today, NLP is broadly adopted by businesses across industries in several forms. In fact, according to recent research, the global NLP market size is expected to reach $35.1 billion by 2026. This ubiquity of the technology form can be accorded to the abundance of text and voice data as well as the shift from human-computer interaction to human-computer conversation. In my upcoming talk at the Open Data Science Conference (ODSC) East, I am excited to be sharing my thoughts on how NLP is already aiding businesses, trends to keep an eye out for in the near future, and things to keep in mind when it comes to adopting NLP solutions. Outlined below is what you can expect me to discuss in detail during the presentation.
Implementing Hearst Patterns with SpaCy
In this article, I will mostly concentrate on the Hearst patterns, implementation and usage for hypernym extraction. However, I will use Named Entity Recognition (NER) and a dataset of patents; so I recommend checking my previous post in this cycle. Why do we care about patterns in the context of NLP? Because they significantly reduce and simplifies work, basically, it is a simple model. Despite being in the era of Transformer Neural Networks, patterns still can be beneficial.
Let's Discuss About Microsites & Dips In Visitors: Ask An search engine optimisation - Channel969
At present's ask an search engine optimisation query comes from Kate in Louisville, who wrote: "I work for a corporation that builds microsites for shoppers. What components do I have to concentrate on when there's a dip in natural site visitors? In This autumn 2021, for instance, we did a rebrand and meta knowledge was altered. Would this have an enormous influence on site visitors going ahead?" They nonetheless take a look at URLs, hyperlinks, titles, content material, and lots of of different rating components so the identical search engine optimisation greatest practices for diagnosing a rankings drop will apply to microsites, too.
Implementing Hearst Patterns with SpaCy
In this article, I will mostly concentrate on the Hearst patterns, implementation and usage for hypernym extraction. However, I will use Named Entity Recognition (NER) and a dataset of patents; so I recommend checking my previous post in this cycle. Why do we care about patterns in the context of NLP? Because they significantly reduce and simplifies work, basically, it is a simple model. Despite being in the era of Transformer Neural Networks, patterns still can be beneficial.
Research Topic Flows in Co-Authorship Networks
Schäfermeier, Bastian, Hirth, Johannes, Hanika, Tom
In scientometrics, scientific collaboration is often analyzed by means of co-authorships. An aspect which is often overlooked and more difficult to quantify is the flow of expertise between authors from different research topics, which is an important part of scientific progress. With the Topic Flow Network (TFN) we propose a graph structure for the analysis of research topic flows between scientific authors and their respective research fields. Based on a multi-graph and a topic model, our proposed network structure accounts for intratopic as well as intertopic flows. Our method requires for the construction of a TFN solely a corpus of publications (i.e., author and abstract information). From this, research topics are discovered automatically through non-negative matrix factorization. The thereof derived TFN allows for the application of social network analysis techniques, such as common metrics and community detection. Most importantly, it allows for the analysis of intertopic flows on a large, macroscopic scale, i.e., between research topic, as well as on a microscopic scale, i.e., between certain sets of authors. We demonstrate the utility of TFNs by applying our method to two comprehensive corpora of altogether 20 Mio. publications spanning more than 60 years of research in the fields computer science and mathematics. Our results give evidence that TFNs are suitable, e.g., for the analysis of topical communities, the discovery of important authors in different fields, and, most notably, the analysis of intertopic flows, i.e., the transfer of topical expertise. Besides that, our method opens new directions for future research, such as the investigation of influence relationships between research fields.
Does Twitter know your political views? POLiTweets dataset and semi-automatic method for political leaning discovery
Baran, Joanna, Kajstura, Michał, Ziółkowski, Maciej, Rajda, Krzysztof
Every day, the world is flooded by millions of messages and statements posted on Twitter or Facebook. Social media platforms try to protect users' personal data, but there still is a real risk of misuse, including elections manipulation. Did you know, that only 13 posts addressing important or controversial topics for society are enough to predict one's political affiliation with a 0.85 F1-score? To examine this phenomenon, we created a novel universal method of semi-automated political leaning discovery. It relies on a heuristical data annotation procedure, which was evaluated to achieve 0.95 agreement with human annotators (counted as an accuracy metric). We also present POLiTweets - the first publicly open Polish dataset for political affiliation discovery in a multi-party setup, consisting of over 147k tweets from almost 10k Polish-writing users annotated heuristically and almost 40k tweets from 166 users annotated manually as a test set. We used our data to study the aspects of domain shift in the context of topics and the type of content writers - ordinary citizens vs. professional politicians.
Writing for Search Engines: Optimize for Robots or People?
Google processes more than 8.5 billion searches every day. That's more than 100,000 searches per second, thousands of which could lead a user to a purchase. It's no wonder, then, that 60% of marketers list SEO as their number one inbound marketing priority. But generating organic traffic comes with challenges. Google has hundreds of billions of webpages in its index, competing for the top spots on search result pages.
Virtual Openhouse
Research¹ shows that including videos in web pages can effectively improve user experiences, increase Search Engine Optimization (SEO), and catch readers further down the sales funnel. To help agents with their business through Compass' website, the Compass AI Content Intelligence (AI-CI) team wants to make it easy for them to generate and share videos. We leverage state-of-the-art AI technologies to create visual and textual content for the videos to be generated and leverage the close to metal rendering algorithms together with the cloud-based distributed computation system to render the videos efficiently. With our current automatic video generation feature, agents can create a video with a single click, or with just a few more clicks they can customize it. They can then quickly review videos that have been created for them.