This week at Microsoft Ignite, a number of new developments to Azure were in focus. While there were dozens of updates to the world's second-largest public cloud, data was once again in the spotlight. The company made a series of announcements to enable users to extract more value from the exponential increase in data. Satya Nadella, in his Ignite keynote, provided a new visionary direction, or at least a new way of expressing the company's cloud endeavors. In short, the Microsoft cloud is evolving to further embrace edge, privacy, security, AI, and developers (both coders and no coders), and to serve as an engine of job creation. On the surface, this shift appears subtle.
Building search systems is hard. Preparing them to work with machine learning is really hard. Developing a complete search engine framework integrated with AI is really really hard. In this post, we'll build a search engine from scratch and discuss on how to further optimize results by adding a machine learning layer using Kubeflow and Katib. This new layer will be capable of retrieving results considering the context of users and is the main focus of this article. As we'll see, thanks to Kubeflow and Katib, final result is rather quite simple, efficient and easy to maintain. To understand the concepts in practice, we'll implement the system with hands-on experience. As it's been built on top of Kubernetes, you can use any infrastructure you like (given appropriate adaptations).
On Feb 15, 2019, John Abowd, chief scientist at the U.S. Census Bureau, announced the results of a reconstruction attack that they proactively launched using data released under the 2010 Decennial Census.19 The decennial census released billions of statistics about individuals like "how many people of the age 10-20 live in New York City" or "how many people live in four-person households." Using only the data publicly released in 2010, an internal team was able to correctly reconstruct records of address (by census block), age, gender, race, and ethnicity for 142 million people (about 46% of the U.S. population), and correctly match these data to commercial datasets circa 2010 to associate personal-identifying information such as names for 52 million people (17% of the population). This is not specific to the U.S. Census Bureau--such attacks can occur in any setting where statistical information in the form of deidentified data, statistics, or even machine learning models are released. That such attacks are possible was predicted over 15 years ago by a seminal paper by Irit Dinur and Kobbi Nissim12--releasing a sufficiently large number of aggregate statistics with sufficiently high accuracy provides sufficient information to reconstruct the underlying database with high accuracy. The practicality of such a large-scale reconstruction by the U.S. Census Bureau underscores the grand challenge that public organizations, industry, and scientific research faces: How can we safely disseminate results of data analysis on sensitive databases? An emerging answer is differential privacy. An algorithm satisfies differential privacy (DP) if its output is insensitive to adding, removing or changing one record in its input database. DP is considered the "gold standard" for privacy for a number of reasons. It provides a persuasive mathematical proof of privacy to individuals with several rigorous interpretations.25,26 The DP guarantee is composable and repeating invocations of differentially private algorithms lead to a graceful degradation of privacy.
The Coronavirus (COVID-19) pandemic has led to a rapidly growing `infodemic' online. Thus, the accurate retrieval of reliable relevant data from millions of documents about COVID-19 has become urgently needed for the general public as well as for other stakeholders. The COVID-19 Multilingual Information Access (MLIA) initiative is a joint effort to ameliorate exchange of COVID-19 related information by developing applications and services through research and community participation. In this work, we present a search system called Multistage BiCross Encoder, developed by team GATE for the MLIA task 2 Multilingual Semantic Search. Multistage BiCross-Encoder is a sequential three stage pipeline which uses the Okapi BM25 algorithm and a transformer based bi-encoder and cross-encoder to effectively rank the documents with respect to the query. The results of round 1 show that our models achieve state-of-the-art performance for all ranking metrics for both monolingual and bilingual runs.
Theoretical and abstract approaches to information have made great advances, but human information processing is still unmatched in many areas, including information management, representation and understanding. Neurocognitive informatics is a new, emerging field that should help to improve the matching of artificial and natural systems, and inspire better computational algorithms to solve problems that are still beyond the reach of machines. In this position paper examples of neurocognitive inspirations and promising directions in this area are given.
Graph structures are powerful tools for modeling the relationships between textual elements. Graph-of-Words (GoW) has been adopted in many Natural Language tasks to encode the association between terms. However, GoW provides few document-level relationships in cases when the connections between documents are also essential. For identifying sub-events on social media like Twitter, features from both word- and document-level can be useful as they supply different information of the event. We propose a hybrid Graph-of-Tweets (GoT) model which combines the word- and document-level structures for modeling Tweets. To compress large amount of raw data, we propose a graph merging method which utilizes FastText word embeddings to reduce the GoW. Furthermore, we present a novel method to construct GoT with the reduced GoW and a Mutual Information (MI) measure. Finally, we identify maximal cliques to extract popular sub-events. Our model showed promising results on condensing lexical-level information and capturing keywords of sub-events.
Most typical click models assume that the probability of a document to be examined by users only depends on position, such as PBM and UBM. It works well in various kinds of search engines. However, in a search engine where massive candidate documents display images as responses to the query, the examination probability should not only depend on position. The visual appearance of an image-oriented document also plays an important role in its opportunity to be examined. In this paper, we assume that vision bias exists in an image-oriented search engine as another crucial factor affecting the examination probability aside from position. Specifically, we apply this assumption to classical click models and propose an extended model, to better capture the examination probabilities of documents. We use regression-based EM algorithm to predict the vision bias given the visual features extracted from candidate documents. Empirically, we evaluate our model on a dataset developed from a real-world online image-oriented search engine, and demonstrate that our proposed model can achieve significant improvements over its baseline model in data fitness and sparsity handling.
Causal classification models are adopted across a variety of operational business processes to predict the effect of a treatment on a categorical business outcome of interest depending on the process instance characteristics. This allows optimizing operational decision-making and selecting the optimal treatment to apply in each specific instance, with the aim of maximizing the positive outcome rate. While various powerful approaches have been presented in the literature for learning causal classification models, no formal framework has been elaborated for optimal decision-making based on the estimated individual treatment effects, given the cost of the various treatments and the benefit of the potential outcomes. In this article, we therefore extend upon the expected value framework and formally introduce a cost-sensitive decision boundary for double binary causal classification, which is a linear function of the estimated individual treatment effect, the positive outcome probability and the cost and benefit parameters of the problem setting. The boundary allows causally classifying instances in the positive and negative treatment class to maximize the expected causal profit, which is introduced as the objective at hand in cost-sensitive causal classification. We introduce the expected causal profit ranker which ranks instances for maximizing the expected causal profit at each possible threshold for causally classifying instances and differs from the conventional ranking approach based on the individual treatment effect. The proposed ranking approach is experimentally evaluated on synthetic and marketing campaign data sets. The results indicate that the presented ranking method effectively outperforms the cost-insensitive ranking approach and allows boosting profitability.
In the social media, there are a large amount of potential zombie accounts which may has negative impact on the public opinion. In tradition, PageRank algorithm is used to detect zombie accounts. However, problems such as it requires a large RAM to store adjacent matrix or adjacent list and the value of importance may approximately to zero for large graph exist. To solve the first problem, since the structure of social media makes the graph divisible, we conducted a community detection algorithm Louvain to decompose the whole graph into 1,002 subgraphs. The modularity of 0.58 shows the result is effective. To solve the second problem, we performed the uneven assignation PageRank algorithm to calculate the importance of node in each community. Then, a threshold is set to distinguish the zombie account and normal accounts. The result shows that about 20% accounts in the dataset are zombie accounts and they center in tier-one cities in China such as Beijing, Shanghai, and Guangzhou. In the future, a classification algorithm with semi-supervised learning can be used to detect zombie accounts.
Around five percent of papers from the conference were on graphs so lots to discuss. A new paper (with authors from every major big tech), was recently published showing how one can attack language models like GPT-2 and extract information verbatim like personal identifiable information from just by querying the model. The information extracted derived from the models' training data that was based on scraped internet info. This is a big problem especially when you train a language model on a private custom dataset. Looks like Booking.com wants a new recommendation engine and they are offering up their dataset of over 1 million anonymized hotel reservations to get you in the game.