Goto

Collaborating Authors

 Information Retrieval


Ahana Cloud for Presto review: Fast SQL queries against data lakes

#artificialintelligence

Hope springs eternal in the database business. While we're still hearing about data warehouses (fast analysis databases, typically featuring in-memory columnar storage) and tools that improve the ETL step (extract, transform, and load), we're also hearing about improvements in data lakes (which store data in its native format) and data federation (on-demand data integration of heterogeneous data stores). Presto keeps coming up as a fast way to perform SQL queries on big data that resides in data lake files. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. Presto allows querying data where it lives, including Hive, Cassandra, relational databases, and proprietary data stores.


FRAKE: Fusional Real-time Automatic Keyword Extraction

arXiv.org Artificial Intelligence

Keyword extraction is called identifying words or phrases that express the main concepts of texts in best. There is a huge amount of texts that are created every day and at all times through electronic infrastructure. So, it is practically impossible for humans to study and manage this volume of documents. However, the need for efficient and effective access to these documents is evident in various purposes. Weblogs, News, and technical notes are almost long texts, while the reader seeks to understand the concepts by topics or keywords to decide for reading the full text. To this aim, we use a combined approach that consists of two models of graph centrality features and textural features. In the following, graph centralities, such as degree, betweenness, eigenvector, and closeness centrality, have been used to optimally combine them to extract the best keyword among the candidate keywords extracted by the proposed method. Also, another approach has been introduced to distinguishing keywords among candidate phrases and considering them as a separate keyword. To evaluate the proposed method, seven datasets named, Semeval2010, SemEval2017, Inspec, fao30, Thesis100, pak2018 and WikiNews have been used, and results reported Precision, Recall, and F- measure.


INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]

arXiv.org Artificial Intelligence

A full-fledged data exploration system must combine different access modalities with a powerful concept of guiding the user in the exploration process, by being reactive and anticipative both for data discovery and for data linking. Such systems are a real opportunity for our community to cater to users with different domain and data science expertise. We introduce INODE -- an end-to-end data exploration system -- that leverages, on the one hand, Machine Learning and, on the other hand, semantics for the purpose of Data Management (DM). Our vision is to develop a classic unified, comprehensive platform that provides extensive access to open datasets, and we demonstrate it in three significant use cases in the fields of Cancer Biomarker Reearch, Research and Innovation Policy Making, and Astrophysics. INODE offers sustainable services in (a) data modeling and linking, (b) integrated query processing using natural language, (c) guidance, and (d) data exploration through visualization, thus facilitating the user in discovering new insights. We demonstrate that our system is uniquely accessible to a wide range of users from larger scientific communities to the public. Finally, we briefly illustrate how this work paves the way for new research opportunities in DM.


Topological Data Analysis of Database Representations for Information Retrieval

arXiv.org Machine Learning

Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval. This recently has been achieved by embedding the graphical structure of the database into a manifold so that the hierarchy is preserved. Persistent homology provides a rigorous characterization for the database topology in terms of both its hierarchy and connectivity structure. We compute persistent homology on a variety of datasets and show that some commonly used embeddings fail to preserve the connectivity. Moreover, we show that embeddings which successfully retain the database topology coincide in persistent homology. We introduce the dilation-invariant bottleneck distance to capture this effect, which addresses metric distortion on manifolds. We use it to show that distances between topology-preserving embeddings of databases are small.


Evidence-based Verification for Real World Information Needs

arXiv.org Artificial Intelligence

Claim verification is the task of predicting the veracity of written statements against evidence. Previous large-scale datasets model the task as classification, ignoring the need to retrieve evidence, or are constructed for research purposes, and may not be representative of real-world needs. In this paper, we introduce a novel claim verification dataset with instances derived from search-engine queries, yielding 10,987 claims annotated with evidence that represent real-world information needs. For each claim, we annotate evidence from full Wikipedia articles with both section and sentence-level granularity. Our annotation allows comparison between two complementary approaches to verification: stance classification, and evidence extraction followed by entailment recognition. In our comprehensive evaluation, we find no significant difference in accuracy between these two approaches. This enables systems to use evidence extraction to summarize a rationale for an end-user while maintaining the accuracy when predicting a claim's veracity. With challenging claims and evidence documents containing hundreds of sentences, our dataset presents interesting challenges that are not captured in previous work -- evidenced through transfer learning experiments. We release code and data to support further research on this task.


Classifying the Unstructured IT Service Desk Tickets Using Ensemble of Classifiers

arXiv.org Artificial Intelligence

Manual classification of IT service desk tickets may result in routing of the tickets to the wrong resolution group. Incorrect assignment of IT service desk tickets leads to reassignment of tickets, unnecessary resource utilization and delays the resolution time. Traditional machine learning algorithms can be used to automatically classify the IT service desk tickets. Service desk ticket classifier models can be trained by mining the historical unstructured ticket description and the corresponding label. The model can then be used to classify the new service desk ticket based on the ticket description. The performance of the traditional classifier systems can be further improved by using various ensemble of classification techniques. This paper brings out the three most popular ensemble methods ie, Bagging, Boosting and Voting ensemble for combining the predictions from different models to further improve the accuracy of the ticket classifier system. The performance of the ensemble classifier system is checked against the individual base classifiers using various performance metrics. Ensemble of classifiers performed well in comparison with the corresponding base classifiers. The advantages of building such an automated ticket classifier systems are simplified user interface, faster resolution time, improved productivity, customer satisfaction and growth in business. The real world service desk ticket data from a large enterprise IT infrastructure is used for our research purpose.


Effective Search Engine Optimization Requires a Commitment, Not a Campaign

#artificialintelligence

A successful product and brand strategy develops brand awareness and identity that distinguishes a product from countless others based on just the brand name. A well-crafted strategy repeatedly reminds prospective and existing customers why they should buy a particular product over others with similar characteristics. Brand is just a perception, and perception will match reality over time. A brand is any trade mark through which a product is correctly identified and described by consumers. Therefore, the brand includes any action and remedy by which the product is identified.


A PSO Strategy of Finding Relevant Web Documents using a New Similarity Measure

arXiv.org Artificial Intelligence

In the world of the Internet and World Wide Web, which offers a tremendous amount of information, an increasing emphasis is being given to searching services and functionality. Currently, a majority of web portals offer their searching utilities, be it better or worse. These can search for the content within the sites, mainly text the textual content of documents. In this paper a novel similarity measure called SMDR (Similarity Measure for Documents Retrieval) is proposed to help retrieve more similar documents from the repository thus contributing considerably to the effectiveness of Web Information Retrieval (WIR) process. Bio-inspired PSO methodology is used with the intent to reduce the response time of the system and optimizes WIR process, hence contributes to the efficiency of the system. This paper also demonstrates a comparative study of the proposed system with the existing method in terms of accuracy, sensitivity, F-measure and specificity. Finally, extensive experiments are conducted on CACM collections. Better precision-recall rates are achieved than the existing system. Experimental results demonstrate the effectiveness and efficiency of the proposed system.


Enable feature reuse across accounts and teams using Amazon SageMaker Feature Store

#artificialintelligence

Amazon SageMaker Feature Store is a new capability of Amazon SageMaker that helps data scientists and machine learning (ML) engineers securely store, discover, and share curated data used in training and prediction workflows. As organizations build data-driven applications using ML, they're constantly assembling and moving features between more and more functional teams. This constant movement of data can lead to inconsistencies in features and become a bottleneck when designing ML initiatives spanning multiple teams. For example, an ecommerce company might have several data science and engineering teams working on different aspects of their platform. The Core Search team focuses on query understanding and information retrieval tasks. The Product Success team solves problems involving customer reviews and feedback signals. The Personalization team uses clickstream and session data to create ML models for personalized recommendations.


Database Systems Research in the Arab World

Communications of the ACM

From Hammurabi's stone tablets to papyrus rolls and leather-bound books, the Arab region has a rich history of recordkeeping and transactional systems that closely matches the evolution of data storage mediums. Even modern-day data management concepts like data provenance and lineage have historic roots in the Arab world; generations of scribes meticulously tracked Islamic prophetic narrations from one narrator to the next, forming lineage chains that originated from central Arabia. Database systems research has been part of the academic culture in the Arab world since the 1970s. High-quality computer science and database education was always available at several universities within the Arab region, such as Alexandria University in Egypt. Many students who went through these programs were drawn to database systems research and became globally prominent, such as Ramez Elmasri (professor at University of Texas, Arlington), Amr El Abbadi (professor at University of California, Santa Barbara), and Walid Aref (professor at Purdue University).