Country
Query Significance in Databases via Randomizations
Ojala, Markus, Garriga, Gemma C., Gionis, Aristides, Mannila, Heikki
Many sorts of structured data are commonly stored in a multi-relational format of interrelated tables. Under this relational model, exploratory data analysis can be done by using relational queries. As an example, in the Internet Movie Database (IMDb) a query can be used to check whether the average rank of action movies is higher than the average rank of drama movies. We consider the problem of assessing whether the results returned by such a query are statistically significant or just a random artifact of the structure in the data. Our approach is based on randomizing the tables occurring in the queries and repeating the original query on the randomized tables. It turns out that there is no unique way of randomizing in multi-relational data. We propose several randomization techniques, study their properties, and show how to find out which queries or hypotheses about our data result in statistically significant information. We give results on real and generated data and show how the significance of some queries vary between different randomizations.
Restricted Global Grammar Constraints
Katsirelos, George, Maneth, Sebastian, Narodytska, Nina, Walsh, Toby
We investigate the global GRAMMAR constraint over restricted classes of context free grammars like deterministic and unambiguous context-free grammars. We show that detecting disentailment for the GRAMMAR constraint in these cases is as hard as parsing an unrestricted context free grammar.We also consider the class of linear grammars and give a propagator that runs in quadratic time. Finally, to demonstrate the use of linear grammars, we show that a weighted linear GRAMMAR constraint can efficiently encode the EDITDISTANCE constraint, and a conjunction of the EDITDISTANCE constraint and the REGULAR constraint
Multiple Hypothesis Testing in Pattern Discovery
Hanhijรคrvi, Sami, Puolamรคki, Kai, Garriga, Gemma C.
The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used with a generic data mining algorithm. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense. We evaluate the performance of our solution on both real and generated data. The results show that our method controls the FWER while maintaining the power of the test.
General combination rules for qualitative and quantitative beliefs
Martin, Arnaud, Osswald, Christophe, Dezert, Jean, Smarandache, Florentin
Martin and Osswald \cite{Martin07} have recently proposed many generalizations of combination rules on quantitative beliefs in order to manage the conflict and to consider the specificity of the responses of the experts. Since the experts express themselves usually in natural language with linguistic labels, Smarandache and Dezert \cite{Li07} have introduced a mathematical framework for dealing directly also with qualitative beliefs. In this paper we recall some element of our previous works and propose the new combination rules, developed for the fusion of both qualitative or quantitative beliefs.
High Dimensional Nonlinear Learning using Local Coordinate Coding
This paper introduces a new method for semi-supervised learning on high dimensional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point $x$ on the manifold can be locally approximated by a linear combination of its nearby anchor points, with the linear weights offering a local-coordinate coding of $x$. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difficult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. The work also gives a theoretical justification to the empirical success of some biologically-inspired models using sparse coding of sensory data, since a local coding scheme must be sufficiently sparse. However, sparsity does not always satisfy locality conditions, and can thus possibly lead to suboptimal results. The properties and performances of the method are empirically verified on synthetic data, handwritten digit classification, and object recognition tasks.
A Novel Two-Stage Dynamic Decision Support based Optimal Threat Evaluation and Defensive Resource Scheduling Algorithm for Multi Air-borne threats
Naeem, Huma, Masood, Asif, Hussain, Mukhtar, Khan, Shoab A.
This paper presents a novel two-stage flexible dynamic decision support based optimal threat evaluation and defensive resource scheduling algorithm for multi-target air-borne threats. The algorithm provides flexibility and optimality by swapping between two objective functions, i.e. the preferential and subtractive defense strategies as and when required. To further enhance the solution quality, it outlines and divides the critical parameters used in Threat Evaluation and Weapon Assignment (TEWA) into three broad categories (Triggering, Scheduling and Ranking parameters). Proposed algorithm uses a variant of many-to-many Stable Marriage Algorithm (SMA) to solve Threat Evaluation (TE) and Weapon Assignment (WA) problem. In TE stage, Threat Ranking and Threat-Asset pairing is done. Stage two is based on a new flexible dynamic weapon scheduling algorithm, allowing multiple engagements using shoot-look-shoot strategy, to compute near-optimal solution for a range of scenarios. Analysis part of this paper presents the strengths and weaknesses of the proposed algorithm over an alternative greedy algorithm as applied to different offline scenarios.
Towards the Patterns of Hard CSPs with Association Rule Mining
The hardness of finite domain Constraint Satisfaction Problems (CSPs) is a very important research area in Constraint Programming (CP) community. However, this problem has not yet attracted much attention from the researchers in the association rule mining community. As a popular data mining technique, association rule mining has an extremely wide application area and it has already been successfully applied to many interdisciplines. In this paper, we study the association rule mining techniques and propose a cascaded approach to extract the interesting patterns of the hard CSPs. As far as we know, this problem is investigated with the data mining techniques for the first time. Specifically, we generate the random CSPs and collect their characteristics by solving all the CSP instances, and then apply the data mining techniques on the data set and further to discover the interesting patterns of the hardness of the randomly generated CSPs
Concept-based Recommendations for Internet Advertisement
Ignatov, Dmitry I., Kuznetsov, Sergei O.
The problem of detecting terms that can be interesting to the advertiser is considered. If a company has already bought some advertising terms which describe certain services, it is reasonable to find out the terms bought by competing companies. A part of them can be recommended as future advertising terms to the company. The goal of this work is to propose better interpretable recommendations based on FCA and association rules.
Node discovery in a networked organization
In this paper, I present a method to solve a node discovery problem in a networked organization. Covert nodes refer to the nodes which are not observable directly. They affect social interactions, but do not appear in the surveillance logs which record the participants of the social interactions. Discovering the covert nodes is defined as identifying the suspicious logs where the covert nodes would appear if the covert nodes became overt. A mathematical model is developed for the maximal likelihood estimation of the network behind the social interactions and for the identification of the suspicious logs. Precision, recall, and F measure characteristics are demonstrated with the dataset generated from a real organization and the computationally synthesized datasets. The performance is close to the theoretical limit for any covert nodes in the networks of any topologies and sizes if the ratio of the number of observation to the number of possible communication patterns is large.