A bagging SVM to learn from positive and unlabeled examples
Mordelet, Fantine, Vert, Jean-Philippe
In many applications, such as information retrieval or gene ranking, one is given a finite set of data of interest sharing a particular property, and wishes to find other data sharing the same property. In information retrieval, for example, the finite set can be a user query, or a set of documents known to belong to a specific category, and the goal is to scan a large database of documents to identify new documents related to the query or belonging to the same category. In gene ranking, the query is a finite list of genes known to have a given function or to be associated to a given disease, and the goal is to identify new genes sharing the same property (Aerts et al., 2006). In fact this setting is ubiquitous in many applications where identifying a data of interest is difficult or expensive, e.g., because human intervention is necessary or expensive experiments are needed, while unlabeled data can be easily collected. In such cases there is a clear opportunity to alleviate the burden and cost of interesting data identification with the help of machine learning techniques. More formally, let us assign a binary label to each possible data: positive ( 1) for data of interest, negative ( 1) for other data. Unlabeled data are data for which we do not know whether 1 they are interesting or not. Denoting X the set of data, we assume that the "query" is a finite set of data P {x
Oct-5-2010
- Country:
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Genre:
- Research Report > New Finding (0.47)
- Industry:
- Technology: