Human Performance on Clustering Web Pages: A Preliminary Study

AAAI Conferences

With the increase in information on the World Wide Web it has become difficult to quickly find desired information without using multiple queries or using a topic-specific search engine. One way to help in the search is by grouping HTML pages together that appear in some way to be related. In order to better understand this task, we performed an initial study of human clustering of web pages, in the hope that it would provide some insight into the difficulty of automating this task. Our results show that subjects did not cluster identically; in fact, on average, any two subjects had little similarity in their webpage clusters. We also found that subjects generally created rather small clusters, and those with access only to URLs created fewer clusters than those with access to the full text of each web page. Generally the overlap of documents between clusters for any given subject increased when given the full text, as did the percentage of documents clustered. When analyzing individual subjects, we found that each had different behavior across queries, both in terms of overlap, size of clusters, and number of clusters. These results provide a sobering note on any quest for a single clearly correct clustering method for web pages.

Syntactic Folding and its Application to the Information Extraction from Web Pages

AAAI Conferences

The focus is on folding principles and their influence on the recognition of certain data in a document undergoing the extraction. Introduction The topic of our work is information extraction from the Internet. There are a couple of approaches which deal with the problem of recognizing structural data in semistructured documents for retrieval of user specified information from these and from similar documents (possibly of the same source), in an automatic semi-antomatic way (Freitag 1996), (Soderland 1997), (Kushmerick 1997). Ideally, structural information shall be learned by presenting only samples of text segments which a user wants to extract from these pages to a learning device, without any need to specify details of how the desired samples can be localized within the document. The learning device should generate a procedure, a wrapper, that - reading the same documents - puts out a collection of information, including the samples and, hopefully, extending them in terms of finding similar items. These approaches led to a variety of wrapper classes, e.g.

Facial recognition helps mom and dad see kids' camp photos, raises privacy concerns for some

USATODAY - Tech Top Stories

A photo from a summer camp posted to the camp's website so parents can view them. Venture capital-backed Waldo Photos has been selling the service to identify specific children in the flood of photos provided daily to parents by many sleep-away camps. Camps working with the Austin, Texas-based company give parents a private code to sign up. When the camp uploads photos taken during activities to its website, Waldo's facial recognition software scans for matches in the parent-provided headshots. Once it finds a match, the Waldo system (as in "Where's Waldo?") then automatically texts the photos to the child's parents.

China tech firms bypassing privacy concerns to apply facial recognition


China's technology firms are rushing to apply the commercial use of facial recognition technology, bypassing the same privacy concerns that have slowed the roll out of the technology in Western markets, according to a report by The Financial Times. People in China are arguably less concerned about privacy rights violation than in Western countries as they are accustomed to having their faces scanned to conduct daily tasks, such as making payments to access residential blocks, student dormitories and hotels. In addition, Chinese citizens are required to swipe their ID cards into chip readers to activate a mobile phone account, purchase a train ticket or check into a hotel. Ant Financial, the online payments division of ecommerce group Alibaba, allows users to take a selfie to access their online wallets, while China Construction Bank offers a similar service for customers at ATMs. Car-hailing service Didi Chuxing is using the technology to verify drivers' identities, while search engine Baidu has developed facial recognition-enabled entry to access its offices and paid events.

A Direct Evolutionary Feature Extraction Algorithm for Classifying High Dimensional Data

AAAI Conferences

Among various feature extraction algorithms, those based on genetic algorithms are promising owing to their potential parallelizability and possible applications in large scale and high dimensional data classification. However, existing genetic algorithm based feature extraction algorithms are either limited in searching optimal projection basis vectors or costly in both time and space complexities and thus not directly applicable to high dimensional data. In this paper, a direct evolutionary feature extraction algorithm is proposed for classifying high-dimensional data. It constructs projection basis vectors using the linear combination of the basis of the search space and the technique of orthogonal complement. It also constrains the search space when seeking for the optimal projection basis vectors. It evaluates individuals according to the classification performance on a subset of the training samples and the generalization ability of the projection basis vectors represented by the individuals. We compared the proposed algorithm with some representative feature extraction algorithms in face recognition, including the evolutionary pursuit algorithm, Eigenfaces, and Fisherfaces. The results on the widely-used Yale and ORL face databases show that the proposed algorithm has an excellent performance in classification while reducing the space complexity by an order of magnitude.