Human Performance on Clustering Web Pages: A Preliminary Study

AAAI Conferences

With the increase in information on the World Wide Web it has become difficult to quickly find desired information without using multiple queries or using a topic-specific search engine. One way to help in the search is by grouping HTML pages together that appear in some way to be related. In order to better understand this task, we performed an initial study of human clustering of web pages, in the hope that it would provide some insight into the difficulty of automating this task. Our results show that subjects did not cluster identically; in fact, on average, any two subjects had little similarity in their webpage clusters. We also found that subjects generally created rather small clusters, and those with access only to URLs created fewer clusters than those with access to the full text of each web page. Generally the overlap of documents between clusters for any given subject increased when given the full text, as did the percentage of documents clustered. When analyzing individual subjects, we found that each had different behavior across queries, both in terms of overlap, size of clusters, and number of clusters. These results provide a sobering note on any quest for a single clearly correct clustering method for web pages.

Clearview AI, which has facial recognition database of 3 billion images, faces data theft

USATODAY - Tech Top Stories

Facial recognition software firm Clearview AI, which has been criticized for scraping together a database of as many as 3 billion online images, has been hit with a data breach. The New York-based firm apparently had its list of customers including numerous law enforcement agencies stolen, according to The Daily Beast, which first reported the incident. The news site reported it had obtained a notice sent to Clearview's customers that an intruder had "gained unauthorized access" to its customer list, the number of searches customers have conducted and other data. Clearview said in the notice that the company's servers were not breached and that there was "no compromise of Clearview's systems or network." Video game legacy:Kazuhisa Hashimoto, creator of the'Konami Code' for video games, has died However, Clearview's attorney Tor Ekeland said, in a statement sent to USA TODAY, "Security is Clearview's top priority. Unfortunately, data breaches are part of life in the 21st century. Our servers were never accessed. We patched the flaw, and continue to work to strengthen our security."

Syntactic Folding and its Application to the Information Extraction from Web Pages

The focus is on folding principles and their influence on the recognition of certain data in a document undergoing the extraction. Introduction The topic of our work is information extraction from the Internet. There are a couple of approaches which deal with the problem of recognizing structural data in semistructured documents for retrieval of user specified information from these and from similar documents (possibly of the same source), in an automatic semi-antomatic way (Freitag 1996), (Soderland 1997), (Kushmerick 1997). Ideally, structural information shall be learned by presenting only samples of text segments which a user wants to extract from these pages to a learning device, without any need to specify details of how the desired samples can be localized within the document. The learning device should generate a procedure, a wrapper, that - reading the same documents - puts out a collection of information, including the samples and, hopefully, extending them in terms of finding similar items. These approaches led to a variety of wrapper classes, e.g.

Plan for massive facial recognition database sparks privacy concerns

The Guardian

If you've had a driver's licence photo or passport photo taken in Australia in the past few years, it's likely your face will end up in a massive new national network the federal government is trying to create. Victoria and Tasmania have already begun to upload driver's licence details to state databases that will eventually be linked to a future national one. Legislation before federal parliament will allow government agencies and private businesses to access facial IDs held by state and territory traffic authorities, and passport photos held by the foreign affairs department. The justification for what would be the most significant compulsory collection of personal data since My Health Record is cracking down on identity fraud. The home affairs department estimates that the annual cost of ID fraud is $2.2bn, and says introducing a facial component to the government's document verification service would help prevent it.

Facial recognition helps mom and dad see kids' camp photos, raises privacy concerns for some

USATODAY - Tech Top Stories

A photo from a summer camp posted to the camp's website so parents can view them. Venture capital-backed Waldo Photos has been selling the service to identify specific children in the flood of photos provided daily to parents by many sleep-away camps. Camps working with the Austin, Texas-based company give parents a private code to sign up. When the camp uploads photos taken during activities to its website, Waldo's facial recognition software scans for matches in the parent-provided headshots. Once it finds a match, the Waldo system (as in "Where's Waldo?") then automatically texts the photos to the child's parents.