When it comes to search, there's Google and there's everyone else -- the company is basically synonymous with searching the internet. But Omnity, a relatively new company from San Francisco, thinks own search that's based on "semantic mapping" offers something that Google can't do. Omnity's trick is that it looks for the connections between documents on the internet based on rare words -- the theory that research that has several of the same rare words will likely be about related topics, even if that research doesn't directly link to or cite each other. Thus far, Omnity has operated primarily by selling enterprise plans to companies and educational institutions. Omnity can search not only all of the public datasets it scans (like patents, scientific, engineering and medical documents, clinical trials, case law, SEC filings and so forth) but also a company's internal documents -- for some companies, Omnity indexes 150 petabytes of data.
Search engines that aren't Google rarely have much that's interesting to offer to the average consumer. But Omnity, a new search engine aimed at researchers -- or even just students doing their homework -- offers some glimmers of something new that make it worth taking notice. Search, as we know it, is ripe for some sort of change, after all. Google is certainly working to bake search more fully into our cars, phones and other devices. Specialized search engines -- for flights, places to stay, even .gifs
By 2020, the market for machine-learning applications will reach 40 billion, per IDC. The next time you see Democratic presidential nominee Hillary Clinton with an unflattering look on her face in a TV spot supporting GOP rival Donald Trump, it's all but certain you can attribute the ad creative to artificial intelligence. The Republican National Committee is using machine-learning software from Veritone, a 2-year-old player that just secured 50 million in funding. Designed to work with laser-fast precision, its audio-based system lets the RNC zip through all the publicly available times Clinton has spoken on TV, radio or online video to scoop up her angriest or oddest moments. The company is about to add a visual-sentiment feature, which will zero in on facial expressions and make cringe-worthy moments even easier to find.
In this paper, we present a language-independent method for the automatic, unsupervised extraction of non-content words from a corpus of documents. This method permits the creation of word lists that may be used in place of traditional function word lists in various natural language processing tasks. As an example we generated lists of words from a corpus of English, Chinese, and Russian posts extracted from Wikipedia articles and Wikipedia Wikitalk discussion pages. We applied these lists to the task of authorship attribution on this corpus to compare the effectiveness of lists of words extracted with this method to expert-created function word lists and frequent word lists (a common alternative to function word lists). hLDA lists perform comparably to frequent word lists. The trials also show that corpus-derived lists tend to perform better than more generic lists, and both sets of generated lists significantly outperformed the expert lists. Additionally, we evaluated the performance of an English expert list on machine translations of our Chinese and Russian documents, showing that our method also outperforms this alternative.