internet archive
Publishers are blocking the Internet Archive for fear AI scrapers can use it as a workaround
The Internet Archive has often been a valuable resource for journalists, from it's finding records of deleted tweets or providing academic texts for background research. However, the advent of AI has created a new tension between the parties. A few major publications have begun blocking the nonprofit digital library's access to their content based on concerns that AI companies' bots are using the Internet Archive's collections to indirectly scrape their articles. A lot of these AI businesses are looking for readily available, structured databases of content, Robert Hahn, head of business affairs and licensing for, told . The Internet Archive's API would have been an obvious place to plug their own machines into and suck out the IP.
The Morning After: Nintendo's latest hardware is not the Switch 2
We've been waiting and waiting, and Nintendo finally did the right thing and announced an entirely new piece of hardware. Alas, it's not a new console but a very Nintendo-looking smart alarm clock. The Alarmo has motion sensors that let you snooze it based on your movement. You'll also be able to check how much you move around while you sleep, and the clock has sleeping sounds and music to drift off to. You can set the clock's background with scenes inspired by the likes of Super Mario Odyssey, The Legend of Zelda: Breath of the Wild, Splatoon 3, Pikmin 4 and, er, Ring Fit Adventure.
- North America > United States > California > San Francisco County > San Francisco (0.06)
- North America > Canada (0.06)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Information Technology (0.74)
- Information Technology > Artificial Intelligence (0.98)
- Information Technology > Communications > Networks > Sensor Networks (0.57)
The grassroots push to digitize India's most precious documents
"Getting access to many of our public libraries is so difficult, and after a point people will give up asking for access. That's the case in many of our public-funded educational institutes too," says Arul George Scaria, an associate professor at the National Law School of India University Bengaluru, who studies intellectual-property law. One of the best ways to liberate access to these libraries, he says, is through digitization. Technologist Omshivaprakash H L felt the acute lack of such resources when he needed references for writing Wikipedia articles in Kannada, a southwestern Indian language. Around 2019, he heard that Carl Malamud, who runs Public Resource, a registered US charity, was already archiving books like Gandhi's Hind Swaraj collection on Indian self-rule and works of the Indian government in the public domain.
- Law > Intellectual Property & Technology Law (0.57)
- Education > Educational Setting > Higher Education (0.57)
- Education > Curriculum > Subject-Specific Education (0.57)
Exploring Large Language Models for Classical Philology
Riemenschneider, Frederick, Frank, Anette
Recent advances in NLP have led to the creation of powerful language models for many languages including Ancient Greek and Latin. While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages: we explore (i) encoder-only and encoder-decoder architectures using RoBERTa and T5 as strong model types, and create for each of them (ii) a monolingual Ancient Greek and a multilingual instance that includes Latin and English. We evaluate all models on morphological and syntactic tasks, including lemmatization, which demonstrates the added value of T5's decoding abilities. We further define two probing tasks to investigate the knowledge acquired by models pre-trained on Classical texts. Our experiments provide the first benchmarking analysis of existing models of Ancient Greek. Results show that our models provide significant improvements over the SoTA. The systematic analysis of model types can inform future research in designing language models for Classical languages, including the development of novel generative tasks. We make all our models available as community resources, along with a large curated pre-training corpus for Ancient Greek, to support the creation of a larger, comparable model zoo for Classical Philology. Our models and resources are available at https://github.com/Heidelberg-NLP/ancient-language-models.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (7 more...)
🇺🇸 Machine learning job: Senior AI Research Scientist at Earth Species Project (work from anywhere!)
Senior AI Research Scientist at Earth Species Project Remote › Worldwide, 100% remote position (Posted Aug 3 2022) Job description The Earth Species Project (ESP) is a nonprofit organization dedicated to decoding animal communication and translating non-human language. ESP partners with biologists and machine learning researchers at universities and institutions around the world and we are honored to be supported by many forward-looking philanthropists and groups, including the Internet Archive, TED Audacious 2020, and the entrepreneur and author Reid Hoffman. Our work has been featured on NPR's Invisibilia documentary, "Two Heart Beats a Minute," "How to Talk to Animals" in Wall Street Journal's The Future of Everything, "The Challenges of Animal Translation" in the New Yorker, published in Scientific Reports, and was honored at the inaugural Anthem Awards. We aim to enable every person to more deeply understand our co-inhabitants on Earth and in doing so, to permanently alter human perspective and culture. Purpose of role You will join an incredible and global remote team, and will be responsible for developing pioneering research towards decoding and translating non-human communication, including extending unsupervised translation techniques and tackling cornerstone biological and computational problems on large-scale multimodal behavioral datasets.
- North America > United States > New York (0.25)
- North America > United States > Massachusetts (0.05)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.05)
- (3 more...)
- Media (0.50)
- Leisure & Entertainment (0.49)
- Law (0.30)
Senior AI Research Scientist
The Earth Species Project (ESP) is a nonprofit organization dedicated to decoding animal communication and translating non-human language. ESP partners with biologists and machine learning researchers at universities and institutions around the world and we are honored to be supported by many forward-looking philanthropists and groups, including the Internet Archive, TED Audacious 2020, and the entrepreneur and author Reid Hoffman. Our work has been featured on NPR's Invisibilia documentary, "Two Heart Beats a Minute," "How to Talk to Animals" in Wall Street Journal's The Future of Everything, "The Challenges of Animal Translation" in the New Yorker, published in Scientific Reports, and was honored at the inaugural Anthem Awards. We aim to enable every person to more deeply understand our co-inhabitants on Earth and in doing so, to permanently alter human perspective and culture. You will join an incredible and global remote team, and will be responsible for developing pioneering research towards decoding and translating non-human communication, including extending unsupervised translation techniques and tackling cornerstone biological and computational problems on large-scale multimodal behavioral datasets.
- North America > United States > New York (0.25)
- North America > United States > Massachusetts (0.05)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.05)
- (2 more...)
- Media (0.36)
- Leisure & Entertainment (0.32)
- Law (0.32)
Documenting the English Colossal Clean Crawled Corpus
Dodge, Jesse, Sap, Maarten, Marasovic, Ana, Agnew, William, Ilharco, Gabriel, Groeneveld, Dirk, Gardner, Matt
As language models are trained on ever more text, researchers are turning to some of the largest corpora available. Unlike most other types of datasets in NLP, large unlabeled text corpora are often presented with minimal documentation, and best practices for documenting them have not been established. In this work we provide the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin with a high-level summary of the data, including distributions of where the text came from and when it was written. We then give more detailed analysis on salient parts of this data, including the most frequent sources of text (e.g., patents.google.com, which contains a significant percentage of machine translated and/or OCR'd text), the effect that the filters had on the data (they disproportionately remove text in AAE), and evidence that some other benchmark NLP dataset examples are contained in the text. We release a web interface to an interactive, indexed copy of this dataset, encouraging the community to continuously explore and report additional findings.
- Asia (1.00)
- Europe (0.93)
- North America > United States > Minnesota (0.28)
- Information Technology > Communications (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)
The Internet Archive now has better scans of Computerworld magazine
The Internet Archive announced that it has made Computerworld's print issues from 1967-2014 available online in better quality than ever. The publication provides a historical record of sorts for computing, as it launched during a key period in the industry and documented significant events like a 75 minute Steve Jobs interview in 1995. Much of Computerworld was already available on the Internet Archive, thanks to cooperation from publisher IDG and its Patrick J. McGovern Foundation. However, those were digitized from poor-quality microfilms that "did not look very good," according to the Internet Archive's blog. Luckily, some better-quality microfilm collections were acquired and donated to be digitized and preserved on film. The Internet Archive worked with open source communities and NextScan to improve the digitization by "adjusting for faded film, straightening the pages, performing optical character recognition" and more.
Stanford Launches AI-powered TV News Analyzer - Liwaiwai
Cable TV news is a primary source of information for millions of Americans each day. The people that appear on cable TV news and the topics they talk about shape public opinion and culture. While many newsrooms and monitoring organizations routinely audit the content of news broadcasts, these efforts typically involve manually counting who and what is on the air. But now researchers at the Brown Institute for Media Innovation at Stanford University have launched the Stanford Cable TV News Analyzer, an interactive tool that gives the public the ability to not only search transcripts but also compute the screen time of public figures in nearly 24/7 TV news broadcasts from CNN, Fox News and MSNBC dating back to January 2010. The site is updated daily with the previous day's coverage, and enables searches of over 270,000 hours of news footage. The Analyzer increases transparency around these broadcasters' editorial decisions by using modern AI techniques to automatically measure who is on the news and what they talk about.