Goto

Collaborating Authors

 Educational Setting


The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Neural Information Processing Systems

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-ofthe-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including indepth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb.


Task-Free Continual Learning via Online Discrepancy Distance Learning

Neural Information Processing Systems

Learning from non-stationary data streams, also called Task-Free Continual Learning (TFCL) remains challenging due to the absence of explicit task information in most applications. Even though recently some algorithms have been proposed for TFCL, these methods lack theoretical guarantees. Moreover, there are no theoretical studies about forgetting during TFCL. This paper develops a new theoretical analysis framework that derives generalization bounds based on the discrepancy distance between the visited samples and the entire information made available for training the model. This analysis provides new insights into the forgetting behaviour in classification tasks. Inspired by this theoretical model, we propose a new approach enabled with the dynamic component expansion mechanism for a mixture model, namely Online Discrepancy Distance Learning (ODDL). ODDL estimates the discrepancy between the current memory and the already accumulated knowledge as an expansion signal aiming to ensure a compact network architecture with optimal performance. We then propose a new sample selection approach that selectively stores the samples into the memory buffer through the discrepancybased measure, further improving the performance. We perform several TFCL experiments with the proposed methodology, which demonstrate that the proposed approach achieves the state of the art performance.


Chicago paper publishes AI-generated 'summer reading list' with books that don't exist

FOX News

Texas high school student Elliston Berry joins'Fox & Friends' to discuss the House's passage of a new bill that criminalizes the sharing of non-consensual intimate images, including content created with artificial intelligence. The Chicago Sun-Times admitted on Tuesday that it published an AI-generated list of books that don't exist for its summer reading list. On Sunday, the publication released a special 64-page section titled "Heat Index: Your Guide to the Best of Summer" which featured a list of 15 recommended books for summer. However, upon further look, it was found that 10 of the 15 books on the list were not real. One example included a book called "Nightshade Market" by Min Jin Lee, which was described as a "riveting tale set in Seoul's underground economy" and follows "three women whose paths intersect in an illegal night market" exploring "class, gender and the shadow economies beneath prosperous societies."


Interview with Gillian Hadfield: Normative infrastructure for AI alignment

AIHub

During the 33rd International Joint Conference on Artificial Intelligence (IJCAI), held in Jeju, I had the opportunity to meet with one of the keynote speakers, Gillian Hadfield. We spoke about her interdisciplinary research, career trajectory, path into AI alignment, law, and general thoughts on AI systems. Transcript: Note: the transcript has been lightly edited for clarity. This is an interview with Professor Gillian Hadfield who was a keynote speaker at IJCAI 2024. She gave a very insightful talk about normative infrastructures and how they can guide our search for AI alignment. Kumar Kshitij Patel (KKP): Could you talk a bit about your background and career trajectory? I want our readers to understand how much interdisciplinary work you've done over the years. Gillian Hadfield (GH): I did a PhD in economics and a law degree, a JD, at Stanford, originally motivated by wanting to think about the big questions about the world. So I read John Rawls' theory of justice when I was an undergraduate, and those are the big questions: how do we organize the world and just institutions, but I was very interested in using more formal methods and social scientific approaches. That's why I decided to do that joint degree. So, this is in the 1980s, and in the early days of starting to use a lot of game theory. I studied information theory, a student of Canaro and Paul Milgram at the economics department at Stanford. I did work on contract theory, bargaining theory, but I was still very interested in going to law school, not to practice law, but to learn about legal institutions and how those work. I was a member of this emerging area of law and economics early in my career, which of course, was interdisciplinary, using economics to think about law and legal institutions.


Feature-fortified Unrestricted Graph Alignment

Neural Information Processing Systems

The necessity to align two graphs, minimizing a structural distance metric, is prevalent in biology, chemistry, recommender systems, and social network analysis. Due to the problem's NP-hardness, prevailing graph alignment methods follow a modular and mediated approach, solving the problem restricted to the domain of intermediary graph representations or products like embeddings, spectra, and graph signals. Restricting the problem to this intermediate space may distort the original problem and are hence predisposed to miss high-quality solutions.


Learning-to-learn non-convex piecewise-Lipschitz functions

Neural Information Processing Systems

We analyze the meta-learning of the initialization and step-size of learning algorithms for piecewise-Lipschitz functions, a non-convex setting with applications to both machine learning and algorithms. Starting from recent regret bounds for the exponential forecaster on losses with dispersed discontinuities, we generalize them to be initialization-dependent and then use this result to propose a practical meta-learning procedure that learns both the initialization and the step-size of the algorithm from multiple online learning tasks. Asymptotically, we guarantee that the average regret across tasks scales with a natural notion of task-similarity that measures the amount of overlap between near-optimal regions of different tasks.


Im a college professor. My advice to young people who feel hooked on tech

Mashable

When I was a child, computers were a fixture in my home, from the giant Atari on which I learned my ABCs, to the Commodore Amiga that my dad used for his videography business, to the PC towers that facilitated my first forays onto the internet. But tech was still a niche hobby back then. Even in college in the late 1990s and early 2000s, many of my friends got by just fine without computers. For people in college now--namely, my students--things are decidedly different. Gadgets are everywhere, and are increasingly designed to insert themselves into every aspect of our consciousness, colonizing every spare moment of our time and attention.


Jan P. Bauer

Neural Information Processing Systems

Exp. Psychology, Oxford ELSC, HebrewU Department of Computing Brain Mind Institute, EPFL Gatsby Unit, UCL Imperial College London Andrew M. Saxe Christopher Summerfield Ali Hummos


My Coworkers Keep Taking This Stupid Shortcut. I Am Filled With Rage.

Slate

Good Job is Slate's advice column on work. Have a workplace problem big or small? I am a hard-line hater of generative AI (ChatGPT, Midjourney, etc.). I think it's bad for the environment and bad for society. It burns water resources, exploits workers in the global south, plagiarizes art and writing, and eliminates badly needed entry-level jobs.


40 of the best MIT courses you can take online for free

Mashable

There's always a catch: These free courses do not come with a shareable certificate of completion or graded assignments/exams. But you can start learning at a pace that suits you, so there's really nothing stopping you from enrolling. Find the best free online courses from MIT on edX.