Sulawesi
A Appendix
The complete list may be seen in Table 8. Here are a few general notes about these strings: 1. Based on their recommendations, we did the following: 1. zh, zh_Latn: This resulted in the special filters described below. URLs) the corpora were in languages different from the LangID predictions. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.
- Oceania > Tonga (0.04)
- North America > United States (0.04)
- South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
- (24 more...)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Europe > Poland (0.04)
- (2 more...)
World's oldest-known rock art found in Indonesian cave
Science Archaeology World's oldest-known rock art found in Indonesian cave The claw-like drawing of a human hand is roughly 67,800-years-old. Breakthroughs, discoveries, and DIY tips sent six days a week. A drawing of a claw-like hand on the wall of a cave in Sulawesi, Indonesia is now the oldest known rock art in the world. The roughly 67,800-year-old art exceeds the previous record holder in the same region of Southeast Asia by 15,000 years or more. The drawing is detailed in a study published today in the journal, and helps fill in the archaeological timeline of how and when Australia was first settled.
- Asia > Indonesia > Sulawesi (0.31)
- Asia > Southeast Asia (0.25)
- North America (0.15)
- (5 more...)
Indonesian rescuers find wreckage of plane that had 11 people on board
Indonesian rescuers have recovered wreckage from a missing plane that is believed to have crashed with 11 people on board while approaching a mountainous region on Sulawesi island during cloudy conditions. The discovery on Sunday comes after the small plane - on its way from Yogyakarta on Indonesia's main island of Java to Makassar, the capital city of South Sulawesi province - vanished from radar on Saturday. Rescuers on the ground then retrieved larger debris consistent with the main fuselage and tail scattered on a steep northern slope, Anwar told a news conference. "The discovery of the aircraft's main sections significantly narrows the search zone and offers a crucial clue for tightening the search area," Anwar said. "Our joint search and rescue teams are now focusing on searching for the victims, especially those who might still be alive." The plane, a turboprop ATR 42-500, was operated by Indonesia Air Transport and was last tracked in the Leang-Leang area of Maros, a mountainous district of South Sulawesi province.
- North America > United States (0.53)
- South America (0.42)
- North America > Central America (0.42)
- (10 more...)
- Transportation > Air (1.00)
- Transportation > Passenger (0.93)
Pigs have been island hopping for 50,000 years
With human help, the mammals can defy'the world's most fundamental natural boundaries.' Breakthroughs, discoveries, and DIY tips sent every weekday. Despite not exactly being world-renowned swimmers, pigs have spread across the Asia-Pacific region for thousands of years . With the genetic and archeological data from over 700 pigs, a team of scientists documented how people helped the mammals make their way across thousands of miles. "This research reveals what happens when people transport animals enormous distances, across one of the world's most fundamental natural boundaries," evolutionary geneticist and study co-author author Dr. David Stanton of the University of Cardiff and Queen Mary University of London said in a statement. "These movements led to pigs with a melting pot of ancestries. These patterns were technically very difficult to disentangle, but have ultimately helped us understand how and why animals came to be distributed across the Pacific islands."
- Asia > Southeast Asia (0.06)
- Oceania > Vanuatu (0.05)
- South America > Brazil (0.05)
- (14 more...)
Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
Omnilingual ASR team, null, Keren, Gil, Kozhevnikov, Artyom, Meng, Yen, Ropers, Christophe, Setzler, Matthew, Wang, Skyler, Adebara, Ife, Auli, Michael, Balioglu, Can, Chan, Kevin, Cheng, Chierh, Chuang, Joe, Droof, Caley, Duppenthaler, Mark, Duquenne, Paul-Ambroise, Erben, Alexander, Gao, Cynthia, Gonzalez, Gabriel Mejia, Lyu, Kehan, Miglani, Sagar, Pratap, Vineel, Sadagopan, Kaushik Ram, Saleem, Safiyyah, Turkatenko, Arina, Ventayol-Boada, Albert, Yong, Zheng-Xin, Chung, Yu-An, Maillard, Jean, Moritz, Rashel, Mourachko, Alexandre, Williamson, Mary, Yates, Shireen
Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.
- North America > Canada > Alberta (0.14)
- Europe > Austria > Vienna (0.14)
- Africa > Sudan (0.14)
- (53 more...)
- Health & Medicine (1.00)
- Education (0.67)
- Information Technology (0.67)
Culture Cartography: Mapping the Landscape of Cultural Knowledge
Ziems, Caleb, Held, William, Yu, Jane, Goldberg, Amir, Grusky, David, Yang, Diyi
To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher's goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Africa > Nigeria > Ogun State > Abeokuta (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (26 more...)
A Appendix A.1 LangID Details
The complete list may be seen in Table 8. Here are a few general notes about these strings: 1. Based on their recommendations, we did the following: 1. zh, zh_Latn: This resulted in the special filters described below. URLs) the corpora were in languages different from the LangID predictions. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.
- Oceania > Tonga (0.04)
- North America > United States (0.04)
- South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
- (24 more...)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Europe > Poland (0.04)
- (2 more...)
- Europe > Moldova (0.14)
- Asia > Middle East > UAE > Dubai Emirate > Dubai (0.05)
- Europe > Ukraine (0.04)
- (47 more...)
- Energy (1.00)
- Education (0.93)
- Information Technology (0.93)
- (4 more...)