ancient language
A lost ancient language may be hiding in plain sight
Amazon Prime Day is live. See the best deals HERE. Clues are left behind in the ruins of the Mesoamerican megacity Teotihuacan. Breakthroughs, discoveries, and DIY tips sent every weekday. At the height of its power, the ancient Mesoamerican city of Teotihuacan near present-day Mexico City was home to over 125,000 inhabitants.
- North America > Mexico > Mexico City > Mexico City (0.26)
- Europe > Denmark > Capital Region > Copenhagen (0.06)
- Africa > Middle East > Egypt (0.05)
- Retail > Online (0.35)
- Transportation (0.31)
ParsiPy: NLP Toolkit for Historical Persian Texts in Python
Farsi, Farhan, Fazel, Parnian, Haghighi, Sepand, Sabouri, Sadra, Goshtasb, Farzaneh, Hajipour, Nadia, Asgari, Ehsaneddin, Sameti, Hossein
The study of historical languages presents unique challenges due to their complex orthographic systems, fragmentary textual evidence, and the absence of standardized digital representations of text in those languages. Tackling these challenges needs special NLP digital tools to handle phonetic transcriptions and analyze ancient texts. This work introduces ParsiPy, an NLP toolkit designed to facilitate the analysis of historical Persian languages by offering modules for tokenization, lemmatization, part-of-speech tagging, phoneme-to-transliteration conversion, and word embedding. We demonstrate the utility of our toolkit through the processing of Parsig (Middle Persian) texts, highlighting its potential for expanding computational methods in the study of historical languages. Through this work, we contribute to computational philology, offering tools that can be adapted for the broader study of ancient texts and their digital preservation.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > California (0.14)
- Europe > Bulgaria > Varna Province > Varna (0.05)
- (8 more...)
HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja
Song, Seyoung, Yoo, Haneul, Jin, Jiho, Cho, Kyunghyun, Oh, Alice
While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (7 more...)
LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
Chen, Danlu, Shi, Freda, Agarwal, Aditi, Myerston, Jacobo, Berg-Kirkpatrick, Taylor
Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription -- this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses.
- Africa > Middle East > Egypt (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (8 more...)
eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey
Nowak, Krzysztof, Ziębura, Jędrzej, Wróbel, Krzysztof, Smywiński-Pohl, Aleksander
This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models' performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.
- Europe > Poland > Lesser Poland Province > Kraków (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Europe > Western Europe (0.04)
- (7 more...)
Deciphering Oracle Bone Language with Diffusion Models
Guan, Haisu, Yang, Huanxin, Wang, Xinyu, Han, Shengwei, Liu, Yongge, Jin, Lianwen, Bai, Xiang, Liu, Yuliang
Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD. Code and decipherment results will be made available at https://github.com/guanhaisu/OBSD.
- Research Report > Promising Solution (0.34)
- Overview > Innovation (0.34)
- Research Report > New Finding (0.34)
Sampling the Swadesh List to Identify Similar Languages with Tree Spaces
Ordway, Garett, Patrangenaru, Vic
Communication plays a vital role in human interaction. Studying language is a worthwhile task and more recently has become quantitative in nature with developments of fields like quantitative comparative linguistics and lexicostatistics. With respect to the authors own native languages, the ancestry of the English language and the Latin alphabet are of the primary interest. The Indo-European Tree traces many modern languages back to the Proto-Indo-European root. Swadesh's cognates played a large role in developing that historical perspective where some of the primary branches are Germanic, Celtic, Italic, and Balto-Slavic. This paper will use data analysis on open books where the simplest singular space is the 3-spider - a union T3 of three rays with their endpoints glued at a point 0 - which can represent these tree spaces for language clustering. These trees are built using a single linkage method for clustering based on distances between samples from languages which use the Latin Script. Taking three languages at a time, the barycenter is determined. Some initial results have found both non-sticky and sticky sample means. If the mean exhibits non-sticky properties, then one language may come from a different ancestor than the other two. If the mean is considered sticky, then the languages may share a common ancestor or all languages may have different ancestry.
- North America > United States > Florida > Hillsborough County > University (0.04)
- Europe > Middle East (0.04)
- Asia > Middle East (0.04)
- (2 more...)
The hype around artificial intelligence
For example, computers obtained the ability to play and win games against humans, such as world champions in chess. Now, AI can be divided into two main categories: functionality-based and capability- based. The functionality-based AI ranges from reactive machines, with limited responsiveness to self-aware ones, where theoretically computers could understand human emotions. The capabilities-based AI ranges from artificial narrow intelligence, where narrowly defined performance tasks can be carried out, to artificial super intelligence, where computers can perform tasks better than humans. A key trait of AI is its ability to store and process large amounts of data.
Heaven's Vault: A Linguist's Buried Treasure
I climb the stairs, my faithful robot Six warning me not to proceed. Do I heed their warning and take a step back? I can see a tall pillar-like statue up ahead, peering at me over a flight of stairs--the prospect of deciphering another fragment of glyphs is motivating me to proceed through the thinning air. As a linguist and writer, Heaven's Vault is the game that I've been waiting a very long time for. It brings together the craft of compelling narrative games and a BAFTA-nominated interactive story presented in a rich, visual novel interface, taking players on a journey of imagination and exploration within an entrancing game environment.
MIT's New Algorithm That Can Decipher Ancient Languages
In a few of my previous articles, I have discussed how technology has helped us to better understand the history behind many lost cultures and civilizations. However, today I am excited to present you with a new stage within this technological trend that will help us unveil so many mysteries from ancient history. The Massachusetts Institute of Technology (MIT) has just created an algorithm that can decipher ancient languages without the input of any sort of data. This sort of new technological trend is known as machine learning and it can be simply defined as an algorithm that can, in simple terms, teach itself. Enginers from MIT that have worked on this algorithm for some time state that they have perfected it in such a way that it can read ancient languages without any information about the culture of the language or other relations the language may have to other similar ancient languages.
- North America > United States > Massachusetts (0.26)
- Europe > Spain (0.06)