unicode
Compositional Sculpting of Iterative Generative Processes
High training costs of generative models and the need to fine-tune them for specific tasks have created a strong interest in model reuse and composition.A key challenge in composing iterative generative processes, such as GFlowNets and diffusion models, is that to realize the desired target distribution, all steps of the generative process need to be coordinated, and satisfy delicate balance conditions.In this work, we propose Compositional Sculpting: a general approach for defining compositions of iterative generative processes. We then introduce a method for sampling from these compositions built on classifier guidance.We showcase ways to accomplish compositional sculpting in both GFlowNets and diffusion models. We highlight two binary operations $\\unicode{x2014}$ the $\\textit{harmonic mean}\\unicode{x00A0}(p_1 \\otimes p_2$) and the $\\textit{contrast}\\unicode{x00A0}(p_1 \\,\\unicode{x25D1}\\,\\, p_2$) between pairs, and the generalization of these operations to multiple component distributions.We offer empirical results on image and molecular generation tasks.
Reasoning Over the Glyphs: Evaluation of LLM's Decipherment of Rare Scripts
Shih, Yu-Fei, Lin, Zheng-Lin, Hsieh, Shu-Kai
We explore the capabilities of LVLMs and LLMs in deciphering rare scripts not encoded in Unicode. We introduce a novel approach to construct a multimodal dataset of linguistic puzzles involving such scripts, utilizing a tokenization method for language glyphs. Our methods include the Picture Method for LVLMs and the Description Method for LLMs, enabling these models to tackle these challenges. We conduct experiments using prominent models, GPT-4o, Gemini, and Claude 3.5 Sonnet, on linguistic puzzles. Our findings reveal the strengths and limitations of current AI methods in linguistic decipherment, highlighting the impact of Unicode encoding on model performance and the challenges of modeling visual language tokens through descriptions. Our study advances understanding of AI's potential in linguistic decipherment and underscores the need for further research.
- Europe > Middle East > Malta > Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Europe > Bulgaria > Sofia City Province > Sofia (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation
Salim, Saiful Islam, Yang, Rubin Yuchan, Cooper, Alexander, Ray, Suryashree, Debray, Saumya, Rahaman, Sazzadur
While Large language model (LLM)-based programming assistants such as CoPilot and ChatGPT can help improve the productivity of professional software developers, they can also facilitate cheating in introductory computer programming courses. Assuming instructors have limited control over the industrial-strength models, this paper investigates the baseline performance of 5 widely used LLMs on a collection of introductory programming problems, examines adversarial perturbations to degrade their performance, and describes the results of a user study aimed at understanding the efficacy of such perturbations in hindering actual code generation for introductory programming assignments. The user study suggests that i) perturbations combinedly reduced the average correctness score by 77%, ii) the drop in correctness caused by these perturbations was affected based on their detectability.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Oceania > Australia > Victoria > Melbourne (0.14)
- North America > United States > Arizona > Pima County > Tucson (0.14)
- (14 more...)
- Questionnaire & Opinion Survey (1.00)
- Instructional Material > Course Syllabus & Notes (0.93)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.87)
- Information Technology > Security & Privacy (1.00)
- Education (1.00)
- Law (0.93)
Compositional Sculpting of Iterative Generative Processes
High training costs of generative models and the need to fine-tune them for specific tasks have created a strong interest in model reuse and composition.A key challenge in composing iterative generative processes, such as GFlowNets and diffusion models, is that to realize the desired target distribution, all steps of the generative process need to be coordinated, and satisfy delicate balance conditions.In this work, we propose Compositional Sculpting: a general approach for defining compositions of iterative generative processes. We then introduce a method for sampling from these compositions built on classifier guidance.We showcase ways to accomplish compositional sculpting in both GFlowNets and diffusion models.
LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
Chen, Danlu, Shi, Freda, Agarwal, Aditi, Myerston, Jacobo, Berg-Kirkpatrick, Taylor
Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription -- this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses.
- Africa > Middle East > Egypt (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (8 more...)
Tamil Language Computing: the Present and the Future
This paper delves into the text processing aspects of Language Computing, which enables computers to understand, interpret, and generate human language. Focusing on tasks such as speech recognition, machine translation, sentiment analysis, text summarization, and language modelling, language computing integrates disciplines including linguistics, computer science, and cognitive psychology to create meaningful human-computer interactions. Recent advancements in deep learning have made computers more accessible and capable of independent learning and adaptation. In examining the landscape of language computing, the paper emphasises foundational work like encoding, where Tamil transitioned from ASCII to Unicode, enhancing digital communication. It discusses the development of computational resources, including raw data, dictionaries, glossaries, annotated data, and computational grammars, necessary for effective language processing. The challenges of linguistic annotation, the creation of treebanks, and the training of large language models are also covered, emphasising the need for high-quality, annotated data and advanced language models. The paper underscores the importance of building practical applications for languages like Tamil to address everyday communication needs, highlighting gaps in current technology. It calls for increased research collaboration, digitization of historical texts, and fostering digital usage to ensure the comprehensive development of Tamil language processing, ultimately enhancing global communication and access to digital services.
- Asia > Sri Lanka > Northern Province > Jaffna District > Jaffna (0.05)
- South America (0.04)
- North America > Central America (0.04)
- Asia > India (0.04)
UniCoder: Scaling Code Large Language Model via Universal Code
Sun, Tao, Chai, Linzheng, Yang, Jian, Yin, Yuwei, Guo, Hongcheng, Liu, Jiaheng, Wang, Bing, Yang, Liqun, Li, Zhoujun
Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks. When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction dataset UniCoder-Instruct to train our model UniCoder on multi-task learning objectives. UniCoder-Instruct comprises natural-language questions, code solutions, and the corresponding universal code. The alignment between the intermediate universal code representation and the final code solution significantly improves the quality of the generated code. The experimental results demonstrate that UniCoder with the universal code significantly outperforms the previous prompting methods by a large margin, showcasing the effectiveness of the structural clues in pseudo-code.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Africa > Rwanda > Kigali > Kigali (0.04)
- (13 more...)
UniCode: Learning a Unified Codebook for Multimodal Large Language Models
Zheng, Sipeng, Zhou, Bohan, Feng, Yicheng, Wang, Ye, Lu, Zongqing
In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLM's ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term ``image decompression'', enabling our model to interpret compressed visual data and generate high-quality images.The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, Unicode demonstrates promising capabilities in visual reconstruction and generation. It also achieves performances comparable to leading MLLMs across a spectrum of VQA benchmarks.
Unsupervised segmentation of irradiation$\unicode{x2010}$induced order$\unicode{x2010}$disorder phase transitions in electron microscopy
Ter-Petrosyan, Arman H, Bilbrey, Jenna A, Doty, Christina M, Matthews, Bethany E, Wang, Le, Du, Yingge, Lang, Eric, Hattar, Khalid, Spurgeon, Steven R
We present a method for the unsupervised segmentation of electron microscopy images, which are powerful descriptors of materials and chemical systems. Images are oversegmented into overlapping chips, and similarity graphs are generated from embeddings extracted from a domain$\unicode{x2010}$pretrained convolutional neural network (CNN). The Louvain method for community detection is then applied to perform segmentation. The graph representation provides an intuitive way of presenting the relationship between chips and communities. We demonstrate our method to track irradiation$\unicode{x2010}$induced amorphous fronts in thin films used for catalysis and electronics. This method has potential for "on$\unicode{x2010}$the$\unicode{x2010}$fly" segmentation to guide emerging automated electron microscopes.
- North America > United States > Tennessee > Knox County > Knoxville (0.14)
- North America > United States > Washington > Benton County > Richland (0.06)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Energy (1.00)
- Government > Regional Government > North America Government > United States Government (0.46)
Pinaki Laskar on LinkedIn: #AI #technology #Data
AI Researcher, Cognitive Technologist Inventor - AI Thinking, Think Chain Innovator - AIOT, XAI, Autonomous Cars, IIOT Founder Fisheyebox Spatial Computing Savant, Transformative Leader, Industry X.0 Practitioner Unicode is an information #technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of March 2020, there is a total of 143,859 characters, with Unicode 13.0 (these characters consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, and both are code-for-code identical. The Universal Coded Character Set (UCS) is a standard set of characters defined by the International Standard ISO/IEC 10646, Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented writing systems are added. To integrate AI into computers and system software means to create a Unicode abstraction level, the Universal Coded Data Set (UCDS), as AI Unidatacode or EIS UCDS.