Goto

Collaborating Authors

 Grammars & Parsing


Stream Output When Parsing Big Xml With Elixir

#artificialintelligence

There are two big players in elixir's XML parsing ecosystem: I want to read a huge XML file that has some elements very repeated, and want to produce some kind of "iterator" from it. I'd like to produce some iterator that, when iterated, produces this: Saxy is incredibly fast and performant, but it's based on the concept that, as you read the XML file, you "fill" some state object (with whatever you want, and the amount you want, but, nevertheless, you fill it). In this scenario, I could "fill" the state with the list of items. That, of course, is a lot less memory than it would take to hold the entire XML structure in memory. But still it establishes a relationship between the size of the XML file and the size of the stored in-memory list, which I don't like because that means that if I use a big enough file, I can consume more memory than I'm allowed to. SweetXml provides some function called stream_tags and when you see what it does, it seems that it hits the spot!!! because it says it's just what I need: parse an xml and, as it finds certain tags, stream the SweetXml representation of them, and it doesn't build into memory any structure representing xml.


Resolution of the Burrows-Wheeler Transform Conjecture

Communications of the ACM

The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as the r-index7), central in modern bioinformatics. The compressibility of BWT is quantified by the number r of equal-letter runs in the output. Despite the practical significance of BWT, no nontrivial upper bound on r is known. By contrast, the sizes of nearly all other known compression methods have been shown to be either always within a poly-log n factor (where n is the length of the text) from z, the size of Lempel–Ziv (LZ77) parsing of the text, or much larger in the worst case (by an nε factor for ε 0). In this paper, we show that r (z log2 n) holds for every text. This result has numerous implications for text indexing and data compression; in particular: (1) it proves that many results related to BWT automatically apply to methods based on LZ77, for example, it is possible to obtain functionality of the suffix tree in (z polylog n) space; (2) it shows that many text processing tasks can be solved in the optimal time assuming the text is compressible using LZ77 by a sufficiently large polylog n factor; and (3) it implies the first nontrivial relation between the number of runs in the BWT of the text and of its reverse. In addition, we provide an (z polylog n)-time algorithm converting the LZ77 parsing into the run-length compressed BWT. To achieve this, we develop several new data structures and techniques of independent interest. In particular, we define compressed string synchronizing sets (generalizing the recently introduced powerful technique of string synchronizing sets11) and show how to efficiently construct them. Next, we propose a new variant of wavelet trees for sequences of long strings, establish a nontrivial bound on their size, and describe efficient construction algorithms. Finally, we develop new indexes that can be constructed directly from the LZ77 parsing and efficiently support pattern matching queries on text substrings. Lossless data compression aims to exploit redundancy in the input data to represent it in a small space.


Beyond Distributional Hypothesis: Let Language Models Learn Meaning-Text Correspondence

arXiv.org Artificial Intelligence

The logical negation property (LNP), which implies generating different predictions for semantically opposite inputs, is an important property that a trustworthy language model must satisfy. However, much recent evidence shows that large-size pre-trained language models (PLMs) do not satisfy this property. In this paper, we perform experiments using probing tasks to assess PLM's LNP understanding. Unlike previous studies that only examined negation expressions, we expand the boundary of the investigation to lexical semantics. Through experiments, we observe that PLMs violate the LNP frequently. To alleviate the issue, we propose a novel intermediate training task, names meaning-matching, designed to directly learn a meaning-text correspondence, instead of relying on the distributional hypothesis. Through multiple experiments, we find that the task enables PLMs to learn lexical semantic information. Also, through fine-tuning experiments on 7 GLUE tasks, we confirm that it is a safe intermediate task that guarantees a similar or better performance of downstream tasks. Finally, we observe that our proposed approach outperforms our previous counterparts despite its time and resource efficiency.


GEC -- Grammatical Error Correction

#artificialintelligence

With millions of people trying to move abroad every year, it has become more and more difficult to achieve it. One of the most important skills required for it is good English Communication. Since majority of the people in this category come from countries where English isn't the first language, they are already at a disadvantage. Automated Grammatical Error Correction (GEC) can be an essential and useful tool for millions of people who learn English as a second language. It can either be used to improve their grammatical knowledge or used on a daily basis to communicate with other people efficiently.


Learn from Structural Scope: Improving Aspect-Level Sentiment Analysis with Hybrid Graph Convolutional Networks

arXiv.org Artificial Intelligence

Aspect-level sentiment analysis aims to determine the sentiment polarity towards a specific target in a sentence. The main challenge of this task is to effectively model the relation between targets and sentiments so as to filter out noisy opinion words from irrelevant targets. Most recent efforts capture relations through target-sentiment pairs or opinion spans from a word-level or phrase-level perspective. Based on the observation that targets and sentiments essentially establish relations following the grammatical hierarchy of phrase-clause-sentence structure, it is hopeful to exploit comprehensive syntactic information for better guiding the learning process. Therefore, we introduce the concept of Scope, which outlines a structural text region related to a specific target. To jointly learn structural Scope and predict the sentiment polarity, we propose a hybrid graph convolutional network (HGCN) to synthesize information from constituency tree and dependency tree, exploring the potential of linking two syntax parsing methods to enrich the representation. Experimental results on four public datasets illustrate that our HGCN model outperforms current state-of-the-art baselines.


Is COVID more dangerous than driving? How scientists are parsing COVID-19 risks.

The Japan Times

Like it or not, the choose-your-own-adventure period of the pandemic is upon us. Some free testing sites have closed. Whatever parts of the United States were still trying to collectively quell the pandemic have largely turned their focus away from communitywide advice. Now, even as case numbers begin to climb again and more infections go unreported, the onus has fallen on individual Americans to decide how much risk they and their neighbors face from the coronavirus -- and what, if anything, to do about it. For many people, the threats posed by COVID-19 have eased dramatically over the two years of the pandemic.


Does BERT really agree ? Fine-grained Analysis of Lexical Dependence on a Syntactic Task

arXiv.org Artificial Intelligence

Although transformer-based Neural Language Models demonstrate impressive performance on a variety of tasks, their generalization abilities are not well understood. They have been shown to perform strongly on subject-verb number agreement in a wide array of settings, suggesting that they learned to track syntactic dependencies during their training even without explicit supervision. In this paper, we examine the extent to which BERT is able to perform lexically-independent subject-verb number agreement (NA) on targeted syntactic templates. To do so, we disrupt the lexical patterns found in naturally occurring stimuli for each targeted structure in a novel fine-grained analysis of BERT's behavior. Our results on nonce sentences suggest that the model generalizes well for simple templates, but fails to perform lexically-independent syntactic generalization when as little as one attractor is present.


Breaking Down and Interpreting Human Language -- NLP

#artificialintelligence

From translation software, chatbots, spam filters, and search engines, to grammar correction software, voice assistants, and social media monitoring tools, NLP is at the core of tools in our everyday life. NLP -- Natural Language Processing trying to make machines that can think and act like humans (Don't worry they won't be Human as humans are). It is used to understand human behavior by feeding it with syntax, language, accents, and many other forms of sensory data that human captures. Algorithms then convert this data, rather say transforms this data in the language that the machine understands, thus making the machine learn on a certain rule to perform actions and solve problems. So How Does NLP Work?


Design considerations for a hierarchical semantic compositional framework for medical natural language understanding

arXiv.org Artificial Intelligence

Medical natural language processing (NLP) systems are a key enabling technology for transforming Big Data from clinical report repositories to information used to support disease models and validate intervention methods. However, current medical NLP systems fall considerably short when faced with the task of logically interpreting clinical text. In this paper, we describe a framework inspired by mechanisms of human cognition in an attempt to jump the NLP performance curve. The design centers about a hierarchical semantic compositional model (HSCM) which provides an internal substrate for guiding the interpretation process. The paper describes insights from four key cognitive aspects including semantic memory, semantic composition, semantic activation, and hierarchical predictive coding. We discuss the design of a generative semantic model and an associated semantic parser used to transform a free-text sentence into a logical representation of its meaning.


Memory limitations are hidden in grammar

arXiv.org Artificial Intelligence

For many centuries, the goal of linguistics has been to capture this capacity by a formal description--a grammar--consisting of a systematic set of rules and/or principles that determine which sentences are part of a given language and which are not (Bod, 2013). Over the years, these formal grammars have taken many forms but common to them all is the assumption that they capture the idealized linguistic competence of a native speaker/hearer, independent of any memory limitations or other non-linguistic cognitive constraints (Chomsky, 1965; Miller, 2000). These abstract formal descriptions have come to play a foundational role in the language sciences, from linguistics, psycholinguistics, and neurolinguistics (Hauser et al., 2002; Pinker, 2003) to computer science, engineering, and machine learning (Klein and Manning, 2003; Dyer et al., 2016; Gómez-Rodríguez et al., 2018). Despite evidence that processing difficulty underpins the unacceptability of certain sentences (Morrill, 2010; Hawkins, 2004), the cognitive independence assumption that is a defining feature of linguistic competence has not been examined in a systematic way using the tools of formal grammar. It is therefore unclear whether these supposedly idealized descriptions of language are free of non-linguistic cognitive constraints, such as memory limitations.