What if I told a story here, how would that story start?" Thus, the summarization prompt: "My second grader asked me what this passage means: …" When a given prompt isn't working and GPT-3 keeps pivoting into other modes of completion, that may mean that one hasn't constrained it enough by imitating a correct output, and one needs to go further; writing the first few words or sentence of the target output may be necessary.
Existing computational methods for the analysis of corpora of text in natural language are still far from approaching a human level of understanding. We attempt to advance the state of the art by introducing a model and algorithmic framework to transform text into recursively structured data. We apply this to the analysis of news titles extracted from a social news aggregation website. We show that a recursive ordered hypergraph is a sufficiently generic structure to represent significant number of fundamental natural language constructs, with advantages over conventional approaches such as semantic graphs. We present a pipeline of transformations from the output of conventional NLP algorithms to such hypergraphs, which we denote as semantic hypergraphs. The features of these transformations include the creation of new concepts from existing ones, the organisation of statements into regular structures of predicates followed by an arbitrary number of entities and the ability to represent statements about other statements. We demonstrate knowledge inference from the hypergraph, identifying claims and expressions of conflicts, along with their participating actors and topics. We show how this enables the actor-centric summarization of conflicts, comparison of topics of claims between actors and networks of conflicts between actors in the context of a given topic. On the whole, we propose a hypergraphic knowledge representation model that can be used to provide effective overviews of a large corpus of text in natural language.
This thesis presents new methods for unsupervised learning of distributed representations of words and entities from text and knowledge bases. The first algorithm presented in the thesis is a multi-view algorithm for learning representations of words called Multiview Latent Semantic Analysis (MVLSA). By incorporating up to 46 different types of co-occurrence statistics for the same vocabulary of english words, I show that MVLSA outperforms other state-of-the-art word embedding models. Next, I focus on learning entity representations for search and recommendation and present the second method of this thesis, Neural Variational Set Expansion (NVSE). NVSE is also an unsupervised learning method, but it is based on the Variational Autoencoder framework. Evaluations with human annotators show that NVSE can facilitate better search and recommendation of information gathered from noisy, automatic annotation of unstructured natural language corpora. Finally, I move from unstructured data and focus on structured knowledge graphs. I present novel approaches for learning embeddings of vertices and edges in a knowledge graph that obey logical constraints.