Towards Verifiable Text Generation with Symbolic References

Hennigen, Lucas Torroba, Shen, Shannon, Nrusimha, Aniruddha, Gapp, Bernhard, Sontag, David, Kim, Yoon

arXiv.org Artificial Intelligence 

Large language models (LLMs) have demonstrated an impressive ability to synthesize plausible and fluent text. However they remain vulnerable to hallucinations, and thus their outputs generally require manual human verification for high-stakes applications, which can be timeconsuming and difficult. This paper proposes symbolically grounded generation (SymGen) as a simple approach for enabling easier validation of an LLM's output. SymGen prompts an LLM to interleave its regular output text with explicit symbolic references to fields present in some conditioning data (e.g., a table in JSON format). The references can be used to display the provenance of different spans of text in the generation, reducing the effort required for manual verification. Across data-to-text and question answering experiments, we find that Figure 1: Compare a standard LLM-generated (A) with LLMs are able to directly output text that makes a SymGen (B, ours) description of a basketball game, use of symbolic references while maintaining based on statistics about it.