Goto

Collaborating Authors

 paragraph



e2cfb719f58585f779d0a4f9f07bd618-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing Systems

A.1 Creation of the Multimodal Web Document Dataset A.1.1 Collecting of a Large Number of HTMLFiles Our data collection process begins by considering the 25 most recent Common Crawl6 dumps available at the time of dataset creation. It contains webpages spanning from February 2020 to January/February 2023. We use a modified version of readability-lxml7 to extract the main text from the pages, discarding any pages that contain text of excessively high perplexity. This process yields a total of 41.2 billion documents. Selection of English content To identify non-English content, we apply the FastText classifier (Joulin et al., 2017) to the extracted text, e ectively filtering out 63.6% of the documents. Early text deduplication Often, a set of URLs is crawled repeatedly across di erent Common Crawl snapshots. However, the content of these websites may vary as web administrators make changes over time. Hence, at this stage, we refrain from deduplicating documents based on their URLs. Instead, we perform MinHash (Broder, 1997) deduplication with 16 hashes calculated over 5-grams. To further refine the data, we eliminate documents containing substantial proportions of repeated paragraphs and n-grams, employing the methodology described in MassiveText (Rae et al., 2022).


Appendix of Modeling

Neural Information Processing Systems

To create a passage representation, the passage title and text are concatenated ([CLS]title [SEP]passage [SEP]), following common practice (Karpukhin et al., 2020). We retrieve top 10 passages and use them as input to mGEN. We differentiate those paragraphs from the question using special tokens (

vs. He graduated with a B.S. degree in Biology in 1957. As in the case of machine translation, we found that the language code does not need to be specified during inference as our model learns the question language automatically. Yet, we found that training with language codes is particularly useful to augment training data for Ltarget without any question data in Ltarget.


040ca38cefb1d9226d79c05dd25469cb-Supplemental.pdf

Neural Information Processing Systems

If there is a bingo on mode-k, the m-th row of the mode-k expansion of P is a constant multiple of the (m 1)-th row, where mis a number determined by the bingo position. When a row is a constant multiple of another row, the rank of the matrix is reduced by a maximum of one, which means Rank(P(k)) Ik 1. In the same way, if there are bk bingos, then bk rows are constant multiple of the other rows, which means Rank(P(k)) Ik bk. For any positive tensor P, rank(P) = 1 if and only if its all many-body θparameters are 0. Proof. First, we show that rank(P) = 1 implies all many-body θ-parameters are 0. From the assumption of rank(P) = 1, the m-th row of the mode-k expansion of P have to be a constant multiple of the (m 1)-th row for all m= {2,...,Ik}and k [d].


Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition

Neural Information Processing Systems

Offline handwriting recognition systems require cropped text line images for both training and recognition. On the one hand, the annotation of position and transcript at line level is costly to obtain. On the other hand, automatic line segmentation algorithms are prone to errors, compromising the subsequent recognition. In this paper, we propose a modification of the popular and efficient Multi-Dimensional Long Short-Term Memory Recurrent Neural Networks (MDLSTM-RNNs) to enable end-to-end processing of handwritten paragraphs. More particularly, we replace the collapse layer transforming the two-dimensional representation into a sequence of predictions by a recurrent version which can select one line at a time. In the proposed model, a neural network performs a kind of implicit line segmentation by computing attention weights on the image representation. The experiments on paragraphs of Rimes and IAM databases yield results that are competitive with those of networks trained at line level, and constitute a significant step towards end-to-end transcription of full documents.



Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Neural Information Processing Systems

Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference.