Multimodal Side-Tuning for Document Classification
Zingaro, Stefano Pio, Lisanti, Giuseppe, Gabbrielli, Maurizio
–arXiv.org Artificial Intelligence
Notwithstanding the many technological advances in computer vision and artificial intelligence, which are contributing to the "digital transformation" of many companies and industrial processes, there still exist a surprising number of tasks which are almost completely carried out by humans. In particular, many tasks in different industries, from administrative procedures to archival of old manuscripts, involve the human elaboration of a huge number of paper documents, with consequent high costs for the companies and, ultimately, for their clients. There are two main reasons for this situation: one is deeply connected to the internal rules and processes of some companies, banks in particular, which have an important number of legacy procedures and have big inertia for innovation. The second reason, that we consider in this paper, is the lack of completely satisfactory (automatic) tools for document classification, especially when documents contain different source of information such as text, images, and handwritten parts. While some paper documents could be replaced by electronic means, one cannot eliminate paper documentation, hence efficient and trustworthy tools for document classification are essential. As we discuss in the next section, document classification has been widely investigated and methods can be roughly divided into three categories: those that are based on the textual content of the document, often obtained from Optical Character Recognition (OCR), those based on the visual structure of the image, and multimodal methods that use both text and image. The latter family of solutions [1-8] have provided significant advances, yet dealing with both textual and visual content in full generality remains an open problem [8].
arXiv.org Artificial Intelligence
Jan-23-2023