AITopics | Golts, Alona

Collaborating Authors

Golts, Alona

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DocVLM: Make Your VLM an Efficient Reader

Nacson, Mor Shpigel, Aberdam, Aviad, Ganz, Roy, Avraham, Elad Ben, Golts, Alona, Kittenplon, Yair, Mazor, Shai, Litman, Ron

arXiv.org Artificial IntelligenceDec-11-2024

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive applications demand high-resolution, resulting in significant computational overhead. Using OCR-extracted text in VLM prompts partially addresses this issue but underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance. We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights. Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM. Comprehensive evaluations across leading VLMs show that DocVLM significantly reduces reliance on high-resolution images for document understanding. In limited-token regimes (448$\times$448), DocVLM with 64 learned queries improves DocVQA results from 56.0% to 86.6% when integrated with InternVL2 and from 84.4% to 91.2% with Qwen2-VL. In LLaVA-OneVision, DocVLM achieves improved results while using 80% less image tokens. The reduced token usage allows processing multiple pages effectively, showing impressive zero-shot results on DUDE and state-of-the-art performance on MP-DocVQA, highlighting DocVLM's potential for applications requiring high-performance and efficiency.

artificial intelligence, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

2412.08746

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

GRAM: Global Reasoning for Multi-Page VQA

Blau, Tsachi, Fogel, Sharon, Ronen, Roi, Golts, Alona, Ganz, Roy, Avraham, Elad Ben, Aberdam, Aviad, Tsiper, Shahar, Litman, Ron

arXiv.org Artificial IntelligenceJan-7-2024

The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document-level tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our C-Former model, which reduces the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2401.03411

Country:

North America > United States > Texas (1.00)
North America > United States > New Jersey (1.00)
Europe (1.00)
(4 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Government > Space Agency (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

CLIPTER: Looking at the Bigger Picture in Scene Text Recognition

Aberdam, Aviad, Bensaïd, David, Golts, Alona, Ganz, Roy, Nuriel, Oren, Tichauer, Royee, Mazor, Shai, Litman, Ron

arXiv.org Artificial IntelligenceJul-23-2023

Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.

artificial intelligence, machine learning, pattern recognition, (16 more...)

arXiv.org Artificial Intelligence

2301.07464

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.84)

Add feedback

Unsupervised Single Image Dehazing Using Dark Channel Prior Loss

Golts, Alona, Freedman, Daniel, Elad, Michael

arXiv.org Machine LearningDec-6-2018

Single image dehazing is a critical stage in many modern-day autonomous vision applications. Early prior-based methods often involved a time-consuming minimization of a hand-crafted energy function. Recent learning-based approaches utilize the representational power of deep neural networks (DNNs) to learn the underlying transformation between hazy and clear images. Due to inherent limitations in collecting matching clear and hazy images, these methods resort to training on synthetic data; constructed from indoor images and corresponding depth information. This may result in a possible domain shift when treating outdoor scenes. We propose a completely unsupervised method of training via minimization of the well-known, Dark Channel Prior (DCP) energy function. Instead of feeding the network with synthetic data, we solely use real-world outdoor images and tune the network's parameters by directly minimizing the DCP. Although our `Deep DCP' technique can be regarded as a fast approximator of DCP, it actually improves its results significantly. This suggests an additional regularization obtained via the network and learning process. Experiments show that our method performs on par with other large-scale, supervised methods.

deep learning, neural network, single image, (20 more...)

arXiv.org Machine Learning

1812.07051

Country: Asia > Middle East > Israel (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Deep Energy: Using Energy Functions for Unsupervised Training of DNNs

Golts, Alona, Freedman, Daniel, Elad, Michael

arXiv.org Machine LearningMay-31-2018

The success of deep learning has been due in no small part to the availability of large annotated datasets. Thus, a major bottleneck in the current learning pipeline is the human annotation of data, which can be quite time consuming. For a given problem setting, we aim to circumvent this issue via the use of an externally specified energy function appropriate for that setting; we call this the "Deep Energy" approach. We show how to train a network on an entirely unlabelled dataset using such an energy function, and apply this general technique to learn CNNs for two specific tasks: seeded segmentation and image matting. Once the network parameters have been learned, we obtain a high-quality solution in a fast feed-forward style, without the need to repeatedly optimize the energy function for each image.

deep learning, neural network, segmentation, (20 more...)

arXiv.org Machine Learning

1805.12355

Country: Asia > Middle East > Israel (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback