MammothModa: Multi-Modal Large Language Model

She, Qi, Pan, Junwen, Wan, Xin, Zhang, Rui, Lu, Dawei, Huang, Kai

Jun-26-2024–arXiv.org Artificial Intelligence

Extending Context Window for High-In this report, we introduce MammothModa, yet another Resolution and Long-Duration Visual Features: The Visual multi-modal large language model (MLLM) designed to Merger Module effectively reduces the token count achieve state-of-the-art performance starting from an elementary for high-resolution images, while frame position IDs manage baseline. We focus on three key design insights: long-duration visual data without resorting to position (i) Integrating Visual Capabilities while Maintaining Complex interpolation. High-Quality Bilingual Datasets: To minimize Language Understanding: In addition to the vision encoder, visual hallucinations and improve model robustness, we incorporated the Visual Attention Experts into we meticulously curated and filtered a high-quality bilingual the LLM to enhance its visual capabilities.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jun-26-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)
- North America > United States (0.14)

Genre:
- Research Report > Promising Solution (0.47)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found