Goto

Collaborating Authors

 toothpick


Ovis2.5 Technical Report

Lu, Shiyin, Li, Yang, Xia, Yu, Hu, Yuwei, Zhao, Shanshan, Ma, Yanqing, Wei, Zhichao, Li, Yinglun, Duan, Lunhao, Zhao, Jianshan, Han, Yuxuan, Li, Haijun, Chen, Wanying, Tang, Junke, Hou, Chengkun, Du, Zhixing, Zhou, Tianli, Zhang, Wenjie, Ding, Huping, Li, Jiahe, Li, Wen, Hu, Gui, Gu, Yiliang, Yang, Siran, Wang, Jiamang, Sun, Hailong, Wang, Yibo, Sun, Hui, Huang, Jinlong, He, Yuping, Shi, Shengze, Zhang, Weihong, Zheng, Guodong, Jiang, Junpeng, Gao, Sensen, Wu, Yi-Feng, Chen, Sijia, Chen, Yuhui, Chen, Qing-Guo, Xu, Zhao, Luo, Weihua, Zhang, Kaifu

arXiv.org Artificial Intelligence

We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.


Creative Problem Solving in Large Language and Vision Models -- What Would it Take?

Nair, Lakshmi, Gizzi, Evana, Sinapov, Jivko

arXiv.org Artificial Intelligence

In Given this overview, we see that LLVMs both at the highlevel this section, we discuss how typical task planning is achieved and low-level, can be modified to incorporate creative with LLVMs. We divide the discussion into three subsections problem solving into task planning. For instance, the high-level based on the level of task planning abstraction where LLVMs task plans generated can encompass a novel substitution for a are applied: a) high-level task planning, b) low-level task missing object, whereas the low-level task plan can generate planning, and c) hybrid task planning.


AI Could Help Free Human Creativity

TIME - Tech

We're more distracted than ever. Why remember anything when I can just Google it? Why summon the attention to read a book when I can just scroll through Twitter? Some philosophers believe that ChatGPT and its siblings will further diminish our ability to do the kind of "deep work" needed to spark creativity and breed big ideas. What good are the tools if we begin to rely on them so much that we no longer have the capacity to think bigger?


Laser-Powered Robot Insect Achieves Lift Off

IEEE Spectrum Robotics

For robots of all sizes, power is a fundamental problem. Any robot that moves is constrained in one way or another by power supply, whether it's relying on carrying around heavy batteries, combustion engines, fuel cells, or anything else. It's particularly tricky to manage power as your robot gets smaller, since it's much more straightforward to scale these things up rather than down--and for really tiny robots (with masses in the hundreds of milligrams range), especially those that demand a lot of power, there really isn't a good solution. In practice, this means that on the scale of small insects robots often depend on tethers for power, which isn't ideal for making them practical in the long term. At the IEEE International Conference on Robotics and Automation in Brisbane, Australia, next week, roboticists from the University of Washington, in Seattle, will present RoboFly, a laser-powered insect-sized flapping wing robot that performs the first (very brief) untethered flight of a robot at such a small scale.


Finding Faces in a Crowd-CMU News - Carnegie Mellon University

#artificialintelligence

An automated face detection method developed at Carnegie Mellon University enables computers to recognize faces in images at a variety of scales, including tiny faces composed of just a handful of pixels. Spotting a face in a crowd, or recognizing any small or distant object within a large image, is a major challenge for computer vision systems. The trick to finding tiny objects, say researchers at Carnegie Mellon University, is to look for larger things associated with them. An improved method for coding that crucial context from an image has enabled Deva Ramanan, associate professor of robotics, and Peiyun Hu, a Ph.D. student in robotics, to demonstrate a significant advance in detecting tiny faces. When applied to benchmarked datasets of faces, their method reduced error by a factor of two, and 81 percent of the faces found using their methods proved to be actual faces, compared with 29 to 64 percent for prior methods.