MeterReader++：基于视觉语言大模型的指针表计识读框架及应用

王昌鹏; 闫云凤; 齐冬莲; 沈潇军; 储海东

doi:10.13336/j.1003-6520.hve.20250426

MeterReader++：基于视觉语言大模型的指针表计识读框架及应用

MeterReader++: Pointer Meter Reading Framework and Applications Based on Visual Language Large Model

摘要

摘要: 指针式表计识读是工业数智化的关键任务，当前主要依赖目标检测、关键点定位等传统识别算法，存在低泛化性、强数据依赖等瓶颈。该文通过视觉语言大模型模拟人类认知识读过程，提出一种通用的指针表计识读框架：1）为突破数据依赖瓶颈，构建工业场景下的识读多模态数据合成管道，可自动生成20 000条以上问答对；2）为克服大模型“幻觉”瓶颈，使用DeepSeek-R1模拟人类认知识读，解耦表计语义理解和识读推理过程，平均参考误差比基础模型Qwen2.5-VL降低10%；3）为提升泛化性，设计基于广义策略优化的容差自适应强化学习优化方法，将绝对精度约束转化为可学习容忍区间以增强分布外数据(out-of-distribution data，OOD)泛化，在OOD测试中，该文方法识读误差降到2%。实验表明，该文所提框架在模拟工业表计测试集的平均参考误差为1.2%，在公开真实表计测试集达到3.16%，超越QWen2.5-VL-72B和GPT4o等先进大模型。该文研究为视觉语言大模型在精细化视觉理解和推理计算任务的落地应用，提供了思路参考。

Abstract: Pointer-type meter reading is a key task in industrial digitalization. Currently, pointer-type meter reading mainly relies on traditional recognition algorithms such as target detection and key point positioning, which have bottlenecks such as low generalization and strong data dependence. This paper simulates the human knowledge reading process through a large visual language model and proposes a general pointer meter reading framework: (1) In order to break through the bottleneck of data dependence, a multimodal data synthesis pipeline for reading in industrial scenarios is constructed, which can automatically generate more than 20, 000 question-answer pairs; (2) In order to overcome the bottleneck of "hallucination" of large models, DeepSeek-R1 is used to simulate human knowledge reading, decouple meter semantic understanding and reading reasoning processes, and the average reference error is reduced by 10% compared with the basic model Qwen2.5-VL; (3) In order to improve generalization, a tolerance adaptive reinforcement learning optimization method based on generalized strategy optimization is designed to convert absolute accuracy constraints into learnable tolerance intervals to enhance out-of-distribution data (OOD) generalization. In the OOD test, the reading error of this method is reduced to 2%. Experiments show that the average reference error of the proposed framework in this paper is 1.2% in the simulated industrial meter test set and 3.16% in the public real meter test set, outperforming the advanced large models such as QWen2.5-VL-72B and GPT4o. The result of this paper provides a reference for the application of visual language large models in refined visual understanding and reasoning computing tasks.

HTML全文

参考文献(27)

施引文献

资源附件(0)