Abstract:
Bolt images collected in transmission line inspection have the characteristics of low resolution and insufficient visual information. To solve the problem that traditional image classification models struggle to learn rich-semantic visual representations from bolt images, this paper proposes a method of bolt defect classification based on multi-modal contrastive learning. Firstly, in order to inject bolt-related semantic information and prior knowledge into the visual representation in a cross-modal manner, a two-stage training algorithm which combines the multi-modal contrastive pre-training and supervised fine-tuning is proposed. Secondly, to alleviate the overfitting in multi-modal contrastive pre-training, the info noise contrastive estimation loss with label smoothing (infoNCE-LS) is proposed to improve the generalization of the pre-trained visual representation. Finally, aimed at the mismatch between the upstream and downstream tasks, three types of classification heads based on text prompts are designed to improve the transfer learning performance of the pre-trained visual representation in the supervised fine-tuning stage. The experimental results show that the accuracy of two models based on ResNet50 and ViT on the bolt defect classification dataset is 92.3% and 97.4%, which is 2.4% and 5.8% higher than the baseline. The study realizes the cross-modal supplement of semantic information from text to image, which provides a new idea for the research of bolt defect identification.