石俊杰, 赵子岩, 何永远, 闫龙川, 彭元龙, 马睿, 王锦诚. 基于多源异构数据湖平台的电力信息通信多源异构数据清洗方案[J]. 电力信息与通信技术, 2023, 21(7): 59-66. DOI: 10.16543/j.2095-641x.electric.power.ict.2023.07.08
引用本文: 石俊杰, 赵子岩, 何永远, 闫龙川, 彭元龙, 马睿, 王锦诚. 基于多源异构数据湖平台的电力信息通信多源异构数据清洗方案[J]. 电力信息与通信技术, 2023, 21(7): 59-66. DOI: 10.16543/j.2095-641x.electric.power.ict.2023.07.08
SHI Junjie, ZHAO Ziyan, HE Yongyuan, YAN Longchuan, PENG Yuanlong, MA Rui, WANG Jincheng. A Multi-source Heterogeneous Data Cleaning Scheme for Power Information and Communication Based on Lakehouse Platform[J]. Electric Power Information and Communication Technology, 2023, 21(7): 59-66. DOI: 10.16543/j.2095-641x.electric.power.ict.2023.07.08
Citation: SHI Junjie, ZHAO Ziyan, HE Yongyuan, YAN Longchuan, PENG Yuanlong, MA Rui, WANG Jincheng. A Multi-source Heterogeneous Data Cleaning Scheme for Power Information and Communication Based on Lakehouse Platform[J]. Electric Power Information and Communication Technology, 2023, 21(7): 59-66. DOI: 10.16543/j.2095-641x.electric.power.ict.2023.07.08

基于多源异构数据湖平台的电力信息通信多源异构数据清洗方案

A Multi-source Heterogeneous Data Cleaning Scheme for Power Information and Communication Based on Lakehouse Platform

  • 摘要: 针对电力信息通信产生的各类多源异构数据的清洗问题,文章基于多源异构数据湖平台(multi-source heterogeneous data lake platform,MHDP),提出一种具有交互性的基于否定约束规则(denial constraints,DCs)的清洗方案。首先,对结构化清洗引擎HoloClean进行了改进,使其在小型数据集上取得更好的效果,经过测试,改进后使得基准电力通信数据集的准确度显著提高,并提出了通用的解析多源异构数据方案。其次,实现了一个实时反馈的交互式系统,此系统解析并可视化数据的元数据信息,给用户提供图形界面建立约束规则来参与清洗工作。最后,清洗数据保存为原始数据格式。实验结果证明,提出的解决方案可以有效清洗多源异构数据,同时具有较高的准确性和易用性。

     

    Abstract: Based on lakehouse MHDP(multi-source heterogeneous data lake platform, MHDP), this paper proposes a cleaning scheme with interactivity based on DCs(Denial Constraints, DCs) for cleaning power information and communication multi-source heterogeneous data. Firstly, it optimizes HoloClean to achieve better results on small datasets, which improves F1 significantly based on power communication data. Furthermore, it proposes algorithms to parse various types of data, which can effectively reconstruct data. Secondly, it implements an interactive system with real-time feedback which extracts and visualizes the basic metadata and allows users to participate in cleaning work by building DCs. Finally, the cleaned data is saved in the original data format without removing the original data. The experiment results prove that this solution can effectively clean multi-source heterogeneous data with both high accuracy and easy usability.

     

/

返回文章
返回