Abstract:
Based on lakehouse MHDP(multi-source heterogeneous data lake platform, MHDP), this paper proposes a cleaning scheme with interactivity based on DCs(Denial Constraints, DCs) for cleaning power information and communication multi-source heterogeneous data. Firstly, it optimizes HoloClean to achieve better results on small datasets, which improves F1 significantly based on power communication data. Furthermore, it proposes algorithms to parse various types of data, which can effectively reconstruct data. Secondly, it implements an interactive system with real-time feedback which extracts and visualizes the basic metadata and allows users to participate in cleaning work by building DCs. Finally, the cleaned data is saved in the original data format without removing the original data. The experiment results prove that this solution can effectively clean multi-source heterogeneous data with both high accuracy and easy usability.