Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks

Liu Jie; Cao Yijia; Li Yong; Guo Yixiu; Deng Wei

doi:10.17775/CSEEJPES.2020.04080

Liu Jie, Cao Yijia, Li Yong, Guo Yixiu, Deng Wei. Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks[J]. CSEE Journal of Power and Energy Systems, 2024, 10(6): 2528-2538. DOI: 10.17775/CSEEJPES.2020.04080

Citation:

Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks

Graphical Abstract

Graphical Abstract

Abstract

Abstract

In order to improve the data quality, the big data cleaning method for distribution networks is studied in this paper. First, the Local Outlier Factor (LOF) algorithm based on DBSCAN clustering is used to detect outliers. However, due to the difficulty in determining the LOF threshold, a method of dynamically calculating the threshold based on the transformer districts and time is proposed. In addition, the LOF algorithm combines the statistical distribution method to reduce the misjudgment rate. Aiming at the diversity and complexity of data missing forms in power big data, this paper has improved the Random Forest imputation algorithm, which can be applied to various forms of missing data, especially the blocked missing data and even some completely missing horizontal or vertical data. The data in this paper are from real data of 44 transformer districts of a certain 10 kV line in a distribution network. Experimental results show that outlier detection is accurate and suitable for any shape and multidimensional power big data. The improved Random Forest imputation algorithm is suitable for all missing forms, with higher imputation accuracy and better model stability. By comparing the network loss prediction between the data using this data cleaning method and the data removing outliers and missing values, it can be found that the accuracy of network loss prediction has improved by nearly 4% using the data cleaning method identified in this paper. Additionally, as the proportion of bad data increased, the difference between the prediction accuracy of cleaned data and that of uncleaned data is more significant.

FullText(HTML)

References (31)

Supplements (0)

Cited By

Turn off MathJax

Article Contents

Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content