基于Apache Spark的配电网大数据预处理技术研究
Research on Distribution System Big Data Preprocessing Technology Based on Apache Spark
-
摘要: 随着配电网采集的数据规模日益增大,如何高效地预处理配电网数据成为目前配电网数据分析面临的重要问题之一。考虑到配电网大数据的复杂性,提出了基于Apache Spark的大规模数据并行预处理的方法。首先,为了更有效地处理配电网大数据,以Spark为计算引擎搭建了大数据并行计算平台;接着,分析了目前配电网大数据面临的一些普遍性问题,提出了针对这些问题的数据治理方案;然后,结合Spark计算引擎,介绍了配电网大数据预处理的具体流程;最后通过实验验证了数据预处理对配电网数据预测的精确度提升,以及分布式计算平台在数据预处理方面的速度优势。Abstract: Given the complexity and growing collection scale of the distribution system big data,it is urgent to figure out how to effectively preprocess data in distribution system. Considering that,this paper proposes a parallel computing technology for large-scale datasets based on Apache Spark. Firstly,the Apache Spark-based big data parallel computing platform is set up to improve preprocessing efficiently. Then,we analyze some common problems in the present distribution system big data and put forward data governance programmes accordingly. After that,we introduce the specific processes of distribution network big data preprocessing combined with the Spark computing engine. Finally,the experiments verify that the proposed data preprocessing method improves the distribution system in data prediction accuracy and that the distributed computing platform possesses high speed in data preprocessing.