A Case Study on Clustering of Datasets Using K-Means Algorithm in Spark

Authors

  • Prateek Srivastava

DOI:

https://doi.org/10.17762/msea.v70i2.2466

Abstract

A crucial step in data analysis is clustering, which seeks to put similar data items together based on their inherent qualities. For the purpose of gaining meaningful insights from the growing number and complexity of contemporary datasets, effective and scalable clustering algorithms are crucial. In this article, we give a case study on clustering datasets in Spark, a well-liked distributed computing framework, using the K-means technique. The paper starts with an extensive overview of the literature on clustering methods and the difficulties posed by large-scale datasets. We draw attention to Spark's benefits for handling huge data processing and its applicability for K-means algorithm implementation. Then, we give a brief overview of Spark, highlighting its main features and functionalities, such as fault tolerance and in-memory processing. We investigate its applicability to distributed computing and how it resolves issues brought on by huge datasets. The case study design is provided, explaining the procedures needed to carry out the investigation. This covers the choice of the experimental design and its execution. To assess the scalability and performance of the K-means algorithm in Spark, it is essential to choose a suitable dataset that accurately represents real-world features. The experimental setup entails setting up the Spark cluster, utilising the Spark programming interface to perform the K-means algorithm, and choosing evaluation metrics to gauge the effectiveness of the clustering. The purpose of the study is to illustrate the advantages of utilising Spark for clustering analysis, including its performance, scalability, and usability. We test the K-means algorithm's capability to handle enormous datasets by running the clustering tests on the Spark cluster. In conclusion, this case study highlights Spark's potential for scalable and effective clustering analysis and advances our understanding of clustering algorithms in that environment. The results of this study can help academics and professionals use Spark's distributed computing capabilities for clustering huge datasets, allowing them to get insightful knowledge and make wise choices.

Downloads

Published

2021-02-26

How to Cite

Srivastava, P. . (2021). A Case Study on Clustering of Datasets Using K-Means Algorithm in Spark. Mathematical Statistician and Engineering Applications, 70(2), 1741 –. https://doi.org/10.17762/msea.v70i2.2466

Issue

Section

Articles