Retail data predictive analysis using machine learning models
MetadataShow full item record
CitationGüner, M. (2021). Retail Data Predictive Analysis Using Machine Learning Models. MEF Üniversitesi Fen Bilimleri Enstitüsü, Bilişim Teknolojileri Yüksek Lisans Programı. ss. 1-39
Machine Learning (ML) is a popular field which deals with training the system with data (experience), performing some task (regression or classification) and evaluating the system with the desired performance metrics. ML automatically extracts useful and meaningful insights from the data. ML models for sales prediction applies computational intelligence in many real world applications such as stock market, production, economics, weather, retail, census analysis and so on. Sales prediction can be viewed as a regression problem and various algorithms can be applied. In this project, real life data analysis has been done to predict the sales for four categories of products like Cold Cereal, Bag Snacks, Oral Hygiene Products, and Frozen Pizza. Exploratory Data Analysis (EDA) has been applied to the dataset to make exact predictions even during an unpredictable environment. The different phases of EDA used in this project are Data Preprocessing and Analysis, Feature Selection and Feature Extraction, Model Building and Regression Analysis, Clustering, Time Series Analysis and Model Evaluation using the Performance Metrics. For outlier detection, InterQuartile Range (IQR) method is used. For Filter Based Feature Selection, Univariate Feature Analysis using SelectK-Best and SelectPercentile, Decision Tree Regressor method has been used. For Wrapper Based Feature Selection, Sequential Feature Selector method has been deployed. For Regression Analysis, various algorithms such as Linear Regression, XGBoost Regression and Support Vector Regression (SVR) are analyzed. K-Means Clustering Algorithm has been used on the dataset to generate 4 different clusters. In Time Series Analysis, the week end date and average weekly basket attributes are analyzed, and the sequential data has been rendered for a given time period of occurrence. In model evaluation phase, the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R2 and Adjusted R2 accuracy has been calculated and validated. The project has been implemented in an open source software called Anaconda which includes Jupyter Notebook platform for scientific computations. Python programming language with different packages such as Numpy, Pandas, Scikit learn has been used.