AGRICULTURAL COMMODITY PRICE PREDICTION WITH THE HELP OF DATA SCIENCE
(26 May 2021) Tanky
Note: This is the second blog in series of ‘Data Science in Daily Lives’ and looks at the feasibility of introducing data science to agricultural commodity price predictions for those of us who also try our hands at agriculture. While machine learning and time series models could be developed and deployed to predict commodity prices under the assumption that future follows the past patterns (trend & seasonality), the prices are also heavily subject to unpredictable external environment like developing political situations or momentary climatic upheavals. The models thus developed may be more relevant over longer term rather than in short term. This is a small effort in price prediction in respect of small cardamom cultivated in Kerala towards alleviating the worries of cardamom cultivators including my Father.
Agricultural Commodity Pricing. The wide variations of agricultural commodity prices have a significant impact on people’s daily lives as well as on the inputs and outputs of agricultural production.
In the recent years, the price fluctuations of agricultural commodities have become more severe and have exerted negative effects on society. For the consumer, an excessive increase in prices imposes a great burden on his/her food expenditures, thus impacting their general welfare. For the agriculturalist, large price fluctuations will increase the uncertainty of production, thus adding to the number of risks that must be managed. I have seen the above dynamics play out in actual with respect to small cardamom, which is widely cultivated in the high ranges of Kerala (Idukki District). I have seen many agriculturists including my Father storing the cardamom produce of a full year for better and higher prices in the future at the risk of natural vagaries causing more harm than good to these stored agricultural commodities. The main purpose of agricultural commodity price forecasting is to allow cultivators to make better-informed decisions and to manage price risk.
Simple price forecast models such as naïve model have performed well in the past predicting agricultural commodity prices. Other models such as autoregressive integrated moving average (ARIMA) models and composite models lead to better estimates with higher accuracies. However, as the accuracy increases, so does the statistical complexity. Practical applications of more complex models are limited by the absence of requisite data and the higher expense of data acquisition. To overcome these limitations, machine learning (ML) models can be used as an alternative to complex forecast models.
Forecasts are done under the assumption that the market and other conditions in future would continue to be very much like the present. Not that there would be no changes, but that the change if at all would be gradual, and not a drastic one. However, a financial crash like 2008 US housing bubble burst or the Indian demonetization exercise of 2019 could send all forecasts into a tizzy.
Data
Data set used here is obtained from the archives of the Spices Board of India url << Archive - Daily Auction Price of Small Cardamom | Spices Board (indianspices.com)>>. The dataset has collated information regarding the daily auctions at commodity exchanges in Kerala with relevant data such as quantity of cardamom received at exchange, quantity sold, maximum price and the average price.
Time Series Approach
Time series analysis comprises methods for analysing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.
Dataset Pre-processing. This step mainly consists of cleaning the dataset, looking for null values / duplicate data points and getting rid of them. We do an outlier check and find that most variables of the dataset have outliers. We do an outlier treatment to remove the outliers. Next, we need to convert the date values which are in string format in the dataset to datetime data types. Thereafter we sort the dataset to get it into ascending order of dates. Finally, we obtain the cleaned dataset having only the average price variable. Detailed steps are indicated in the referred google colab notebook. The cleaned dataset looks as follows:
Past Price Variations. We see that there have been considerable price variations in the average pricing of small cardamom over the past 6 years. As can be seen, most of the outliers have been in 2019 and early 2020. A plot of the variation is as shown below:
Monthly price variations. A plot of monthly price variations is as indicated below:
Trend & Seasonality. A decomposition plot of the timeseries data is plotted below. The trend plot gives an impression of increasing trend during 2015 – mid 2019 and a decreasing trend thereafter (A case of trend following the actual curve). It can be deduced that in the instant case, there is no trend, no seasonality pattern and no error/residuals (white noise) present.
Train-Test Split. We divide the data into train and test data. Train data comprises all the data points till 31 Dec 2019 and test dataset comprises of all the balance data points.
Simple Exponential Smoothing (SES) Model. Since there is absence of both trend and seasonality, the SES Model could be used for the prediction. SES is a short-range forecasting method that assumes a reasonably stable mean in the data with no trend (consistent growth or decline). SES Model is trained and fit onto the train data set and therefore test dataset is used for prediction. The prediction for the instant dataset arrived at using SES is INR 1703. The root means square of errors (RMSE) for the prediction is 355.
We thereafter try Double Exponential Smoothing Model (DES), but the RMSE for the same increases indicating that SES Model gives the best time series forecast.
Machine Learning Approach
Since the model is being created for prediction for 30 days, an additional variable is created ‘prediction’, which is populated with the data points from variable ‘Avg.Price(Rs./Kg)’ with a lag of 30 days.
Train-Test Split. Thereafter X (independent variable) and y (dependent variable) are defined. X pertains to the data points in variable ‘Avg.Price(Rs./Kg)’, duly omitting the last 30 data points. Further, y pertains to the data points in variable ‘prediction’. A train-test split is undertaken to facilitate running of ML models.
Ensemble Random Forest Regressor Model. A random forest is a meta estimator that fits several classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. We define the ensemble random forest regressor model and thereafter train the model by fitting it on to the training data set. The model is thereafter used for predicting on the test dataset. The model gives an accuracy of 93% on training set and 88% on the test dataset. The predictions are available on the referred google colab notebook. As seen from the accuracy scores, the model appears to be a robust one with no overfitting or underfitting.
Support Vector Regression Model. Support Vector Machines (SVMs) are well known in classification problems. However, the use of SVMs in regression is not as well documented. These types of models are known as Support Vector Regression (SVR). SVR gives us the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data. We define the SVR model and thereafter train the model by fitting it on to the training data set. The model is thereafter used for predicting on the test dataset. The model gives an accuracy of 88% on training set and 91% on the test dataset. The predictions are available on the referred google colab notebook. As seen from the accuracy scores, this model also appears to be a robust one with no overfitting/underfitting.
The Google Colab link is added below for reference..
https://colab.research.google.com/drive/17W3OVCv8Y4lkwUiwzVx11Ns9L7DBuNDl?usp=sharing