In today’s modern world, huge shopping centers such as big malls and marts are recording data related to sales of items or products as an important step to predict the sales and get an idea about future demands that can help with inventory management. Understanding what role certain properties of an item play and how they affect their sales is imperative to any retail business
The Data Scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. Using this data, BigMart is trying to understand the properties of products and stores which play a key role in increasing sales.
To build a predictive model that can find out the sales of each product at a particular store and then provide actionable recommendations to the BigMart sales team to understand the properties of products and stores which play a key role in increasing sales.
Item_Identifier : Unique product ID
Item_Weight : Weight of the product
Item_Fat_Content : Whether the product is low fat or not
Item_Visibility : The % of the total display area of all products in a store allocated to the particular product
Item_Type : The category to which the product belongs
Item_MRP : Maximum Retail Price (list price) of the product
Outlet_Identifier : Unique store ID
Outlet_Establishment_Year : The year in which the store was established
Outlet_Size : The size of the store in terms of ground area covered
Outlet_Location_Type : The type of city in which the store is located
Outlet_Type : Whether the outlet is just a grocery store or some sort of supermarket
Item_Outlet_Sales : Sales of the product in the particular store. This is the outcome variable to be predicted.
We have two datasets - train (8523) and test (5681) data. The training dataset has both input and output variable(s). You will need to predict the sales for the test dataset.
Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.
# Importing libraries for data manipulation
import numpy as np
import pandas as pd
# Importing libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Importing libraries for building linear regression model
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Importing libraries for scaling the data
from sklearn.preprocessing import MinMaxScaler
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Loading both train and test datasets
train_df = pd.read_csv('Train.csv')
test_df = pd.read_csv('Test.csv')
# Checking the first 5 rows of the dataset
train_df.head()
Item_Identifier | Item_Weight | Item_Fat_Content | Item_Visibility | Item_Type | Item_MRP | Outlet_Identifier | Outlet_Establishment_Year | Outlet_Size | Outlet_Location_Type | Outlet_Type | Item_Outlet_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FDA15 | 9.30 | Low Fat | 0.016047 | Dairy | 249.8092 | OUT049 | 1999 | Medium | Tier 1 | Supermarket Type1 | 3735.1380 |
1 | DRC01 | 5.92 | Regular | 0.019278 | Soft Drinks | 48.2692 | OUT018 | 2009 | Medium | Tier 3 | Supermarket Type2 | 443.4228 |
2 | FDN15 | 17.50 | Low Fat | 0.016760 | Meat | 141.6180 | OUT049 | 1999 | Medium | Tier 1 | Supermarket Type1 | 2097.2700 |
3 | FDX07 | 19.20 | Regular | 0.000000 | Fruits and Vegetables | 182.0950 | OUT010 | 1998 | NaN | Tier 3 | Grocery Store | 732.3800 |
4 | NCD19 | 8.93 | Low Fat | 0.000000 | Household | 53.8614 | OUT013 | 1987 | High | Tier 3 | Supermarket Type1 | 994.7052 |
Observations:
train_df = train_df.drop(['Item_Identifier', 'Outlet_Identifier'], axis = 1)
test_df = test_df.drop(['Item_Identifier', 'Outlet_Identifier'], axis = 1)
Now, let's find out some more information about the training dataset, i.e., the total number of observations in the dataset, columns and their data types, etc.
train_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8523 entries, 0 to 8522 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Item_Weight 7060 non-null float64 1 Item_Fat_Content 8523 non-null object 2 Item_Visibility 8523 non-null float64 3 Item_Type 8523 non-null object 4 Item_MRP 8523 non-null float64 5 Outlet_Establishment_Year 8523 non-null int64 6 Outlet_Size 6113 non-null object 7 Outlet_Location_Type 8523 non-null object 8 Outlet_Type 8523 non-null object 9 Item_Outlet_Sales 8523 non-null float64 dtypes: float64(4), int64(1), object(5) memory usage: 666.0+ KB
Observations:
As we have already seen, two columns have missing values in the dataset. Let's check the percentage of missing values using the below code.
(train_df.isnull().sum() / train_df.shape[0])*100
Item_Weight 17.165317 Item_Fat_Content 0.000000 Item_Visibility 0.000000 Item_Type 0.000000 Item_MRP 0.000000 Outlet_Establishment_Year 0.000000 Outlet_Size 28.276428 Outlet_Location_Type 0.000000 Outlet_Type 0.000000 Item_Outlet_Sales 0.000000 dtype: float64
The percentage of missing values for the columns Item_Weight and Outlet_Size is ~17% and ~28% respectively. Now, let's see next how to treat these missing values.
Now that we have an understanding of the business problem we want to solve, and we have loaded the datasets, the next step to follow is to have a better understanding of the dataset, i.e., what is the distribution of the variables, what are different relationships that exist between variables, etc. If there are any data anomalies like missing values or outliers, how do we treat them to prepare the dataset for building the predictive model?
Let us now start exploring the dataset, by performing univariate analysis of the dataset, i.e., analyzing/visualizing the dataset by taking one variable at a time. Data visualization is a very important skill to have and we need to decide what charts to plot to better understand the data. For example, what chart makes sense to analyze categorical variables or what charts are suited for numerical variables, and also it depends on what relationship between variables we want to show.
Let's start with analyzing the categorical variables present in the data. There are five categorical variables in this dataset and we will create univariate bar charts for each of them to check their distribution.
Below we are creating 4 subplots to plot four out of the five categorical variables in a single frame.
fig, axes = plt.subplots(2, 2, figsize = (18, 10))
fig.suptitle('Bar plot for all categorical variables in the dataset')
sns.countplot(ax = axes[0, 0], x = 'Item_Fat_Content', data = train_df, color = 'blue',
order = train_df['Item_Fat_Content'].value_counts().index);
sns.countplot(ax = axes[0, 1], x = 'Outlet_Size', data = train_df, color = 'blue',
order = train_df['Outlet_Size'].value_counts().index);
sns.countplot(ax = axes[1, 0], x = 'Outlet_Location_Type', data = train_df, color = 'blue',
order = train_df['Outlet_Location_Type'].value_counts().index);
sns.countplot(ax = axes[1, 1], x = 'Outlet_Type', data = train_df, color = 'blue',
order = train_df['Outlet_Type'].value_counts().index);
Observations:
Below we are analyzing the categorical variable Item_Type.
fig = plt.figure(figsize = (18, 6))
sns.countplot(x = 'Item_Type', data = train_df, color = 'blue', order = train_df['Item_Type'].value_counts().index);
plt.xticks(rotation = 45);
Observation:
Before we move ahead with the univariate analysis for the numerical variables, let's first fix the data issues that we have found for the column Item_Fat_Content.
In the below code, we are replacing the categories - low fat and LF by Low Fat using lambda function and also we are replacing the category reg by Regular.
train_df['Item_Fat_Content'] = train_df['Item_Fat_Content'].apply(lambda x: 'Low Fat' if x == 'low fat' or x == 'LF' else x)
train_df['Item_Fat_Content'] = train_df['Item_Fat_Content'].apply(lambda x: 'Regular' if x == 'reg' else x)
The data preparation steps that we do on the training data, we need to perform the same steps on the test data as well. So, below we are performing the same transformations on the test dataset.
test_df['Item_Fat_Content'] = test_df['Item_Fat_Content'].apply(lambda x: 'Low Fat' if x == 'low fat' or x == 'LF' else x)
test_df['Item_Fat_Content'] = test_df['Item_Fat_Content'].apply(lambda x: 'Regular' if x == 'reg' else x)
Below we are analyzing all the numerical variables present in the data. And since we want to visualize one numerical variable at a time, histogram is the best choice to visualize the data.
fig, axes = plt.subplots(1, 3, figsize = (20, 6))
fig.suptitle('Histogram for all numerical variables in the dataset')
sns.histplot(x = 'Item_Weight', data = train_df, kde = True, ax = axes[0]);
sns.histplot(x = 'Item_Visibility', data = train_df, kde = True, ax = axes[1]);
sns.histplot(x = 'Item_MRP', data = train_df, kde = True, ax = axes[2]);
Now, let's move ahead with bivariate analysis to understand how variables are related to each other and if there is a strong relationship between dependent and independent variables present in the training dataset.
In the below plot, we are analyzing the variables Outlet_Establishment_Year and Item_Outlet_Sales. Here, since the variable Outlet_Establishment_Year is defining a time component, the best chart to analyze this relationship will be a line plot.
fig = plt.figure(figsize = (18, 6))
sns.lineplot(x = 'Outlet_Establishment_Year', y = 'Item_Outlet_Sales', data = train_df, ci = None, estimator = 'mean');
Observations:
Next, we are trying to find out linear correlations between the variables. This will help us to know which numerical variables are correlated with the target variable. Also, we can find out multi-collinearity, i.e., which pair of independent variables are correlated with each other.
fig = plt.figure(figsize = (18, 6))
sns.heatmap(train_df.corr(), annot = True);
plt.xticks(rotation = 45);
Observations:
Next, we are creating the bivariate scatter plots to check relationships between the pair of independent and dependent variables.
fig, axes = plt.subplots(1, 3, figsize = (20, 6))
fig.suptitle('Bi-variate scatterplot for all numerical variables with the dependent variable')
sns.scatterplot(x = 'Item_Weight', y = 'Item_Outlet_Sales', data = train_df, ax = axes[0]);
sns.scatterplot(x = 'Item_Visibility', y = 'Item_Outlet_Sales', data = train_df, ax = axes[1]);
sns.scatterplot(x = 'Item_MRP', y = 'Item_Outlet_Sales', data = train_df, ax = axes[2]);
Observations:
Data Description:
Observations from EDA:
Here, we are imputing missing values for the variable Item_Weight. There are many ways to impute missing values, we can impute the missing values by its mean, median, and using advanced imputation algorithms like knn, etc. But, here we are trying to find out some relationship of the variable Item_Weight with other variables in the dataset to impute those missing values.
And also, after imputing the missing values for the variables, the overall distribution of the variable should not change significantly.
fig = plt.figure(figsize = (18, 3))
sns.heatmap(train_df.pivot_table(index = 'Item_Fat_Content', columns = 'Item_Type', values = 'Item_Weight'), annot = True);
plt.xticks(rotation = 45);
Observation:
fig = plt.figure(figsize = (18, 3))
sns.heatmap(train_df.pivot_table(index = 'Outlet_Type', columns = 'Outlet_Location_Type', values = 'Item_Weight'), annot = True);
plt.xticks(rotation = 45);
Observation:
fig = plt.figure(figsize = (18, 3))
sns.heatmap(train_df.pivot_table(index = 'Item_Fat_Content', columns = 'Outlet_Size', values = 'Item_Weight'), annot = True);
plt.xticks(rotation = 45);
Observation:
We will impute the missing values using an uniform distribution with parameters a=10 and b=14, as shown below:
item_weight_indices_to_be_updated = train_df[train_df['Item_Weight'].isnull()].index
train_df.loc[item_weight_indices_to_be_updated, 'Item_Weight'] = np.random.uniform(10, 14,
len(item_weight_indices_to_be_updated))
Performing the same transformation on the test dataset.
item_weight_indices_to_be_updated = test_df[test_df['Item_Weight'].isnull()].index
test_df.loc[item_weight_indices_to_be_updated, 'Item_Weight'] = np.random.uniform(10, 14,
len(item_weight_indices_to_be_updated))
Next, we will be imputing missing values for the column Outlet_Size. Below, we are creating two different datasets - one where we have non-null values for the column Outlet_Size and in the other dataset, all the values of the column Outlet_Size are missing. We then check the distribution of the other variables in the dataset where Outlet_Size is missing to identify if there is any pattern present or not.
outlet_size_data = train_df[train_df['Outlet_Size'].notnull()]
outlet_size_missing_data = train_df[train_df['Outlet_Size'].isnull()]
fig, axes = plt.subplots(1, 3, figsize = (18, 6))
fig.suptitle('Bar plot for all categorical variables in the dataset where the variable Outlet_Size is missing')
sns.countplot(ax = axes[0], x = 'Outlet_Type', data = outlet_size_missing_data, color = 'blue',
order = outlet_size_missing_data['Outlet_Type'].value_counts().index);
sns.countplot(ax = axes[1], x = 'Outlet_Location_Type', data = outlet_size_missing_data, color = 'blue',
order = outlet_size_missing_data['Outlet_Location_Type'].value_counts().index);
sns.countplot(ax = axes[2], x = 'Item_Fat_Content', data = outlet_size_missing_data, color = 'blue',
order = outlet_size_missing_data['Item_Fat_Content'].value_counts().index);
Observation:
Now, we are creating a cross-tab of all the above categorical variables against the column Outlet_Size, where we want to impute the missing values.
fig= plt.figure(figsize = (18, 3))
sns.heatmap(pd.crosstab(index = outlet_size_data['Outlet_Type'], columns = outlet_size_data['Outlet_Size']), annot = True, fmt = 'g')
plt.xticks(rotation = 45);
Observations:
fig = plt.figure(figsize = (18, 3))
sns.heatmap(pd.crosstab(index = outlet_size_data['Outlet_Location_Type'], columns = outlet_size_data['Outlet_Size']), annot = True, fmt = 'g')
plt.xticks(rotation = 45);
Observation:
fig = plt.figure(figsize = (18, 3))
sns.heatmap(pd.crosstab(index = outlet_size_data['Item_Fat_Content'], columns = outlet_size_data['Outlet_Size']), annot = True, fmt = 'g')
plt.xticks(rotation = 45);
Observation:
Now, we will use the patterns we have from the variables Outlet_Type and Outlet_Location_Type to impute the missing values for the column Outlet_Size.
Below, we are identifying the indices in the DataFrame where Outlet_Size is null/missing and Outlet_Type is Grocery Store, so that we can replace those missing values with the value Small based on the pattern we have identified in above visualizations. Similarly, we are also identifying the indices in the DataFrame where Outlet_Size is null/missing and Outlet_Location_Type is Tier 2, to impute those missing values.
grocery_store_indices = train_df[train_df['Outlet_Size'].isnull()].query(" Outlet_Type == 'Grocery Store' ").index
tier_2_indices = train_df[train_df['Outlet_Size'].isnull()].query(" Outlet_Location_Type == 'Tier 2' ").index
Now, we are updating those indices for the column Outlet_Size with the value Small.
train_df.loc[grocery_store_indices, 'Outlet_Size'] = 'Small'
train_df.loc[tier_2_indices, 'Outlet_Size'] = 'Small'
Performing the same transformations on the test dataset.
grocery_store_indices = test_df[test_df['Outlet_Size'].isnull()].query(" Outlet_Type == 'Grocery Store' ").index
tier_2_indices = test_df[test_df['Outlet_Size'].isnull()].query(" Outlet_Location_Type == 'Tier 2' ").index
test_df.loc[grocery_store_indices, 'Outlet_Size'] = 'Small'
test_df.loc[tier_2_indices, 'Outlet_Size'] = 'Small'
After we have imputed the missing values, let's check again if we still have missing values in both train and test datasets.
train_df.isnull().sum()
Item_Weight 0 Item_Fat_Content 0 Item_Visibility 0 Item_Type 0 Item_MRP 0 Outlet_Establishment_Year 0 Outlet_Size 0 Outlet_Location_Type 0 Outlet_Type 0 Item_Outlet_Sales 0 dtype: int64
test_df.isnull().sum()
Item_Weight 0 Item_Fat_Content 0 Item_Visibility 0 Item_Type 0 Item_MRP 0 Outlet_Establishment_Year 0 Outlet_Size 0 Outlet_Location_Type 0 Outlet_Type 0 dtype: int64
There are no missing values in the datasets.
Now that we have completed the data understanding and data preparation step, before starting with the modeling task, we think that certain features are not present in the dataset, but we can create them using the existing columns, which can have the predictive power to predict the sales. This step of creating a new feature from the existing features in the dataset is known as Feature Engineering. So, we will start with a hypothesis - As the store gets older, the sales increase. Now, how do we define old? We know the establishment year and this data is collected in 2013, so the age can be found by subtracting the establishment year from 2013. This is what we are doing in the below code.
We are creating a new feature Outlet_Age which indicates how old the outlet is.
train_df['Outlet_Age'] = 2013 - train_df['Outlet_Establishment_Year']
test_df['Outlet_Age'] = 2013 - test_df['Outlet_Establishment_Year']
fig = plt.figure(figsize = (18, 6))
sns.boxplot(x = 'Outlet_Age', y = 'Item_Outlet_Sales', data = train_df);
Observations:
Now, that we have analyzed all the variables in the dataset, we are ready to start building the model. We have observed that not all the independent variables are important to predict the outcome variable. But at the beginning, we will use all the variables, and then from the model summary, we will decide on which variable to remove from the model. Model building is an iterative task.
# We are removing the outcome variable from the feature set
# Also removing the variable Outlet_Establishment_Year, as we have created a new variable Outlet_Age
train_features = train_df.drop(['Item_Outlet_Sales', 'Outlet_Establishment_Year'], axis = 1)
# And then we are extracting the outcome variable separately
train_target = train_df['Item_Outlet_Sales']
Whenever we have categorical variables as independent variables, we need to create one hot encoded representation (which is also known as dummy variables) of those categorical variables. The below code is creating dummy variables and we are removing the first category in those variables, known as reference variable. The reference variable helps to interpret the linear regression which we will see later.
# Creating dummy variables for the categorical variables
train_features = pd.get_dummies(train_features, drop_first = True)
train_features.head()
Item_Weight | Item_Visibility | Item_MRP | Outlet_Age | Item_Fat_Content_Regular | Item_Type_Breads | Item_Type_Breakfast | Item_Type_Canned | Item_Type_Dairy | Item_Type_Frozen Foods | ... | Item_Type_Snack Foods | Item_Type_Soft Drinks | Item_Type_Starchy Foods | Outlet_Size_Medium | Outlet_Size_Small | Outlet_Location_Type_Tier 2 | Outlet_Location_Type_Tier 3 | Outlet_Type_Supermarket Type1 | Outlet_Type_Supermarket Type2 | Outlet_Type_Supermarket Type3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9.30 | 0.016047 | 249.8092 | 14 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 5.92 | 0.019278 | 48.2692 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | 17.50 | 0.016760 | 141.6180 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 19.20 | 0.000000 | 182.0950 | 15 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
4 | 8.93 | 0.000000 | 53.8614 | 26 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
5 rows × 27 columns
Observations:
Below, we are scaling the numerical variables in the dataset to have the same range. If we don't do this, then the model will be biased towards a variable where we have a higher range and the model will not learn from the variables with a lower range. There are many ways to do scaling. Here, we are using MinMaxScaler as we have both categorical and numerical variables in the dataset and don't want to change the dummy encodings of the categorical variables that we have already created. For more information on different ways of doing scaling, refer to the section 6.3.1 of this page here
# Creating an instance of the MinMaxScaler
scaler = MinMaxScaler()
# Applying fit_transform on the training features data
train_features_scaled = scaler.fit_transform(train_features)
# The above scaler returns the data in array format, below we are converting it back to pandas DataFrame
train_features_scaled = pd.DataFrame(train_features_scaled, index = train_features.index, columns = train_features.columns)
train_features_scaled.head()
Item_Weight | Item_Visibility | Item_MRP | Outlet_Age | Item_Fat_Content_Regular | Item_Type_Breads | Item_Type_Breakfast | Item_Type_Canned | Item_Type_Dairy | Item_Type_Frozen Foods | ... | Item_Type_Snack Foods | Item_Type_Soft Drinks | Item_Type_Starchy Foods | Outlet_Size_Medium | Outlet_Size_Small | Outlet_Location_Type_Tier 2 | Outlet_Location_Type_Tier 3 | Outlet_Type_Supermarket Type1 | Outlet_Type_Supermarket Type2 | Outlet_Type_Supermarket Type3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.282525 | 0.048866 | 0.927507 | 0.416667 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.081274 | 0.058705 | 0.072068 | 0.000000 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
2 | 0.770765 | 0.051037 | 0.468288 | 0.416667 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 0.871986 | 0.000000 | 0.640093 | 0.458333 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
4 | 0.260494 | 0.000000 | 0.095805 | 0.916667 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
5 rows × 27 columns
Now, as the dataset is ready, we are set to build the model using the statsmodels package.
# Adding the intercept term
train_features_scaled = sm.add_constant(train_features_scaled)
# Calling the OLS algorithm on the train features and the target variable
ols_model_0 = sm.OLS(train_target, train_features_scaled)
# Fitting the Model
ols_res_0 = ols_model_0.fit()
print(ols_res_0.summary())
OLS Regression Results ============================================================================== Dep. Variable: Item_Outlet_Sales R-squared: 0.563 Model: OLS Adj. R-squared: 0.562 Method: Least Squares F-statistic: 405.8 Date: Sat, 06 Aug 2022 Prob (F-statistic): 0.00 Time: 16:25:01 Log-Likelihood: -71993. No. Observations: 8523 AIC: 1.440e+05 Df Residuals: 8495 BIC: 1.442e+05 Df Model: 27 Covariance Type: nonrobust =================================================================================================== coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------------------- const 192.2250 508.450 0.378 0.705 -804.460 1188.910 Item_Weight -18.6789 48.666 -0.384 0.701 -114.075 76.718 Item_Visibility -99.6205 81.718 -1.219 0.223 -259.807 60.566 Item_MRP 3668.9239 46.594 78.742 0.000 3577.588 3760.259 Outlet_Age -746.0040 250.310 -2.980 0.003 -1236.672 -255.336 Item_Fat_Content_Regular 40.1790 28.244 1.423 0.155 -15.187 95.545 Item_Type_Breads 5.2058 84.085 0.062 0.951 -159.622 170.033 Item_Type_Breakfast 7.5314 116.664 0.065 0.949 -221.159 236.222 Item_Type_Canned 26.8954 62.797 0.428 0.668 -96.202 149.993 Item_Type_Dairy -39.8361 62.264 -0.640 0.522 -161.889 82.216 Item_Type_Frozen Foods -27.2245 58.897 -0.462 0.644 -142.678 88.229 Item_Type_Fruits and Vegetables 30.0232 54.984 0.546 0.585 -77.760 137.806 Item_Type_Hard Drinks -2.1841 90.229 -0.024 0.981 -179.055 174.687 Item_Type_Health and Hygiene -11.3017 68.044 -0.166 0.868 -144.685 122.082 Item_Type_Household -38.1290 59.952 -0.636 0.525 -155.650 79.392 Item_Type_Meat 0.8572 70.685 0.012 0.990 -137.702 139.416 Item_Type_Others -21.8458 98.673 -0.221 0.825 -215.269 171.578 Item_Type_Seafood 186.2079 148.080 1.257 0.209 -104.065 476.481 Item_Type_Snack Foods -10.0531 55.274 -0.182 0.856 -118.405 98.298 Item_Type_Soft Drinks -27.7325 70.204 -0.395 0.693 -165.350 109.885 Item_Type_Starchy Foods 23.9142 103.091 0.232 0.817 -178.170 225.999 Outlet_Size_Medium -728.1544 274.677 -2.651 0.008 -1266.588 -189.721 Outlet_Size_Small -762.6946 254.941 -2.992 0.003 -1262.441 -262.948 Outlet_Location_Type_Tier 2 -168.8740 87.637 -1.927 0.054 -340.665 2.917 Outlet_Location_Type_Tier 3 -421.2031 152.008 -2.771 0.006 -719.176 -123.230 Outlet_Type_Supermarket Type1 1516.2963 140.022 10.829 0.000 1241.819 1790.774 Outlet_Type_Supermarket Type2 1252.0527 123.873 10.108 0.000 1009.232 1494.874 Outlet_Type_Supermarket Type3 3724.0002 176.037 21.155 0.000 3378.925 4069.076 ============================================================================== Omnibus: 961.289 Durbin-Watson: 2.004 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2293.830 Skew: 0.667 Prob(JB): 0.00 Kurtosis: 5.163 Cond. No. 105. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Observations:
Interpreting the Regression Results:
Adj. R-squared: It reflects the fit of the model.
coeff: It represents the change in the output Y due to a change of one unit in the independent variable (everything else held constant).
P >|t|: It is the p-value.
Pr(>|t|) : For each independent feature, there is a null hypothesis and alternate hypothesis.
Ho : Independent feature is not significant.
Ha : Independent feature is significant.
The p-value of less than 0.05 is considered to be statistically significant with a confidence level of 95%.
To understand in detail how p-values can help to identify statistically significant variables to predict the sales, we need to understand the hypothesis testing framework.
In the following example, we are showing the null hypothesis between the independent variable Outlet_Size and the dependent variable Item_Outlet_Sales to identify if there is any relationship between them or not.
From the above model summary, if the p-value is less than the significance level of 0.05, then we will reject the null hypothesis in favor of the alternate hypothesis. In other words, we have enough statistical evidence that there is some relationship between the variables Outlet_Size and Item_Outlet_Sales.
Based on the above analysis, if we observe the above model summary, we can see only some of the variables or some of the categories of a categorical variables have a p-value lower than 0.05.
Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the correlation between independent variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity in the linear model, the coefficients that the model suggests are unreliable.
There are different ways of detecting (or testing) multicollinearity. One such way is the Variation Inflation Factor.
Variance Inflation factor: Variance inflation factor measures the inflation in the variances of the regression parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is “inflated” by the existence of correlation among the predictor variables in the model.
General Rule of thumb: If VIF is 1, then there is no correlation between the kth predictor and the remaining predictor variables, and hence the variance of β̂k is not inflated at all. Whereas, if VIF exceeds 5 or is close to exceeding 5, we say there is moderate VIF and if it is 10 or exceeds 10, it shows signs of high multicollinearity.
vif_series = pd.Series(
[variance_inflation_factor(train_features_scaled.values, i) for i in range(train_features_scaled.shape[1])],
index = train_features_scaled.columns,
dtype = float)
print("VIF Scores: \n\n{}\n".format(vif_series))
VIF Scores: const 1726.930249 Item_Weight 1.020436 Item_Visibility 1.101135 Item_MRP 1.013143 Outlet_Age 50.920787 Item_Fat_Content_Regular 1.216616 Item_Type_Breads 1.349952 Item_Type_Breakfast 1.158277 Item_Type_Canned 1.853154 Item_Type_Dairy 1.906438 Item_Type_Frozen Foods 2.093556 Item_Type_Fruits and Vegetables 2.497309 Item_Type_Hard Drinks 1.331226 Item_Type_Health and Hygiene 1.771880 Item_Type_Household 2.289816 Item_Type_Meat 1.581290 Item_Type_Others 1.264078 Item_Type_Seafood 1.091655 Item_Type_Snack Foods 2.468958 Item_Type_Soft Drinks 1.629245 Item_Type_Starchy Foods 1.211395 Outlet_Size_Medium 111.035924 Outlet_Size_Small 106.822011 Outlet_Location_Type_Tier 2 11.286485 Outlet_Location_Type_Tier 3 36.822583 Outlet_Type_Supermarket Type1 29.622481 Outlet_Type_Supermarket Type2 9.945411 Outlet_Type_Supermarket Type3 20.218127 dtype: float64
Outlet_Age has a high VIF score. Hence, we are dropping Outlet_Age and building the model.
train_features_scaled_new = train_features_scaled.drop("Outlet_Age", axis = 1)
vif_series = pd.Series(
[variance_inflation_factor(train_features_scaled_new.values, i) for i in range(train_features_scaled_new.shape[1])],
index = train_features_scaled_new.columns,
dtype = float)
print("VIF Scores: \n\n{}\n".format(vif_series))
VIF Scores: const 121.393000 Item_Weight 1.020365 Item_Visibility 1.101115 Item_MRP 1.013102 Item_Fat_Content_Regular 1.216573 Item_Type_Breads 1.349651 Item_Type_Breakfast 1.158268 Item_Type_Canned 1.853093 Item_Type_Dairy 1.906435 Item_Type_Frozen Foods 2.093311 Item_Type_Fruits and Vegetables 2.497137 Item_Type_Hard Drinks 1.331145 Item_Type_Health and Hygiene 1.771845 Item_Type_Household 2.289797 Item_Type_Meat 1.581277 Item_Type_Others 1.264023 Item_Type_Seafood 1.091493 Item_Type_Snack Foods 2.468907 Item_Type_Soft Drinks 1.629240 Item_Type_Starchy Foods 1.211392 Outlet_Size_Medium 10.994155 Outlet_Size_Small 12.274728 Outlet_Location_Type_Tier 2 2.691632 Outlet_Location_Type_Tier 3 7.538596 Outlet_Type_Supermarket Type1 5.953573 Outlet_Type_Supermarket Type2 4.230171 Outlet_Type_Supermarket Type3 4.263294 dtype: float64
Now, let's build the model and observe the p-values.
ols_model_2 = sm.OLS(train_target, train_features_scaled_new)
ols_res_2 = ols_model_2.fit()
print(ols_res_2.summary())
OLS Regression Results ============================================================================== Dep. Variable: Item_Outlet_Sales R-squared: 0.563 Model: OLS Adj. R-squared: 0.561 Method: Least Squares F-statistic: 420.6 Date: Thu, 14 Apr 2022 Prob (F-statistic): 0.00 Time: 17:11:00 Log-Likelihood: -71997. No. Observations: 8523 AIC: 1.440e+05 Df Residuals: 8496 BIC: 1.442e+05 Df Model: 26 Covariance Type: nonrobust =================================================================================================== coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------------------- const -1270.9837 134.780 -9.430 0.000 -1535.186 -1006.782 Item_Weight -12.3498 48.666 -0.254 0.800 -107.747 83.047 Item_Visibility -98.4424 81.754 -1.204 0.229 -258.700 61.815 Item_MRP 3667.9397 46.614 78.688 0.000 3576.565 3759.314 Item_Fat_Content_Regular 40.7470 28.256 1.442 0.149 -14.643 96.137 Item_Type_Breads 1.7078 84.114 0.020 0.984 -163.176 166.591 Item_Type_Breakfast 8.4265 116.720 0.072 0.942 -220.373 237.226 Item_Type_Canned 25.8750 62.825 0.412 0.680 -97.278 149.028 Item_Type_Dairy -39.9257 62.291 -0.641 0.522 -162.031 82.179 Item_Type_Frozen Foods -25.4585 58.923 -0.432 0.666 -140.962 90.045 Item_Type_Fruits and Vegetables 28.4420 55.010 0.517 0.605 -79.391 136.275 Item_Type_Hard Drinks -4.0005 90.266 -0.044 0.965 -180.945 172.944 Item_Type_Health and Hygiene -10.5661 68.079 -0.155 0.877 -144.017 122.884 Item_Type_Household -38.9075 59.979 -0.649 0.517 -156.481 78.666 Item_Type_Meat 1.3535 70.719 0.019 0.985 -137.274 139.981 Item_Type_Others -24.1736 98.715 -0.245 0.807 -217.679 169.332 Item_Type_Seafood 180.8347 148.139 1.221 0.222 -109.554 471.224 Item_Type_Snack Foods -10.9284 55.305 -0.198 0.843 -119.339 97.482 Item_Type_Soft Drinks -27.2034 70.234 -0.387 0.699 -164.879 110.472 Item_Type_Starchy Foods 24.0722 103.143 0.233 0.815 -178.113 226.257 Outlet_Size_Medium 48.7150 86.480 0.563 0.573 -120.806 218.236 Outlet_Size_Small -48.0264 86.468 -0.555 0.579 -217.525 121.472 Outlet_Location_Type_Tier 2 59.0752 42.817 1.380 0.168 -24.858 143.008 Outlet_Location_Type_Tier 3 -17.3905 68.822 -0.253 0.801 -152.298 117.517 Outlet_Type_Supermarket Type1 1889.1598 62.814 30.075 0.000 1766.029 2012.290 Outlet_Type_Supermarket Type2 1531.9612 80.825 18.954 0.000 1373.524 1690.398 Outlet_Type_Supermarket Type3 3258.2514 80.874 40.288 0.000 3099.720 3416.783 ============================================================================== Omnibus: 962.884 Durbin-Watson: 2.004 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2300.313 Skew: 0.668 Prob(JB): 0.00 Kurtosis: 5.166 Cond. No. 28.3 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
It is not a good practice to consider VIF values for dummy variables as they are correlated to other categories and hence have a high VIF usually. So, we built the model and calculated the p-values. We can see that all the categories in the Item_Type column show a p-value higher than 0.05. So, we can drop the Item_Type column.
train_features_scaled_new2 = train_features_scaled_new.drop(['Item_Type_Breads',
'Item_Type_Breakfast',
'Item_Type_Canned',
'Item_Type_Dairy',
'Item_Type_Frozen Foods',
'Item_Type_Fruits and Vegetables',
'Item_Type_Hard Drinks',
'Item_Type_Health and Hygiene',
'Item_Type_Household',
'Item_Type_Meat',
'Item_Type_Others',
'Item_Type_Seafood',
'Item_Type_Snack Foods',
'Item_Type_Soft Drinks',
'Item_Type_Starchy Foods'], axis = 1)
vif_series = pd.Series(
[variance_inflation_factor(train_features_scaled_new2.values, i) for i in range(train_features_scaled_new2.shape[1])],
index = train_features_scaled_new2.columns,
dtype = float)
print("VIF Scores: \n\n{}\n".format(vif_series))
VIF Scores: const 109.452185 Item_Weight 1.007189 Item_Visibility 1.093214 Item_MRP 1.000729 Item_Fat_Content_Regular 1.003145 Outlet_Size_Medium 10.984463 Outlet_Size_Small 12.266452 Outlet_Location_Type_Tier 2 2.689564 Outlet_Location_Type_Tier 3 7.529063 Outlet_Type_Supermarket Type1 5.945893 Outlet_Type_Supermarket Type2 4.226082 Outlet_Type_Supermarket Type3 4.260733 dtype: float64
ols_model_3 = sm.OLS(train_target, train_features_scaled_new2)
ols_res_3 = ols_model_3.fit()
print(ols_res_3.summary())
OLS Regression Results ============================================================================== Dep. Variable: Item_Outlet_Sales R-squared: 0.563 Model: OLS Adj. R-squared: 0.562 Method: Least Squares F-statistic: 994.9 Date: Sat, 06 Aug 2022 Prob (F-statistic): 0.00 Time: 16:33:59 Log-Likelihood: -72000. No. Observations: 8523 AIC: 1.440e+05 Df Residuals: 8511 BIC: 1.441e+05 Df Model: 11 Covariance Type: nonrobust ================================================================================================= coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------------- const -1277.0790 127.990 -9.978 0.000 -1527.970 -1026.188 Item_Weight -19.4482 48.343 -0.402 0.687 -114.213 75.317 Item_Visibility -94.6918 81.414 -1.163 0.245 -254.284 64.900 Item_MRP 3666.7944 46.302 79.192 0.000 3576.030 3757.559 Item_Fat_Content_Regular 52.0902 25.644 2.031 0.042 1.821 102.359 Outlet_Size_Medium 48.7242 86.384 0.564 0.573 -120.609 218.057 Outlet_Size_Small -49.2185 86.382 -0.570 0.569 -218.547 120.110 Outlet_Location_Type_Tier 2 60.3934 42.776 1.412 0.158 -23.459 144.245 Outlet_Location_Type_Tier 3 -17.8502 68.728 -0.260 0.795 -152.573 116.873 Outlet_Type_Supermarket Type1 1888.6502 62.726 30.110 0.000 1765.692 2011.608 Outlet_Type_Supermarket Type2 1532.3453 80.740 18.979 0.000 1374.076 1690.614 Outlet_Type_Supermarket Type3 3258.5172 80.803 40.327 0.000 3100.124 3416.911 ============================================================================== Omnibus: 961.630 Durbin-Watson: 2.003 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2292.565 Skew: 0.668 Prob(JB): 0.00 Kurtosis: 5.161 Cond. No. 25.6 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now, from the p-values, we can see that we can remove the Item_Weight column as it has the highest p-value, i.e., it is the most insignificant variable.
train_features_scaled_new3 = train_features_scaled_new2.drop("Item_Weight", axis = 1)
vif_series = pd.Series(
[variance_inflation_factor(train_features_scaled_new3.values, i) for i in range(train_features_scaled_new3.shape[1])],
index = train_features_scaled_new3.columns,
dtype = float)
print("VIF Scores: \n\n{}\n".format(vif_series))
VIF Scores: const 106.905754 Item_Visibility 1.093065 Item_MRP 1.000167 Item_Fat_Content_Regular 1.002641 Outlet_Size_Medium 10.977215 Outlet_Size_Small 12.259457 Outlet_Location_Type_Tier 2 2.689448 Outlet_Location_Type_Tier 3 7.519408 Outlet_Type_Supermarket Type1 5.938545 Outlet_Type_Supermarket Type2 4.225983 Outlet_Type_Supermarket Type3 4.255095 dtype: float64
Let's build the model again.
ols_model_4 = sm.OLS(train_target, train_features_scaled_new3)
ols_res_4 = ols_model_4.fit()
print(ols_res_4.summary())
OLS Regression Results ============================================================================== Dep. Variable: Item_Outlet_Sales R-squared: 0.563 Model: OLS Adj. R-squared: 0.562 Method: Least Squares F-statistic: 1095. Date: Sat, 06 Aug 2022 Prob (F-statistic): 0.00 Time: 16:34:16 Log-Likelihood: -72000. No. Observations: 8523 AIC: 1.440e+05 Df Residuals: 8512 BIC: 1.441e+05 Df Model: 10 Covariance Type: nonrobust ================================================================================================= coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------------- const -1284.9327 126.486 -10.159 0.000 -1532.875 -1036.990 Item_Visibility -94.3088 81.405 -1.159 0.247 -253.882 65.264 Item_MRP 3666.3534 46.287 79.209 0.000 3575.619 3757.088 Item_Fat_Content_Regular 52.3215 25.637 2.041 0.041 2.068 102.575 Outlet_Size_Medium 47.8315 86.351 0.554 0.580 -121.437 217.100 Outlet_Size_Small -50.0483 86.353 -0.580 0.562 -219.321 119.224 Outlet_Location_Type_Tier 2 60.5062 42.773 1.415 0.157 -23.340 144.352 Outlet_Location_Type_Tier 3 -18.8403 68.680 -0.274 0.784 -153.470 115.790 Outlet_Type_Supermarket Type1 1887.7630 62.684 30.116 0.000 1764.887 2010.639 Outlet_Type_Supermarket Type2 1532.5026 80.735 18.982 0.000 1374.243 1690.762 Outlet_Type_Supermarket Type3 3259.6997 80.746 40.370 0.000 3101.419 3417.981 ============================================================================== Omnibus: 961.830 Durbin-Watson: 2.003 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2292.698 Skew: 0.668 Prob(JB): 0.00 Kurtosis: 5.161 Cond. No. 24.5 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
From the above p-values, both the categories of the column Outlet_Location_Type have a p-value lower than 0.05. So, we can remove the categories of Outlet_Location_Type.
train_features_scaled_new4 = train_features_scaled_new3.drop(["Outlet_Location_Type_Tier 2", "Outlet_Location_Type_Tier 3"], axis = 1)
vif_series = pd.Series(
[variance_inflation_factor(train_features_scaled_new4.values, i) for i in range(train_features_scaled_new4.shape[1])],
index = train_features_scaled_new4.columns,
dtype = float)
print("VIF Scores: \n\n{}\n".format(vif_series))
VIF Scores: const 27.169808 Item_Visibility 1.092348 Item_MRP 1.000143 Item_Fat_Content_Regular 1.002605 Outlet_Size_Medium 4.034011 Outlet_Size_Small 2.814532 Outlet_Type_Supermarket Type1 2.478493 Outlet_Type_Supermarket Type2 2.842880 Outlet_Type_Supermarket Type3 2.863427 dtype: float64
ols_model_5 = sm.OLS(train_target, train_features_scaled_new4)
ols_res_5 = ols_model_5.fit()
print(ols_res_5.summary())
OLS Regression Results ============================================================================== Dep. Variable: Item_Outlet_Sales R-squared: 0.562 Model: OLS Adj. R-squared: 0.562 Method: Least Squares F-statistic: 1368. Date: Sat, 06 Aug 2022 Prob (F-statistic): 0.00 Time: 16:35:24 Log-Likelihood: -72001. No. Observations: 8523 AIC: 1.440e+05 Df Residuals: 8514 BIC: 1.441e+05 Df Model: 8 Covariance Type: nonrobust ================================================================================================= coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------------- const -1358.8933 63.766 -21.311 0.000 -1483.889 -1233.897 Item_Visibility -93.3375 81.378 -1.147 0.251 -252.859 66.184 Item_MRP 3666.0517 46.287 79.203 0.000 3575.318 3756.785 Item_Fat_Content_Regular 52.1432 25.636 2.034 0.042 1.890 102.396 Outlet_Size_Medium 66.6696 52.347 1.274 0.203 -35.943 169.282 Outlet_Size_Small 14.1489 41.376 0.342 0.732 -66.958 95.255 Outlet_Type_Supermarket Type1 1942.9093 40.496 47.978 0.000 1863.527 2022.291 Outlet_Type_Supermarket Type2 1568.8091 66.218 23.692 0.000 1439.005 1698.613 Outlet_Type_Supermarket Type3 3296.0104 66.238 49.760 0.000 3166.167 3425.854 ============================================================================== Omnibus: 961.728 Durbin-Watson: 2.002 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2296.342 Skew: 0.667 Prob(JB): 0.00 Kurtosis: 5.164 Cond. No. 13.3 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
From the above p-values, both the categories of the column Outlet_Size have a p-value lower than 0.05. So, we can remove the categories of Outlet_Size.
train_features_scaled_new5 = train_features_scaled_new4.drop(["Outlet_Size_Small", "Outlet_Size_Medium"], axis=1)
vif_series = pd.Series(
[variance_inflation_factor(train_features_scaled_new5.values, i) for i in range(train_features_scaled_new5.shape[1])],
index = train_features_scaled_new5.columns,
dtype = float)
print("VIF Scores: \n\n{}\n".format(vif_series))
VIF Scores: const 15.813401 Item_Visibility 1.092314 Item_MRP 1.000114 Item_Fat_Content_Regular 1.002581 Outlet_Type_Supermarket Type1 2.306663 Outlet_Type_Supermarket Type2 1.731428 Outlet_Type_Supermarket Type3 1.744734 dtype: float64
ols_model_6 = sm.OLS(train_target, train_features_scaled_new5)
ols_res_6 = ols_model_6.fit()
print(ols_res_6.summary())
OLS Regression Results ============================================================================== Dep. Variable: Item_Outlet_Sales R-squared: 0.562 Model: OLS Adj. R-squared: 0.562 Method: Least Squares F-statistic: 1824. Date: Sat, 06 Aug 2022 Prob (F-statistic): 0.00 Time: 16:36:39 Log-Likelihood: -72002. No. Observations: 8523 AIC: 1.440e+05 Df Residuals: 8516 BIC: 1.441e+05 Df Model: 6 Covariance Type: nonrobust ================================================================================================= coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------------- const -1344.7104 48.647 -27.642 0.000 -1440.070 -1249.351 Item_Visibility -93.1427 81.377 -1.145 0.252 -252.661 66.376 Item_MRP 3665.7108 46.286 79.197 0.000 3574.979 3756.443 Item_Fat_Content_Regular 52.3194 25.636 2.041 0.041 2.067 102.572 Outlet_Type_Supermarket Type1 1949.3298 39.067 49.897 0.000 1872.749 2025.911 Outlet_Type_Supermarket Type2 1621.3566 51.677 31.375 0.000 1520.057 1722.657 Outlet_Type_Supermarket Type3 3348.5571 51.705 64.763 0.000 3247.203 3449.911 ============================================================================== Omnibus: 960.314 Durbin-Watson: 2.002 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2290.252 Skew: 0.667 Prob(JB): 0.00 Kurtosis: 5.161 Cond. No. 10.7 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Finally, let's drop the Item_Visibility.
train_features_scaled_new6 = train_features_scaled_new5.drop("Item_Visibility", axis = 1)
vif_series = pd.Series(
[variance_inflation_factor(train_features_scaled_new6.values, i) for i in range(train_features_scaled_new6.shape[1])],
index = train_features_scaled_new6.columns,
dtype = float)
print("VIF Scores: \n\n{}\n".format(vif_series))
VIF Scores: const 11.452496 Item_MRP 1.000114 Item_Fat_Content_Regular 1.000048 Outlet_Type_Supermarket Type1 2.125686 Outlet_Type_Supermarket Type2 1.654765 Outlet_Type_Supermarket Type3 1.658941 dtype: float64
ols_model_7 = sm.OLS(train_target, train_features_scaled_new6)
ols_res_7 = ols_model_7.fit()
print(ols_res_7.summary())
OLS Regression Results ============================================================================== Dep. Variable: Item_Outlet_Sales R-squared: 0.562 Model: OLS Adj. R-squared: 0.562 Method: Least Squares F-statistic: 2188. Date: Sat, 06 Aug 2022 Prob (F-statistic): 0.00 Time: 16:37:04 Log-Likelihood: -72003. No. Observations: 8523 AIC: 1.440e+05 Df Residuals: 8517 BIC: 1.441e+05 Df Model: 5 Covariance Type: nonrobust ================================================================================================= coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------------- const -1373.9504 41.400 -33.187 0.000 -1455.104 -1292.796 Item_MRP 3665.7374 46.287 79.196 0.000 3575.004 3756.471 Item_Fat_Content_Regular 50.8447 25.604 1.986 0.047 0.655 101.035 Outlet_Type_Supermarket Type1 1961.8549 37.504 52.311 0.000 1888.338 2035.371 Outlet_Type_Supermarket Type2 1633.8029 50.521 32.339 0.000 1534.769 1732.837 Outlet_Type_Supermarket Type3 3361.6803 50.418 66.676 0.000 3262.848 3460.513 ============================================================================== Omnibus: 962.316 Durbin-Watson: 2.002 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2299.500 Skew: 0.668 Prob(JB): 0.00 Kurtosis: 5.166 Cond. No. 8.43 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Observations:
Lets' check the assumptions of the linear regression model.
In this step, we will check whether the below assumptions hold true or not for the model. In case there is an issue, we will rebuild the model after fixing those issues.
# Residuals
residual = ols_res_7.resid
residual.mean()
1.404618687386992e-12
What is the test?
Error terms/Residuals should be normally distributed.
If the error terms are non-normally distributed, confidence intervals may become too wide or narrow. Once the confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on the minimization of least squares.
What does non-normality indicate?
How to check the normality?
We can plot the histogram of residuals and check the distribution visually.
It can be checked via QQ Plot. Residuals following normal distribution will make a straight line plot otherwise not.
Another test to check for normality: The Shapiro-Wilk test.
What if the residuals are not-normal?
# Plot histogram of residuals
sns.histplot(residual, kde = True)
<AxesSubplot:ylabel='Count'>
We can see that the error terms are normally distributed. The assumption of normality is satisfied.
It states that the predictor variables must have a linear relation with the dependent variable.
To test this assumption, we'll plot the residuals and the fitted values and ensure that residuals do not form a strong pattern. They should be randomly and uniformly scattered on the x-axis.
# Predicted values
fitted = ols_res_7.fittedvalues
sns.residplot(x = fitted, y = residual, color = "lightblue")
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()
Observations:
# Log transformation on the target variable
train_target_log = np.log(train_target)
# Fitting new model with the transformed target variable
ols_model_7 = sm.OLS(train_target_log, train_features_scaled_new6)
ols_res_7 = ols_model_7.fit()
# Predicted values
fitted = ols_res_7.fittedvalues
residual1 = ols_res_7.resid
sns.residplot(x = fitted, y = residual1, color = "lightblue")
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()
Observations:
print(ols_res_7.summary())
OLS Regression Results ============================================================================== Dep. Variable: Item_Outlet_Sales R-squared: 0.720 Model: OLS Adj. R-squared: 0.720 Method: Least Squares F-statistic: 4375. Date: Sat, 06 Aug 2022 Prob (F-statistic): 0.00 Time: 16:41:11 Log-Likelihood: -6816.7 No. Observations: 8523 AIC: 1.365e+04 Df Residuals: 8517 BIC: 1.369e+04 Df Model: 5 Covariance Type: nonrobust ================================================================================================= coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------------- const 4.6356 0.020 234.794 0.000 4.597 4.674 Item_MRP 1.9555 0.022 88.588 0.000 1.912 1.999 Item_Fat_Content_Regular 0.0158 0.012 1.293 0.196 -0.008 0.040 Outlet_Type_Supermarket Type1 1.9550 0.018 109.305 0.000 1.920 1.990 Outlet_Type_Supermarket Type2 1.7737 0.024 73.618 0.000 1.726 1.821 Outlet_Type_Supermarket Type3 2.4837 0.024 103.297 0.000 2.437 2.531 ============================================================================== Omnibus: 829.137 Durbin-Watson: 2.007 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1164.564 Skew: -0.775 Prob(JB): 1.31e-253 Kurtosis: 3.937 Cond. No. 8.43 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Let's check the final assumption of homoscedasticity.
Homoscedasticity - If the variance of the residuals are symmetrically distributed across the regression line, then the data is said to homoscedastic.
Heteroscedasticity - If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic. In this case, the residuals can form an arrow shape or any other non symmetrical shape.
We will use Goldfeld–Quandt test to check homoscedasticity.
Null hypothesis : Residuals are homoscedastic
Alternate hypothesis : Residuals are hetroscedastic
from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip
import statsmodels.stats.api as sms
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(train_target_log, train_features_scaled_new6)
lzip(name, test)
[('F statistic', 0.9395156175145154), ('p-value', 0.9790604597916552)]
We have verified all the assumptions of the linear regression model. The final equation of the model is as follows:
$\log ($ Item_Outlet_Sales $)$ $= 4.6356 + 1.9555 *$ Item_MRP$ + 0.0158 *$ Item_Fat_Content_Regular $ + 1.9550 *$ Outlet_Type_Supermarket Type1 $ + 1.7737 *$ Outlet_Type_Supermarket Type2$ + 2.4837*$ Outlet_Type_Supermarket Type3
Now, let's make the final test predictions.
without_const = train_features_scaled.iloc[:, 1:]
test_features = pd.get_dummies(test_df, drop_first = True)
test_features = test_features[list(without_const)]
# Applying transform on the test data
test_features_scaled = scaler.transform(test_features)
test_features_scaled = pd.DataFrame(test_features_scaled, columns = without_const.columns)
test_features_scaled = sm.add_constant(test_features_scaled)
test_features_scaled = test_features_scaled.drop(["Item_Weight", "Item_Visibility", "Item_Type_Breads", "Item_Type_Breakfast", "Item_Type_Canned", "Item_Type_Dairy","Item_Type_Frozen Foods","Item_Type_Fruits and Vegetables", "Item_Type_Hard Drinks", "Item_Type_Health and Hygiene", "Item_Type_Household", "Item_Type_Meat", "Item_Type_Others", "Item_Type_Seafood", "Item_Type_Snack Foods", "Item_Type_Soft Drinks", "Item_Type_Starchy Foods", "Outlet_Size_Medium", "Outlet_Size_Small", "Outlet_Location_Type_Tier 2", "Outlet_Location_Type_Tier 3", 'Outlet_Age'], axis = 1)
test_features_scaled.head()
const | Item_MRP | Item_Fat_Content_Regular | Outlet_Type_Supermarket Type1 | Outlet_Type_Supermarket Type2 | Outlet_Type_Supermarket Type3 | |
---|---|---|---|---|---|---|
0 | 1.0 | 0.325012 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 1.0 | 0.237819 | 1.0 | 1.0 | 0.0 | 0.0 |
2 | 1.0 | 0.893316 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 0.525233 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 1.0 | 0.861381 | 1.0 | 0.0 | 0.0 | 1.0 |
The R-squared metric gives us an indication that how good/bad our model is from a baseline model. Here, we have explained ~98% variance in the data as compared to the baseline model when there is no independent variable.
print(ols_res_7.rsquared)
0.71975057509795
This metric measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value.
print(ols_res_7.mse_resid)
0.2900908064681798
This metric is the same as the above but, instead of simply taking the average, we also take the square root of MSE to get RMSE. This helps to get the metric in the same unit as the target variable.
print(np.sqrt(ols_res_7.mse_resid))
0.5386007858035298
Below, we are checking the cross-validation score to identify if the model that we have built is underfitted, overfitted or just right fit model.
# Fitting linear model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
linearregression = LinearRegression()
cv_Score11 = cross_val_score(linearregression, train_features_scaled_new6, train_target_log, cv = 10)
cv_Score12 = cross_val_score(linearregression, train_features_scaled_new6, train_target_log, cv = 10,
scoring = 'neg_mean_squared_error')
print("RSquared: %0.3f (+/- %0.3f)" % (cv_Score11.mean(), cv_Score11.std()*2))
print("Mean Squared Error: %0.3f (+/- %0.3f)" % (-1*cv_Score12.mean(), cv_Score12.std()*2))
RSquared: 0.718 (+/- 0.049) Mean Squared Error: 0.290 (+/- 0.030)
Observations:
It seems like that our model is just right fit. It is giving a generalized performance.
Since this model that we have developed is a linear model which is not capable of capturing non-linear patterns in the data, we may want to build more advanced regression model which can capture the non-linearities in the data and improve this model further. But that is out of the scope of this case study.
Once our model is built and validated, we can now use this to predict the sales in our test data as shown below:
# These test predictions will be on a log scale
test_predictions = ols_res_7.predict(test_features_scaled)
# We are converting the log scale predictions to its original scale
test_predictions_inverse_transformed = np.exp(test_predictions)
test_predictions_inverse_transformed
0 1374.895044 1 1177.811420 2 591.395560 3 2033.800973 4 6765.005719 ... 5676 1843.793296 5677 1937.766915 5678 1504.855760 5679 3388.112463 5680 1106.508932 Length: 5681, dtype: float64
Point to remember: The output of this model is in log scale. So, after making prediction, we need to transform this value from log scale back to its original scale by doing the inverse of log transformation, i.e., taking exponentiation.</font>
fig, ax = plt.subplots(1, 2, figsize = (24, 12))
sns.histplot(test_predictions, ax = ax[0]);
sns.histplot(test_predictions_inverse_transformed, ax = ax[1]);
Lastly below is the model equation:
$\log ($ Item_Outlet_Sales $)$ $= 4.6356 + 1.9555 *$ Item_MRP$ + 0.0158 *$ Item_Fat_Content_Regular $ + 1.9550 *$ Outlet_Type_Supermarket Type1 $ + 1.7737 *$ Outlet_Type_Supermarket Type2$ + 2.4837*$ Outlet_Type_Supermarket Type3
From the above equation, we can interpret that, with one unit change in the variable Item_MRP, the outcome variable, i.e., log of Item_Outlet_Sales increases by 1.9555 units. So, if we want to increase the sales, we may want to store higher MRP items in the high visibility area.
On average, keeping other factors constant, the log sales of stores with type Supermarket type 3 are 1.4 (2.4837/1.7737) times the log sales of stores with type 2 and 1.27 (2.4837/1.9550) times the log sales of those stores with type 1.
After interpreting this linear regression equation, it is clear that large stores of supermarket type 3 have more sales than other types of outlets. So, we want to maintain or improve the sales in these stores and for the remaining ones, we may want to make strategies to improve the sales, for example, providing better customer service, better training for store staffs, providing more visibility of high MRP item, etc.