walmart sales forecast
Post on 15-Apr-2017
519 Views
Preview:
TRANSCRIPT
(Praxis Business School)
Data Mining Assignment
A report on
Sales forecasting for Walmart
Submitted to
Prof. Suman K Mazumdar
In partial fulfillment of the requirements of the subject
(iSAS)
On (26th September, 2015)
By
Anurag Mukherjee
1
Sales forecasting for Walmart
2
Table of Content
Sl No Topic Page
1 Cover Page 12 Title Page 23 Executive Summary 34 Background 35 Business Problem 36 Data Overview 47 Exploratory Analysis 58 Examining the final features dataset : 19
9Merging of train and features for the final data set creation 20
10 Model Building 23
Executive Summary :
3
Walmart is the world's largest company by revenue, according to the Fortune Global 500 list in 2014, as well as the biggest private employer in the world with 2.2 million employees. Walmart is a family-owned business, as the company is controlled by the Walton family. Sam Walton's heirs own over 50 percent of Walmart through their holding company, Walton Enterprises, and through their individual holdings. It is also one of the world's most valuable companies by market value,[10]and is also the largest grocery retailer in the U.S. In 2009, it generated 51 percent of its US$258 billion (equivalent to $284 billion in 2015) sales in the U.S. from its grocery business.
We are provided with datasets containing sales per store,per department on weekly basis.We are are about to forecast sales for Walmart to help the company in taking much better data driven decisions for inventory planning and channel optimization.
Background :
Wal-Mart Stores, Inc. is an American multinational retail corporation that operates a chain ofdiscount department stores and warehouse stores. Headquartered in Bentonville, Arkansas, United States, the company was founded by Sam Walton in 1962 and incorporated on October 31, 1969. It has over 11,000 stores in 28 countries, under a total of 65 banners. The company operates under the Walmart name in the United States and Canada. It operates asWalmart de México y Centroamérica in Mexico, as Asda in the United Kingdom, as Seiyu in Japan, and as Best Price in India. It has wholly owned operations in Argentina, Brazil, and Canada. It also owns and operates the Sam's Club retail warehouses.
Business Problem :
With historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and the aim is to project the sales for each department in each store. To add to the challenge, selected holiday markdown events are included in the dataset. These markdowns are known to affect sales.
Data Overview :
train.csv
4
This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:
Store - the store number
Dept - the department number
Date - the week
Weekly_Sales - sales for the given department in the given store
IsHoliday - whether the week is a special holiday week
features.csv
This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:
Store - the store number
Date - the week
Temperature - average temperature in the region
Fuel_Price - cost of fuel in the region
MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI - the consumer price index
Unemployment - the unemployment rate
IsHoliday - whether the week is a special holiday week
Exploratory Analysis :
5
1.train.csv
1.1 Importing the raw dataset :
proc import out=walmart_train datafile='/folders/myshortcuts/myfolder/train_walmart.csv' dbms=csv replace;
getnames=yes;
run;
1.2 Checking the contents of train.csv :
proc contents data=walmart_train;
run;
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
3 Date Num 8 DDMMYY10. DDMMYY10.
2 Dept Num 8 BEST12. BEST32.
6 IsHoliday Char 5 $5. $5.
4 Month_Year Num 8 DATETIME. ANYDTDTM40.
1 Store Num 8 BEST12. BEST32.
5 Weekly_Sales
Num 8 BEST12. BEST32.
1.3 Checking the basic statistical measures
proc means data=walmart_train;
6
var Weekly_Sales;
run;
Analysis Variable : Weekly_Sales
N Mean Std Dev Minimum Maximum
421570
15981.26
22711.18
-4988.94 693099.36
Negative Sales Indicate Returns.
1.4 Plot of Weekly_Sales Vs Date :
1.5 Plotting Sales Year Wise :
7
proc sql;
create table walmart_train_data as
select Date,sum(Weekly_Sales) as Sales
from walmart_train
group by Date;
run;
2010 Sales Report :
data Sales_2010;
set walmart_train_data(keep=Sales Date where=(Date between '05Feb2010'd and '31Dec2010'd));
run;
*plotting 2010 Sales by Date;
ods graphics / reset imagemap;
proc sgplot data=WORK.SALES_2010;
vbar Date / response=Sales stat=Mean name='Bar';
yaxis grid;
run;
ods graphics / reset;
proc print data=Sales_2010;
run;
8
0
20000000
40000000
60000000
80000000Sa
les
(Mea
n)
05/02/2010
12/02/2010
19/02/2010
26/02/2010
05/03/2010
12/03/2010
19/03/2010
26/03/2010
02/04/2010
09/04/2010
16/04/2010
23/04/2010
30/04/2010
07/05/2010
14/05/2010
21/05/2010
28/05/2010
04/06/2010
11/06/2010
18/06/2010
25/06/2010
02/07/2010
09/07/2010
16/07/2010
23/07/2010
30/07/2010
06/08/2010
13/08/2010
20/08/2010
27/08/2010
03/09/2010
10/09/2010
17/09/2010
24/09/2010
01/10/2010
08/10/2010
15/10/2010
22/10/2010
29/10/2010
05/11/2010
12/11/2010
19/11/2010
26/11/2010
03/12/2010
10/12/2010
17/12/2010
24/12/2010
31/12/2010
Date
Obs Date Sales
1 05/02/2010 49750740.50
2 12/02/2010 48336677.63
3 19/02/2010 48276993.78
4 26/02/2010 43968571.13
5 05/03/2010 46871470.30
6 12/03/2010 45925396.51
7 19/03/2010 44988974.64
8 26/03/2010 44133961.05
(First 8 Sales figures for 2010 for convenience)
9
2011 Sales Report :
data Sales_2011;
set walmart_train_data(keep=Sales Date where=(Date between '07Jan2011'd and '30Dec2011'd));
run;
*plotting 2011 Sales by Date;
ods graphics / reset imagemap;
proc sgplot data=WORK.SALES_2011;
vbar Date / response=Sales stat=Mean name='Bar';
yaxis grid;
run;
ods graphics / reset;
proc print data=Sales_2011;
run;
0
20000000
40000000
60000000
80000000
Sale
s (M
ean)
07/01/2011
14/01/2011
21/01/2011
28/01/2011
04/02/2011
11/02/2011
18/02/2011
25/02/2011
04/03/2011
11/03/2011
18/03/2011
25/03/2011
01/04/2011
08/04/2011
15/04/2011
22/04/2011
29/04/2011
06/05/2011
13/05/2011
20/05/2011
27/05/2011
03/06/2011
10/06/2011
17/06/2011
24/06/2011
01/07/2011
08/07/2011
15/07/2011
22/07/2011
29/07/2011
05/08/2011
12/08/2011
19/08/2011
26/08/2011
02/09/2011
09/09/2011
16/09/2011
23/09/2011
30/09/2011
07/10/2011
14/10/2011
21/10/2011
28/10/2011
04/11/2011
11/11/2011
18/11/2011
25/11/2011
02/12/2011
09/12/2011
16/12/2011
23/12/2011
30/12/2011
Date
10
Sales in tabular Form :
Obs Date Sales
1 07/01/2011
42775787.77
2 14/01/2011
40673678.04
3 21/01/2011
40654648.03
4 28/01/2011
39599852.99
5 04/02/2011
46153111.12
6 11/02/2011
47336192.79
7 18/02/2011
48716164.12
(First 8 Sales figures for 2011 for convenience)
11
2012 Sales Report :
data Sales_2012;
set walmart_train_data(keep=Sales Date where=(Date between '06Jan2012'd and '26Oct2012'd));
run;
*plotting 2012 Sales by Date;
ods graphics / reset imagemap;
proc sgplot data=WORK.SALES_2012;
vbar Date / response=Sales stat=Mean name='Bar';
yaxis grid;
run;
ods graphics / reset;
proc print data=Sales_2012; run;
;
0
10000000
20000000
30000000
40000000
50000000
Sales
(Mea
n)
Date
12
Sales in tabular form - 2012
Obs Date Sales
1 06/01/2012
44955421.95
2 13/01/2012
42023078.48
3 20/01/2012
42080996.56
4 27/01/2012
39834974.67
5 03/02/2012
46085608.09
6 10/02/2012
50009407.92
7 17/02/2012
50197056.96
8 24/02/2012
45771506.57
13
1.6.Outlier Treatment for train.csv :
The data being a time series record have some seasonalities .During the month of December there’s a sales spike.This can be explained further by Markdowns.
Markdown 1,2,4,5 doesnt seem to be that effective as compared to Markdown 3.
14
15
As the spike in the sales would affect the entire model,the difference of excess sales has been distributed across all the records.
data wal;
set walmart_train_data;
where Sales > 50000000;
sales_diff=Sales-46243899.58;
run;
proc sql;
create table mapper as
select sum(Sales_diff) from
wal;
run;
*total excess sales from weeks having > 50000000 = 181638262.18;
data walmart_final;
set walmart_train;
if Weekly_Sales > 50000000 then Weekly_Sales=46243899.58;
Weekly_Sales_new=Weekly_Sales+(181638262.18/421570);
run;
proc univariate data=walmart_final;
var Weekly_Sales;
run;
16
2.features.csv
2.1 Importing raw data set :
proc import out=walmart_features datafile='/folders/myshortcuts/myfolder/features.csv' dbms=csv replace;
getnames=yes;
guessingrows=200;
run;
2.2 Checking the contents of features.csv :
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
4 CPI Char 11 $11. $11.
2 Date Num 8 YYMMDD10. YYMMDD10.
6 Fuel_Price Num 8 BEST12. BEST32.
13 IsHoliday Char 5 $5. $5.
7 MarkDown1 Char 8 $8. $8.
8 MarkDown2 Char 8 $8. $8.
9 MarkDown3 Char 8 $8. $8.
10 MarkDown4 Char 8 $8. $8.
11 MarkDown5 Char 8 $8. $8.
1 Store Num 8 BEST12. BEST32.
5 Temperature Num 8 BEST12. BEST32.
12 Unemployment
Char 5 $5. $5.
14 VAR14 Char 1 $1. $1.
3 Weekly_Sales Char 8 $8. $8.
17
2.3 Checking the basic statistical measures of features.csv :
proc means data=walmart_features;
run;
2.4 Outlier Treatment :
data walmart_f;
set walmart_features;
format Date DDMMYY10.;
if MarkDown1="NA" or MarkDown1="#N/A" then MarkDown1=0;
if MarkDown2="NA" or MarkDown2="#N/A" then MarkDown2=0;
if MarkDown3="NA" or MarkDown3="#N/A" then MarkDown3=0;
if MarkDown4="NA" or MarkDown4="#N/A" then MarkDown4=0;
if MarkDown5="NA" or MarkDown5="#N/A" then MarkDown5=0;
if IsHoliday="TRUE" then IsHoliday_Yes=1;
else IsHoliday_Yes=0;
if Weekly_Sales="#N/A" then Weekly_Sales=0;
run;
18
data walmart_features_1 (keep=Store Date Weekly_Sales_n Fuel_Price IsHoliday_Yes MarkDown1_n MarkDown1_n MarkDown2_n MarkDown3_n MarkDown4_n MarkDown5_n Temperature Unemployment CPI) ;
set walmart_f;
MarkDown1_n=MarkDown1*1;
MarkDown2_n=MarkDown2*1;
MarkDown3_n=MarkDown3*1;
MarkDown4_n=MarkDown4*1;
MarkDown5_n=MarkDown5*1;
Weekly_Sales_n=Weekly_Sales*1;
run;
19
Examining the final features dataset :
proc contents data=walmart_features_1;
run;
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
3 CPI Char 11 $11. $11.
2 Date Num 8 DDMMYY10. YYMMDD10.
5 Fuel_Price Num 8 BEST12. BEST32.
7 IsHoliday_Yes Num 8
8 MarkDown1_n Num 8
9 MarkDown2_n Num 8
10 MarkDown3_n Num 8
11 MarkDown4_n Num 8
12 MarkDown5_n Num 8
1 Store Num 8 BEST12. BEST32.
4 Temperature Num 8 BEST12. BEST32.
6 Unemployment Char 5 $5. $5.
13 Weekly_Sales_n
Num 8
20
Merging of train and features for the final data set creation :
proc sql;
create table walmart_final_1 as
select a.*,b.CPI,b.Temperature,b.Fuel_Price,b.MarkDown1_n,b.MarkDown2_n,b.MarkDown3_n,b.MarkDown4_n,b.MarkDown5_n,b.Unemployment,b.IsHoliday_Yes
from walmart_final as a left join walmart_features_1 as b on
a.Date=b.Date and a.Store=b.Store;
run;
data walmart_final_2 (drop=IsHoliday Month_Year Unemployment Weekly_Sales);
set walmart_final_1;
run;
21
proc contents data=walmart_final_2;
run;
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
5 CPI Char 11 $11. $11.
3 Date Num 8 DDMMYY10. DDMMYY10.
2 Dept Num 8 BEST12. BEST32.
7 Fuel_Price Num 8 BEST12. BEST32.
13 IsHoliday_Yes Num 8
8 MarkDown1_n Num 8
9 MarkDown2_n Num 8
10 MarkDown3_n Num 8
11 MarkDown4_n Num 8
12 MarkDown5_n Num 8
1 Store Num 8 BEST12. BEST32.
6 Temperature Num 8 BEST12. BEST32.
4 Weekly_Sales_new
Num 8
22
Printing the final dataset after merge :
proc print data=walmart_final_2(obs=10);
run;
Obs
Store
De
pt
Date Weekly_Sales_ne
w
CPI Temperatur
e
Fuel_Pric
e
MarkDown1_
n
MarkDown2_
n
MarkDown3_
n
MarkDown4_
n
MarkDown5_
n
IsHoliday_Y
es
1 1 45
05/02/20
10
468.30 211.0963582
42.31 2.572
0 0 0 0 0 0
2 1 5 05/02/20
10
32660.24 211.0963582
42.31 2.572
0 0 0 0 0 0
3 1 9 05/02/20
10
17361.85 211.0963582
42.31 2.572
0 0 0 0 0 0
4 1 29
05/02/20
10
7455.81 211.0963582
42.31 2.572
0 0 0 0 0 0
5 1 92
05/02/20
10
140315.80
211.0963582
42.31 2.572
0 0 0 0 0 0
6 1 42
05/02/20
10
8797.57 211.0963582
42.31 2.572
0 0 0 0 0 0
7 1 80
05/02/20
10
16125.03 211.0963582
42.31 2.572
0 0 0 0 0 0
8 1 19
05/02/20
10
2377.91 211.0963582
42.31 2.572
0 0 0 0 0 0
9 1 32
05/02/20
10
12306.70 211.0963582
42.31 2.572
0 0 0 0 0 0
1 1 4 05/0 67211.49 211.0 42.31 2.57 0 0 0 0 0 0
23
Obs
Store
De
pt
Date Weekly_Sales_ne
w
CPI Temperatur
e
Fuel_Pric
e
MarkDown1_
n
MarkDown2_
n
MarkDown3_
n
MarkDown4_
n
MarkDown5_
n
IsHoliday_Y
es
0 0 2/2010
963582
2
Model Building :
proc reg data=walmart_final_2;
model Weekly_Sales_new= Fuel_Price MarkDown3_n Temperature ;
run;
24
top related