Applying Data Science principles to build dress recommendation system
The 21st century has seen the advent of 4th industrial revolution. The accelerated growth of the internet over the last 15 years had made the retail industry one of the most competitive markets. What began with Brick & Mortar has now evolved into e-commerce with the predicted retail e-commerce sales worldwide to hit 4.88 trillion US dollars in 2021 . Every brick & mortar store are having to increase efficiency by reducing costs in order to compete. Knowing exactly what a customer is going to buy the following season will help with stock control, competitive advantage, stock wastage, customer approval rate and storage rental. Having a recommendation model of what was favored in one reason, will allow the retailer to predict and gain a better understanding on what will or won’t sell the following season.
Keywords—Classification, Data Mining, Feature Selection, Weka, Instances, Attributes
A recommendation system is a way of providing a user with product guidance based on past preferences of another users’ information . Recommendation models are becoming more prevalent with some of the most renowned companies like Netflix, and Amazon adopting this model to drive adoption and sales respectively. In the chosen dataset ‘DressData.csv’, the aim is to predict the likelihood of a dress being purchased based on historical data with the aim being that the model is able to generalize, its ability to adapt to new data, and give an accurate result. The dress dataset and the desired outcome is a form of supervised learning and more specifically a classification problem. This classification problem involves training a learning algorithm to correctly separate the input into two predefined classes. With the dress data we have two classes in the Target Feature, they are ‘Recommendation’=Yes and ‘Recommendation’ =No. The classifier(algorithm) used will try to find similarities between different attributes and instances using the labels as a guide and can check its prediction against actual values stored in the labels. I have made the assumption that the Target Feature (Recommendation) was added after the sale (or not sale) occurred, and thus is a source of truth as to whether these dresses have been able to sell or not in the past within some time threshold.
For the Dataset description, various featuring engineering techniques will be used based on both qualitative intuition and quantitative proof to identify the most version of the dataset in order to build a classification model.
For some features (e.g. decoration), some experimenting will take place whereby where by the missing value ‘?’ will be treated as both a class and as ‘Null’. Intuitive reasoning has been made to assume that a decoration that may not fit into a predefined class, could have some predictive value, and thus I want to create a class for the unclassified. However, this may dilute the dataset and thus several versions of the dataset will be ran through the selected models. For Rating, this will be experimented with as I will not want to use ‘?’ as a class as this is a regressive feature, with a higher value meaning higher quality. I will experiment with Feature selection to understand the relevance of the different attributes and make changes accordingly.
III. Preparing the data
Before the data was uploaded to the chosen web application, it was inspected first in Excel to understand the quality of the data. The term Data quality is defined as ‘fitness for use’, it is an evaluation to which extent the data serves its purposes and is often divided into four dimensions; accuracy, timeliness, completeness and consistency . The first obvious problem was the completeness of the data set. To gather more of an insight, the Count function seen below was used to calculate the amount of null values:
The next decision was the fill the missing data with ‘?’ as this is the chosen Machine learning software default value for missing data. The following formula was used to replace the missing value with a “?”:
This was then spread across each of the attributes. The data was then accumulated into a table, seen below in table 1:
TABLE 1. MISSING VALUES FOR ATTRIBUTES
|Missing Data||Total Instances||Percentage missing (%)|
Fig 1. Table showing Excel summary of missing data
From table 1 it is evident that there are key attributes that have a lack of completeness. This will be considered when the data is uploaded into Weka.
A. Pre Building the Model
Before the experimentation process starts, different versions of the dataset will be run through different models, as some versions may be better suited to some models. Before putting the dataset through each model, a prediction will be made on the likelihood of each model making a strong prediction on the cross-validation set. Based on my intuition I then want to see how well my assumptions held up and what, if anything, was counter-intuitive. As this is a classification problem, only models that are designed for classification will be used, there are: kNN, SVM, Decision Tree and Naïve Bayes
B. Model Classifiers
- KNN-(Lazy IBK in Weka) is based on feature similarity and governs a degree of smoothing . This classifier is a nonparametric method, meaning it doesn’t make assumptions, however its shortcoming include curse of dimensionality which is the process of reducing the number of attributes under consideration , therefore may not work in the dress data as there are a number of attributes with incomplete data, therefore not necessary to the selection.
- SVM- (SMO in Weka) looks for the optimal separating hyperplane between two classes by maximising the margin between the classes’ closest points.This classifier work well with small datasets and outliers. SVM’s objective is to determine a classifier which minimises the training set error and the confidence interval that affects the generalization . This classifier should respond well with the dress data because of its ability to work with heterogenous data including numerical and categorical.
- Decision Tree-(J48 in Weka) constructs a tree of possibilities. This classifier measures the uncertainty in a sequence of random events. When the entropy (uncertainty) of a system is high, the knowledge that can be derived from the system is low and vice versa, therefore decision trees will often overfit with a small dataset. The J48 tends to work better with homogenous data so with having a combination of numerical and categorical may not work in the Dress Data.
- Naïve Bayes- is based of Bayes theorem which offers a natural way to unfold experimental distribution in order to get the nest estimates of the true ones . It is a probabilistic classifiers and along with SMO should work well with the dress data.
C. Uploading the Data
The csv file was uploaded onto Jupyter Notebook, an open source web application. The decision was made to import pandas, a software library written for Python programming, as it understands labelled data, is easy to visualise and most importantly can handle missing data . Table 2 shows the code used to extract the csv. The query ‘DressData.head()’ is used to show the first 5 attributes and their features. As seen below all 500 rows and 12 columns have been imported into the Jupyter notebook with question marks replacing any null values.
TABLE 2 DATA UPLOAD
Fig 2. Table showing imported ‘dressdata’ csv into Jupyter
To ensure no loss of data it is important to run a sanity on the imported data. The descriptive statistics were gathered using the ‘.info’ and ‘.describe’. In Table 3 below, the describe function is examined. The count is the overall sum of the values, which according to the excel is 500. The only outlier is ‘Decoration’ which Pandas had interrupted as a numerical column, so any missing data has been referred to as ‘NaN’-Not a Number. When the data is uploaded to Weka, measures will be used to ensure that the missing values will match with the excel table. The ‘top’ row notes the mode value, something to note is the ‘?’ value in ‘Rating’ and the possible relationship with the ‘No’ in Recommendation. This will be explored later.
TABLE 3 DESCRIBE
Fig 3. Table showing describe method
In Table 4 ‘.info’ function has printed a concise summary of the dress data frame giving information on the index, column and non-null values. The same observation can be made with the ‘Decoration’ attribute so that is as expected. What wasn’t expected is the type of data in the rating column, as this was number and included decimals, the assumption would be that this is a float type.
TABLE 4 INFO
Fig 4. Table showing info method
The next step was the gather a subset of the data based on specific columns; Season, Sleevelength and Material. To do this the function ‘iloc’ was used as seen in Table 5. The iloc function is used to return a valid output for indexing. As the dress data is a small, when wanting to extract information from particular columns, it was possible to count the column number of the attributes I was extracting. If the data set was larger this could’ve been achieved through calling the column by their string.
TABLE 5 SUBSET DATASET
Fig 1. Table showing use if ‘iloc’ indexing to retrieve data
D. Uploading the Data to Weka
After some statistical information had been gathered and a deeper insight into the quality of the Data it can now be uploaded in Weka. Weka is a data mining software and was chosen as it contains the algorithms necessary for data analysis, and has a graphical user interface built in for easy functionality . The Dress data was uploaded into Weka, see table 6. As shown in the current relation box, we have our relation data=DressData, the instances=500, the attributes=12 and the sum of the weights=500. The output class has been highlighted below in table 6. By going through each feature, I check the missing data values match to previous excel results. For ‘decoration’ the missing data has been recorded as a ‘null’ class.
Fig 1. Table showing screenshot of loaded file indicating output classes
The output class ratio is calculated by dividing both values by the lesser value.
Therefore for every one output that was recommended ‘Yes’, 1.38 was recommended ‘No’.
E. Building the Model
Weka offers four applications, explorer has been chosen as it works well for small to medium datasets. Before the Dress Data was divided into training and test sets, it’s important to gain more of an understanding of relationship between each attribute as how this may affect the accuracy. A method of feature selection was used. Feature selection is evaluate the attributes and uncover the ranked value each attribute brings . Information gain-based feature selection was used as it ranks each attribute from 0 to 1, the attribute that contributed more information would have a higher value. The attribute evaluator ‘InfoGainAttributeEval’ was used and results are shown in table 7. At the bottom scoring a ranking of 0 meaning that it has no value is ‘Rating’, this could be a result of the incompleteness in the data set therefore does not add any value. The next two attributes from the bottom were ‘Size’ and ‘Waistline’, which has a low value as it is specific to any individual characteristic therefore shouldn’t affect the recommendation. Something to note is the ‘decoration’ attribute, which ranks highly. Its missing value has been classed as a distinct value as opposed to a missing value so may affect the accuracy.
TABLE 7 INFO GAIN
Fig 7. Table showing ranked attributes in Feature selection method
It is important to note the worth of each attribute for experimental purposes. The dataset will remain untouched will be evaluated at a later stage.
F. Training, test and cross validaiton set.
In order to produce the most effect generalised model, it is crucial to use both a training and test set. The training set is fed through the chosen algorithm and it produces a classifier, a cross-validation test in then ran using the same dataset. The result of this is then tested separately on a test set. The training and test set must be kept independent of one another, so the information is not misleading. Cross validation testing is a holdout method and is used when the dataset is small. It allows for a proportion of the available data to be used for training while making use of all data to assess performance .
After uploading the dress data set, the first step was to split the data into 2 datasets; training and test. The training set should contain the higher proportion of data so it would take 70% of the dataset, and the remaining 30% would be the test. A proportion of the training data will be reserved for the holdout method. To split the Data, this is achieved in the ‘Preprocess’ stage. The filter ‘Resample’ was used, and the proprieties were edited for each sample. The ‘noReplacement’ was changed to True to ensure the set didn’t contain duplications, and the sampleSizePercent was changed to 70%. This file was then saved as ‘DressData(training)’. To gather the test set, we resort back to the full data set and repeat the process. However, the invert selection is changed to True as we want the instances to be drawn without replacement, and the sampleSizePercent is changed to 30%. This 30% of the dataset will not include instances in the dataset. The holdout method-Cross Validation- will be used on the training set of the k fold 10. This cross-validation total will then be re-evaluated on the test set. All three separate values will be recorded.
G. Performance of models.
Each classifier was run through the training set and cross validation set. The cross-validation set was then revaluated on the test data. As said previously, the four algorithms that were used are; kNN(IBK), SVM(SMO), Decision Tree (J48) and Naïve Bayes. The results are shown in Table 8:
TABLE 8 RESULTS TABLE
|Classifier used||Training set (%)||Cross-Validation (hold-out) (%)||Test (%)||ROC Area|
|Decision Tree (J48)||67.4||62.5||68.6||0.51|
Fig 8. Results table showing correctly classified instances
Beside each classifier output, there is added information. Table 8 outlines the percentage of the correctly classified instances. The test set is independent of the training set, so it shows how the classifier has worked outside of the training model. The cross validation % calculates the optimal value by taking the available data and partitioning it into the training set and the hold-out set . The confusion matrix is included in the summary of each classifier, it outlines how many of the classes were labelled as a or b which is ‘yes’, ‘no’ respectively. This is useful to understand the incorrectly classified instances.
The ROC Area has been calculated for the cross-validation dataset. The ROC (receiver operating characteristic) is a mathematical predication that represents the probability that a chosen subject is correctly rated with greater suspicious than the other subject . To have the ROC valued around 0.5 would suggest it’s a 50/50 chance, however to have a ROC of over 0.90 would suggest a problem with overfitting. The ideal benchmark is 0.80.
The next step was to understand if the ‘null’ value that was associated with the ‘Decoration’ attribute has an effect. If we refer to Table 1, the ‘Decoration’ attribute has 238 missing values, however when this was uploaded it did not comply with the default ‘?’ missing value in Weka. Therefore ‘null’ was seen as a class. This did not prove a problem, as earlier stated it may have been the case that the ‘decoration’ was not one of the 20 instances. However, to ensure this did not affect the classifier, a duplicated dataset was created in excel, and the missing values were changed to ‘?’. When uploaded to Weka, it now showed that ‘Decoration’ had 238 missing values. The same process was run as before, with each classifier and the results are recorded in table 9:
TABLE 9 ALL MISSING VALUES
|Classifier used||Training set (%)||Cross-Validation (%)||Test (%)||ROC Area|
|Decision Tree (J48)||58||61.4||58||0.59|
Fig 9. Results table showing correctly classified instances
Lastly, the decision was made to experiment with the ‘Rating’ attribute. As seen from when the full dataset was run through ‘InfoGainAttributeEval’, it received a value of 0, meaning it brought no value to the Dataset. The same CSV file was used however from the ‘Preprocess’ stage, the rating attribute was selected and removed. The same process was followed as before, and the results recorded in table 10:
TABLE 10 DROP RATING
|Classifier used||Training set (%)||Cross-Validation (%)||Test (%)||ROC Area|
|Decision Tree (J48)||86.5||69.1||90||0.74|
Fig 10. Results table showing correctly classified instances
From reviewing the three different experiments that were used on each model, it is clear that the ‘Rating’ attribute had affected the correctly classified instances table 8, this was probably down to the last of completeness in this regressive feature. After looking at my previous assumptions which suggested that SMO and Naïve Bayes would be the most suited classifiers to use, this appears to be correct. If we look at kNN, in all three separate experiments the algorithm percentage for correctly identifying the output classes is nearly 90%, which would assume that the classifier is overfitting and therefore would not work on unseen data. The Decision tree appeared inconsistent across the three experiments, the main variance came when it was used in table 10. The lack of numerical data meant it was able to correctly classify more instances, which would agree with the previous statement that the decision tree works best with homogenous data. To conclude, the presence of more data, both number of features and number of instances would give a better result and would mean I could explore more advanced techniques such as boosted trees or neural networks. Secondly if I knew what the historic sales numbers were, the price of dress and the cost of dress, I would be able to show the return on investment of the recommendation system I have built, thus showing the value of the system to the business.
 The Statistics Portal, “Retail e-commerce sales worldwide from 2014 to 2021 (in billion U.S. dollars),” Statista, 2018. [Online]. Available: https://www.statista.com/statistics/379046/worldwide-retail-e-commerce-sales/. [Accessed: 30-Nov-2018].
 P. Morgan, Data Science from Scratch With Python. AI Sciences LLC, 2018.
 A. Haug, F. Zachariassen, and D. Van Liempd, “The costs of poor data quality,” J. Ind. Eng. Manag., vol. 4, no. 2, pp. 168–193, 2011.
 C. M. Bishop, Pattern Recognition and Machine Learning. Cambridge: Springer Science + Business Media LLC, 2006.
 P. Rani and J. Vashishtha, “An Appraise of KNN to the Perfection.”
 D. Meyer and F. H. T. Wien, “Support vector machines,” R News, vol. 1, no. 3, pp. 23–26, 2001.
 S. A. Sanap, M. Nagori, and V. Kshirsagar, “Classification of anemia using data mining techniques,” in International Conference on Swarm, Evolutionary, and Memetic Computing, 2011, pp. 113–121.
 G. D’Agostini, “A multidimensional unfolding method based on Bayes’ theorem,” Nucl. Instruments Methods Phys. Res. Sect. A Accel. Spectrometers, Detect. Assoc. Equip., vol. 362, no. 2–3, pp. 487–498, 1995.
 W. McKinney, “pandas: a Python data analysis library,” see http//pandas. pydata. org/. Google Sch., 2015.
 M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” ACM SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18, 2009.
 H. Liu and H. Motoda, Feature selection for knowledge discovery and data mining, vol. 454. Springer Science & Business Media, 2012.
 J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (ROC) curve.,” Radiology, vol. 143, no. 1, pp. 29–36, 1982.