本次是MSBD5002的第二次作业，作业要求用MLP实现二分类和多分类的任务。多分类其实是对于图片分类，所以我也照着MNIST的CNNdemo改了个CNN的模型，CNN效果会好一点。但是由于数据集本身的原因，准确率没能上90%。

额外引用了skorch。skorch是基于pytorch的外部工具，集成了很多基础功能，使得训练过程变得异常简单。一个net.fit()就能搞定。

记录一下过程，仅供参考。以下正文。

# Neural Networks Models for Binary Classifcation Data Sets

## Define the network

For first task we need to set of single hidden layer neural network models, and I choice stochastic gradient descent algorithm by minimizing the cross-entropy loss. By use PyTorch, it’s pretty easy to define my own network jusk as

1 | class Net_1hidden(torch.nn.Module): |

For this net both

`n_feature`

and `n_hidden`

are needed to be given a parameter. `n_feature`

depends on the feature of X, and `n_hidden`

are needed to be be choice by cross validation.## Build the pipeline

To make the train process easier, it’s better to build a hole pipeline include the read data, split the data into different parts and shuffle the data to make the model more stable.

So I used scikit-learn compatible neural network library that wraps PyTorch. The goal of skorch is to make it possible to use PyTorch with sklearn. This is achieved by providing a wrapper around PyTorch that has an sklearn interface. In that sense, skorch is the spiritual successor to nolearn, but instead of using Lasagne and Theano, it uses PyTorch.

So it’s easy to fit the model by `net.fit(train_X,train_Y)`

.

1 | from skorch import NeuralNetClassifier |

To make the program more clear, I just choice max_epochs is 20 which by test several times. And I didn’t use early stopping tech because all dataset is small enough and doesn’t need too much time to train.

## Cross validation to find best parameter

In the assignment description, we needed to find the best hidden units H for each dataset. So I used the `GridSearchCV`

from `sklearn`

. GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.Which is very useful for search best parameter. The `NeuralNet`

class allows to directly access parameters of the `pytorch module`

by using the `module__`

prefix.

For each dataset, it is done by randomly sampling 80% of the training instances to train a classifier and then testing it on the remaining 20%. Which means 5-flods.

1 | params = {'module__n_hidden': [1,2,3,4,5,6,7,8,9,10],} |

By this two function we can find best number of hidden between 1 to 10.

After train process and I find the best value is:

It’s weird for `wine.npz`

‘s best hidden number is 1. And I will discuss it on next part.

## All result

On all five dataset, we can see the model performance as follows .

filename | best_params | train_acc | test_acc | test_AUC | train_time |
---|---|---|---|---|---|

diabetes.npz | 8 | 0.652 | 0.647 | 0.500 | 0.725 |

breast-cancer.npz | 10 | 0.952 | 0.963 | 0.957 | 0.568 |

iris.npz | 10 | 1.000 | 1.000 | 1.000 | 0.214 |

wine.npz | 1 | 0.401 | 0.389 | 0.500 | 0.244 |

digit.npz | 8 | 0.959 | 0.935 | 0.939 | 0.833 |

As we can see, more hidden unit means more train time, because it need more computing .

Model performance well on `breast-cancer`

, `iris.npz`

, `digit.npz`

.But for the `diabetes.npz`

and `wine.npz`

the model seems not work. There are several reasons for this problem.

- The feature and label of dataset is meaningless.
- The model is underfitting, because the epoch is only 20.
- The model is too easy and con’t get the relationship.

For find the reason, i try 2 hidden layers and add more epoch.

And the train result like follows:

1 | >>>>>>>>>>>>>>>diabetes.npz (615, 8) (153, 8) |

As we can see, no matter how many epoch and how many layers for this two dataset. The model all don’t work. So I think it is the first reason. The feature and label of dataset is meaningless.

# Neural Networks Models for Multi-class Data Sets

## Dataset

For this dataset, we have train with 10000 image, and test with 1000 image. And the label is 0-9, ten different classes. For each image, it is a 784 dimensions array, which can convert to 28*28 Square picture.

As we can see, there are some shoes, clothes, skirts, T-Shirts and each one class represent one number between 0-9.

## Build a 2 hidden layers net

By use pytorch, it’s pretty easy to define a network. Again I still use relu as activate function, CrossEntropy as loss function. And I define the input feature is a flatten array with 784, and output layer is 10 which equal to the class number.

1 | class Net_2hidden(torch.nn.Module): |

And then I use GridSearchCV to try different number of hidden units. And also I used cross validation (randomly sampling 80% of the training instances to train a classifier and then testing it on the remaining 20%). I set the max epochs is 30 for all dataset. And this time I used Adma instead of SGD. Because by experiment, Adma is faster than SGD to convergence.

1 | from sklearn.model_selection import GridSearchCV |

After

`10 minutes`

train and test, GridSearchCV get the best params with `{'module__n_hidden1': 75, 'module__n_hidden2': 20}`

.I was surprised to see that the first layer was 75 instead of 100, because in general, the larger the number of cells or the closer to the input layer model, the better the performance. After many experiments, I have come to a conclusion. The first hidden layer is 75 instead of 100, mainly because the largest parameter of the second hidden layer is 20, and the span before 100 to 20 is too large, which may cause too much loss here. Therefore, for the latter layer is 20 units, the first layer chooses 75 units to perform better.

And the final model get the `accuracy with 84%`

on the 1000 test dataset. For each class, the accuracy as follows.

1 | class precision recall f1-score support |

As we can see, the class 6 get the lowest accuracy.

## Improve: build CNN net

As we know, CNN make a good performance on image. So I try a 3*3 CNN model, try to improve the performance of model.*

1 | class Cnn(torch.nn.Module): |

*is obtained in 1000 test sets.*

After many attempts, I chose

After many attempts, I chose

`max_epics = 200`

, the `learning rate is 0.0005`

, and the `optimizer is Adam.`

The internal parameters of the model have been fixed after several attempts. In the end, **88.8% of the accuracy*1 | epoch train_loss valid_acc valid_loss dur |

Again I compute the accuracy for each class.

1 | class precision recall f1-score support |

As we can see, still for the class 6 it’s hard to classify, only 66% accuracy.

## Compare CNN and MLP

By tensorboard, we can intuitively compare the loss trends of the two models. When I set the maximum number of iterations of both models to 50, the MLP of the two layers decreased significantly faster in terms of training loss, but in the loss of the test set, the MLP of the two layers first decreased and then increased. Fitted. So the optimal number of iterations for two-layer MLP is around 10 words.

For the CNN model, because I added the characteristics of the Dropout layer and the CNN model itself, the iteration of the model to 50 layers is still in a decline in loss, and the accuracy of the verification set is increasing.

## Error analysis and Optimization direction

In order to further study why it is wrong, I printed out a sample of prediction errors. I chose the most representative group to explain.

In the picture, we can see that these six pictures are all shoes, and their classification also belongs to [5,7,9].

The prediction range is also [5,7,9], but the model does not find the difference between shoes. To be honest, it’s hard for the naked eye to see the obvious difference between [5,7,9]. Therefore, it’s understandable that the model is not clear.

For the above question, I think we can use the ** hierarchical prediction method**. For example, train a model to distinguish whether shoes or clothes are needed. Then train a model for shoes to capture the subtle differences between shoes. In this way, I think the accuracy can be further improved.