终其一生,我们只不过在寻找自己

0%

pytorch再接触

本次是MSBD5002的第二次作业,作业要求用MLP实现二分类和多分类的任务。多分类其实是对于图片分类,所以我也照着MNIST的CNNdemo改了个CNN的模型,CNN效果会好一点。但是由于数据集本身的原因,准确率没能上90%。

额外引用了skorch。skorch是基于pytorch的外部工具,集成了很多基础功能,使得训练过程变得异常简单。一个net.fit()就能搞定。

记录一下过程,仅供参考。以下正文。


Neural Networks Models for Binary Classifcation Data Sets

Define the network

For first task we need to set of single hidden layer neural network models, and I choice stochastic gradient descent algorithm by minimizing the cross-entropy loss. By use PyTorch, it’s pretty easy to define my own network jusk as

1
2
3
4
5
6
7
8
9
class Net_1hidden(torch.nn.Module):
def __init__(self, n_feature, n_hidden, n_output=2):
super(Net_1hidden, self).__init__()
self.hidden = torch.nn.Linear(n_feature, n_hidden) # hidden layer
self.out = torch.nn.Linear(n_hidden, n_output) # output layer
def forward(self, x):
x = Fun.relu(self.hidden(x)) # activation function for hidden layer we choose rele
x = F.softmax(self.out(x), dim=-1)
return x

For this net both n_feature and n_hidden are needed to be given a parameter. n_feature depends on the feature of X, and n_hidden are needed to be be choice by cross validation.

Build the pipeline

To make the train process easier, it’s better to build a hole pipeline include the read data, split the data into different parts and shuffle the data to make the model more stable.
So I used scikit-learn compatible neural network library that wraps PyTorch. The goal of skorch is to make it possible to use PyTorch with sklearn. This is achieved by providing a wrapper around PyTorch that has an sklearn interface. In that sense, skorch is the spiritual successor to nolearn, but instead of using Lasagne and Theano, it uses PyTorch.
So it’s easy to fit the model by net.fit(train_X,train_Y).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from skorch import NeuralNetClassifier
def get_data(filename):
data = np.load('./datasets/bi-class/'+filename)
train_X,test_X = torch.FloatTensor(data['train_X']),torch.FloatTensor(data['test_X'])
train_Y,test_Y = torch.LongTensor(data['train_Y']),torch.LongTensor(data['test_Y'])
print('\n>>>>>>>>>>>>>>>'+filename,data['train_X'].shape,data['test_X'].shape)
return train_X,test_X,train_Y,test_Y

net = NeuralNetClassifier(
Net_1hidden(n_feature=n_feature, n_hidden=10, n_output=2),
max_epochs=20,
lr=0.1,
optimizer=torch.optim.SGD,
criterion=torch.nn.CrossEntropyLoss,
iterator_train__shuffle=True,
)
net.fit(train_X,train_Y)

To make the program more clear, I just choice max_epochs is 20 which by test several times. And I didn’t use early stopping tech because all dataset is small enough and doesn’t need too much time to train.

Cross validation to find best parameter

In the assignment description, we needed to find the best hidden units H for each dataset. So I used the GridSearchCV from sklearn. GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.Which is very useful for search best parameter. The NeuralNet class allows to directly access parameters of the pytorch module by using the module__ prefix.

For each dataset, it is done by randomly sampling 80% of the training instances to train a classifier and then testing it on the remaining 20%. Which means 5-flods.

1
2
params = {'module__n_hidden': [1,2,3,4,5,6,7,8,9,10],}
gs = GridSearchCV(net, params, refit=False, cv=5, scoring='accuracy')

By this two function we can find best number of hidden between 1 to 10.

After train process and I find the best value is:

It’s weird for wine.npz‘s best hidden number is 1. And I will discuss it on next part.

All result

On all five dataset, we can see the model performance as follows .

filename best_params train_acc test_acc test_AUC train_time
diabetes.npz 8 0.652 0.647 0.500 0.725
breast-cancer.npz 10 0.952 0.963 0.957 0.568
iris.npz 10 1.000 1.000 1.000 0.214
wine.npz 1 0.401 0.389 0.500 0.244
digit.npz 8 0.959 0.935 0.939 0.833

As we can see, more hidden unit means more train time, because it need more computing .

Model performance well on breast-cancer, iris.npz, digit.npz.But for the diabetes.npz and wine.npz the model seems not work. There are several reasons for this problem.

  1. The feature and label of dataset is meaningless.
  2. The model is underfitting, because the epoch is only 20.
  3. The model is too easy and con’t get the relationship.

For find the reason, i try 2 hidden layers and add more epoch.
And the train result like follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
>>>>>>>>>>>>>>>diabetes.npz (615, 8) (153, 8)
1layer
epoch: 0, loss: 0.716,accuracy: 0.346
epoch: 100, loss: 0.664,accuracy: 0.647
epoch: 200, loss: 0.644,accuracy: 0.647
epoch: 300, loss: 0.635,accuracy: 0.647
epoch: 400, loss: 0.631,accuracy: 0.647
epoch: 500, loss: 0.628,accuracy: 0.647
2layer
epoch: 0, loss: 0.706,accuracy: 0.353
epoch: 100, loss: 0.689,accuracy: 0.647
epoch: 200, loss: 0.676,accuracy: 0.647
epoch: 300, loss: 0.668,accuracy: 0.647
epoch: 400, loss: 0.661,accuracy: 0.647
epoch: 500, loss: 0.657,accuracy: 0.647

>>>>>>>>>>>>>>>wine.npz (142, 13) (36, 13)
1layer
epoch: 0, loss: 0.912,accuracy: 0.389
epoch: 100, loss: 0.912,accuracy: 0.389
epoch: 200, loss: 0.912,accuracy: 0.389
epoch: 300, loss: 0.912,accuracy: 0.389
epoch: 400, loss: 0.912,accuracy: 0.389
epoch: 500, loss: 0.912,accuracy: 0.389
2layer
epoch: 0, loss: 0.912,accuracy: 0.389
epoch: 100, loss: 0.912,accuracy: 0.389
epoch: 200, loss: 0.912,accuracy: 0.389
epoch: 300, loss: 0.912,accuracy: 0.389
epoch: 400, loss: 0.912,accuracy: 0.389
epoch: 500, loss: 0.912,accuracy: 0.389

As we can see, no matter how many epoch and how many layers for this two dataset. The model all don’t work. So I think it is the first reason. The feature and label of dataset is meaningless.

Neural Networks Models for Multi-class Data Sets

Dataset

For this dataset, we have train with 10000 image, and test with 1000 image. And the label is 0-9, ten different classes. For each image, it is a 784 dimensions array, which can convert to 28*28 Square picture.

As we can see, there are some shoes, clothes, skirts, T-Shirts and each one class represent one number between 0-9.

Build a 2 hidden layers net

By use pytorch, it’s pretty easy to define a network. Again I still use relu as activate function, CrossEntropy as loss function. And I define the input feature is a flatten array with 784, and output layer is 10 which equal to the class number.

1
2
3
4
5
6
7
8
9
10
11
class Net_2hidden(torch.nn.Module):
def __init__(self, n_hidden1, n_hidden2, n_output = 10, n_feature = 784):
super(Net_2hidden, self).__init__()
self.hidden1 = torch.nn.Linear(n_feature, n_hidden1)
self.hidden2 = torch.nn.Linear(n_hidden1, n_hidden2)
self.out = torch.nn.Linear(n_hidden2, n_output) # output layer
def forward(self, x):
x1 = Fun.relu(self.hidden1(x))
x2 = Fun.relu(self.hidden2(x1))
x = self.out(x2)
return x

And then I use GridSearchCV to try different number of hidden units. And also I used cross validation (randomly sampling 80% of the training instances to train a classifier and then testing it on the remaining 20%). I set the max epochs is 30 for all dataset. And this time I used Adma instead of SGD. Because by experiment, Adma is faster than SGD to convergence.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.model_selection import GridSearchCV
net = NeuralNetClassifier(
Net_2hidden(n_hidden1=500, n_hidden2=100),
max_epochs=30,
lr=0.001,
optimizer=torch.optim.Adam,#torch.optim.SGD,
criterion=torch.nn.CrossEntropyLoss,
iterator_train__shuffle=True,
)
params = {
'module__n_hidden1': [50,75,100],#,500],
'module__n_hidden2': [10,15,20],#100]
}
gs = GridSearchCV(net, params, refit=False, cv=5, scoring='accuracy')
train_X,train_Y = torch.FloatTensor(train_images),torch.LongTensor(train_labels)
gs.fit(train_X,train_Y)
print(gs.best_score_, gs.best_params_)

After 10 minutes train and test, GridSearchCV get the best params with {'module__n_hidden1': 75, 'module__n_hidden2': 20}.
I was surprised to see that the first layer was 75 instead of 100, because in general, the larger the number of cells or the closer to the input layer model, the better the performance. After many experiments, I have come to a conclusion. The first hidden layer is 75 instead of 100, mainly because the largest parameter of the second hidden layer is 20, and the span before 100 to 20 is too large, which may cause too much loss here. Therefore, for the latter layer is 20 units, the first layer chooses 75 units to perform better.

And the final model get the accuracy with 84% on the 1000 test dataset. For each class, the accuracy as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
       class    precision    recall  f1-score   support
0 0.78 0.78 0.78 107
1 0.93 0.95 0.94 105
2 0.78 0.82 0.80 111
3 0.78 0.75 0.77 93
4 0.74 0.80 0.77 115
5 0.93 0.90 0.91 87
6 0.63 0.56 0.59 97
7 0.91 0.95 0.93 95
8 0.97 0.93 0.95 95
9 0.94 0.93 0.93 95

accuracy 0.83 1000
macro avg 0.84 0.84 0.84 1000
weighted avg 0.83 0.83 0.83 1000

As we can see, the class 6 get the lowest accuracy.

Improve: build CNN net

As we know, CNN make a good performance on image. So I try a 33 CNN model, try to improve the performance of model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class Cnn(torch.nn.Module):
def __init__(self, dropout=0.4):
super(Cnn, self).__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=5)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3)
self.conv2_drop = nn.Dropout2d(p=dropout)
self.fc1 = nn.Linear(1600, 800) # 1600 = number channels * width * height
self.fc2 = nn.Linear(800, 10)
self.fc1_drop = nn.Dropout(p=dropout)

def forward(self, x):
x = torch.relu(F.max_pool2d(self.conv1(x), 2))
x = torch.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
# flatten over channel, height and width = 1600
x = x.view(-1, x.size(1) * x.size(2) * x.size(3))
x = self.fc1_drop(torch.relu(self.fc1(x)))
x = torch.softmax(self.fc2(x), dim=-1)
return x

After many attempts, I chose max_epics = 200, the learning rate is 0.0005, and the optimizer is Adam. The internal parameters of the model have been fixed after several attempts. In the end, *88.8% of the accuracy
is obtained in 1000 test sets.
1
2
3
4
5
6
7
8
  epoch    train_loss    valid_acc    valid_loss     dur
------- ------------ ----------- ------------ ------
1 5.0852 0.7520 0.9469 0.3805
50 0.1847 0.8795 0.3360 0.3779
100 0.0694 0.8860 0.3733 0.3614
150 0.0336 0.8855 0.4257 0.3691
200 0.0175 0.8865 0.4704 0.3666
Accuracy on test: 0.888

Again I compute the accuracy for each class.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
        class  precision    recall  f1-score   support
0 0.91 0.80 0.85 107
1 0.99 0.99 0.99 105
2 0.82 0.84 0.83 111
3 0.91 0.85 0.88 93
4 0.85 0.85 0.85 115
5 0.93 0.95 0.94 87
6 0.66 0.75 0.71 97
7 0.92 0.97 0.94 95
8 0.97 0.97 0.97 95
9 0.98 0.93 0.95 95

accuracy 0.89 1000
macro avg 0.89 0.89 0.89 1000
weighted avg 0.89 0.89 0.89 1000

As we can see, still for the class 6 it’s hard to classify, only 66% accuracy.

Compare CNN and MLP

By tensorboard, we can intuitively compare the loss trends of the two models. When I set the maximum number of iterations of both models to 50, the MLP of the two layers decreased significantly faster in terms of training loss, but in the loss of the test set, the MLP of the two layers first decreased and then increased. Fitted. So the optimal number of iterations for two-layer MLP is around 10 words.

For the CNN model, because I added the characteristics of the Dropout layer and the CNN model itself, the iteration of the model to 50 layers is still in a decline in loss, and the accuracy of the verification set is increasing.

Error analysis and Optimization direction

In order to further study why it is wrong, I printed out a sample of prediction errors. I chose the most representative group to explain.

In the picture, we can see that these six pictures are all shoes, and their classification also belongs to [5,7,9].

The prediction range is also [5,7,9], but the model does not find the difference between shoes. To be honest, it’s hard for the naked eye to see the obvious difference between [5,7,9]. Therefore, it’s understandable that the model is not clear.

For the above question, I think we can use the hierarchical prediction method. For example, train a model to distinguish whether shoes or clothes are needed. Then train a model for shoes to capture the subtle differences between shoes. In this way, I think the accuracy can be further improved.

-------------    你的留言  是我更新的动力😊    -------------