FanduTech - Text Classification with Deep Learning in Keras

Text Classification with Deep Learning in Keras

Post date March 30, 2017

Author: gopalsharma2001

Views 1260 | Likes 1 | Dislikes 0

Description:

This blog post explains how Keras library is used to classify email texts as spam or ham.

Keras is a high-level neural networks API for deep learning in Python that can run on top of Theano or TensorFlow. Its minimalistic and modualr approach makes it easy to get deep neural networks up and running.It runs on Python 2.7 or 3.5 and can seamlessly execute on GPUs and CPUs given the underlying frameworks.

Deep Learning refers to neural networks with multiple hidden layers that can learn increasingly abstract representations of the input data. Deep neural networks contain multiple non-linear hidden layers and this makes them very expressive models that can learn very complicated relationships between their inputs and outputs. Deep Learning is not very useful for small datasets.

Theano and TensorFlow are two of the top numerical platforms in Python that provide the basis for Deep Learning research and development. Both of them are powerful libraries but difficult to use directly for creating deep learning model. Keras provides an easy and convenient way to create deep learning models on the top of theano or tensorflow.

Installation

Since Keras runs on the top of either tensforflow or theano, one of these numerical platforms needs to be installed alongwith Keras.

Install TensorFlow

Refer this for tensorflow version and binary for your platform. As mentioned on the official tensforflow website there are different ways to install tensorflow. We will use pip. You need to install pip, if not already installed. For Ubuntu 64-bit CPU only install on python 2.7,

$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.12.1-cp27-none-linux_x86_64.whl
$ sudo pip install --upgrade $TF_BINARY_URL

Install Keras

First we need to install Python dependencies that are required for Keras

$ sudo pip install numpy scipy
$ sudo pip install scikit-learn
$ sudo pip install pillow
$ sudo pip install h5py

Then install Keras

$ sudo pip install keras

In my case this also tried to install theano. If you want to use theano, then theano needs to be installed cleanly to avoid any issue. In our case, keras configuration needs to be changed to point to TensorFlow. To check whether Keras is using tensorflow as backend, verify file keras.json which can be found at ~/.keras/keras.json. Specifically, you need to verify that following properties are set to tensorflow.

"image_dim_ordering": "tf"
"backend": "tensorflow"

"tf" indicates that TensorFlow image dimension ordering is used. "th" is used for theano image dimension ordering.

Verify Keras Installation

$ python
>>> import keras
This should not throw any import error and display "Using TensorFlow backend."

Note: User can also install Keras first before installing theano or tensorflow, and then configure the backend (keras.json) to point to intended backend numerical platform.

Build Models with Keras

Keras works on the idea of a model. The main type of model is called sequential, which is a linear stack of layers. You create a sequence and add layers to it in the order that you wish for the computation to be performed.Once defined, you compile the model which makes use of the underlying framework to optimize the computation to be performed by your model. In this you can specify the loss function and the optimizer to be used. Once compiled, the model must be fit to data. This can be done one batch of data at a time or by firing off the entire model training regime. This is where all the compute happens.Once trained, you can use your model to make predictions on new data.

During training phase, it might fail due to "AttributeError: 'module' object has no attribute 'global_variables'" error. This is tensorflow error, which means that the tensorflow version is incomptible and should be 0.12.

Dropout: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem.The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.

Steps:

Load Data: To start with we create function that loads the labeled training data, that is email texts. We split the data into training and test data which will be used for model evaluation.

Build the Model: We create a function that at first vectorizes texts using Keras Tokenizer. It finds the 700 most frequent words in the texts, makes those into binary features. Then create a baseline neural network. It creates a simple fully connected network with one hidden layer that contains 512 neurons, 700 inputs (it could be any number of neurons). We use Dense as NN layer to create regular densely-connected NN layer.The hidden layer uses a rectifier activation function. Then we use dropout layer that drops a fraction (0.5, in our case) of input units at each update during training time. The output layer has 2 (equal to number of classes) output values, one for each class - spam and ham. The output value with the largest value will be taken as the class predicted by the model. We use a sigmoid activation function in the output layer. This is to ensure the output values are in the range of 0 and 1 and may be used as predicted probabilities. Finally, the network is compiled to configure the learning process before using it for training. This process uses the efficient ADAM gradient descent optimization algorithm with a logarithmic loss function, which is called categorical_crossentropy in Keras. The loss function is the objective that the model will try to minimize.

Train the Model: The model is then trained on the training email texts. It is evaluated on test email texts. The trained model maybe saved as HDF5 file on hard disk and later loaded multiple times to make predictions, as in production.

Code:

 
def load_data():
    messages = pandas.read_csv('email.csv', sep=',',  usecols=['type', 'text'], encoding = "ISO-8859-1")
    messages=messages.dropna()    
    msg_train, msg_test, label_train, label_test = train_test_split(messages['text'], messages['type'], test_size=0.2) 
    return msg_train, msg_test, label_train, label_test 
    
def deeplearning_model():
	max_words = 700
	batch_size = 32
	epochs = 5
    msg_train, msg_test, label_train, label_test = load_data()
    num_classes = np.max(label_train) + 1
    print(num_classes, 'classes')
    print('Vectorizing sequence data...')
    tokenizer = Tokenizer(num_words=max_words)
    x_train = tokenizer.sequences_to_matrix(msg_train, mode='binary')
    x_test = tokenizer.sequences_to_matrix(msg_test, mode='binary')
    
    print('Convert class vector to binary class matrix ')
    y_train = keras.utils.to_categorical(label_train, num_classes)
    y_test = keras.utils.to_categorical(label_test, num_classes)

    print('Building model...')
    model = Sequential()
    model.add(Dense(512, input_shape=(max_words,)))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes))
    model.add(Activation('sigmoid'))

    model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1)
    score = model.evaluate(x_test, y_test,
                           batch_size=batch_size, verbose=1)
    print('Test score:', score[0])
    print('Test accuracy:', score[1])

    model.save('email_spam_detector.h5')  # creates a HDF5 file 'my_model.h5'
    del model  # deletes the existing model

#Use model to predict unlabeled, new data.    
def test_model():
    if (not os.path.isfile('email_spam_detector.h5')):
        deep_model()
    model = load_model('email_spam_detector.h5')
    test_messages = pandas.read_csv('email_test.csv', sep=',',  usecols=['text'], encoding = "ISO-8859-1")
    test_messages=test_messages.dropna()
    print("rows",  test_messages.shape[0])
    tokenizer = Tokenizer(num_words=max_words)
    x_test = tokenizer.sequences_to_matrix(test_messages, mode='binary')
    pred = model.predict_classes(x_test)
    print(pred)

Summary

This post explains about installing Keras and using it for basic deep learning classification exercise. Of course, this does not give the best result. There are still few more steps to be performed before arriving at better model. For example, instead of changing the emails into matrices of binary features it’s possible to just change the words into numbers using the words’ frequency ranking, and the numbers themselves will be converted into vectors which represent the ‘idea’ of each word. So we can alter the feature representation to help the neural net out. The Keras module that converts the text into matrices has several options besides making a binary matrix: matrices with word counts, frequencies, or tfidf values. It is also very easy to alter the amount of words kept in the matrices as features. Also, while tokenizing we need to take care of stop words (like "is", "to","by") so that more relevant and meaningful words are included in the feature matrix. Moreover, we can play around with variables and parameters to optimize the model. More importantly, training data needs to be huge to get deep learning model performing better than older approaches such as n-grams + tfifd + SVM.

Comments

Recent Blogs

Sept. 2, 2020

1210 |

0 |

How to leverage Data Science in Retail Industry

Sept. 1, 2018

1637 |

0 |

Building AWS Data Pipeline for cross-account resources

March 30, 2017

1260 |

1 |

Text Classification with Deep Learning in Keras

Jan. 18, 2017

1137 |

0 |

Unpickling issue in multi-module Python project

Aug. 13, 2015

959 |

0 |

Is Big Data Just a Fad?