Train on the Kaggle dataset using the pretrained model

Download the Kaggle API

In [7]:
!pip install -q kaggle

Import Packages

In [8]:
from google.colab import drive, files
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import zipfile

from sklearn import model_selection

drive.mount('/content/gdrive')
Mounted at /content/gdrive

Import the Secret Key in order to download the data from kaggle

In [9]:
import os
import pathlib

# Upload the API token.
def get_kaggle():
  try:
    import kaggle
    return kaggle
  except OSError:
    pass

  token_file = pathlib.Path("~/.kaggle/kaggle.json").expanduser()
  token_file.parent.mkdir(exist_ok=True, parents=True)

  try:
    from google.colab import files
  except ImportError:
    raise ValueError("Could not find kaggle token.")

  uploaded = files.upload()
  token_content = uploaded.get('kaggle.json', None)
  if token_content:
    token_file.write_bytes(token_content)
    token_file.chmod(0o600)
  else:
    raise ValueError('Need a file named "kaggle.json"')

  import kaggle
  return kaggle


kaggle = get_kaggle()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving kaggle.json to kaggle.json

Download the data from Kaggle and import as Dataframe object

  • The data we download from Kaggle need to be unzip.
  • We will split the training data into 2 part (validation and train).
In [10]:
# Download data from Kaggle and create a DataFrame.
def load_data_from_zip(path):
  with zipfile.ZipFile(path, "r") as zip_ref:
    name = zip_ref.namelist()[0]
    with zip_ref.open(name) as zf:
      return pd.read_csv(zf)

# The data does not come with a validation set so we'll create one from the
# training set.
def get_data(competition, train_file, test_file, validation_set_ratio=0.1):
  data_path = pathlib.Path("data")
  kaggle.api.competition_download_files(competition, data_path)
  competition_path = (data_path/competition)
  competition_path.mkdir(exist_ok=True, parents=True)
  competition_zip_path = competition_path.with_suffix(".zip")

  with zipfile.ZipFile(competition_zip_path, "r") as zip_ref:
    zip_ref.extractall(competition_path)

  train_df = load_data_from_zip(competition_path/train_file)
  test_df = load_data_from_zip(competition_path/test_file)
  
  # We split by sentence ids, because we don't want to have phrases belonging
  # to the same sentence in both training and validation set.
  train_df, validation_df = model_selection.train_test_split(
      train_df,
      test_size=0.3,
      random_state=0)
  
  print("Split the training data into %d training and %d validation examples." %
        (len(train_df), len(validation_df)))

  return train_df, validation_df, test_df


train_df, validation_df, test_df = get_data(
    "jigsaw-toxic-comment-classification-challenge",
    "train.csv.zip", "test.csv.zip")
Split the training data into 111699 training and 47872 validation examples.
In [11]:
test_df.head(5)
Out[11]:
id comment_text
0 00001cee341fdb12 Yo bitch Ja Rule is more succesful then you'll...
1 0000247867823ef7 == From RfC == \n\n The title is fine as it is...
2 00013b17ad220c46 " \n\n == Sources == \n\n * Zawe Ashton on Lap...
3 00017563c3f7919a :If you have a look back at the source, the in...
4 00017695ad8997eb I don't anonymously edit articles at all.
In [12]:
train_df.head(5)
Out[12]:
id comment_text toxic severe_toxic obscene threat insult identity_hate
104158 2d59e577d6d081da Actually you know what, I'm not using this acc... 0 0 0 0 0 0
81146 d90eed2c03efb2b0 "\n\nThe problem with this disambig is, Michae... 0 0 0 0 0 0
6248 10af179019d6d9b8 "\n\nLegalleft, I just guess sarcasm is not on... 0 0 0 0 0 0
36126 6088014ca4f31017 Please note that you have no right to free spe... 0 0 0 0 0 0
70143 bb9fc7e55e51f62f "\n\nBoba Phat at AFD again\nAn AFD you partic... 0 0 0 0 0 0

Show the distribution of Result data

In [13]:
graph_df = train_df['toxic'].value_counts().to_frame()\
               .join(train_df['severe_toxic'].value_counts().to_frame())\
               .join(train_df['obscene'].value_counts().to_frame())\
               .join(train_df['threat'].value_counts().to_frame())\
               .join(train_df['insult'].value_counts().to_frame())\
               .join(train_df['identity_hate'].value_counts().to_frame())\

graph_df.plot(kind='bar',figsize=(12, 6))
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff8b610fa10>

Extract Label from data

  • X -> The data we would like to train
  • y -> The result we would like to predict
In [14]:
x_train = train_df['comment_text']
y_train = train_df.iloc[:, 2:]

x_val = validation_df['comment_text']
y_val = validation_df.iloc[:, 2:]
In [15]:
x_train.head()
Out[15]:
104158    Actually you know what, I'm not using this acc...
81146     "\n\nThe problem with this disambig is, Michae...
6248      "\n\nLegalleft, I just guess sarcasm is not on...
36126     Please note that you have no right to free spe...
70143     "\n\nBoba Phat at AFD again\nAn AFD you partic...
Name: comment_text, dtype: object
In [16]:
y_train.head()
Out[16]:
toxic severe_toxic obscene threat insult identity_hate
104158 0 0 0 0 0 0
81146 0 0 0 0 0 0
6248 0 0 0 0 0 0
36126 0 0 0 0 0 0
70143 0 0 0 0 0 0

Use the pretrained model from tensorflow

  • Train our model based on the universal sentence encoder.
  • Set the output layer, must correspond to y_train's shape.
  • Set some training information. Because the output is multi-label. We will use the loss function tf.keras.losses.BinaryCrossentropy.

What is universal sentence encoder?

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The pre-trained Universal Sentence Encoder is publicly available in Tensorflow-hub.

Application:

  1. Skip-thought: use the current sentence to predict the previous and next sentence.
  2. Response Prediction
  3. Natural Language Inference
In [17]:
class MyModel(tf.keras.Model):
  def __init__(self, hub_url):
    super().__init__()
    self.hub_url = hub_url
    self.embed = hub.load(self.hub_url)
    self.sequential = tf.keras.Sequential([
      tf.keras.layers.Dense(500),
      tf.keras.layers.Dense(100),
      tf.keras.layers.Dense(6),
      tf.keras.layers.Activation('sigmoid')
    ])

  def call(self, inputs):
    inputs = tf.reshape(inputs, shape=[-1])
    embedding = 6 * self.embed(inputs)
    return self.sequential(embedding)

  def get_config(self):
    return {"hub_url":self.hub_url}
In [18]:
model = MyModel("https://tfhub.dev/google/nnlm-en-dim128/2")
model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=tf.optimizers.Adam(),
    metrics = [tf.keras.metrics.BinaryCrossentropy()])
In [19]:
history = model.fit(x=x_train, y=y_train,
          validation_data=(x_val, y_val),
          epochs = 25)
Epoch 1/25
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:1082: UserWarning: "`binary_crossentropy` received `from_logits=True`, but the `output` argument was produced by a sigmoid or softmax activation and thus does not represent logits. Was this intended?"
  return dispatch_target(*args, **kwargs)
3483/3491 [============================>.] - ETA: 0s - loss: 0.1097 - binary_crossentropy: 0.1097
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:1082: UserWarning: "`binary_crossentropy` received `from_logits=True`, but the `output` argument was produced by a sigmoid or softmax activation and thus does not represent logits. Was this intended?"
  return dispatch_target(*args, **kwargs)
3491/3491 [==============================] - 20s 5ms/step - loss: 0.1097 - binary_crossentropy: 0.1097 - val_loss: 0.1006 - val_binary_crossentropy: 0.1006
Epoch 2/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0969 - binary_crossentropy: 0.0969 - val_loss: 0.0937 - val_binary_crossentropy: 0.0937
Epoch 3/25
3491/3491 [==============================] - 15s 4ms/step - loss: 0.0953 - binary_crossentropy: 0.0953 - val_loss: 0.0940 - val_binary_crossentropy: 0.0940
Epoch 4/25
3491/3491 [==============================] - 15s 4ms/step - loss: 0.0940 - binary_crossentropy: 0.0940 - val_loss: 0.0959 - val_binary_crossentropy: 0.0959
Epoch 5/25
3491/3491 [==============================] - 17s 5ms/step - loss: 0.0935 - binary_crossentropy: 0.0935 - val_loss: 0.0936 - val_binary_crossentropy: 0.0936
Epoch 6/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0931 - binary_crossentropy: 0.0931 - val_loss: 0.0958 - val_binary_crossentropy: 0.0958
Epoch 7/25
3491/3491 [==============================] - 17s 5ms/step - loss: 0.0931 - binary_crossentropy: 0.0931 - val_loss: 0.0931 - val_binary_crossentropy: 0.0931
Epoch 8/25
3491/3491 [==============================] - 15s 4ms/step - loss: 0.0926 - binary_crossentropy: 0.0926 - val_loss: 0.0957 - val_binary_crossentropy: 0.0957
Epoch 9/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0923 - binary_crossentropy: 0.0923 - val_loss: 0.0955 - val_binary_crossentropy: 0.0955
Epoch 10/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0921 - binary_crossentropy: 0.0921 - val_loss: 0.0923 - val_binary_crossentropy: 0.0923
Epoch 11/25
3491/3491 [==============================] - 17s 5ms/step - loss: 0.0919 - binary_crossentropy: 0.0919 - val_loss: 0.0923 - val_binary_crossentropy: 0.0923
Epoch 12/25
3491/3491 [==============================] - 17s 5ms/step - loss: 0.0914 - binary_crossentropy: 0.0914 - val_loss: 0.0938 - val_binary_crossentropy: 0.0938
Epoch 13/25
3491/3491 [==============================] - 17s 5ms/step - loss: 0.0915 - binary_crossentropy: 0.0915 - val_loss: 0.0950 - val_binary_crossentropy: 0.0950
Epoch 14/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0919 - binary_crossentropy: 0.0919 - val_loss: 0.0931 - val_binary_crossentropy: 0.0931
Epoch 15/25
3491/3491 [==============================] - 17s 5ms/step - loss: 0.0912 - binary_crossentropy: 0.0912 - val_loss: 0.0954 - val_binary_crossentropy: 0.0954
Epoch 16/25
3491/3491 [==============================] - 17s 5ms/step - loss: 0.0915 - binary_crossentropy: 0.0915 - val_loss: 0.0925 - val_binary_crossentropy: 0.0925
Epoch 17/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0916 - binary_crossentropy: 0.0916 - val_loss: 0.0942 - val_binary_crossentropy: 0.0942
Epoch 18/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0912 - binary_crossentropy: 0.0912 - val_loss: 0.0959 - val_binary_crossentropy: 0.0959
Epoch 19/25
3491/3491 [==============================] - 17s 5ms/step - loss: 0.0915 - binary_crossentropy: 0.0915 - val_loss: 0.0932 - val_binary_crossentropy: 0.0932
Epoch 20/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0911 - binary_crossentropy: 0.0911 - val_loss: 0.0926 - val_binary_crossentropy: 0.0926
Epoch 21/25
3491/3491 [==============================] - 16s 4ms/step - loss: 0.0918 - binary_crossentropy: 0.0918 - val_loss: 0.0938 - val_binary_crossentropy: 0.0938
Epoch 22/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0910 - binary_crossentropy: 0.0910 - val_loss: 0.0941 - val_binary_crossentropy: 0.0941
Epoch 23/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0909 - binary_crossentropy: 0.0909 - val_loss: 0.0927 - val_binary_crossentropy: 0.0927
Epoch 24/25
3491/3491 [==============================] - 15s 4ms/step - loss: 0.0908 - binary_crossentropy: 0.0908 - val_loss: 0.0918 - val_binary_crossentropy: 0.0918
Epoch 25/25
3491/3491 [==============================] - 14s 4ms/step - loss: 0.0913 - binary_crossentropy: 0.0913 - val_loss: 0.0939 - val_binary_crossentropy: 0.0939

Use the trained Model to predict the test data.

  • predict a probability for each of the six possible types of comment toxicity (toxic, severetoxic, obscene, threat, insult, identityhate).
  • Export result to csv file.
In [25]:
test_df['comment_text'][1]
Out[25]:
'== From RfC == \n\n The title is fine as it is, IMO.'
In [22]:
test_df['comment_text']
Out[22]:
0         Yo bitch Ja Rule is more succesful then you'll...
1         == From RfC == \n\n The title is fine as it is...
2         " \n\n == Sources == \n\n * Zawe Ashton on Lap...
3         :If you have a look back at the source, the in...
4                 I don't anonymously edit articles at all.
                                ...                        
153159    . \n i totally agree, this stuff is nothing bu...
153160    == Throw from out field to home plate. == \n\n...
153161    " \n\n == Okinotorishima categories == \n\n I ...
153162    " \n\n == ""One of the founding nations of the...
153163    " \n :::Stop already. Your bullshit is not wel...
Name: comment_text, Length: 153164, dtype: object
In [23]:
y_test = model.predict(test_df['comment_text'])
In [24]:
y_test
Out[24]:
array([[9.9932635e-01, 2.3970285e-01, 9.8839921e-01, 2.2686779e-02,
        9.8349077e-01, 3.5265735e-01],
       [4.0203094e-02, 2.4605960e-02, 4.9449682e-02, 8.7434649e-03,
        4.0368348e-02, 6.6496730e-03],
       [1.6446471e-02, 8.3044469e-03, 1.6644359e-02, 4.2854548e-03,
        9.1138184e-03, 2.0401180e-03],
       ...,
       [1.5958250e-03, 3.2114983e-03, 3.7794709e-03, 5.8686733e-04,
        1.5451610e-03, 9.0268254e-04],
       [1.8376350e-02, 8.9213252e-04, 6.2696338e-03, 1.0089576e-03,
        8.1562102e-03, 9.9056959e-03],
       [3.3477962e-02, 1.8678904e-03, 1.2961894e-02, 1.1110604e-03,
        1.1900723e-02, 2.8145313e-03]], dtype=float32)
In [28]:
result_df = pd.DataFrame(y_test, columns=y_train.columns)
result_df
Out[28]:
toxic severe_toxic obscene threat insult identity_hate
0 0.999326 0.239703 0.988399 0.022687 0.983491 0.352657
1 0.040203 0.024606 0.049450 0.008743 0.040368 0.006650
2 0.016446 0.008304 0.016644 0.004285 0.009114 0.002040
3 0.011524 0.001230 0.006594 0.000870 0.008534 0.002780
4 0.077224 0.004752 0.028164 0.001783 0.020256 0.003974
... ... ... ... ... ... ...
153159 0.108599 0.011909 0.064952 0.001921 0.049012 0.005252
153160 0.070937 0.006661 0.038952 0.005681 0.060483 0.016679
153161 0.001596 0.003211 0.003779 0.000587 0.001545 0.000903
153162 0.018376 0.000892 0.006270 0.001009 0.008156 0.009906
153163 0.033478 0.001868 0.012962 0.001111 0.011901 0.002815

153164 rows × 6 columns

In [26]:
test_df['id']
Out[26]:
0         00001cee341fdb12
1         0000247867823ef7
2         00013b17ad220c46
3         00017563c3f7919a
4         00017695ad8997eb
                ...       
153159    fffcd0960ee309b5
153160    fffd7a9a6eb32c16
153161    fffda9e8d6fafa9e
153162    fffe8f1340a79fc2
153163    ffffce3fb183ee80
Name: id, Length: 153164, dtype: object
In [29]:
result_df.insert(0, 'id', test_df['id'])
In [30]:
result_df
Out[30]:
id toxic severe_toxic obscene threat insult identity_hate
0 00001cee341fdb12 0.999326 0.239703 0.988399 0.022687 0.983491 0.352657
1 0000247867823ef7 0.040203 0.024606 0.049450 0.008743 0.040368 0.006650
2 00013b17ad220c46 0.016446 0.008304 0.016644 0.004285 0.009114 0.002040
3 00017563c3f7919a 0.011524 0.001230 0.006594 0.000870 0.008534 0.002780
4 00017695ad8997eb 0.077224 0.004752 0.028164 0.001783 0.020256 0.003974
... ... ... ... ... ... ... ...
153159 fffcd0960ee309b5 0.108599 0.011909 0.064952 0.001921 0.049012 0.005252
153160 fffd7a9a6eb32c16 0.070937 0.006661 0.038952 0.005681 0.060483 0.016679
153161 fffda9e8d6fafa9e 0.001596 0.003211 0.003779 0.000587 0.001545 0.000903
153162 fffe8f1340a79fc2 0.018376 0.000892 0.006270 0.001009 0.008156 0.009906
153163 ffffce3fb183ee80 0.033478 0.001868 0.012962 0.001111 0.011901 0.002815

153164 rows × 7 columns

In [31]:
result_df.to_csv("submission.csv", index=False)
In [32]:
len(y_test)
Out[32]:
153164
In [33]:
model.predict(["Mother fucker"])
Out[33]:
array([[0.79559195, 0.10643551, 0.5876882 , 0.0098615 , 0.52826744,
        0.03464001]], dtype=float32)
In [34]:
model.predict(["Starting a new AI project there are lots of things to consider, and getting a proof-of-concept going with whatever-bunch-of-tools-you-come-across is, probably, the way to go. But once this phase of the project is over you’ll need to think engineering!"])
Out[34]:
array([[0.00917289, 0.00102201, 0.00524017, 0.0005126 , 0.00504491,
        0.00389126]], dtype=float32)

Save the model and try to load it next time.

If you want to use your own model in graphql server, you can convert the model you trained to tensorflow.js.Document

In [40]:
model.save_weights('my_model.h5')
In [ ]: