Mixing Toys: combining numerical and text features. #252

skyblasy · 2021-11-03T04:31:47Z

skyblasy
Nov 3, 2021

One of the great appeals and power of deep learning is the democratization it allows to practitioners. Where as in the past, statisticians or econometricians needed a fair amount of domain knowledge to build models.

Deep learning and ensemble methods mitigate against that requirement by automatically creating different combinations of interactions to find optimality, making it unnecessary to engineer specific interaction terms.

That being said, there are likely times when feature engineering, aided with domain knowledge, would benefit model building, and combining those features with different data types I think is super interesting, such as text/nlp with regular numeric data. There is a short blog post I found here that introduces the idea, although I haven't really seen any classes/tutorials do it using deep learning.

I'm actually attempting to do it myself, but am running into a bit of a trouble. I know this goes beyond the scope of the course, but curious if anyone could spot how to fix this. I'm using the wine reviews data set from kaggle, and turned the points column that's a range from 0 to 100 into a dichotomous target varible so it's more clear-cut classification problem.

There is a description column with the actual text, along with other variables, but just for the sake of this proof-of-concept the only additional numeric variables I'm using are a scaled number of words in the description and a scaled price. (there could be many more useful ones to use, but just for the sake of simplicity I'm using these for now).

I tried to follow the blog post as best I could, but it's pretty scant. After loading in the data and doing the feature engineering and preprocessing, the workflow and error message is what I have below.

y = df['y']
X = df.drop('y', axis=1)

# split up the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

X_train.head();
description_train = X_train['description']
description_test = X_test['description']

#subsetting the numeric variables
numeric_train = X_train[['scaled_price','scaled_num_words']].to_numpy()
numeric_test = X_test[['scaled_price','scaled_num_words']].to_numpy()

MAX_VOCAB_SIZE = 60000
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(description_train)
sequences_train = tokenizer.texts_to_sequences(description_train)
sequences_test = tokenizer.texts_to_sequences(description_test)

word2idx = tokenizer.word_index
V = len(word2idx)
print('Found %s unique tokens.' % V)
Found 31598 unique tokens.

nlp_train = pad_sequences(sequences_train)
print('Shape of data train tensor:', nlp_train.shape)
Shape of data train tensor: (91944, 136)

# get sequence length
T = nlp_train.shape[1]

nlp_test = pad_sequences(sequences_test, maxlen=T)
print('Shape of data test tensor:', nlp_test.shape)
Shape of data test tensor: (45286, 136)

data_train = np.concatenate((nlp_train,numeric_train), axis=1)
data_test = np.concatenate((nlp_test,numeric_test), axis=1)



# Choosing embedding dimensionality
D = 20

# Hidden state dimensionality
M = 40

nlp_input = Input(shape=(T,),name= 'nlp_input')
meta_input = Input(shape=(2,), name='meta_input')
emb = Embedding(V + 1, D)(nlp_input)
emb = Bidirectional(LSTM(64, return_sequences=True))(emb)
emb = Dropout(0.40)(emb)
emb = Bidirectional(LSTM(128))(emb)
nlp_out = Dropout(0.40)(emb)
x = tf.concat([nlp_out, meta_input], 1)
x = Dense(64, activation='swish')(x)
x = Dropout(0.40)(x)
x = Dense(1, activation='sigmoid')(x)

model = Model(inputs=[nlp_input, meta_input], outputs=[x])

#next, create a custom optimizer
optimizer1 = RMSprop(learning_rate=0.0001)

# Compile and fit
model.compile(
  loss='binary_crossentropy',
  optimizer='adam',
  metrics=['accuracy']
)


print('Training model...')
r = model.fit(
  data_train,
  y_train,
  epochs=5, 
  validation_data=(data_test, y_test))

And the error message

ValueError: Layer model expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(None, 138) dtype=float32>]

Any ideas on how to solve that?

Answered by skyblasy

Nov 19, 2021

Just as an update to this (sorry, but I do find this topic intersitng) I found the answer on stack, and it's pretty simple.

r = model.fit(
  [nlp_train, numeric_train],
  y_train,
  epochs=5, 
  validation_data=([nlp_test, numeric_test], y_test))

What's also super interesting is that changing the model to include both text and extracted features, the model's performance improved DRAMATICALLY. I wasn't necessarily surprised by this, but it was gratifying to see..

View full answer

skyblasy · 2021-11-19T19:03:28Z

skyblasy
Nov 19, 2021
Author

Just as an update to this (sorry, but I do find this topic intersitng) I found the answer on stack, and it's pretty simple.

r = model.fit(
  [nlp_train, numeric_train],
  y_train,
  epochs=5, 
  validation_data=([nlp_test, numeric_test], y_test))

What's also super interesting is that changing the model to include both text and extracted features, the model's performance improved DRAMATICALLY. I wasn't necessarily surprised by this, but it was gratifying to see..

2 replies

mrdbourke Nov 25, 2021
Maintainer

Hey @skyblasy,

Epic to see you found a fix to your problem!

PS we cover multi-input models in notebook 09: https://dev.mrdbourke.com/tensorflow-deep-learning/09_SkimLit_nlp_milestone_project_2/

See Model 4 onwards for mixing multiple different input types.

skyblasy Nov 30, 2021
Author

LOL, dude, I'm a dumbass. haha. But thank you!

I will also say, though, it is remarkable how powerful they become. It's kind of mind-blowing, actually in the marginal jump the metrics reported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mixing Toys: combining numerical and text features. #252

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Mixing Toys: combining numerical and text features. #252

Uh oh!

skyblasy Nov 3, 2021

Replies: 1 comment · 2 replies

Uh oh!

skyblasy Nov 19, 2021 Author

Uh oh!

mrdbourke Nov 25, 2021 Maintainer

Uh oh!

skyblasy Nov 30, 2021 Author

skyblasy
Nov 3, 2021

Replies: 1 comment 2 replies

skyblasy
Nov 19, 2021
Author

mrdbourke Nov 25, 2021
Maintainer

skyblasy Nov 30, 2021
Author