I'm having some trouble mentally visualizing how a 1-dimensional convolutional layer feeds into a max pooling layer. I'm using Python 3.6.3
and Keras 2.1.2
with Tensorflow 1.4.0
backend.
In [1]: # Build model
...: # ===========
...: NUMBER_OF_POSITIONS = 500
...: # Input Layer
...: input_layer = Input(shape=(NUMBER_OF_POSITIONS,4))
...: # Hidden Layers
...: _h = Conv1D(320, 16, strides=1, activation="relu")(input_layer)
...: _h = MaxPooling1D(pool_size=8, strides=8)(_h)
...: _h = SimpleRNN(128, return_sequences = True, activation="tanh")(_h)
...: _h = Flatten()(_h)
...: _h = Dense(1)(_h)
...: # Output layer
...: output_layer = Activation("sigmoid")(_h)
...:
...: model = Model(input_layer, output_layer, )
...: model.compile(optimizer="sgd", loss="categorical_crossentropy",)
...: model.summary()
...:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 500, 4) 0
_________________________________________________________________
conv1d_1 (Conv1D) (None, 485, 320) 20800
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 60, 320) 0
_________________________________________________________________
simple_rnn_1 (SimpleRNN) (None, 60, 128) 57472
_________________________________________________________________
flatten_1 (Flatten) (None, 7680) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 7681
_________________________________________________________________
activation_1 (Activation) (None, 1) 0
=================================================================
Total params: 85,953
Trainable params: 85,953
Non-trainable params: 0
_________________________________________________________________
My understanding of a convolution in 1-dimensions with a kernel_size=16
and stride=1
is essentially stepping through a vector to get 16 element long subsets adjacent to each other. For example, below:
# Generate 20 element sequence then step through window size of 16 by 1
filter_size= 16
# https://pastebin.com/k7tzHvYN
# ATCATTTTCTCGATGAAAGC
# ====================
# ATCATTTTCTCGATGA
# TCATTTTCTCGATGAA
# CATTTTCTCGATGAAA
# ATTTTCTCGATGAAAG
# TTTTCTCGATGAAAGC
(1) What is the difference between convolutions, feature maps, and filters?
(2) Is my pooling layer reducing the dimensionality from 16 to 8 by taking the maximum value for every 2 positions? I'm also not sure how strides
apply to this part of the pipeline
(3) How does max pooling work with One-Hot Encoded categorical variables? For example, [[1 0 0 0],[0 0 0 1], [0 1 0 0]]
Response to Alex R's answer below:
Terminology is wishy-washy here, but in this case the feature maps refer to the outputs of each convolution (filter). In this case you are applying 1x16 convolutions, stride 1, to your input of size 500x4, which gives you 500-16+1=485 positions to apply the convolution. Note that since your image depth is 4, then each convolution has 1x16x4 weights total.
So in this case, each (1 x 16) convolution is a filter that has 4 channels. Is the feature map going to be the connections between all of the (1 x 16) pixels in the input vector and their maximum value that is determined in the pooling layer? Or would these be the weights. Also, did you mean the output of the convolution is (485, 320) or was alluding to the 5 that were dropped?
You are then applying a maxpool of size 8, with stride 8, meaning that each 8x8 cell will condense into a single cell whose value is the maximum. With a stride of 8, you will do 60 maxpools (up to pixel 480). I believe the last 5 pixels are just thrown out.
This is where I'm lost a little bit. If the initial input was (500,4) what does the (8,8) cell represent? Would this be grouping 8 filters together? I understand that 8*60 would be 480 but I'm having trouble understanding what the strides are in this case since the input are convolutions. Would this be a set of 8 convolutions (pool_size) and then skipping the current 8 (stride) and going to the next 8 to apply the maximum?
Your question about one-hot encoded vector maxpooling is strange. Max-pool would operate no differently in that case, taking the maximum value in max pool region size.
Sorry, most examples I've seen use RGB values where are usually not in {0,1} so the maximum makes more sense to me. Would the maximum be per channel in that filter?