“MixConv: Mixed Depthwise Convolutional Kernels”

Abstract

다양한 kernel size의 효과
- 다양한 kernel size의 조합은 모델의 accuracy 및 efficient를 향상 시킬 수 있다.
위를 바탕으로 depth-wise convolution(MixConv)제안
MixConv의 효과를 증명하기 위해서 AutoML을 결합

Introduction

Figure 1에서 볼 수 있듯이, kernel size가 클수록 모델의 성능이 올라간다고 할 수 없다. 올라가는 추세를 보이다가 k9*9를 넘어서면 accuracy가 떨어지는 것을 확인할 수 있다.

ConvNets 연구에서는 하나의 kernel size의 한계를 말한다. 결국 high-resolustion pattern을 위해서는 큰 kernel size가 필요하고 local-resolution pattern을 위해서는 작은 kernel-size가 필요하다.

아래는 해당 논문의 MixConv의 개략적인 구조이다.

Efficient ConvNets
Multi-Scale Networks and Features
- 이전의 연구들은 모델 구조 자체를 변화시킴
- MixConv는 모델의 구조는 그대로 유지한채로 kernel size를 변화시키는 것이 목표
Neural Arichitecture Search

MixConv

MixConv는 하나의 depth wise convolution op에서 multiple kernel을 섞는 것이다. 기대효과는 다양한 타입의 pattern을 수집하는 것이다.

3.1 MixConv Feature Map

def mixconv(x, filters, **args):
  # x: input features with shape [N,H,W,C]
  # filters: a list of filters with shape [K_i, K_i, C_i, M_i] for i−th group.
  G = len(filters) # number of groups.
  y = []
  for xi, fi in zip(tf.split(x, G, axis=−1), filters):
  y.append(tf.nn.depthwise_conv2d(xi, fi, ∗∗args))
  return tf.concat(y, axis=−1)

$X^{(h, w, c)}$: input tensor with shape (h, w, c)
- h: height, w: width, c: channel
$W^{(k, k, c, m)}$ : depth wise convolution kernel
- $k \times k$ : kernel size
- $c$: input channel size
- $m$: output channel size
$Y^{(h, w, c*m)}$: same spatial shape (h, w), multiplied output channel size $m \cdot c$

$Y_{x, y, z} = \sum_{-\frac{k}{2} \le i \le \frac{k}{2}, -\frac{k}{2} \le j \le \frac{k}{2}} X_{x+i, y+j, \frac{z}{m}} \cdot W_{i, j, z} \ \ \ \forall z=1, \cdots, m \cdot c$

위의 식은 MixConv과 적용되는 과정을 수식화한 것이다. Input tensor를 g개의 그룹으로 분리하면 아래와 같다.

$<\hat{X}^{(h,w,c_1)}, \cdots, \hat{X}^{(h,w,c_g)}>$

각 tensor의 spatial height와 width는 모두 동일하며, 각 channel을 모두 더하면 다음과 같다.

$c_1 + c_2 + \cdots + c_g = c$

또한 g개의 kernel group을 구성할 수 있다.

$<\hat{W}^{(k_1,k_1,c_1, m)}, \cdots, \hat{W}^{(k_g,k_g,c_g, m)}>$

t-th group output은 다음과 같이 구성된다.

$\hat{Y^t}_{x, y, z} = \sum_{-\frac{k_t}{2} \le i \le \frac{k_t}{2}, -\frac{k_t}{2} \le j \le \frac{k_t}{2}} \hat{X^t}_{x+i, y+i, \frac{z}{m}} \cdot\hat{W^t}_{i, j, z} \ \ \ \forall z=1, \cdots, m \cdot c_t$

그리고 final output tensor는 아래와 같다.

$Y_{x, y, z_0}=Concat(\hat{Y^1}_{x,y,z_1}, \cdots,\hat{Y^g}_{x,y,z_g})$ `

3.2 MixConv Design Choices

Group size

하나의 input tensor에 얼마나 다양한 size의 kernel을 사용할 것인지 정해야 한다.

group이 하나라면 vanilla depth wise convolution과 같고, 해당 논문은 실험을 통해서 MobileNets에서는 $g=4$ 가 안정적으로 적용됨을 확인하였다. 하지만, neural architecture search를 통해서 group size가 1에서 5까지의 범위에서 변하면 더 좋은 성능을 낼 수 있음을 확인하였다.
Kenel size per group

다른 그룹간의 같은 kernel size을 사용한다면, 그룹을 분리하는 의미가 없다. 따라서 해당 논문에서는 다른 그룹은 다른 kernel size를 가지도록 제한하였다.
channel size per group

해당 논문에서는 두 가지 channel 분리 방법을 사용하였다.
1. equal parition: (8, 8, 8, 8)
2. exponential parition: (16, 8, 4, 4)
dilated convolution

large size의 kernel은 많은 parameter와 computation을 필요로 한다. 따라서 이를 대처하기 위해서 dilated convolution을 사용하였다.

[Dilated convolution]

Dilated Convolution은 필터 내부에 zero padding을 추가해 강제로 receptive field를 늘리는 방법이다. 위 그림은 파란색이 인풋, 초록색이 아웃풋인데, 진한 파랑 부분에만 weight가 있고 나머지 부분은 0으로 채워진다.

출처: https://3months.tistory.com/213 [Deep Play]

3.3 MixConv Performance on Mobile Nets

3. 4 Ablation Study

ablation study

모델이나 알고리즘의 특징들을 제거하면서 그게 퍼포먼스에 어떤 영향을 줄지 연구하는 거

MixConv for Single Layer

하나의 layer씩 변화시켜본 결과는 위의 Figure 5와 같다. 2(s2)-(with large kernel size + stride2)를 보면 accuracy가 상승한 것을 확인할 수 있으며, 대부분의 레이어에서 비슷하거나 조금의 성능 향상이 있었다.

Channel Partition Method/ Dilated Convolution

Figure 6에서 확인할 수 있듯이, dilated convolution은 parameter size가 커질 때 급격한 성능 하락이 있다. 해당 논문은 이는 local information을 잃어버리기 때문이라고 주장한다.

FLOPS

플롭스(FLOPS, FLoating point OPerations per Second)는 컴퓨터의 성능을 수치로 나타낼 때 주로 사용되는 단위이다.

https://ko.wikipedia.org/wiki/%ED%94%8C%EB%A1%AD%EC%8A%A4

MixNet

4.1 Architecture Search

이전의 Architecture Search 연구에서는 kernel size, expansion ratio, channel size등을 고려했다면, 해당 논문에서는 vanilla depth wise convolution을 baseline으로 삼고 MixConv를 search option으로 지정하였다.

MixConv 옵션으로는 $g=1, \cdots, 5$ 가 있다.
search option을 단순화 시키기 위해서 exponential partion은 제외

4.2 MixNet Performance on ImageNet

4.3 MixNet Architectures

accuracy와 efficiency 향상의 이유를 알기 위해서 network architecture를 분석하였다.

전반적으로 MixNet모델들은 다양한 크기의 kernel을 사용하였다.

small kernel은 앞단의 stage에서 computation cost를 줄이기 위하여 사용함
large kernel은 뒷단의 stage에서 더 좋은 accuracy를 위해서 많이 나타남.

또한, 큰 MixNet model일수록 parameter와 FLOPS를 비용으로 지불하면서 더 큰 size의 kernel을 많이 사용하고 더 많은 수의 layer를 사용하였다.

Figure 1을 보면 vanilla depthwise convolution은 일정크기 이상의 kernel을 사용하면 급격한 성능하락이 있었지만, MixConv는 large size의 kernel을 사용할 수 있었다.

4.4 Transfer Learning Performance

Figure 9는 transfer learning을 MixNet에 적용하였을 때, 성능향상을 나타내는 이미지이다. MixNet-M은 97.92%의 성능향상을 했으며 이는 ResNet-50보다 1% 높은 수치이다.

실험은 imagenet으로 pretrained된 모델을 CIFAR10, 100에 적용한 것이다.

Reference

https://arxiv.org/pdf/1907.09595.pdf