(GAN)U-Net: Convolutional Networks for Biomedical Image Segmentation Translation

CV Paper List

$\mathbf{U-Net:\;Convolutional\;Networks\;for\;Biomedical\;Image\;Segmentation}$

$\mathbf{Abstract}$

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.

심층 네트워크의 성공적인 훈련에는 수천 개의 주석이 달린 훈련 샘플이 필요하다는 데 큰 동의가 있다. 본 논문에서는 주석이 달린 사용 가능한 샘플을 보다 효율적으로 사용하기 위해 데이터 확장의 강력한 사용에 의존하는 네트워크 및 훈련 전략을 제시한다. 아키텍처는 컨텍스트를 캡처하기 위한 수축 경로와 정확한 현지화를 가능하게 하는 대칭 확장 경로로 구성된다. 우리는 그러한 네트워크가 극소수의 이미지에서 종단 간으로 훈련될 수 있으며 전자 현미경 스택에서 신경 구조의 분할을 위한 ISBI 과제에서 이전의 최선의 방법(슬라이딩 윈도우 컨볼루션 네트워크)을 능가한다는 것을 보여준다. 전송된 광 현미경 이미지(위상 대비 및 DIC)에 대해 훈련된 동일한 네트워크를 사용하여 이러한 범주에서 ISBI 세포 추적 챌린지 2015에서 큰 차이로 우승했다. 게다가, 네트워크는 빠르다. 최근 GPU에서 512x480 이미지를 분할하는 데 1초도 걸리지 않는다. 전체 구현(Caffe 기반)과 훈련된 네트워크는 http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net))에서 사용할 수 있다.

$1\;\mathbf{Introduction}$

In the last two years, deep convolutional networks have outperformed the state of the art in many visual recognition tasks, e.g. [7,3]. While convolutional networks have already existed for a long time [8], their success was limited due to the size of the available training sets and the size of the considered networks. The breakthrough by Krizhevsky et al. [7] was due to supervised training of a large network with 8 layers and millions of parameters on the ImageNet dataset with 1 million training images. Since then, even larger and deeper networks have been trained [12].

지난 2년 동안, 심층 컨볼루션 네트워크는 많은 시각적 인식 과제(예: [7,3])에서 최첨단 기술을 능가했다. 컨볼루션 네트워크는 이미 오랫동안 존재했지만[8] 사용 가능한 훈련 세트의 크기와 고려된 네트워크의 크기 때문에 성공이 제한되었다. Krizhevsky et al.의 돌파구. [7] 100만 개의 교육 이미지가 있는 ImageNet 데이터 세트에서 8개 계층과 수백만 개의 매개 변수가 있는 대규모 네트워크의 감독된 교육 때문이었다. 그 이후로, 더 크고 더 깊은 네트워크가 훈련되었다[12].

The typical use of convolutional networks is on classification tasks, where the output to an image is a single class label. However, in many visual tasks, especially in biomedical image processing, the desired output should include localization, i.e., a class label is supposed to be assigned to each pixel. Moreover, thousands of training images are usually beyond reach in biomedical tasks. Hence, Ciresan et al. [1] trained a network in a sliding-window setup to predict the class label of each pixel by providing a local region (patch) around that pixel as input. First, this network can localize. Secondly, the training data in terms of patches is much larger than the number of training images. The resulting network won the EM segmentation challenge at ISBI 2012 by a large margin.

컨볼루션 네트워크의 일반적인 사용은 이미지에 대한 출력이 단일 클래스 레이블인 분류 작업에 있다. 그러나 많은 시각적 작업, 특히 생물의학 이미지 처리에서 원하는 출력에는 지역화가 포함되어야 한다. 즉, 클래스 레이블이 각 픽셀에 할당되어야 한다. 더욱이 수천 개의 훈련 이미지는 보통 생물의학 작업에서 도달할 수 없다. 따라서, Ciresan 외 [1] 슬라이딩 픽셀 설정에서 네트워크를 훈련시켜 해당 픽셀 주변의 로컬 영역(픽셀)을 입력으로 제공하여 각 픽셀의 클래스 레이블을 예측한다. 첫째, 이 네트워크는 지역화할 수 있습니다. 둘째, 패치 측면에서 훈련 데이터는 훈련 이미지 수보다 훨씬 크다. 그 결과 네트워크는 ISBI 2012에서 EM 세분화 과제를 큰 차이로 이겼다.

Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.

그림 1. U-net 아키텍처(예: 가장 낮은 해상도의 32x32 픽셀) 각 파란색 상자는 다중 채널 피쳐 맵에 해당합니다. 채널 수는 상자 위에 표시됩니다. x-y-size는 상자의 왼쪽 하단 모서리에 제공됩니다. 흰색 상자는 복사된 피쳐 맵을 나타냅니다. 화살표는 서로 다른 작업을 나타냅니다.

Obviously, the strategy in Ciresan et al. [1] has two drawbacks. First, it is quite slow because the network must be run separately for each patch, and there is a lot of redundancy due to overlapping patches. Secondly, there is a trade-off between localization accuracy and the use of context. Larger patches require more max-pooling layers that reduce the localization accuracy, while small patches allow the network to see only little context. More recent approaches [11,4] proposed a classifier output that takes into account the features from multiple layers. Good localization and the use of context are possible at the same time.

분명히, Ciresan 등의 전략입니다. [1] 두 가지 단점이 있습니다. 우선 패치별로 네트워크를 별도로 운영해야 하기 때문에 속도가 상당히 느리고, 패치가 중복돼 중복되는 부분이 많다. 둘째, 현지화 정확도와 컨텍스트 사용 사이에는 균형이 있다. 패치가 클수록 로컬리제이션 정확도가 떨어지는 최대 풀링 계층이 더 많이 필요한 반면, 패치가 작을수록 네트워크에서 컨텍스트를 거의 볼 수 없습니다. 더 최근의 접근 방식[11,4]은 여러 계층의 특징을 고려한 분류기 출력을 제안했다. 좋은 현지화와 콘텍스트 사용이 동시에 가능하다.

In this paper, we build upon a more elegant architecture, the so-called “fully convolutional network” [9]. We modify and extend this architecture such that it works with very few training images and yields more precise segmentations; see Figure 1. The main idea in [9] is to supplement a usual contracting network by successive layers, where pooling operators are replaced by upsampling operators. Hence, these layers increase the resolution of the output. In order to localize, high resolution features from the contracting path are combined with the upsampled output. A successive convolution layer can then learn to assemble a more precise output based on this information.

본 논문에서, 우리는 소위 “완전한 컨볼루션 네트워크”라는 보다 우아한 아키텍처를 기반으로 한다[9]. 우리는 이 아키텍처를 수정하고 확장하여 매우 적은 수의 교육 이미지로 작동하고 보다 정확한 세분화를 산출한다. 그림 1을 참조하라. [9]의 주요 아이디어는 풀링 연산자가 업샘플링 연산자로 대체되는 연속적인 계층으로 일반적인 수축 네트워크를 보완하는 것이다. 따라서, 이러한 계층들은 출력의 분해능을 증가시킨다. 로컬리제이션하기 위해 수축 경로의 고해상도 기능이 업샘플링 출력과 결합된다. 연속적인 컨볼루션 레이어는 이 정보를 기반으로 더 정확한 출력을 조립하는 것을 배울 수 있다.

Fig. 2. Overlap-tile strategy for seamless segmentation of arbitrary large images (here segmentation of neuronal structures in EM stacks). Prediction of the segmentation in the yellow area, requires image data within the blue area as input. Missing input data is extrapolated by mirroring

그림 2. 임의의 큰 이미지의 매끄러운 분할을 위한 오버랩 타일 전략(여기서는 전자파 스택의 신경 구조 분할) 노란색 영역의 분할을 예측하려면 파란색 영역 내의 영상 데이터가 입력으로 필요합니다. 누락된 입력 데이터는 미러링을 통해 추정됩니다.

One important modification in our architecture is that in the upsampling part we have also a large number of feature channels, which allow the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture. The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image. This strategy allows the seamless segmentation of arbitrarily large images by an overlap-tile strategy (see Figure 2). To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory.

우리 아키텍처의 한 가지 중요한 수정은 업샘플링 부분에는 네트워크가 컨텍스트 정보를 고해상도 계층으로 전파할 수 있는 많은 기능 채널도 있다는 것이다. 결과적으로, 확장 경로는 수축 경로와 다소 대칭적이며, u자형 구조를 산출한다. 네트워크는 완전히 연결된 레이어가 없으며 각 컨볼루션의 유효한 부분만 사용한다. 즉, 분할 맵은 입력 영상에서 전체 컨텍스트를 사용할 수 있는 픽셀만 포함한다. 이 전략을 사용하면 오버랩 타일 전략을 통해 임의의 큰 이미지를 원활하게 분할할 수 있습니다(그림 2 참조). 이미지의 경계 영역에 있는 픽셀을 예측하기 위해 입력 이미지를 미러링하여 누락된 컨텍스트를 추정합니다. 이 타일링 전략은 네트워크를 큰 이미지에 적용하는 데 중요하다. 그렇지 않으면 GPU 메모리에 의해 해상도가 제한되기 때문이다.

As for our tasks there is very little training data available, we use excessive data augmentation by applying elastic deformations to the available training images. This allows the network to learn invariance to such deformations, without the need to see these transformations in the annotated image corpus. This is particularly important in biomedical segmentation, since deformation used to be the most common variation in tissue and realistic deformations can be simulated efficiently. The value of data augmentation for learning invariance has been shown in Dosovitskiy et al. [2] in the scope of unsupervised feature learning.

우리의 과제에는 사용 가능한 훈련 데이터가 거의 없으며, 사용 가능한 훈련 이미지에 탄력적인 변형을 적용하여 과도한 데이터 증강을 사용한다. 이를 통해 네트워크는 주석이 달린 이미지 말뭉치에서 이러한 변환을 볼 필요 없이 이러한 변형에 대한 불변성을 학습할 수 있다. 이것은 생체 의학 분할에서 특히 중요한데, 변형은 조직에서 가장 일반적인 변형이었고 현실적인 변형은 효율적으로 시뮬레이션될 수 있기 때문이다. 학습 불변성에 대한 데이터 증강 가치는 Dosovitskyy 등에 나타났다. [2] 비지도 기능 학습의 범위에서

Another challenge in many cell segmentation tasks is the separation of touching objects of the same class; see Figure 3. To this end, we propose the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function.

많은 세포 분할 작업에서 또 다른 과제는 동일한 클래스의 접촉 물체를 분리하는 것이다. 그림 3을 참조하라. 이를 위해, 우리는 터치 셀 간의 분리 배경 레이블이 손실 함수의 큰 가중치를 얻는 가중 손실의 사용을 제안한다.

The resulting network is applicable to various biomedical segmentation problems. In this paper, we show results on the segmentation of neuronal structures in EM stacks (an ongoing competition started at ISBI 2012), where we out-performed the network of Ciresan et al. [1]. Furthermore, we show results for cell segmentation in light microscopy images from the ISBI cell tracking challenge 2015. Here we won with a large margin on the two most challenging 2D transmitted light datasets.

결과 네트워크는 다양한 생물의학 분할 문제에 적용할 수 있다. 본 논문에서는 Ciresan 등의 네트워크를 능가한 EM 스택(ISBI 2012에서 시작된 지속적인 경쟁)의 신경 구조 분할에 대한 결과를 보여준다[1]. 또한 ISBI 세포 추적 챌린지 2015의 광현미경 이미지에서 세포 분할 결과를 보여준다. 여기서 우리는 가장 어려운 2D 전송 광 데이터 세트 두 개를 큰 차이로 이겼다.

$2\;\mathbf{Network\;Architecture}$

The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.

네트워크 아키텍처는 그림 1에 나와 있습니다. 수축 경로(왼쪽)와 확장 경로(오른쪽)로 구성됩니다. 수축 경로는 컨볼루션 네트워크의 전형적인 아키텍처를 따른다. 그것은 두 개의 3x3 컨볼루션(추가되지 않은 컨볼루션)의 반복 적용으로 구성되며, 각각은 다운샘플링을 위한 스트라이드 2를 가진 정류 선형 단위(ReLU)와 2x2 최대 풀링 연산을 따른다. 각 다운샘플링 단계에서 피처 채널 수를 두 배로 늘린다. 확장 경로의 모든 단계는 형상 맵의 업샘플링에 이어 형상 채널의 수를 절반으로 줄이는 2x2 컨볼루션(“업 컨볼루션”)과 수축 경로에서 해당 잘린 형상 맵과의 연결, 그리고 각각 ReLU가 이어지는 2개의 3x3 컨볼루션으로 구성된다. 모든 컨볼루션에서 테두리 픽셀의 손실로 인해 자르는 것이 필요하다. 마지막 계층에서 1x1 컨볼루션은 각 64-구성 요소 피처 벡터를 원하는 클래스 수에 매핑하는 데 사용된다. 네트워크에는 총 23개의 컨볼루션 레이어가 있다.

To allow a seamless tiling of the output segmentation map (see Figure 2), it is important to select the input tile size such that all 2x2 max-pooling operations are applied to a layer with an even $x$- and $y$-size.

출력 분할 맵의 원활한 타일링을 허용하려면(그림 2 참조), 모든 2x2 최대 풀링 연산이 $x$와 $y$ 크기가 동일한 레이어에 적용되도록 입력 타일 크기를 선택하는 것이 중요합니다.

$\mathbf{3\;Training}$

The input images and their corresponding segmentation maps are used to train the network with the stochastic gradient descent implementation of Caffe [6]. Due to the unpadded convolutions, the output image is smaller than the input by a constant border width. To minimize the overhead and make maximum use of the GPU memory, we favor large input tiles over a large batch size and hence reduce the batch to a single image. Accordingly we use a high momentum (0.99) such that a large number of the previously seen training samples determine the update in the current optimization step.

입력 이미지와 그에 해당하는 분할 맵은 Caffe의 확률적 경사 하강 구현으로 네트워크를 훈련시키는 데 사용된다[6]. 추가되지 않은 컨볼루션으로 인해 출력 이미지는 일정한 테두리 너비만큼 입력보다 작다. 오버헤드를 최소화하고 GPU 메모리를 최대한 사용하기 위해 큰 배치 크기보다 큰 입력 타일을 선호하여 배치를 단일 이미지로 줄인다. 따라서 이전에 본 많은 훈련 샘플이 현재 최적화 단계에서 업데이트를 결정하도록 높은 모멘텀(0.99)을 사용한다.

The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross entropy loss function. The soft-max is defined as $p_{k}(x) = \exp(a_{k}(x))/(\sum_{K’=1}^{K}\exp(a_{k’}(x)))$ where $a_{k}(x)$ denotes the activation in feature channel $k$ at the pixel position $x\in\Omega$ with $\Omega⊂Z 2$ . $K$ is the number of classes and $p_{k}(x)$ is the approximated maximum-function. I.e. $p_{k}(x) ≈ 1$ for the k that has the maximum activation $a_{k}(x)$ and $p_{k}(x) ≈ 0$ for all other $k$. The cross entropy then penalizes at each position the deviation of $p_{l(x)}(x)$ from 1 using

에너지 함수는 교차 엔트로피 손실 함수와 결합된 최종 특징 맵에 대한 픽셀 단위 소프트맥스에 의해 계산된다. 소프트맥스는 $p_{k}(x) = \exp(a_{k}(x))/(\sum_{K’=1}^{K}\exp(a_{k’}(x)))$로 정의되며, 여기서 $a_{k}(x)$는 $\Omega⊂Z 2$로 픽셀 위치 $x\in\Omega$에서 피처 채널 $k$의 활성화를 나타낸다. $K$는 클래스 수이고 $p_{k}(x)$는 근사 최대 함수이다. 즉, 다른 모든 $k$에 대해 최대 활성화 $a_{k}(x)$와 $p_{k}(x) ≈ 0$를 갖는 k에 대한 $p_{k}(x) ≈ 1$. 교차 엔트로피는 각 위치에서 다음과 같이 1로부터 $p_{l(x)}(x)$의 편차를 벌한다.

\[E=\sum_{x\in\Omega}w(x)\log{p_{l(x)}(x))}\]

Fig. 3. HeLa cells on glass recorded with DIC (differential interference contrast) microscopy. (a) raw image. (b) overlay with ground truth segmentation. Different colors indicate different instances of the HeLa cells. (c) generated segmentation mask (white: foreground, black: background). (d) map with a pixel-wise loss weight to force the network to learn the border pixels.

그림 3. DIC(Differentience Contrast) 현미경으로 기록된 유리 위의 HeLa 세포. (a) 원시 이미지. (b) 실측 자료 분할이 있는 오버레이. 다른 색상은 HeLa 셀의 다른 인스턴스를 나타냅니다. (c) 생성된 분할 마스크(흰색: 전경, 검은색: 배경) (d) 네트워크가 테두리 픽셀을 학습하도록 강제하기 위해 픽셀 단위 손실 가중치로 맵합니다.

where $l:\Omega\to{(1, . . . , K)}$ is the true label of each pixel and $w:\Omega\to{R}$ is a weight map that we introduced to give some pixels more importance in the training.

여기서 $l:\Omega\to{(1, . . . , K)}$는 각 픽셀의 실제 레이블이고 $w:\Omega\to{R}$는 훈련에서 일부 픽셀을 더 중요시하기 위해 도입한 가중치 맵이다.

We pre-compute the weight map for each ground truth segmentation to compensate the different frequency of pixels from a certain class in the training data set, and to force the network to learn the small separation borders that we introduce between touching cells (See Figure 3c and d).

우리는 훈련 데이터 세트의 특정 클래스의 픽셀의 다른 주파수를 보상하고, 터치 셀 사이에 도입하는 작은 분리 경계를 네트워크가 학습하도록 하기 위해 각 실측 자료 분할에 대한 가중치 맵을 사전 계산한다(그림 3c와 d 참조).

The separation border is computed using morphological operations. The weight map is then computed as

분리 경계는 형태학적 연산을 사용하여 계산된다. 가중치 맵은 다음과 같이 계산됩니다.

\[w(x)=w_{c}(x)+w_{0}\cdot\exp(-\frac{(d_{1}(x)+d_{2}(x))^{2}}{2\sigma^{2}})\]

where $w_{c}:\Omega\to{R}$ is the weight map to balance the class frequencies, $d_{1}:\Omega\to{R}$ denotes the distance to the border of the nearest cell and $d_{2}:\Omega\to{R}$ the distance to the border of the second nearest cell. In our experiments we set $w_{0} = 10$ and $\sigma ≈ 5$ pixels.

여기서 $w_{c}:\Omega\to{R}$는 클래스 주파수의 균형을 맞추기 위한 가중치 맵이고, $d_{1}:\Omega\to{R}$는 가장 가까운 셀의 경계까지의 거리, $d_{2}:\Omega\to{R}$는 두 번째로 가까운 셀의 경계까지의 거리를 나타냅니다. 우리의 실험에서 우리는 $w_{0} = 10$ and $\sigma ≈ 5$ 픽셀을 설정했다.

In deep networks with many convolutional layers and different paths through the network, a good initialization of the weights is extremely important. Otherwise, parts of the network might give excessive activations, while other parts never contribute. Ideally the initial weights should be adapted such that each feature map in the network has approximately unit variance. For a network with our architecture (alternating convolution and ReLU layers) this can be achieved by drawing the initial weights from a Gaussian distribution with a standard deviation of $\sqrt{2/N}$, where $N$ denotes the number of incoming nodes of one neuron [5]. E.g. for a 3x3 convolution and 64 feature channels in the previous layer $N = 9\cdot 64 = 576$.

네트워크를 통해 많은 컨볼루션 레이어와 다른 경로를 가진 심층 네트워크에서 가중치의 좋은 초기화가 매우 중요하다. 그렇지 않으면 네트워크의 일부가 과도한 활성화를 제공하는 반면 다른 부분은 기여하지 않을 수 있습니다. 이상적으로는 네트워크의 각 형상 맵이 대략적인 단위 분산을 갖도록 초기 가중치를 조정해야 한다. 아키텍처(컨볼루션 및 ReLU 계층)를 사용하는 네트워크의 경우 표준 편차가 $\sqrt{2/N}$인 가우스 분포에서 초기 가중치를 끌어와 이를 달성할 수 있다. 여기서 $N$은 한 뉴런의 들어오는 노드 수를 나타낸다[5]. 예: 이전 계층 $N = 9\cdot 64 = 576$의 3x3 컨볼루션 및 64개의 피쳐 채널의 경우.

$\mathbf{3.1\;Data\;Augmentation}$

Data augmentation is essential to teach the network the desired invariance and robustness properties, when only few training samples are available. In case of microscopical images we primarily need shift and rotation invariance as well as robustness to deformations and gray value variations. Especially random elastic deformations of the training samples seem to be the key concept to train a segmentation network with very few annotated images. We generate smooth deformations using random displacement vectors on a coarse 3 by 3 grid. The displacements are sampled from a Gaussian distribution with 10 pixels standard deviation. Per-pixel displacements are then computed using bicubic interpolation. Drop-out layers at the end of the contracting path perform further implicit data augmentation.

훈련 샘플이 거의 없을 때 원하는 불변성 및 견고성 속성을 네트워크에 가르치려면 데이터 확장이 필수적이다. 현미경 이미지의 경우 변형 및 그레이 값 변동에 대한 견고성뿐만 아니라 주로 이동 및 회전 불변성이 필요하다. 특히 훈련 샘플의 무작위 탄성 변형은 주석이 달린 이미지가 거의 없는 분할 네트워크를 훈련시키는 핵심 개념인 것으로 보인다. 우리는 거친 3×3 그리드에서 무작위 변위 벡터를 사용하여 매끄러운 변형을 생성한다. 변위는 10픽셀 표준 편차를 가진 가우스 분포에서 샘플링됩니다. 그런 다음 픽셀당 변위는 바이큐빅 보간법을 사용하여 계산된다. 수축 경로의 끝에 있는 드롭아웃 계층은 추가적인 암묵적 데이터 확대를 수행한다.

$\mathbf{4\;Experiments}$

We demonstrate the application of the u-net to three different segmentation tasks. The first task is the segmentation of neuronal structures in electron microscopic recordings. An example of the data set and our obtained segmentation is displayed in Figure 2. We provide the full result as Supplementary Material. The data set is provided by the EM segmentation challenge [14] that was started at ISBI 2012 and is still open for new contributions. The training data is a set of 30 images (512x512 pixels) from serial section transmission electron microscopy of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes with a corresponding fully annotated ground truth segmentation map for cells (white) and membranes (black). The test set is publicly available, but its segmentation maps are kept secret. An evaluation can be obtained by sending the predicted membrane probability map to the organizers. The evaluation is done by thresholding the map at 10 different levels and computation of the “warping error”, the “Rand error” and the “pixel error” [14].

우리는 세 가지 다른 세분화 작업에 u-net의 적용을 보여준다. 첫 번째 작업은 전자 현미경 기록에서 뉴런 구조의 분할이다. 데이터 세트와 얻은 분할의 예가 그림 2에 나와 있습니다. 우리는 전체 결과를 보충 자료로 제공합니다. 데이터 세트는 ISBI 2012에서 시작된 전자파 세분화 과제[14]에 의해 제공되며, 여전히 새로운 기여에 대해 열려 있다. 훈련 데이터는 Drosophila의 직렬 단면 투과 전자 현미경에서 얻은 30개의 이미지(512x512 픽셀) 세트이다. 각 이미지에는 셀(흰색)과 멤브레인(검은색)에 대한 해당 전체 주석이 달린 지상 진실 분할 맵이 함께 제공된다. 테스트 세트는 공개적으로 사용할 수 있지만 분할 맵은 비밀로 유지됩니다. 평가는 예측된 막 확률 맵을 주최자에게 전송하여 얻을 수 있다. 평가는 10개의 다른 레벨에서 지도를 임계값으로 설정하고 “왜곡 오류”, “Rand 오류” 및 “픽셀 오류”를 계산하여 수행됩니다 [14].

The u-net (averaged over 7 rotated versions of the input data) achieves without any further pre- or ostprocessing a warping error of 0.0003529 (the new best score, see Table 1) and a rand-error of 0.0382.

u-net (입력 데이터의 7개 이상의 회전된 버전 평균)은 0.0003529의 뒤틀림 오류(새로운 최고 점수, 표 1 참조)와 0.0382의 랜드 오류를 더 이상 사전 처리하지 않고 달성한다.

This is significantly better than the sliding-window convolutional network result by Ciresan et al. [1], whose best submission had a warping error of 0.000420 and a rand error of 0.0504. In terms of rand error the only better performing algorithms on this data set use highly data set specific post-processing methods1 applied to the probability map of Ciresan et al. [1].

이것은 Ciresan 등의 슬라이딩 윈도우 컨볼루션 네트워크 결과보다 훨씬 좋다. 1 랜덤 오류 측면에서 이 데이터 세트에서 가장 성능이 좋은 알고리듬은 Ciresan 등의 확률 맵에 적용된 높은 데이터 세트 특정 후 처리 방법1을 사용한다. [1]

Table 1. Ranking on the EM segmentation challenge [14] (march 6th, 2015), sorted by warping error.

표 1. 뒤틀림 오류별로 정렬된 전자파 세분화 과제 14의 순위.

Table 1

Fig. 4. Result on the ISBI cell tracking challenge. (a) part of an input image of the “PhC-U373” data set. (b) Segmentation result (cyan mask) with manual ground truth (yellow border) (c) input image of the “DIC-HeLa” data set. (d) Segmentation result (random colored masks) with manual ground truth (yellow border).

그림 4. ISBI 세포 추적 도전 결과 (a) “PhC-U373” 데이터 세트의 입력 영상의 일부. (b) 수동 접지 측(노란색 테두리)이 있는 분할 결과(시안 마스크) (c) “DIC-HeLa” 데이터 세트의 입력 영상. (d) 수동 접지 측(노란색 테두리)이 있는 분할 결과(랜덤 컬러 마스크)

Table 2. Segmentation results (IOU) on the ISBI cell tracking challenge 2015.

표 2. ISBI 세포 추적 챌린지 2015에 대한 분할 결과(IOU).

Table 2

We also applied the u-net to a cell segmentation task in light microscopic images. This segmenation task is part of the ISBI cell tracking challenge 2014 and 2015 [10,13]. The first data set “PhC-U373”2 contains Glioblastoma-astrocytoma U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy (see Figure 4a,b and Supp. Material). It contains 35 partially annotated training images. Here we achieve an average IOU (“intersection over union”) of 92%, which is significantly better than the second best algorithm with 83% (see Table 2). The second data set “DIC-HeLa”3 are HeLa cells on a flat glass recorded by differential interference contrast (DIC) microscopy (see Figure 3, Figure 4c,d and Supp. Material). It contains 20 partially annotated training images. Here we achieve an average IOU of 77.5% which is significantly better than the second best algorithm with 46%.

우리는 또한 u-net을 가벼운 현미경 이미지의 세포 분할 작업에 적용했다. 이 분할 작업은 ISBI 세포 추적 과제 2014 및 2015의 일부이다[10,13]. 첫 번째 데이터 세트 “PhC-U373”2는 위상 대비 현미경으로 기록된 폴리아크릴이미드 기질에 있는 교아세포종-천문세포종 U373 세포를 포함하고 있다(그림 4a, b, Sup 참조). 재료). 여기에는 부분적으로 주석이 달린 35개의 교육 이미지가 포함되어 있다. 여기서 우리는 평균 92%의 IOU(“조합에 대한 교차”)를 달성하는데, 이는 83%로 두 번째로 좋은 알고리듬보다 훨씬 낫다(표 2 참조). 두 번째 데이터 세트 “DIC-HeLa”3는 차동 간섭 조영(DIC) 현미경으로 기록된 평평한 유리의 HeLa 세포입니다(그림 3, 그림 4c, d 및 Supp 참조). 재료). 여기에는 부분적으로 주석이 달린 20개의 교육 이미지가 포함되어 있다. 여기서 우리는 평균 77.5%의 IOU를 달성하는데, 이는 46%로 두 번째로 좋은 알고리듬보다 훨씬 낫다.

$\mathbf{5.\;Conclusion}$

The u-net architecture achieves very good performance on very different biomedical segmentation applications. Thanks to data augmentation with elastic deformations, it only needs very few annotated images and has a very reasonable training time of only 10 hours on a NVidia Titan GPU (6 GB). We provide the full Caffe[6]-based implementation and the trained networks4 . We are sure that the u-net architecture can be applied easily to many more tasks.

u-net 아키텍처는 매우 다른 생물의학 세분화 애플리케이션에서 매우 우수한 성능을 달성한다. 탄성 변형을 통한 데이터 증강 덕분에 주석이 달린 이미지만 거의 필요하지 않으며 NVidia Titan GPU(6GB)에서 10시간이라는 매우 합리적인 교육 시간을 갖는다. 우리는 완전한 Caffe[6] 기반 구현과 훈련된 네트워크4를 제공한다. 우리는 u-net 아키텍처를 더 많은 작업에 쉽게 적용할 수 있다고 확신한다.

$\mathbf{Acknowledgments}$

This study was supported by the Excellence Initiative of the German Federal and State governments (EXC 294) and by the BMBF (Fkz 0316185B).

이 연구는 독일 연방 및 주 정부의 Excellence Initiative(EXC 294)와 BMBF(Fkz 0316185B)의 지원을 받았다.

(GAN)U-Net: Convolutional Networks for Biomedical Image Segmentation Translation

CV Paper List

\(\mathbf{U-Net:\;Convolutional\;Networks\;for\;Biomedical\;Image\;Segmentation}\)

\(\mathbf{Abstract}\)

$1\;\mathbf{Introduction}$

$2\;\mathbf{Network\;Architecture}$

$\mathbf{3\;Training}$

$\mathbf{3.1\;Data\;Augmentation}$

$\mathbf{4\;Experiments}$

$\mathbf{5.\;Conclusion}$

$\mathbf{Acknowledgments}$