Noise2Atom: unsupervised denoising for scanning transmission electron microscopy images

We propose an effective deep learning model to denoise scanning transmission electron microscopy (STEM) image series, named Noise2Atom, to map images from a source domain \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {S}$\end{document}S to a target domain \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {C}$\end{document}C, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {S}$\end{document}S is for our noisy experimental dataset, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {C}$\end{document}C is for the desired clear atomic images. Noise2Atom uses two external networks to apply additional constraints from the domain knowledge. This model requires no signal prior, no noise model estimation, and no paired training images. The only assumption is that the inputs are acquired with identical experimental configurations. To evaluate the restoration performance of our model, as it is impossible to obtain ground truth for our experimental dataset, we propose consecutive structural similarity (CSS) for image quality assessment, based on the fact that the structures remain much the same as the previous frame(s) within small scan intervals. We demonstrate the superiority of our model by providing evaluation in terms of CSS and visual quality on different experimental datasets.


Introduction
Deep neural network denoising techniques have drawn a lot of attention (Kokkinos and Lefkimmiatis 2019;Chang et al. 2019;Song et al. 2019;Lin et al. 2019;Lehtinen et al. 2018;Buchholz et al. 2019b;Guo et al. 2018;Kadimesetty et al. 2018;Liu et al. 2018;Mildenhall et al. 2018;Ran et al. 2019;Su et al. 2019;Xie et al. 2018) as they have significant impacts in addressing several drawbacks in conventional analytical methods (Lucas et al. 2018) such as (1) computation burden in the testing phase, i.e., an analytical method requires to resolve an optimization problem for every input, which is computationally inefficient, and (2) difficulties in setting up hyper-parameters to incorporate prior or domain knowledge. Deep Convolutional Neural Networks (DCNNs) are the default models of the choice when working with highly structured *Correspondence: Feng.Wang@empa.ch Electron Microscopy Center, Empa, Swiss Federal Laboratories for Materials Science and Technology, Überlandstr. 129, CH-8600 Dübendorf, Switzerland datasets such as images and videos, as DCNNs are (1) more computationally efficient than multilayer perceptron models featuring fewer parameters, and (2) take the advantages of the structured datasets such as translation invariance and locality.
While most of the DCNN models are trained using pairs of noisy and clean images, some of the recent methods, such as Noise2Void (Krull et al. 2019a), Noise2Self (Batson and Royer 2019) can be unsupervised, but at a price of degraded performance ). Noiser2Noise (Moran et al. 2020), probabilistic Noise2Void ) and parametric probabilistic Noise2Void (Prakash et al. 2020) (PPN2V) improve the performance by introducing estimated noise models.
With a typical dwell time of 10 −7 s and down to less than 10 electrons per pixel, a modern scanning transmission electron microscopy (STEM) optimized for low-dose fast dynamic imaging produces very noisy images, often containing more noise than signal, as a result of high frames per second (fps) and the need to limit the radiation dose . For training on such data, there do not exist any ground truth images. As modern electron microscopy experiments often target on studying complex dynamic systems of moving atoms (Cao et al. 2018;, it can be difficult to generate simulated images suitable as a ground truth. Therefore it is not feasible to denoise these images directly with supervised models. Furthermore, because of its inner complexity in data degradation, i.e., a simple additive white noise model does not comply (Wang et al. 2020), noise model based approaches are difficult.
Our approach assumes an underlying relationship between the clean atomic images and the noisy highangle annular dark-field (HAADF) STEM images for our model to learn: for a bright area, there is a high possibility of the presence of atom(s), and for a dim area, there might only be the background. This relationship holds for our case when studying small metal atoms and clusters on lighter support films. Although we lack paired noisy-clean images, we can still train our model using Cycle-Consistent Adversarial Networks (Zhu et al. 2017). Moreover, we can apply an additional constraint to improve the restoration quality: a good model should give Gaussian-like shapes for atomic peaks (Dwyer et al. 2010). Our main contributions are: 1 demonstrating how to integrate domain-specific information with a Generative Adversarial Network (GAN) and a customized convolutional network extracting low-frequency features in a denoising application, 2 showing how to restore images using a cycle training strategy without knowing the signal prior and thus is free of noise models, and 3 proposing a quantitative metric for the image time series restoration where the ground truth does not exist.

Methods
Our goal is to train a deep convolutional neural network translating the images in the domain S to another domain C, where S is for the experimental noisy STEM images with training samples {s i } N i ∈ S, and C is for the expected atomic images composed of pure Gaussian peaks with simulated samples {c i } M i ∈ C. Our model includes two mappings M s2c : S → C and M c2s : C → S. In addition, we introduce an adversarial discriminator D to distinguish between images {c} and translated images M s2c (s), an additional mapping M s2b translating noisy images from domain S to domain B, where B is for the slowly varying background of the noisy images, and a mapping M s2b : S → B.
In The objective of our model contains three terms: 1 an adversarial loss for matching the distributions of the denoised images to the data distributions of the simulated atomic images, 2 a cycle consistency loss to prevent the learned composed mapping M s2c • M c2s from contradicting to the original, which is inspired by CycleGan (Zhu et al. 2017), Dualgan (Yi et al. 2017) and DiscoGan (Kim et al. 2017), and 3 a low-frequency cycle loss to keep the learned composed mapping M s2b • M c2s • M s2c staying consistent with the low-frequency components.

Adversarial loss
We apply adversarial loss to mapping M s2c . For this mapping and its adversarial D, with a batch of training samples s ∈ S and c ∈ C, the objective contains two adversarial losses and a gradient penalty loss in which W is the Wasserstein loss function (Arjovsky et al. 2017;Wu et al. 2018), and in our implementation where G is the gradient penalty loss function, for a single sample s i ∈ s, with the prediction of x i = M s2c (s i ) and the random weighted sample y i = R(s i , x i ). The penalty loss for these samples is in which R is the randomized weighting function with a uniform random tensor u in range [ 0, 1] and G is an average over the training batch size b s λ = 10 controls the gradient penalty strength, denotes elementwise multiplication and E denotes the mean of the elements.

Cycle consistency loss
We expect the clean images translated to domain S with M c2s could be translated back to the domain C with M s2c , without changing any of the contents. To apply this constraint, we use the cycle consistency loss

Low-Frequency cycle loss
We realize that the noises in the experimental images are typically random discrete bright and dark pixels, therefore we relax M c2s by comparing the low-frequency features of its outputs with the inputs, instead of enforcing exactly the pixel-wise matching. We express this objective as in which M s2b is manually designed with precalculated Gaussian filters.

Full objective
The full objective is in which the constants α = 5 and β = 1 control the relative weights of the three losses. Finally, we aim to solve

Implementation details
Datasets. There are two types of images: 1 Simulated clean atomic images: the images in domain C. We simulated 32768 clean images. First, we randomly sampled 75-150 atomic positions in a 2D 256 × 256 pixel lattice. Then, we randomly assigned 1-4 atoms to each of the positions. Afterward, we generated a 33 × 33 pixel 2D Gaussian kernel with a random variance in the range [ 1.0, 10.0]. Then, this kernel was convolved with the 2D lattice, and the 128 × 128 pixels from the center of the image was cropped as our simulated clean image. 2 Experimental STEM images: the images in domain S. We tested our approach on three experimental movies of dynamic atomic clusters of Pt on a carbon film . This data was recorded at 150 fps with 128 × 128 pixels, 15 fps with 512 × 512 pixels and 5 fps with 1024 × 1024 pixels, with an electron dose in the range [ 10 5 , 10 6 ] eÅ −2 s −1 using a FEI Titan Themis, operated at 300 kV.
Network Architectures. We design M s2c and M c2s as two identical U-Nets (Ronneberger et al. 2015) of depth 3 with Xception modules (Chollet 2017) using kernel sizes {1, 3, 5, 7}, deep residual blocks (He et al. 2016) of 16, instance normalization (Ulyanov et al. 2016) followed by leaky relu activation, except a tanh activation function at the last layer. For upsampling, we use a transposed convolution with a stride of 2 and a kernel size of 4 × 4, followed by a convolution with a kernel size of 3 × 3. For downsampling, we use transpose convolution with a kernel size of 3 × 3, followed by a convolution with a stride of 2 and a kernel size of 3 × 3. There are no zero-paddings applied to the convolution and transpose convolution operations. These two models aim at translating noisy images to clean images and at translating clean images to noisy images, respectively. We design M s2b as a one-layered network, Fig. 1 Typical loss curves for L cycle and L lfc using a single filter of size 33 × 33, without padding and bias. Moreover, we precalculate the weights as a normalized 2D Gaussian distribution with a variance of σ = 15.0. This model aims to match the slowing varying lowfrequency features of two noisy images. The critic model D is composed of 4 downsampling modules and a fully connected layer. The downsampling modules contain a transposed convolution layer with a kernel size of 3 × 3, followed by a convolution layer of stride 2 with a kernel size of 3 × 3 and then a dropout layer of 25%. This model aims to classify whether an input image only contains 2D Gaussian-like peaks or not.
Training Details. Our model contains hundreds of layers, to fit all the data into the 12 GB memory of an Nvidia GTX 1080 Ti GPU, we crop our experimental images into 128 × 128 pixels, and train using RMSProp algorithm (Krizhevsky et al. 2012) with a batch size of 6, a learning rate of 5 × 10 −5 and a momentum of 0.9. A typical aberration-corrected HAADF STEM dataset could contain thousands of images. To save the computation time, for each dataset, 120 images are randomly sampled to train our model. Our model usually takes 100 epochs to reach a good prediction, when the cycle consistency loss is less than 0.05 and the low-frequency cycle loss is less than 0.1. Typical convergence curves for L cycle and L lfc are presented in Fig. 1.
Explicitly, Noise2Atom is composed of 4 sub-models: 1 a critic model, D, that predicts True on clean images and False on noisy images, as is demonstrated in Fig. 2a, 2 a noisy to clean model, M s2c , that translates noisy images to clean images, as is demonstrated in Fig. 2d, 3 a clean to noisy model, M c2s , that translates clean images to noisy images, as is demonstrated in Fig. 2c, and

Fig. 2 Submodels in Noise2Atom
4 a low-frequency feature extraction model, M s2b , that translates noisy images to blurry images, as is demonstrated in Fig. 2b.
With the weights for sub-model ➃ already hand-crafted, we train sub-models ➀ -➂ in this fashion (in a single batch): 1 training sub-model ➀ 5 times by mapping noisy images to False and clean images to True, 2 training composed model ➁•➀ once by mapping noisy images to True, with sub-model ➀ fixed, 3 training composed model ➁•➂•➃ once by mapping noisy images to their low-frequency features, with sub-model ➃ fixed, and 4 training composed model ➂•➁ once by mapping clean images to themselves.

Results
There do not exist any noise-free images as ground truth for our experimental datasets. Therefore, for evaluating the restoration quality, we design a consecutive similarity (CSS) metric. For this, we assume most of the contents (except for noise) of neighboring consecutive frames remain the same from frame to frame, due to the short frame time (typically 10 −1 s). The CSS metric for the image I n at frame n and the image I n+1 at frame n + 1 is given by in which σ 2 I n and σ 2 I n+1 are the variance, σ I n ,I n+1 is the covariance of I n and I n+1 , C 1 = 10 −4 and C 2 = 9 × 10 −4 are two constants to stabilize the division. The CSS metric is a variation of the structural similarity index measure (SSIM), which is widely used to predict the reconstruction quality by measuring the similarity between the ground truth image and the predicted image. As we do not have ground truth in this domain specific problem, we predict the denoising quality by measuring the similarity between two denoised consecutive images.
We compared our approach against the recent analytical Poisson-Gaussian Unbiased Risk Estimator for Singular Value Thresholding (PGURE-SVT) method (Furnival et al. 2017) and reasonable deep learning methods Noise2Self (Batson and Royer 2019) and Noise2Void (Krull et al. 2019a). We also tried PPN2V (Prakash et al. 2020) but did not get a satisfying result, as it is challenging to get a good enough parametric noise model estimation.
In this benchmark, we used the semi-supervised Multiscale Convolutional Neural Network (MCNN) method (Wang et al. 2020) as the baseline. When testing with datasets acquired from 150 fps with 128 × 128 pixels to 5 fps with 1024 × 1024 pixels, as is shown in Fig. 3,   Fig. 3 Benchmarking PGURE-SVT, Noise2Self, Noise2Void, Noise2Atom and MCNN on heavily-noised HAADF images. The CSS metrics are presented on the top-left corners. Noise2Atom gives much better predictions than the other unsupervised methods: visually more Gaussian-like and background corrected, with an increased quantitative contrast. In the 512 × 512 case, Noise2Atom can even beats the baseline MCNN, which has been trained using semi-supervised method Noise2Atom gives visually clear (Gaussian-like) and consistent (high CSS score) results. And in some case it even outperforms MCNN. Our approach yields predictions reflecting almost only atomic peaks, while removing a vast majority of the background. This result, as our approach consistently detects atoms, agrees well with our physical model. We present more denoising results in noise2atom repository, https://github.com/fengwang/Noise2Atom. It is increasingly common for a fast modern STEM to produce a dataset including thousands of images in a few minutes. Such a large dataset poses particular computation pressure on Noise2Atom. A direct solution is to sample a fixed number of images, rather than including them all. To find out a lower boundary to make a compromise between the computation speed and denoising quality, we trained four models including 1, 10, 50, and 100 experimental images respectively. Their performances are demonstrated in Fig. 4 Fig. 4, the model trained with 100 images predicts four atoms from the first and the second frames. The models trained with small numbers of images tend to over fit atomic peaks onto clusters of bright noisy pixels. As is shown from the second to the fourth row, different numbers of atoms are predicted.
We therefore suggest a training set of around 100 images in favour of both computation speed and denoising quality. Failure cases. As the contrast of STEM depends on atomic number (Kirkland et al. 1987): individual atoms of platinum (Z=78) are reliably detectable with a background signal given by ca 20 nm of carbon (Z=6), as is the case in the results of Fig. 3. This Z-dependence gives an estimated upper detecting limit of maximum ca 40-60 nm of carbon (and other neighboring light elements), before there is too much background noise to reliably detect individual Pt atoms. When we image atoms of Pt in droplets of ionic liquids (up to ca 50nm thickness) on the carbon film (Keller et al. 2019) we get closer to the detection limit, and we can see two failure cases in Figs. 5 and 6. From dark field STEM, we assume that bright areas match atomic peaks, and darker areas are due to the background noise. However, we did not apply an additional constraint to reflect  Noise2Atom is sensitive to gradients in background noise. In this image, at the edge of a nanodroplet with a low concentration of Pt atoms, there is an increase of background noise from left to right (shown in the plot below), as the nanodroplet gets thicker. On the left of the image where a small amount on atoms are correctly fit. However, further to the right of the image, as the background noise increases, clusters of pixels from the background noise, increasingly gets overfit as false atoms this relationship. In our numerical experiments, occasionally (once in every ten trials), there were cases of reverted mapping: the bright areas go to the background, and the dark areas go to the atomic peaks, as is shown in Fig. 5. A second failure case is due to gradients in the background noise, as shown in Fig. 6. Such a gradient causes areas with higher background noise to be overfit resulting in many false atoms. Therefore, it is important to largely have a homogeneous background intensity across the image.

Discussion and conclusion
The analytic methods stem from domain knowledge but are computationally inefficient. The supervised learning methods are fast but require paired training sets. Our approach takes advantage of the domain knowledge and is not restricted by the absence of the paired training data. We understand the critic model and the low-frequency feature extraction model as domain knowledge embedding. The critic model gives high priority of 2D Gaussian peaks, which is the expected physical pattern. The lowfrequency feature extraction model focuses on the slowly varying features by cutting down the influence of the noises, as the noises take the form of very bright or dark pixels. Our approach shows that even when it is difficult to generate suitable simulated ground truth/noisy image pairs and estimate a proper noise model, it is still possible to train neural networks to restore noisy images to extract interpretable and quantitative information, by using domain knowledge. Hence, the denoising approach of Noise2Atom is especially useful for timeresolved microscopy, but is likely also useful for many other applications. We also make the dataset and source code publicly available.