Integral Imaging (II) is a technique used to capture and reconstruct three-dimensional (3D) scenes. Figure 1 illustrates the Photons Counted Integral Imaging (PCII) setup. PCII operates in two primary stages.
Figure 1: Pickup process of Photons Counted 3D Integral Imaging. The illustration shows the rectangular translation of a single camera across the pickup plane to capture multiple 2D elemental images.
Step 1: Pickup. A 3D object is captured from multiple perspectives by moving a camera in both vertical and horizontal directions. This process records four-dimensional (4D) light-field data, encompassing two spatial dimensions (x,y) and two angular dimensions (( (theta _{ x},theta _{y} ))). The resulting images are known as Elemental Images (EIs) or collectively as an Elemental Image Array (EIA).
Step 2: Reconstruction. The recorded 2D EIs are combined to reconstruct the 3D scene. This reconstruction is the reverse of the pickup process, employing ray back propagation. During reconstruction, objects at the same depth plane are sharply focused, while objects at different depths appear blurred or out of focus. More details on Bayer patterned based EIA capturing and reconstruction can be found in prior research.
Image reconstruction is feasible even with low scattered photons. This can be achieved through physical cameras like EMCCD and sCMOS in low light conditions, or via computational approaches. This study adopts a computational approach, utilizing a Poisson distribution to estimate photons in an image. If (n_{p}) is the total number of photons in a normalized elemental image (({widehat{EI}})), and (lambda ) is the Poisson parameter, the photon-counted image can be estimated using the formula:
$$begin{aligned} P(C_{x} |I_{x})sim PoissonDistribution(lambda =n_{p}times {widehat{EI}}(x,y)) end{aligned}$$
Subsequently, the parametric maximum likelihood estimator (MLE) is applied to the photon-counted elemental images to reconstruct 3D sectional images:
$$begin{aligned} MLE {I_{P} ^{Z}}= dfrac{1}{n_{p}RS} sum _{r=1}^{R} sum _{s=1}^{S} C_{rs}left( x+rleft( dfrac{shf_{x}}{MF} right) ,y + sleft( dfrac{shf_{x}}{MF}right) right) end{aligned}$$
Here, MF is the magnification factor, calculated as ( MF= z/d), where d is the distance between the pickup grid and the image plane (refer to Figure 1). Subscripts r and s indicate the pickup location of the elemental image, and p denotes photon-counted images. (C_{rs} (.)) represents the photon-counted pixel value of the ((r,s))th elemental image, corresponding to the voxel value (I_{p}^{z}).
Unsupervised Denoising Network Architecture for 3D Images
This section details the deep learning denoising architecture employed: the U-Net, as depicted in Figure 2. This is an end-to-end unsupervised 3D denoising machine learning approach. Noisy photon-counted 3D sectional images are directly input into the network. The network uses symmetrically arranged encoder and decoder layers to extract denoised images, requiring minimal training data.
Figure 2: Architecture of the unsupervised denoising network, highlighting Encoder Blocks (EB), Decoder Blocks (DB), and Skip Blocks (SB).
Let x represent a clean 3D sectional image, n the noise from the photon counting process, and y the resulting noisy photon-counted 3D sectional image. This relationship is mathematically expressed as:
$$begin{aligned} y = x+n end{aligned}$$
The goal of image denoising is to recover x from y by reducing noise n. This process is represented by:
$$begin{aligned} {hat{x}} = {mathcal {H}}(y;Theta ) end{aligned}$$
where ({hat{x}}) is the estimated denoised photon-counted image, ({mathcal {H}}(.;Theta )) is a parametric function, and (Theta ) represents trainable parameters. The U-Net architecture incorporates encoder and decoder blocks with skip connection layers. Skip blocks (SB) are integrated into the skip connection strategy to mitigate the vanishing gradient problem. These blocks and connections are specifically designed within the encoder and decoder to suit the characteristics of photon-counted 3D images, preserving crucial image information. Skip connections are used to retain low-level abstract features extracted by the encoder, which might be lost during neural network training.
During training, 3D input images are fed into the network as patches. Patching increases the number of training samples, enabling the network to learn image features more effectively. The patch window is moved horizontally and vertically across the image. Patched input images are converted to 1D vectors and input into the network. After denoising, the 1D vector is unpatched and reshaped to the original input data size.
The encoder block includes two fully connected layers, two batch normalization layers, and two activation layers. Encoding reduces dimensionality, extracting relevant image content from the noisy image y. The encoding operation is:
$$begin{aligned} d_{e1}=W_{e1} P_{m}+b_{e1} end{aligned}$$
where (d_{e1}) is the encoder output, (P_{m}) is the patched input, and (W_{e1}) and (b_{e1}) are the weights and biases of the mth encoding layer. Batch normalization layers prevent internal covariate shifts, which can reduce network accuracy due to error accumulation. They also accelerate training and prevent gradient vanishing. The output of the batch normalization layer is passed to an Exponential Linear Activation (ELU) function layer:
$$begin{aligned} ELU(x)= {left{ begin{array}{ll} x &{} text {if}, xge 0\ a(e^{x}-1)&{} text { otherwise,} end{array}right. } end{aligned}$$
where a is a hyper-parameter ((age 0)). ELU accelerates the network by driving mean activations closer to zero. The output of each encoder block is:
$$begin{aligned} {{hat{d}}_{e1}}=ELU(W_{e1}P_{m}+b_{e1}) end{aligned}$$
The decoder reconstructs photon-counted 3D images from the encoder’s abstract features. Decoder and encoder blocks are symmetrical, each with two dense layers, batch normalization layers, and activation function layers. The decoder block output is:
$$begin{aligned} {{hat{r}}_{dm}}=ELU(W_{dm}K_{m}+b_{dm}) end{aligned}$$
where (W_{dm}) and (b_{dm}) are weights and biases of the mth decoder layer, and (K_{m}) is the input. The number of neurons decreases through the encoder (128, 64, 32, 16, 8) and increases symmetrically in the decoder. The network’s final fully connected layer reconstructs output patches to the size of the input photon-counted image.
The ADAM optimizer is used to update parameters (Theta ) due to its ease of implementation, computational efficiency, and low memory requirement. The parameter update rule is:
$$begin{aligned} rho _{(t+1)}=rho _{t}-frac{eta }{sqrt{({hat{v}}(t)) +epsilon }}{hat{n}}(t), end{aligned}$$
where ({hat{v}}(t)) and ({hat{n}}(t)) are bias-corrected first and second moments. MSE is used as the cost function, minimizing the difference between noisy input patches P and denoised output patches (psi (P;Theta )):
$$begin{aligned} C(Theta )= min parallel psi (P;Theta )-Pparallel ^{2} end{aligned}$$
Early stopping is employed during training: if the validation set’s cost function does not decrease for four epochs, training stops, and the best parameters are saved. SB blocks include a fully connected layer, batch normalization layer, and ELU layer. The output of the mth decoder layer connects to the SB block output ({hat{s}}_{dm}). The final output after each decoder block is ({hat{r}}_{edl}={{hat{d}}_{e1}, {hat{s}}_{dm}}).
Experimental Validation and Results
To evaluate the denoising network’s performance, experiments were conducted using (10 times 10) elemental images, captured by shifting a CCD camera in 5 mm increments horizontally and vertically (Figure 1). Two 3D objects were used: a tri-colored ball (Object 1) and a toy bird (Object 2) (Figure 3a). Object 1 and Object 2 were placed 370 mm and 500 mm from the pickup grid, respectively. The imaging system had a 50 mm focal length and a pixel size of (7.4mu m times 7.4mu m). Elemental images, initially (1668(H) times 1668(V)), were resized to (422(H) times 539(V)) before processing.
Figure 3: Three-dimensional (3D) objects used: (a) EI of the 3D scene in Bayer format, (b) 3D reconstructed sectional image focusing on the Tri-coloured ball, and (c) 3D reconstructed sectional image focusing on the Toy bird.
Figure 3a shows the 3D objects. Figures 3b and 3c display clean sectional images reconstructed computationally without photon counting, illustrating that only one object is in focus at a time depending on depth. The original images were Bayer patterned (GRBG), convertible to RGB color images via interpolation, though this conversion is not detailed here but is discussed in prior works. The network was also tested on a single-photon detector based 2D dataset from a Quanta Image sensor (QIS) (Figure 5). Photon shot noise is inherent in standard imaging systems, making denoising crucial, especially in low-light conditions where noise is difficult to distinguish from the scene.
Simulations were run on an Intel® Xeon® Silver 4216 CPU @2.10 GHz with 256 GB RAM. The code, in Spyder/Anaconda, ran in approximately 68 seconds. For photon-counted 3D sectional image (PCSI) synthesis, (n_{p}) = 5000 photons/scene was used (Eq. 1). The network was trained on a single PCSI, patched to increase the dataset size, reducing the need for large datasets. Patch size optimization showed 8(times )8 patches yielded superior PSNR. 20% of PCSI patches were used for validation and 60% for training. Validation loss (val_loss) was monitored to optimize the model, and early stopping was used if val_loss did not converge after 5 epochs. Training used 15 epochs with a 0.001 learning rate. Computational complexity analysis showed the classical TV denoising algorithm took 15.09018 s, while the proposed DL network took 2110 s (training and testing).
Figure 4: Denoising results: (a1, b1, c1) Noisy Photon counted 3D sectional image, TV denoised image, and proposed denoising method result for Object 1 in focus. (a2, b2, c2) Corresponding images for Object 2 in focus.
Figure 5: Denoising results for QIS image: (a1) Noisy QIS image, (b1) TV denoised image, and (c1) proposed denoising method result, with PSNR values.
Quantitative evaluation used Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). PSNR is defined as:
$$begin{aligned} PSNR= 10.log10 dfrac{I_{max}^{2}}{MSE} end{aligned}$$
where (I_{max}) is the maximum pixel intensity, and MSE is the mean squared error between the clean and noisy images. SSIM is calculated as:
$$begin{aligned} {{,textrm{SSIM},}}(x,y) = frac{(2mu _xmu _y + C_1) (2 sigma _{xy} + C_2)}{(mu _x^2 + mu _y^2+C_1) (sigma _x^2 + sigma _y^2+C_2)} end{aligned}$$
where ( mu _x, mu _y, sigma _x, sigma _y, sigma _{xy}) are means, standard deviations, cross-covariance, and (C_1, C_2) are constants. For Object 1 in focus, the proposed method achieved an SSIM of 0.8540, compared to 0.7218 for TV denoising. For Object 2, SSIM values were 0.6445 and 0.5913, respectively. For the QIS dataset, SSIM was 0.3222 for the proposed method and 0.3205 for TV denoising.
The results demonstrate that the proposed 3D denoising machine learning method outperforms classical TV denoising, achieving PSNR improvements of up to 1.91 dB for 3D data and 1.38 dB for QIS data, and SSIM improvements of up to 0.1322 for 3D data and 0.0017 for QIS data.