Multi-view Cardiac Image Segmentation via Trans-Dimensional Priors (2024)

Abbas KhanMuhammad AsadMartin BenningCaroline RoneyGregory Slabaugh

Abstract

We propose a novel multi-stage trans-dimensional architecture for multi-view cardiac image segmentation. Our method exploits the relationship between long-axis (2D) and short-axis (3D) magnetic resonance (MR) images to perform a sequential 3D-to-2D-to-3D segmentation, segmenting the long-axis and short-axis images. In the first stage, 3D segmentation is performed using the short-axis image, and the prediction is transformed to the long-axis view and used as a segmentation prior in the next stage. In the second step, the heart region is localized and cropped around the segmentation prior using a Heart Localization and Cropping (HLC) module, focusing the subsequent model on the heart region of the image, where a 2D segmentation is performed. Similarly, we transform the long-axis prediction to the short-axis view, localize and crop the heart region and again perform a 3D segmentation to refine the initial short-axis segmentation. We evaluate our proposed method on the Multi-Disease, Multi-View & Multi-Center Right Ventricular Segmentation in Cardiac MRI (M&Ms-2) dataset, where our method outperforms state-of-the-art methods in segmenting cardiac regions of interest in both short-axis and long-axis images. The pre-trained models, source code, and implementation details will be publicly available.

keywords:

Cardiac MRI , Image Segmentation , Short-Axis , Long-Axis , Transformation Priors , Sequential Segmentation

\affiliation

[inst1]School of Electronic Engineering and Computer Science, Queen Mary University of London, UK\affiliation[inst2]Queen Mary’s Digital Environment Research Institute (DERI), London, UK\affiliation[inst3]School of Biomedical Engineering and Imaging Sciences, King’s College London, UK\affiliation[inst4]Department of Computer Science, University College London, UK\affiliation[inst5]School of Engineering and Materials Science, Queen Mary University of London, UK

{graphicalabstract}

{highlights}

We propose a sequential 3D-to-2D-to-3D approach for multi-view cardiac image segmentation by effectively utilizing the trans-dimensional segmentation priors (TDSP), which transform a segmentation from one view into another and serve as guidance.

The TDSP provides a robust anatomical reference at the network’s input and encourages the network to produce anatomically plausible segmentation maps.

We also introduce a Heart Localization and Cropping (HLC) module to focus the segmentation on the heart region only. This strategy reduces the computation for the second and third-stage segmentation network and eliminates false positive predictions.

Extensive experiments are conducted to showcase the efficacy of the proposed pipeline utilizing the HLC module and TDSP, where our proposed method outperforms the state-of-the-art as well as methods on the M&Ms-2 challenge leaderboard.

1 Introduction

Cardiovascular disease is the leading cause of death, with a yearly toll of 23.6 million lives due to heart disease and stroke globally [1]. This underscores the need to identify and treat cardiac disorders. Cardiologists have focused on early diagnosis as part of the clinical workflow [2]. Deep learning architectures have achieved a wide range of competencies for computational cardiac imaging [3],[4],[5] including segmentation [6],[7],[8].

Modern non-invasive medical imaging techniques, such as ultrasound, magnetic resonance imaging (MRI), and computed tomography, are widely used to capture detailed images of the structure and function of the heart and its associated vessels [9]. However, detection of disease and quantification often requires a laborious process of manual segmentation to identify the areas of the anatomy in these scans. Recent advances in artificial intelligence [10] are improving automation to segment a medical image into meaningful areas of interest. In the context of cardiac imaging, areas of interest include the left atrium, right atrium, left ventricle, right ventricle, and myocardium to diagnose different cardiac pathologies. Many image segmentation methods have been proposed, including active shape models [11], active appearance models [12], atlas-based methods [13], convolutional neural network (CNN)-based approaches [14],[15] including those with self-attention-based architectures [16],[17].

Among the successful cardiac image segmentation methods, most rely on a single view, i.e., short-axis (SA) or long-axis (LA), where the segmentation is performed. However, capturing both SA and LA MR images is considered standard practice [18], [19], and segmentation of one view can be utilized to improve the segmentation of the other. Here, we propose a novel framework that performs accurate cardiac image segmentation by transferring the segmentation of one view to guide the segmentation of the other. Our proposed method sequentially utilizes the multi-view images. Despite being based on single encoder-decoder segmentation networks, the proposed pipeline still benefits from multi-view data.

Fig.1(d) depicts the overall architecture of the proposed pipeline. TriggerNet, which functions as a 3D segmentation model, generates the initial segmentation for the short-axis denoted as $S_{SA1}$ .Subsequently, utilizing the transformation parameters for the given volume, $S_{SA1}$ undergoes transformation to the LA view to produce a SA-to-LA map (SA2LAmap), a trans-dimensional segmentation prior. The SA2LAmap and LA image ( $I_{LA}$ ) are fed as input to the Heart Localization and Cropping (HLC) module, resulting in a cropped $I_{LA}$ and SA2LAmap containing only the heart region. The SA2LAmap is input along with the cropped $I_{LA}$ to the LA-SegNet model that generates a segmentation for the long-axis named $S_{LA}$ .

We next refine the short axis segmentation $S_{SA1}$ . $S_{LA}$ is transformed to the SA view, resulting in the LA-to-SA map (LA2SAmap), another trans-dimensional segmentation prior. Here, we again use the HLC module to obtain cropped LA2SAmap, SA image ( $I_{SA}$ ), and the TriggerNet output ( $S_{SA1}$ ). Finally, the SA-SegNet utilizes these cropped outputs of the HLC module and generates the final segmentation for the short-axis named $S_{SA2}$ .

In our proposed framework, integrating the segmentation from alternate views (SA to LA and LA to SA) acts as a segmentation prior, and provides HLC module guidance to remove the surrounding background regions and improve overall segmentation accuracy for the respective views. This framework enables LA-SegNet and SA-SegNet segmentation to outperform the existing state-of-the-art methods on the Multi-Disease, Multi-View & Multi-Center Right Ventricular Segmentation in Cardiac MRI (M&Ms-2) dataset’s challenge [20]. The proposed framework efficiently utilizes the multi-view aspect of the M&Ms-2 dataset, as the challenge provides the images and labels of both views (LA and SA) for each instance, compared to the previous datasets M&Ms [21] and ACDC [22].

2 Related Work

This section lists some of the most well-known deep learning-based segmentation architectures, including those unifying the power of CNN and self-attention-based mechanisms. We also detail existing cardiac image segmentation approaches, specifically from the M&Ms-2 dataset’s challenge [20] leaderboard and subsequent publications leveraging the dataset.We note that our proposed pipeline can be implemented with any segmentation backbone, provided that the architecture can segment both 2D and 3D views.

UNet [23] revolutionized deep learning-based medical image segmentation by proposing a symmetric encoder-decoder architecture. The encoder part extracts the features from the image, and the decoder reconstructs the segmentation map, while the skip-connections help to propagate information across different stages. The no-new-Net (nnUNet) [24] is built on UNet and proposed an automatically configurable segmentation architecture. It can configure data pre-processing, network design, and post-processing for many medical image segmentation datasets. An overview is provided in Section3.1. ResUNet [25] is an encoder-decoder architecture based on the UNet model and also incorporates knowledge of residual connections [26], atrous convolutions [27], and pyramid scene parsing (PSP) pooling [28]. Each convolution block is replaced with a residual block to achieve consistent training with the increased network depth, atrous convolutions help increase the receptive field, and PSP pooling enhances the network’s performance by including background context information.

Inspired by the emergence of vision transformers [29] in computer vision regimes [30], many hybrid architectures that utilize multi-head self-attention (MHSA) [31] have been proposed for medical image segmentation. TransUNet [32] is a UNet architecture that utilizes both CNN and self-attention. This includes a transformer-based encoder that extracts features from images and a CNN-based decoder that upsamples the encoded features. UTNet [33] is also a hybrid architecture integrating transformer and CNN for medical image segmentation. It proposes a revised MHSA mechanism to reduce the complexity of the model. In addition, a hybrid layer utilizing CNN and revised MHSA is incorporated into the encoder and decoder stages.The Multi-Compound Transformer (MCTrans) [34] aims to combine rich features and semantic structures into multi-scale convolutional features using self-attention. The MCTrans transforms convolutional features as a sequence of tokens to perform intra- and inter-scale self-attention across multiple scales.A multi-view and transformer-based architecture named Transfusion was proposed by [35] to correlate and fuse data coming from SA/LA views. It proposed Divergent Fusion Attention (DiFA), which combines features from different views using multi-scale self-attention.Al Khalil et al. [36] proposed a three-stage approach: firstly, the region of the heart is detected using a regression model; secondly, a GAN-based augmentation technique is used for image synthesis to increase the diversity of the training data for segmentation tasks. More specifically, their approach generates more examples of pathologies to balance instances of pathological and normal cases. Lastly, the late-fusion segmentation approach combined with intensity transformations is utilized to generate the final segmentation map.

Sun et al. [37] utilized labels from the end-diastolic and end-systolic phases through an intensity-based image registration approach. These registered labels increase the size of the training set. Arega et al. [38] relied on the MRI-specific based, intensity, and spatial data augmentation techniques to improve the generalization and robustness of their segmentation models.In [39], a multi-view SA-LA Network was proposed to simultaneously segment the RV blood pools in both the SA and LA views. It merged the bottleneck features from both the SA and LA and combined the labels of the left ventricle (LV) and myocardium (MYO) to generate a label that aids with contextual information to better segment the right ventricle (RV). Another multi-encoder-decoder network (xUnet) is proposed by [40] to simultaneously process the SA and LA views. It utilizes a pre-processing step where both views are centered and rotated to match their axes. A spatial transformer multi-pass feature pyramid (Tempera) [41] segments the RV in both SA and LA cardiac MR images. Tempera is based on the multi-scale feature pyramid network from [42] and transforms the SA features to LA via a geometric target spatial transformer. InfoTrans [43] proposed a nnUNet-based architecture, where the first 2D-nnUNet segments the LA views and then utilizes the LA prediction to crop the region of interest (ROI) from SA views.The Refined Deep Layer Aggregation (RDLA) [44] proposed a two-stage 2D architecture, using DLA-34 stride-2 network [45] as the backbone. The LA and SA images are segmented independently, followed by a refinement step by utilizing the complementary information of another view along with the images.

3 Proposed Framework

Fig.1(d) depicts our proposed framework, where the pipeline starts with trigger network (TriggerNet), followed by the transformation of its predictions $S_{SA1}$ to the LA view, resulting in SA2LAMap. As a pre-processing step for SA to LA transformation, the header information of the original SA image ( $I_{SA}$ ) is applied to $S_{SA1}$ to ensure the matching of all properties of the $I_{SA}$ and $S_{SA1}$ .The SA2LAMap is used to remove the unrelated non-cardiac areas of the original LA image ( $I_{LA}$ ) utilizing the HLC module and as input to the LA-SegNet through concatenation with the cropped $I_{LA}$ as a segmentation prior. The output from LA-SegNet is restored to its original size, followed by copying all the metadata information from the $I_{LA}$ to preserve the header information of the $I_{LA}$ in $S_{LA}$ .The final segmentation for LA ( $S_{LA}$ ) is transformed to the SA view using the $I_{SA}$ to obtain LA2SAMap.Following the same process, the HLC module utilizes the LA2SAMap and $I_{SA}$ to localize and crop the heart in a full-scale image.Here, we also cropped and concatenated the $S_{SA1}$ from the TriggerNet (further details are provided in the ablation in Section5). Finally, the $S_{SA2}$ is restored to its original size.

All three networks (TriggerNet, LA-SegNet, and SA-SegNet) are trained independently. The downstream tasks, such as LA-2-SA transformations and vice versa, and the HLC module are applied when the previous network outputs are available. However, at inference, the full framework is used sequentially to perform 3D-to-2D-to-3D segmentation output results for both the LA and SA images.

The following subsections will list the details of each step, segmentation networks, HLC module, and transformation process in the proposed pipeline shown in Fig.1.

3.1 Segmentation Networks

Segmentation networks used within the proposed framework, i.e., TriggerNet, LA-SegNet, and SA-SegNet, are implemented using nnUNet [24]. The nnUNet is built upon the original U-Net architecture with modifications and improvements adopted for medical image segmentation tasks. It stands out from UNet due to its ability to configure the architecture and hyperparameters automatically during training. Moreover, nnUNet achieved the best results in the challenge cohort of M&Ms-2 [20], and the proposed pipeline is built using its architecture as a baseline. The nnUNet architecture has three major components: (i) Encoder, (ii) Decoder, and (iii) Skip-Connections.

Multi-view Cardiac Image Segmentation via Trans-Dimensional Priors (3)

The encoder extracts features from the input data by gradually increasing the number of features while reducing the spatial dimensions as it goes deeper. Each encoder block consists of two consecutive convolutions with a kernel size 3 (3 $\times$ 3 for 2D/LA, and 3 $\times$ 3 $\times$ 3 for 3D/SA network). To reduce the spatial resolution, the features are again convolved with a kernel size of 3 and stride of 2. Each convolutional layer is followed by LeakyReLU activation and instance normalization.

The decoder reconstructs the segmentation map by progressively increasing the spatial dimension and reducing the number of features from the bottleneck layer. Each decoder block has two consecutive convolutions with a kernel size of 3, followed by a transpose convolution layer with a kernel size of 2 and stride 2. Similar to the encoder, each convolution layer is followed by LeakyReLU activation and instance normalization. The final convolution layer utilizes a sigmoid activation function with four kernels of size 1, where each kernel generates segmentation output for four classes, i.e., MYO, LV, RV, and background.

The skip-connections have been shown to improve segmentation methods [46] for medical image segmentation tasks, and hence, we utilize skip connections in our proposed architecture. These skip-connections copy and concatenate the features from the contracting path from the encoder to the expanding path in the decoder for a better gradient flow during backpropagation and to recover the lost spatial information.

The number of encoder-decoder blocks can be different for each segmentation network shown in Fig.1(d), and the corresponding computational complexity in Fig.2, depending upon the spatial dimension of the input data. The TriggerNet gets spatial dimension images of 64 $\times$ 192 $\times$ 192, and it has five downsampling (encoder’s block) and upsampling blocks (decoder’s blocks). The LA-SegNet network gets the cropped input of spatial size 128 $\times$ 128 and has the four encoder and corresponding decoder blocks. The spatial dimension of input data for SA-SegNet is 112 $\times$ 128 $\times$ 112, and it also has four blocks for downsampling and upsampling. We also trained the LA-SegNet network on LA images without utilizing the HLC module for the ablation studies mentioned in Section5. For this network, the nnUNet configures six encoder-decoder stages due to its large spatial dimension of 384 $\times$ 384.

3.2 Heart Localization and Cropping (HLC) Module

The foreground-background imbalance of pixels has been a fundamental issue for accurately segmenting medical images [47]. The foreground pixels occupy a smaller proportion of the image than the background objects.It can significantly degrade the segmentation performance by forcing the model to focus more on the background pixels due to their majority compared to the foreground pixels [48]. A common way of solving this issue is by designing the loss functions for segmentation, which can weigh the foreground pixels more than the background pixels [49], [50]. However, in the proposed framework, this problem is solved intrinsically to some extent, as we cropped the original full-scale image using the segmentation prior, resulting in a reduction of background search space.

Multi-view Cardiac Image Segmentation via Trans-Dimensional Priors (4)

The segmentation networks utilizing full-scale images frequently confuse the background and cardiac tissue, resulting in considerable false positives. Using the proposed HLC module, the network only focuses on a smaller area containing heart tissues. Moreover, as the HLC module reduces the spatial resolution, the next stage network will run on a low-resolution image, leading to reduced computational complexity.
The HLC module is implemented by extracting the heart region containing the LV blood pool (LV), RV blood pool (RV), and left ventricular MYO from the original full-scale images. This can be achieved using the SA2LAmap for $I_{LA}$ and either LA2SAmap or $S_{SA1}$ for $I_{SA}$ .

To find the bounding box across three labeled regions (LV-blood pool, RV-blood pool, and left ventricular MYO) within a segmentation prior, we’ll use transformation maps (SA2LAmap, LA2SAmap, or $S_{SA1}$ ) as binary masks, where nonzero values represent the regions of interest. To ensure that the cropped region occupies the entire region and does not miss any pixels of the foreground regions, we defined a parameter named margin, which can help to safeguard the edges and provide a margin that assists in preserving the entire heart in the cropped region, as shown in Fig.3.

We further confirmed that the obtained cropped region from the HLC module safely encloses the entire heart by cropping the ground truth, restoring the original size, and finding the Dice score between the original size and the restored ground truth. We ensured a Dice score of 1 and Hausdorff Distance (HD) of 0 for the resized ground truth segmentation against the original size ground truth. Empirically, we found that a margin of 15 pixels perfectly fits this purpose for all images in the training, validation, and test sets. We applied the same protocols to crop the intensity image and bring back the prediction from the cropped image to the original spatial dimension.

3.3 The Transformation Process

The M&Ms-2 dataset is novel in terms of providing the images/labels of the LA view along with the SA view to give detailed information for the apical and basal slices of short-axis views [20]. We have utilized this information more efficiently in the proposed framework and generated the transformations for each axis. More specifically, we transformed one view’s physical coordinates into the other view’s image coordinates system and vice versa.

The trans-dimensional segmentaton prior SA2LAmap is obtained using the pseudo-code shown in Algorithm1. The prediction from TriggerNet $S_{SA1}$ has different metadata information than the original $I_{SA}$ . This metadata information is essential for the conversion between physical coordinates and image coordinate systems and includes additional information like image orientation, voxel size, and origin. We used the CopyInformation function from SimpleITK to inherit all the relevant metadata from the original SA image to the $S_{SA1}$ . This ensured that $S_{SA1}$ was

Input $I_{LA},I_{SA},S_{SA1},T_{SA\rightarrow LA}$
Output $SA2LAmap$

0: Initialize output to zero, $SA2LAmap=0$

foreach point ( $\mathbf{p}$ ) in $I_{LA}$ do

Use $T_{SA\rightarrow LA}$ to transform $\mathbf{p}$ into $S_{SA1}$ , producing $\mathbf{q}$

$SA2LAmap(\mathbf{p})=S_{SA1}(\mathbf{q})$

endfor

3.4 Implementation Details

The proposed architecture is implemented using a single NVidia A100 GPU with 40GB RAM. The SA and LA MRI scans are resampled to a voxel size of 1 $\times$ 1 $\times$ 1 ${mm^{3}}$ . Dice loss and cross-entropy loss are used as loss functions to train the segmentation networks. Stochastic gradient descent is used as an optimizer with an initial learning rate of 0.01 and a Nesterov momentum of 0.99. We utilize a polynomial learning rate scheduler [51] with a weight decay of 0.0005 to decrease the learning rate after each training epoch.All networks, i.e., TriggerNet, LA-SegNet, SA-SegNet, and the nnUNet baseline, are trained independently for 1000 epochs (nnUNet default), where each epoch has 250 training iterations. The pre and post-processing steps, such as LA-2-SA transformation and vice versa, and the heart localization and cropping are performed in succession after the previous network predictions are available. All three segmentation networks and their respective pre and post-processing steps are carried out sequentially in the inference phase.

Different data augmentation techniques are applied during training to allow the networks to see a stream of distinct examples. Spatial transformations, including random rotation, scaling, and mirroring, provide distinct spatial perspectives from which the model can learn. Intensity adjustments, such as random brightness, contrast, and gamma variations, ensure the model’s adaptability to varying acquisition settings. Additionally, additive zero-mean Gaussian noise is utilized to enhance stochasticity, and blurring techniques, such as Gaussian blur, contribute to the model’s robustness against variations in image quality.

4 Dataset

The Multi-Disease, Multi-View, and Multi-Center Right Ventricular Segmentation challenge (M&Ms-2) was introduced in MICCAI 2021. The challenge focused on segmenting RV blood pools across cardiac imaging of multiple views and centers [21],[20]. The data includes diverse images from three clinical centers in Spain utilizing nine scanners from three vendors, including Siemens, General Electric, and Philips. It includes instances having various LV and RV pathologies as well as healthy subjects. The labels are provided for three regions of interest, including (i) LV blood pools, (ii) RV blood pools, and (iii) left ventricular MYO. It contains 360 instances from two cardiac cycles, specifically the end-diastolic and end-systolic phases. The subjects are divided sequentially into 160 for training, 40 for validation, and 160 for testing, such that different patients are in each split. The validation and test set also includes patients with pathologies not included in the training set. For each individual, both SA and LA MR images are provided, having SA and LA 4-chamber views.

5 Ablation Studies

We study the effectiveness of our algorithmic design via different ablation studies. In particular, we evaluate the effect of utilizing the SA2LAmap and LA2SAmap/ $S_{SA1}$ as a segmentation prior and the HLC module under different settings.

The SA2LAmap can be used in two ways to boost the network’s performance: (i) Localization and Cropping Guide for HLC module: To localize and crop the heart in original full-scale $I_{LA}$ , and (ii) Segmentation Prior: As a Segmentation prior concatenated to the $I_{LA}$ . Table5 lists the results of these experiments.

HLC Module	\pbox3.5cmSA2LAmap As Segmentation Prior	Dice Score LA $\uparrow$			HD (mm) LA $\downarrow$
	\pbox3.5cmSA2LAmap As Segmentation Prior	LV	RV	MYO	LV	RV	MYO
✗	✗	0.94	0.90	0.86	5.91	6.61	5.98
✓	✗	0.95	0.91	0.87	3.81	4.90	3.08
✗	✓	0.95	0.92	0.86	3.26	4.35	2.73
✓	✓	0.96	0.93	0.88	2.81	3.80	2.51

\pbox1.5cmHLC Module	\pbox3.5cm $S_{SA1}$ As Segmentation Prior	\pbox3.5cmLA2SAmap AsSegmentation Prior	Dice Score LA $\uparrow$			HD (mm) LA $\downarrow$
\pbox1.5cmHLC Module	\pbox3.5cm $S_{SA1}$ As Segmentation Prior	\pbox3.5cmLA2SAmap AsSegmentation Prior	LV	RV	MYO	LV	RV	MYO
✗	✗	✗	0.924	0.888	0.842	9.137	8.662	6.181
✓	✓	✗	0.938	0.912	0.864	3.809	5.027	2.671
✓	✓	✓	0.939	0.918	0.863	3.616	4.496	2.666

Methods	Dice( $\%$ )-Short-axis $\uparrow$				HD (mm)-Short-axis $\downarrow$
Methods	LV	RV	MYO	Avg	LV	RV	MYO	Avg
UNet	87.02	88.85	79.07	84.98	13.78	12.10	12.23	12.70
ResUNet	87.98	89.63	79.28	85.63	13.80	11.61	12.09	12.50
DLA	87.27	89.88	80.23	86.12	13.25	10.84	12.31	12.13
InfoTrans*	88.24	90.41	80.25	86.30	12.41	10.98	12.83	12.07
rDLA*	88.64	90.28	80.78	86.57	12.74	10.31	12.49	11.85
TransUNet	87.91	88.69	78.67	85.09	13.80	10.29	13.43	12.51
MCTrans	88.52	89.90	80.08	86.17	12.29	9.92	13.28	11.83
MCTrans*	87.79	89.22	79.37	85.46	11.28	9.39	13.84	11.49
UTNet	87.52	90.57	80.20	86.10	12.03	9.78	13.72	11.84
UTNet*	87.74	90.82	80.71	86.42	11.79	9.11	13.41	11.44
TransFusion*	89.52	91.75	81.46	87.58	11.31	9.18	11.96	10.82
Proposed*	93.94	91.87	86.34	91.71	3.61	4.49	2.66	3.58
	Dice( $\%$ )-Long-axis $\uparrow$				HD (mm)-Long-axis $\downarrow$
	LV	RV	MYO	Avg	LV	RV	MYO	Avg
UNet	87.26	88.20	79.96	85.14	13.04	8.76	12.24	11.35
ResUNet	87.61	88.41	80.12	85.38	12.72	8.39	11.28	10.80
DLA	88.37	89.38	80.35	86.03	11.74	7.04	10.79	9.86
InfoTrans*	88.21	89.11	80.55	85.96	12.47	7.23	10.21	9.97
rDLA*	88.71	89.71	81.05	86.49	11.12	6.83	10.42	9.46
TransUNet	87.91	88.23	79.05	85.06	12.02	8.14	11.21	10.46
MCTrans	88.42	88.19	79.47	85.36	11.78	7.65	10.76	10.06
MCTrans*	88.81	88.61	79.94	85.79	11.52	7.02	10.07	9.54
UTNet	86.93	89.07	80.48	85.49	11.47	6.35	10.02	9.28
UTNet*	87.36	90.42	81.02	86.27	11.13	5.91	9.81	8.95
TransFusion*	89.78	91.52	81.79	87.70	10.25	5.12	8.69	8.02
Proposed*	95.80	93.07	87.71	92.19	2.81	3.80	2.51	3.04

Quantitative comparison on validation set
Methods	Dice Score LA $\uparrow$	HD-LA(mm) $\downarrow$	Dice Score SA $\uparrow$	HD-SA(mm) $\downarrow$
[37]	0.922	5.35	0.925	8.90
[52]	0.922	5.59	0.924	8.85
[43]	0.920	5.34	0.922	9.47
Proposed	0.926	3.49	0.928	3.72
Quantitative comparison on test set
Methods	Dice Score LA $\uparrow$	HD-LA(mm) $\downarrow$	Dice Score SA $\uparrow$	HD-SA(mm) $\downarrow$
[37]	0.919	6.04	0.925	10.58
[52]	0.919	6.10	0.920	9.94
[43]	0.916	6.17	0.920	10.30
Proposed	0.928	3.91	0.927	4.01

Methods	Dice( $\%$ )-Short-axis $\uparrow$			HD (mm)-Short-axis $\downarrow$
Methods	LV	RV	MYO	LV	RV	MYO
[36]	0.959	0.938	0.907	6.42	8.62	9.37
Proposed	0.963	0.928	0.870	3.43	3.87	3.62
	Dice( $\%$ )-Long-axis $\uparrow$			HD (mm)-Long-axis $\downarrow$
	LV	RV	MYO	LV	RV	MYO
[36]	0.958	0.924	0.901	4.07	5.81	5.27
Proposed	0.961	0.927	0.878	2.83	3.70	2.56