Liver Segmentation in Ultrasound Images Using Self-Supervised Learning with Physics-inspired Augmentation and Global-Local Refinement

Shear Wave Elastography (SWE) is a non-invasive ultrasound method that evaluates changes in liver stiffness, serving as a useful biomarker for liver fibrosis. The proper placement of a region of interest (ROI) on the liver in the B-mode image is imperative for obtaining accurate and dependable results in SWE. In order to develop an automated system for liver fibrosis measurement utilizing SWE, the initial crucial phase involves the segmentation of the liver capsule. This paper presents a novel approach for liver segmentation in ultrasound images using a contrastive self-supervised learning approach. The proposed method leverages a large dataset of unannotated abdominal ultrasound images to learn the feature representations, which are then fine-tuned on the downstream task of liver segmentation. The algorithm is trained in two stages: in the first stage a SimCLR model is trained to learn the feature representations from non-labeled data, and in the second stage these representations are fine-tuned with a smaller annotated dataset of liver segmentation masks. Finally, this is followed by a refinement step using CascadePSP. The study also investigates the use of physics-inspired augmentations, such as sector angle and penetration to improve the performance of the deep learning model on ultrasound images. The proposed approach of SimCLR+ENet was compared against the state-of-the-art method U-Net. Evaluation of the average Dice similarity showed that SimCLR+ENet outperformed U-Net with a result of 90.58% compared to 89.77%. Similarly, the average Huasdorff distance evaluation demonstrated that SimCLR+ENet achieved superior performance with a value of 21.71 compared to U-Net’s 29.53. This highlights the effectiveness of the proposed approach, with performance improvements of 0.9% and 26.5% for the average Dice coefficient and average Hausdorff distance, respectively. The study provides insights into the use of physics-inspired augmentations in the medical ultrasound imaging field and highlights the potential for self-supervised learning in improving segmentation results.


Introduction
Non-alcoholic fatty liver disease (NAFLD), a prevalent cause of chronic liver disease, is characterized by the accumulation of excess fat in the liver, leading to damage and inflammation.It is predicted that the rate of NAFLD in Canada will rise from 20.8% to 22.9% between the years 2019 and 2030 [1], while the overall prevalence of NAFLD in the U.S. is estimated to be 24% [2].This upward trend in the incidence of NAFLD is likely to lead to a corresponding increase in the economic burden, as seen in the U.S. where the direct annual medical costs has been estimated to be $103 billion [3].Recent studies [4] have shown that the stage of liver fibrosis at the time of diagnosis is the best indicator of negative outcomes for patients with excess liver fat.Liver biopsy, currently as the standard for detecting excess fat and fibrosis stage, is invasive, costly, and subject to sampling error and interpretative variability.Due to these limitations, non-invasive alternatives have been developed, with medical ultrasound being the most practical option as it is widely accessible, portable, lowcost, non-toxic, and could realistically be used to risk stratify NAFLD.As liver fibrosis progresses, liver stiffness increases, making it a useful biomarker.Shear wave elastography (SWE) is a non-invasive ultrasound method that can measure these changes in stiffness.It has been shown that SWE is highly effective in diagnosing cirrhosis and moderately accurate for intermediate fibrosis stages in adults with NAFLD [5].Proper placement of a region of interest (ROI) on the liver in the B-mode image is crucial for obtaining accurate and reliable results in SWE.Therefore, when creating an automated system for measuring liver fibrosis using SWE, the initial step would be to segment the liver capsule.However, segmenting the liver in ultrasound images is considered a challenging task due to several factors such as low resolution and the presence of noise and artifacts.Moreover, the cost and time required for annotating such images can be prohibitively expensive, particularly when training supervised deep learning algorithms.In this paper, we propose a contrastive self-supervised learning approach for image segmentation that utilizes a large dataset of unannotated abdominal ultrtasound images to learn image representations that are then applied on the downstream task of liver segmentation, an approach that can achieve stateof-the-art (SOTA) performance with minimal annotated data.The remainder of the paper is presented as follows: Section 2 provides a survey of relevant literature.The proposed method is described in section 3. The results of the proposed approach are analyzed and compared with a SOTA method in section 4. Finally, the conclusions of the study are given in section 5.

Related Work
This section presents a review of previous studies that have employed deep learning approaches for the segmentation of the liver in ultrasound images; we have noticed that most of the studies in literature used U-Net [6] as the deep learning approach.Garcia et al. [7] trained U-Net on B-mode images from 20 patients (500 frames) and tested the algorithm on 5 patients (125 frames).The architecture showed high accuracy in segmenting the images with test accuracy surpassing 86%, sensitivity of 80%, and specificity of 95%.The results suggest that U-Net can greatly improve the early screening of liver diseases and improve outcomes for at-risk patients.Wu et al. [8] used data collected from a public medical center to train, validate, and test U-Net, and consisted of 215, 30, and 76 ultrasound images, respectively.The aim of the study was to perform texture analysis on only a portion of the renal cortex and liver, not on the entire renal cortex or liver.Evaluation of the image segmentation model was thus performed by dividing the number of pixels that match between the ground truth and prediction masks by the total number of pixels in the prediction mask, represented as itersection over pixels (IoP), which evaluated to 0.95 ± 0.118.However, the liver intersection over union (IoU) evaluated to 0.643 ± 0.186.Ibrahim [9] used U-Net to segment liver tissue in ultrasound images with Dice scores ranging between 78% and 89%.
Rhyou and Yoo [10] propose a cascaded deep learning neural network to estimate the level of liver steatosis from US images.The network includes three parts: (a) Liver and kidney (L-K) detection which uses DeepLabv3+ [11] for semantic segmentation and transfer learning for improved performance; (b) ring detection for areas difficult to detect with L-K detection; and (c) SteatosisNet which grades the severity of fatty liver disease using transfer learning with Inception v3 [12] and a dataset of cropped L-K areas.Training, validation, and testing data were composed of 1590, 530, and 530 images, respectively.The neural networks used for segmentation were trained using cross-entropy loss and the results were evaluated using mean accuracy, mean IoU, and boundary F-1 (BF1) score, which evaluated to 94.9%, 78.6%, and 52.3% for liver segmentation, respectively.

Methodology
The methodology outlined in this study is depicted in Fig. 1 and will be elaborated upon in the following sections.The approach is trained in two stages.In the first stage we trained a contrastive self-supervised learning approach (i.e., SimCLR [13]) on unannotated abdominal ultrasound images.We employed the SimCLR model to acquire feature representations of the ultrasound images, which were subsequently fine-tuned on the downstream segmentation task involving ENet (second stage) [14].In the second stage, the downstream task was trained through a supervised learning approach.We used a significantly smaller dataset (compared with the first stage) of images with annotated liver segmentation masks.To attain the optimized final result, we refined the output of the E-Net model through the use of CascadePSP.[15].SimCLR (simple framework for contrastive learning of visual representations) is a method for training visual representations using a contrastive loss function where a neural network is trained to distinguish between different views of the same image, rather than training it to classify images into a set of predefined categories (i.e., without the need of large labeled datasets).The contrastive learning framework in SimCLR involves four main components: (i) stochastic data augmentation: random cropping followed by resizing back to the original size, random color distortions, and random Gaussian blur augmentation methods were sequentially used to create two correlated views of the same example (i.e., positive pair).This allows the network to learn to distinguish between different views of the same image; (ii) feature extractor: a convolutional neural network (CNN) based on the ResNet-50 architecture [16] which maps the input image to a feature space and produces a feature representation of the image.The network is then fine-tuned on the target dataset using the contrastive loss function [17]; (iii) contrastive loss function: compares the feature representations of different views of the same image, and aims to minimize the distance between the representations of the same image while maximizing the distance between the representations of different images.The authors used the normalized temperature-scaled cross-entropy (NT-Xent) loss function; (iv) projection head: a multilayer perception (MLP) with one hidden layer is used to project the feature representation of the image (i.e., produced by ResNet-50) to a lower-dimensional space which is used as the final representation of the image and fed to the contrastive loss function.This is done by applying a linear transformation followed by a non-linear activation function (i.e., ReLU) to the feature representation.The authors used the L2-normalized feature representation of the image as input to the MLP.In our work, we used ENet as the backbone (base model) of SimCLR (instead of ResNet), since we are interested in liver segmentation.
During training, a minibatch of N examples is randomly selected, and the contrastive prediction task is performed on pairs of augmented examples, resulting in 2N data points.Instead of explicitly sampling negative examples, all other 2(N − 1) augmented examples within the minibatch are treated as negative examples for a given positive pair.The final loss is computed across all positive pairs within the minibatch, and is defined as: where is an indicator function used to determine whether a sample is a positive or negative sample relative to the anchor image (i.e., it evalues to 1 iff k ̸ = i).The temperature parameter, τ , controls the degree of similarity between the representations and can be used to regulate the difficulty of the contrastive task (i.e., increasing the temperature can result in a softer and less discriminative similarity function, while decreasing the temperature can result in a harder and more discriminative similarity function).
The training batch size was varied from 256 to 8192 and the LARS optimizer [18] was used to stabilize training, as training with large batch size and standard SGD/Momentum with linear learning rate scaling can be unstable.The model was trained using Cloud TPUs with 32 to 128 cores, depending on the batch size.To address the issue of the model exploiting local information leakage to improve prediction accuracy without improving representations, the mean and variance of batch normalization was aggregated over all devices during training.The authors found that the combination of data augmentations, a learnable nonlinear transformation, and larger batch sizes and more training steps are critical to learning effective representations.Results showed that SimCLR outperforms previous methods in self-supervised and semi-supervised learning on ImageNet, achieving a 7% relative improvement in top-1 accuracy over previous SOTA.When fine-tuned with only 1% of labeled data, SimCLR was able to outperform AlexNet.

ENet
ENet (Efficient Neural Network) is a deep neural network architecture for real-time semantic segmentation that is able to achieve high accuracy in segmenting images while being computationally efficient.This allows for real-time performance on resource-constrained devices such as medical devices.ENet's architecture is based on ResNet, and is divided into several stages (see table 1 in [14]).The first three stages form the encoder part of the network, while the last two stages form the decoder part.Each stage is made up of several bottleneck modules, which consist of three convolutional layers.These layers include a 1x1 projection layer to reduce dimensionality, a main convolutional layer, and a 1x1 expansion layer.Batch Normalization and PReLU [19] activation are used between all convolutional layers.In the case of downsampling, a max pooling layer is added to the main branch.The network uses Spatial Dropout as a regularization technique, with different dropout rates used in different stages.In the decoder, max pooling is replaced with max unpooling, and padding is replaced with spatial convolution without bias.The final layer of the network is a full convolution.Optimizations have been made to the network to improve performance, such as not using bias terms in projections to reduce memory usage.Batch Normalization on the other hand is used between every convolutional layer and non-linearity to improve accuracy.
ENet was evaluated on CamVid, Cityscapes, and SUN datasets, and was found to be 18x faster, requires 75x less FLOPs, has 79x less parameters, and provides similar or better accuracy compared to existing models.It can be applied to mobile devices as well as highend GPUs, which can lead to significant savings in data-center applications by allowing faster and more efficient large-scale computations.

CascadePSP
The CascadePSP method is designed to obtain high-resolution segmentation that is classagnostic.The refinement module of CascadePSP takes an image and multiple imperfect segmentation masks at different scales to produce a refined segmentation.It captures different levels of structural and boundary information by concatenating the image and segmentation masks and extracting features from the inputs using a Pyramid Scene Parsing network (PSPNet) [20] with ResNet-50 as the backbone.The module also generates intermediate segmentations at different strides, employs skip connections to reconstruct lost pixel-level details, and outputs a final segmentation through a 2-layer 1 × 1 convolution followed by sigmoid activation.The refinement module uses a combination of different loss functions to produce the best result, with cross-entropy loss for the coarse output, L1 + L2 loss for the fine output, and an average of both for the intermediate output.L1 loss on gradient magnitude is also used to improve boundary refinement, with the gradient loss weighed with α.The final loss can be written as follows: where L s CE , L s L1+L2 , and L s grad represent cross-entropy loss, L 1 + L 2 loss, and gradient loss for stride s, respectively.
High-resolution segmentation refinement is performed using a trained refinement module in two steps: global step and local step.The global step refines the whole resized image using a 3-level cascade to fix segmentation errors progressively while preserving details.The local step refines high-resolution images by taking crops and processing them with a 2-level cascade, resolving any disagreement between crops by averaging their outputs.The same refinement module can be applied recursively for higher resolution refinement.The authors found that CascadePSP performs similarly or better than SOTA in terms of mean intersection over union (mIoU) and F1-score, and excels at segmenting small objects and fine details with high accuracy and resolution.

Physics-inspired Augmentation
Data augmentation improves a deep learning model's ability to generalize by applying geometric transformations to the data, increasing both the amount and diversity of the data, and reducing overfitting during training.In the imaging domain, some of the common traditional augmentations employed are flipping, rotating, and translating the image.Data augmentation is crucial when the amount of training data is limited.However, its benefits may be reduced when the augmented images do not accurately represent the object in realworld scenarios.This is relevant for medical ultrasound images as traditional augmentations used in the imaging field may create unrealistic images that do not enhance the performance of the deep learning model, especially as these augmentations are typically based on how optical cameras operate.We hypothesize that using ultrasound-specific augmentations rather than domain-agnostic augmentations could improve the model's performance operating on ultrasound images.In this paper, we investigate this hypothesis by introducing different physics-inspired augmentations for ultrasound imaging, namely: sector angle, penetration, gaussian blur, force control, and zoom.In sector angle, we produce B-mode images with a narrow sector angle which reflects the number of lines per frame in real-world scenarios.Penetration augmentation simulates how deeper areas appear darker, particularly when using high-frequency probes.Gaussian blur represents a form of distortion in ultrasound images that acts as a low-pass filter that mitigates higher spatial frequencies.As opposed to gaussian blur, forced control leading to focused ultrasound energy produces sharper images.Zooming in/out mimics the zooming that can occur during ultrasound image scanning, affecting thereby the textural features of the image.The proposed augmentations were utilized in training both SimCLR+ENet (our proposed approach) and U-Net (a method for comparison) for liver segmentation in ultrasound images.Fig. 2 shows some examples of the proposed physics-inspired augmentations and their associated masks.

Results and Discussion
Liver ultrasound DICOM exams were collected from the electronic medical record (EMR) of the Massachusetts General Hospital (MGH).In particular, training, validation, and testing data were sampled from 903 abdominal ultrasound exams (each exam contained a sequence of DICOM files); the sampled images were annotated by MGH radiologists to be used in the supervised part of the proposed approach (i.e., ENet).The self-supervised learning part (i.e., SimCLR) on the other hand was trained on 17,441 unlabeled images, and the trained model was subsequently fine-tuned on a downstream task of liver segmentation in ultrasound images utilizing ENet.The labeled/annotated training and validation B-mode images consisted of 495 and 51 images, respectively.Experiments were performed on a test set of 90 images to evaluate the proposed approach and compare results with the U-Net model.Fig. 3 displays examples of the training data and corresponding ground truth masks.Table 1 illustrates the various experiments performed using SimCLR+ENet and U-Net, including the application of CascadePSP as a refinement step and the use of augmented data.In our research, we utilized a pre-trained CascadePSP model that was trained on a merged dataset of MSRA-10K [21], DUT-OMRON [22], ECSSD [23], and FSS-1000 [24], following the recommendation of the authors of CascadePSP, who indicated that the pretrained model is sufficient for use in refinement and that retraining on a custom dataset was not necessary.An ablation study was conducted to assess the impact of refinement on the original segmentation approach, through the use of both the original and the refined versions of the approach.Both the ENet and U-Net models were trained for 500 epochs using a NVIDIA GeForce RTX 2080-Ti GPU, while the SimCLR model was trained for 200 epochs on the same GPU.The training data was also augmented to 3880 images using physics-inspired augmentations (Section 3.4).In evaluating the performance of the proposed segmentation method, a set of commonly used metrics were employed, including Dice coefficient, Jaccard index, accuracy, sensitivity, specificity, and Hausdorff distance.We can observe from Table .1 (and Fig. 4) that the proposed SimCLR+ENet approach demonstrated the highest segmentation performance when trained on a physics-augmented dataset and refined using CascadePSP.The addition of the refinement step in the proposed segmentation approach resulted in outputs that closely resemble the shape and structure of the ground truth, demonstrating its effectiveness in improving segmentation accuracy.On the other hand, the U-Net model did not perform well when trained on the augmented dataset, which could be attributed to both its inadequate capacity to model complex relationships in ultrasound images and a possible case of overfitting.The limited capacity of the U-Net architecture hindered its ability to learn the patterns in the data needed for good performance.Although the metrics mentioned in Table 1 serve as commonly used metrics for evaluating segmentation performance, they fail to account for factors such as the shape and smoothness of the segmented regions.To address this limitation, the Hausdorff distance was employed as an additional metric, providing a more comprehensive evaluation of the segmentation experiments and yielding a more accurate assessment of their performance (i.e., considers the shape of the segmented regions).Using Hausdorff distance involved extracting the contours of the segmentation masks, and measuring the maximum distance between the contours of the predicted segmentation mask and the ground truth mask to determine the similarity of their shapes.The results of the proposed approach (involving training on a physics-inspired augmented dataset and refinement using CascadePSP) reveal a remarkable resemblance to the structure of the liver as annotated by the ground truth.It is evident that this approach effectively captures the complexities and intricacies of the liver's anatomy and is robust to inherent noise in ultrasound images, resulting in a highly accurate representation of the organ's structure.

Conclusion
Building an automated system for measuring liver fibrosis using shear SWE has been a challenge due to various factors such as low resolution and presence of noise and artifacts in ultrasound images.This paper proposes a novel contrastive self-supervised learning approach for image segmentation (considered the first crucial step in the automated system) which leverages a dataset of unannotated abdominal ultrasound images to learn the corresponding feature representations that are then fine-tuned on the downstream task of liver segmentation.The proposed method is trained in two stages, where the first stage uses the SimCLR model to learn the feature representations, while the second stage fine-tunes the model on the downstream task involving ENet.The proposed approach was compared to a state-of-the-art method and found to produce improved results.In addition, the study investigates the hypothesis that using physics-specific augmentations could improve the model's performance operating on ultrasound images.The study introduces different physics-inspired augmentations for ultrasound imaging, including sector angle, penetration, gaussian blur, force control, and zoom.The proposed augmentations were utilized in training both SimCLR+ENet and U-Net for liver segmentation in ultrasound images.The results of the experiments showed that the proposed augmentations improved the performance of the deep learning model operating on ultrasound images.The results of the study indicate that the proposed contrastive self-supervised learning approach for image segmentation can effectively be applied to the automated system for measuring liver fibrosis using SWE.The use of physics-specific augmentations also showed to be crucial in improving the performance of the deep learning model operating on ultrasound images.These findings are promising and could lead to the development of more effective automated systems for measuring liver fibrosis using SWE.As future work, we aim to improve the quality of data used for training the downstream task by combining self-supervised learning with a data-centric AI approach.This will help us achieve better segmentation performance with the smallest amount of labeled data.We are also planning to investigate the use of a vision transformer (ViT) as the backbone of SimCLR.

Figure 2 .
Figure 2. Physics-inspired augmentations and associated masks: (a) original B-mode image (b) sector angle (c) penetration (d) sector angle and penetration (e) gaussian blur (f) force control (g) zoom-in (h) zoom-out

Figure 3 .
Figure 3.Samples of training B-mode images and corresponding masks

Fig. 5
Fig.5illustrates the comparison between the contours of the ground truth shape and the segmentations generated by ES_phy_refine and UN, showcasing their level of similarity.The results of the proposed approach (involving training on a physics-inspired augmented dataset and refinement using CascadePSP) reveal a remarkable resemblance to the structure of the liver as annotated by the ground truth.It is evident that this approach effectively captures the complexities and intricacies of the liver's anatomy and is robust to inherent noise in ultrasound images, resulting in a highly accurate representation of the organ's structure.

Figure 5 .
Figure 5. Liver segmentation contours (ground truth, ES_phy_refine, UN): (a) B-mode image (b) contours of the ground truth shape and the segmentations generated by ES_phy_refine and UN overlaid on the B-mode image (c) contours overlaid on a blank image

Table 2 .
Table 2 presents the Hausdorff distance values for the different experiments.Shape similarity evaluation using Hausdorff distance for different experiments | best, second best, and third best