SANGRIA: Surgical Video Scene Graph Optimization for Surgical Workflow Prediction

Read original: arXiv:2407.20214 - Published 7/30/2024 by c{C}au{g}han Koksal, Ghazal Ghazaei, Felix Holm, Azade Farshad, Nassir Navab

SANGRIA: Surgical Video Scene Graph Optimization for Surgical Workflow Prediction

Overview

The paper proposes SANGRIA, a method for surgical video scene graph optimization to predict surgical workflow.
It aims to improve surgical workflow prediction by leveraging a scene graph representation of surgical videos.
The key contributions include a novel unsupervised video segmentation approach, a surgical scene graph generation method, and a workflow prediction model.

Plain English Explanation

The paper presents a new system called SANGRIA that uses scene graphs to better understand and predict the workflow during surgical procedures. Scene graphs are a way of representing the objects, actions, and relationships in a visual scene.

The researchers developed an unsupervised method to automatically segment surgical videos into meaningful phases or steps. They then generated a scene graph to represent each video segment, capturing the relevant instruments, tissues, and interactions.

Finally, the team used the scene graph representations to train a model that can predict the overall surgical workflow - i.e., the sequence of steps that make up the procedure. This could help surgeons better understand and plan for the different phases of an operation.

Technical Explanation

The key technical components of SANGRIA are:

Unsupervised Video Segmentation: The researchers developed an unsupervised approach to segment surgical videos into meaningful phases or steps, without requiring labeled training data. This leverages temporal and visual cues to automatically identify segment boundaries.
Surgical Scene Graph Generation: For each video segment, SANGRIA constructs a scene graph representation. This captures the relevant surgical instruments, tissues, and the relationships and interactions between them. The scene graph provides a rich, structured representation of the visual contents.
Workflow Prediction Model: The team trained a neural network model to take the sequence of scene graphs from a surgical video and predict the overall surgical workflow - i.e., the expected sequence of phases or steps in the procedure. This allows the system to anticipate the next steps in the workflow.

By using this scene graph-based approach, SANGRIA is able to better model the complex dynamics and dependencies within surgical procedures, leading to improved workflow prediction performance compared to prior methods.

Critical Analysis

The authors acknowledge several limitations of their work. First, the unsupervised video segmentation approach relies on heuristics and may not generalize well to all surgical procedures. Incorporating more domain knowledge or training data could improve its robustness.

Additionally, the scene graph representations capture static visual information, but do not model the temporal evolution of the surgical process. Incorporating dynamic scene graph techniques could further enhance the system's understanding of the surgical workflow.

Finally, the workflow prediction model was only evaluated on a single dataset, so its performance on diverse surgical procedures is unclear. Broader testing and validation would be needed to assess the generalizability of the approach.

Overall, SANGRIA represents a promising step towards leveraging scene graph representations for improved surgical workflow analysis. However, there are still opportunities to refine the techniques and expand their applicability to real-world surgical settings.

Conclusion

This paper introduces SANGRIA, a novel system that uses scene graph representations of surgical videos to better predict the overall surgical workflow. By automatically segmenting videos, constructing scene graphs, and training a workflow prediction model, SANGRIA demonstrates the potential of this structured representation for understanding and anticipating the complex steps involved in surgical procedures.

While the system has some limitations, the core ideas behind SANGRIA could lead to meaningful advancements in surgical video analysis and computer-assisted surgery. By providing surgeons with better situational awareness and workflow prediction capabilities, SANGRIA and similar techniques could ultimately help improve surgical outcomes and efficiency.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SANGRIA: Surgical Video Scene Graph Optimization for Surgical Workflow Prediction

c{C}au{g}han Koksal, Ghazal Ghazaei, Felix Holm, Azade Farshad, Nassir Navab

Graph-based holistic scene representations facilitate surgical workflow understanding and have recently demonstrated significant success. However, this task is often hindered by the limited availability of densely annotated surgical scene data. In this work, we introduce an end-to-end framework for the generation and optimization of surgical scene graphs on a downstream task. Our approach leverages the flexibility of graph-based spectral clustering and the generalization capability of foundation models to generate unsupervised scene graphs with learnable properties. We reinforce the initial spatial graph with sparse temporal connections using local matches between consecutive frames to predict temporally consistent clusters across a temporal neighborhood. By jointly optimizing the spatiotemporal relations and node features of the dynamic scene graph with the downstream task of phase segmentation, we address the costly and annotation-burdensome task of semantic scene comprehension and scene graph generation in surgical videos using only weak surgical phase labels. Further, by incorporating effective intermediate scene representation disentanglement steps within the pipeline, our solution outperforms the SOTA on the CATARACTS dataset by 8% accuracy and 10% F1 score in surgical workflow recognition

7/30/2024

Advancing Surgical VQA with Scene Graph Knowledge

Kun Yuan, Manasi Kattel, Joel L. Lavanchy, Nassir Navab, Vinkle Srivastav, Nicolas Padoy

Modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with language capabilities is emerging as a necessity. Our work aims to advance Visual Question Answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question-condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. First, we propose a Surgical Scene Graph-based dataset, SSG-QA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. Our SSG-QA dataset provides a more complex, diverse, geometrically grounded, unbiased, and surgical action-oriented dataset compared to existing surgical VQA datasets. We then propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module (SIM), which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Our comprehensive analysis of the SSG-QA dataset shows that SSG-QA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-QA

6/26/2024

SURGIVID: Annotation-Efficient Surgical Video Object Discovery

c{C}au{g}han Koksal, Ghazal Ghazaei, Nassir Navab

Surgical scenes convey crucial information about the quality of surgery. Pixel-wise localization of tools and anatomical structures is the first task towards deeper surgical analysis for microscopic or endoscopic surgical views. This is typically done via fully-supervised methods which are annotation greedy and in several cases, demanding medical expertise. Considering the profusion of surgical videos obtained through standardized surgical workflows, we propose an annotation-efficient framework for the semantic segmentation of surgical scenes. We employ image-based self-supervised object discovery to identify the most salient tools and anatomical structures in surgical videos. These proposals are further refined within a minimally supervised fine-tuning step. Our unsupervised setup reinforced with only 36 annotation labels indicates comparable localization performance with fully-supervised segmentation models. Further, leveraging surgical phase labels as weak labels can better guide model attention towards surgical tools, leading to $sim 2%$ improvement in tool localization. Extensive ablation studies on the CaDIS dataset validate the effectiveness of our proposed solution in discovering relevant surgical objects with minimal or no supervision.

9/14/2024

Revisiting Surgical Instrument Segmentation Without Human Intervention: A Graph Partitioning View

Mingyu Sheng, Jianan Fan, Dongnan Liu, Ron Kikinis, Weidong Cai

Surgical instrument segmentation (SIS) on endoscopic images stands as a long-standing and essential task in the context of computer-assisted interventions for boosting minimally invasive surgery. Given the recent surge of deep learning methodologies and their data-hungry nature, training a neural predictive model based on massive expert-curated annotations has been dominating and served as an off-the-shelf approach in the field, which could, however, impose prohibitive burden to clinicians for preparing fine-grained pixel-wise labels corresponding to the collected surgical video frames. In this work, we propose an unsupervised method by reframing the video frame segmentation as a graph partitioning problem and regarding image pixels as graph nodes, which is significantly different from the previous efforts. A self-supervised pre-trained model is firstly leveraged as a feature extractor to capture high-level semantic features. Then, Laplacian matrixs are computed from the features and are eigendecomposed for graph partitioning. On the deep eigenvectors, a surgical video frame is meaningfully segmented into different modules such as tools and tissues, providing distinguishable semantic information like locations, classes, and relations. The segmentation problem can then be naturally tackled by applying clustering or threshold on the eigenvectors. Extensive experiments are conducted on various datasets (e.g., EndoVis2017, EndoVis2018, UCL, etc.) for different clinical endpoints. Across all the challenging scenarios, our method demonstrates outstanding performance and robustness higher than unsupervised state-of-the-art (SOTA) methods. The code is released at https://github.com/MingyuShengSMY/GraphClusteringSIS.git.

8/28/2024