TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Prediction of Lung Metastasis from Hepatocellular Carcinom...

Prediction of Lung Metastasis from Hepatocellular Carcinoma using the SEER Database

Jeff J. H. Kim, George R. Nahass, Yang Dai, Theja Tulabandhula

2025-01-20ImputationEpidemiologyFeature Importance
PaperPDFCode(official)

Abstract

Hepatocellular carcinoma (HCC) is a leading cause of cancer-related mortality, with lung metastases being the most common site of distant spread and significantly worsening prognosis. Despite the growing availability of clinical and demographic data, predictive models for lung metastasis in HCC remain limited in scope and clinical applicability. In this study, we develop and validate an end-to-end machine learning pipeline using data from the Surveillance, Epidemiology, and End Results (SEER) database. We evaluated three machine learning models (Random Forest, XGBoost, and Logistic Regression) alongside a multilayer perceptron (MLP) neural network. Our models achieved high AUROC values and recall, with the Random Forest and MLP models demonstrating the best overall performance (AUROC = 0.82). However, the low precision across models highlights the challenges of accurately predicting positive cases. To address these limitations, we developed a custom loss function incorporating recall optimization, enabling the MLP model to achieve the highest sensitivity. An ensemble approach further improved overall recall by leveraging the strengths of individual models. Feature importance analysis revealed key predictors such as surgery status, tumor staging, and follow up duration, emphasizing the relevance of clinical interventions and disease progression in metastasis prediction. While this study demonstrates the potential of machine learning for identifying high-risk patients, limitations include reliance on imbalanced datasets, incomplete feature annotations, and the low precision of predictions. Future work should leverage the expanding SEER dataset, improve data imputation techniques, and explore advanced pre-trained models to enhance predictive accuracy and clinical utility.

Related Papers

Missing value imputation with adversarial random forests -- MissARF2025-07-21MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Neural Network-Guided Symbolic Regression for Interpretable Descriptor Discovery in Perovskite Catalysts2025-07-16A Simple Approximate Bayesian Inference Neural Surrogate for Stochastic Petri Net Models2025-07-14SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning2025-07-14Feature-Guided Neighbor Selection for Non-Expert Evaluation of Model Predictions2025-07-08BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects2025-06-26