Background/aim: Single-cell transcriptomics (scRNA-Seq) explores cellular diversity at the gene expression level. Due to the inherent sparsity and noise in scRNA-Seq data and the uncertainty on the types of sequenced cells, effective clustering and cell type annotation are essential. The graph-based clustering of scRNA-Seq data is a simple yet powerful approach which presents data as a “shared nearest neighbour” graph and clusters the cells using graph clustering algorithms. These algorithms are dependent on several user-defined parameters. Here we present SUMA, a lightweight tool that uses a random forest model to predict the optimum number of neighbours to have the optimum clustering results. Also, we integrated our method with other commonly used methods in an RShiny application. SUMA can be used in a local environment (https://github.com/hkarakurt8742/SUMA) or as a browser tool (https://hkarakurt.shinyapps.io/suma/). Materials and methods: Publicly available scRNA-Seq datasets and3 different graph-based clustering algorithms are used to develop SUMA, a large range for number of neighbours and variant genes was taken into consideration. The quality of clustering was assessed using the Adjusted Rand Index (ARI) and true labels of each dataset. Data was split into training and test datasets, model was built and optimized using Scikit-learn (Python) and RandomForest (R) libraries. Results: The accuracy of our machine learning model is 0.96 while the AUC of ROC curve is 0.98. The model indicated that the number of cells in scRNA-Seq data is the most important feature when deciding the number of neighbours. Conclusion: We developed and evaluated the SUMA model and implemented the method in the SUMAShiny app, which integrates SUMA with different clustering methods, and enables non-bioinformatician users to cluster and visualize their scRNA data easily. The SUMAShiny app is available both for desktop and browser use.
Clustering, Machine Learning, Random Forest, Rshiny, scRNA-Seq
Karakurt, Hamza Umut and Pir, Pınar
"SUMA: A Lightweight Machine Learning Model Powered Shared Nearest Neighbour Based Clustering Application Interface of scRNA-Seq,"
Turkish Journal of Biology: Vol. 47:
6, Article 8.
Available at: https://journals.tubitak.gov.tr/biology/vol47/iss6/8