•  
  •  
 

Turkish Journal of Biology

DOI

10.55730/1300-0152.2676

Abstract

Background/aim: Single-cell transcriptomics (scRNA-Seq) explores cellular diversity at the gene expression level. Due to the inherent sparsity and noise in scRNA-Seq data and the uncertainty on the types of sequenced cells, effective clustering and cell type annotation are essential. The graph-based clustering of scRNA-Seq data is a simple yet powerful approach which presents data as a “shared nearest neighbour” graph and clusters the cells using graph clustering algorithms. These algorithms are dependent on several user-defined parameters. Here we present SUMA, a lightweight tool that uses a random forest model to predict the optimum number of neighbours to have the optimum clustering results. Also, we integrated our method with other commonly used methods in an RShiny application. SUMA can be used in a local environment (https://github.com/hkarakurt8742/SUMA) or as a browser tool (https://hkarakurt.shinyapps.io/suma/). Materials and methods: Publicly available scRNA-Seq datasets and3 different graph-based clustering algorithms are used to develop SUMA, a large range for number of neighbours and variant genes was taken into consideration. The quality of clustering was assessed using the Adjusted Rand Index (ARI) and true labels of each dataset. Data was split into training and test datasets, model was built and optimized using Scikit-learn (Python) and RandomForest (R) libraries. Results: The accuracy of our machine learning model is 0.96 while the AUC of ROC curve is 0.98. The model indicated that the number of cells in scRNA-Seq data is the most important feature when deciding the number of neighbours. Conclusion: We developed and evaluated the SUMA model and implemented the method in the SUMAShiny app, which integrates SUMA with different clustering methods, and enables non-bioinformatician users to cluster and visualize their scRNA data easily. The SUMAShiny app is available both for desktop and browser use.

Keywords

Clustering, Machine Learning, Random Forest, Rshiny, scRNA-Seq

First Page

413

Last Page

422

Included in

Biology Commons

Share

COinS