Multi-view Clustering on Single-cell Data with Community Detection

Multi-view Clustering on Single-cell Data with Community Detection

Dayu Hu, Zhibin Dong, Ke Liang, Jun Wang, Siwei Wang, and Xinwang Liu from the National University of Defense Technology of China, reported their contribution in identifying clusters on single-cell data.

Single-cell data, such as single-cell RNA (scRNA) and single-cell Assay of Transposase Accessible Chromatin (scATAC), contained valuable information about individual cells but analyzing them across different views posed difficulties. One challenge was the discrepancy in data richness between different views, which could lead to a decrease in overall performance when using traditional clustering methods. Another challenge was the requirement of manual specification of the number of clusters, which was a daunting task for biologists dealing with single-cell data.

To address these challenges, the study proposed a novel approach called scUNC. The main objective of scUNC was to accurately cluster single-cell data from different views without the need for a predefined number of clusters. It integrated a cross-view fusion network to effectively integrate information from different views and automatically allocated weights based on the information richness of each view. Additionally, it used community detection and a dip-test to generate initial clusters and iteratively merged them until convergence, eliminating the need for manual cluster specification.

The study evaluated scUNC using three single-cell datasets, demonstrating its superior performance compared to baseline methods. These datasets included BMNC, SMAGE-10K, and SMAGE-3K, which contained varying sample sizes and numbers of clusters. The evaluation metrics used included the Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Purity (PUR), and Accuracy (ACC).

How did they build their algorithm?

The authors described their proposed scUNC framework for integrating single-cell RNA sequencing (scRNA) and single-cell ATAC sequencing (scATAC) data. The framework aimed to assign optimal weights to each view and effectively fuse the information from both views. One key advantage of the framework was that it eliminated the need for manual specification of the number of clusters, which was beneficial for biologists conducting cell cluster analysis.

The framework started by excluding outlier cells and then used multiple autoencoders to transform the original feature matrices into low-dimensional representations. These representations were then concatenated to form a shared embedding. To address the discrepancy in information richness between scRNA and scATAC views, the authors proposed a Cross-View Fusion Network (CVFN) that assigned weights to each view based on their information richness. This imbalance was rectified by assigning different weights to each view in the fusion process.

Instead of using the traditional k-means algorithm, the authors employed community detection to form initial clusters. Community detection was a technique used to assign nodes to communities based on their neighbor relationships, which was suitable for analyzing single-cell data. The authors then proposed an iterative merging process inspired by the dip-test statistical tool to merge clusters based on their structural similarity.

The CVFN network and community detection process were combined into an overall optimization module, which included a reconstruction loss that measured the difference between the reconstructed data and the input data. The framework aimed to generate high-quality representations and clusters from the integrated scRNA and scATAC data.

Relying solely on the reconstruction loss was not sufficient to impose enough constraints on the cell representations. Therefore, the authors introduced a clustering loss to facilitate joint optimization. In essence, their model refined the embeddings by minimizing the cell representation’s disparity from the assigned cluster centers. Consequently, throughout the optimization procedure, clusters exhibiting elevated dip-scores progressively converged. This result concurred with the design principles of their workflow, iteratively merging similar clusters. Furthermore, they integrated the Dc-based standard deviation to promise the scale simultaneously pulled single clusters to a distant position.

The final loss function was a combination of the clustering loss and the reconstruction loss, with hyperparameters λ1 and λ2 used to balance the two losses. The complete clustering procedure involved the collaboration of the optimization module and the automated merging module. After obtaining the fused cell representation, initial clusters were generated using a community detection algorithm. These clusters were then evaluated based on the dip-test. Highly correlated clusters were merged together. The optimization process and merging process operated alternately and mutually reinforced each other until no further merging could be done. This automated clustering algorithm eliminated the need for manual parameter configuration and produced high-quality clusters by bringing similar clusters closer and merging them.

How is the performance?

The authors presented the performance comparison between the proposed scUNC method and other baseline methods. The results showed that scUNC consistently outperformed the other methods in various evaluation metrics, achieving first place in 8 out of 12 evaluations and ranking within the top two in 11 of them. The slight decrease in the PUR metric on the SMAGE-3K dataset was attributed to potential class imbalance issues. Their paper also included visualizations of embeddings generated by scUNC and models with removed modules, highlighting the superior dispersion and cluster separation achieved by scUNC.

To validate the effectiveness of the proposed modules, the paper conducted ablation experiments on two sets of model variants. The results showed that all three modules (CVFN network, clustering loss, and reconstruction loss) contributed significantly to the overall performance of scUNC. Removing any of these modules led to a decrease in performance, indicating their importance in optimizing the model. Additionally, another set of ablation experiments verified the performance enhancement provided by the automatic merging module. The results demonstrated that the merging module greatly improved clustering performance, emphasizing its critical role in the scUNC model.

Furthermore, the paper evaluated the generalization capabilities of scUNC on a non-cellular multi-view dataset and compared it with other competing methods designed for single-cell data. The results showed that scUNC achieved excellent performance on the non-cellular dataset, demonstrating its strong generalization capabilities and potential for extension to various scenarios.

In conclusion, the scUNC model presented in the paper was a No-K MVC framework tailored for single-cell data. It effectively addressed the information richness disparity between different cell views and incorporated automatic clustering and merging modules. Extensive experimental results validated the superiority and generalization capabilities of scUNC in both single-cell and non-cellular data. The paper also analyzed hyperparameters, convergence, and stability of the scUNC model, providing further insights into its performance and effectiveness.