CASCAT is a tree-shaped structural causal model with the local Markovian property between clusters and conditional independences to infer a unique cell differentiation trajectory, overcoming Markov equivalence in high-dimensional, non-linear data. CASCAT eliminates redundant links between spatially close but independent cells, creating a causal cell graph that enhances the accuracy of existing spatial clustering algorithms.
This step can be finished within a few minutes.
- Install Miniconda if not already available.
- Create a new cascat environment, activate it, and install the basic packages.
conda create -n cascat python==3.10 -y
conda activate cascat- Install PyTorch and PyG. To select the appropriate versions, you may refer to the official websites of PyTorch and PyG. The following commands are for CUDA 11.8.
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
pip install torch_geometric==2.6.1 pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
pip install scanpy==1.10.1 matplotlib networkx scikit-misc pydot pot numpy==1.26.4 scikit-learn==1.5.2
pip install numba==0.60.0 numba-scipy==0.4.0 pandas==2.2.3 scipy==1.11.0 pyyaml==6.0.3(optinal) Install cupy on Linux and Windows. On macOS, numpy will be used by default, but the performance will be slower.
pip install cupy-cuda11x - (optinal) Install R to generate simulated data.
conda create -n r_env r-essentials r-base -y;
conda activate r_env
conda install r-mclust
export R_HOME='/home/yourname/miniconda3/envs/r_env/lib/R'
export rScript = '/home/yourname/miniconda3/envs/r_env/bin/Rscript'We provide example dataset tree1 in the ./data/tree1/. Other simulation and real data is hosted on Figshare.
python main.py --YAML ./config/tree1.yml --mode train --verbose True
The output of CASCAT is a new Anndata object data_processed.h5ad under ./result, with the following information stored
within it:
adata.obs['cascat_clusters']The predicted cluster labels.adata.obsm['cascat_embedding']The generated low-dimensional cell embeddings.adata.uns['cascat_connectivities']The inferred trajecory topology connectivities.adata.uns['CMI']The inferred conditional mutual information matrix for each cluster.
The YAML files for all datasets are stored on config/yaml/CMI folder, and the comparison method scripts are located in the submodules folder.
To run CASCAT, follow the steps below:
CASCAT takes AnnData formatted input, stored in .h5ad files, where obs contains cell/spot information and var holds
gene annotations.
To use the data, place it in a folder, then update the adata_file field in the tree1.yml configuration to reflect
the relative path to the data.
-
update params in
./config/tree1.ymlCMI_diras the directory for storing the casual cell graph outputs.- We have accelerated the computation process using GPUs, completing the analysis of 2000 cells within 3 minutes.
- We have provided the pre-caculated CMI values between cells in the Google Drive.
percentas the percentage of the causal cell graph to be removed.- default is 0.1 in scRNA-seq dataset and 0.15 in ST dataset.
-
To run CASCAT get cluster result, you can execute following code:
python main.py --YAML ./config/tree1.yml --mode train --verbose True- store ground-truth/predicted cluster labels in
adata.obs['cluster']is recommended!!! - Note: To access the clustering metrics, set
verbose=Trueand store ground-truth cluster labels inadata.obs['cluster'].
- store ground-truth/predicted cluster labels in
-
store obs_index of root cell in adata.uns['start_id']
- Note: if the root cell is unknown, following up run_cellrank2.py to set predicted root cell
-
update params in
./config/tree1.ymlemb_pathis the path of clustering embedding.job_diris the directory of storing the clustering output.output_diris the directory of storing the trajectory output.rootis the cluster_id of start_idnclassis the number of clusters
-
To run CASCAT get only trajectory result, you can execute following code:
python main.py --YAML ./config/tree1.yml --mode infer- Note: To access the TI metrics, store the true pseudo-time labels in
adata.uns['timecourse']and the trajectory topology inadata.uns['milestone_network'].
- Note: To access the TI metrics, store the true pseudo-time labels in
To visualize the results, refer to the Visualization.ipynb notebook
We've implemented the Python version of InformationMeasures.jl, enhanced with a kernel function.
Consult the InfoMeasure.ipynb for usage details.
In addition, we also provide a GPU version implemented with CuPy, as well as a parallel version implemented with Numba to accelerate the computation of conditional mutual information.

