10X单细胞空间数据分析之表征细胞状态和生态型（EcoTyper示例代码）

10X单细胞空间数据分析之表征细胞状态和生态型（EcoTyper示例代码）

2024-12-27 05:14

隔离的第四天，一个人的日子总是很难熬，孤独的生活却总是来到我的身边，这一篇我们来分享EcoTyper的示例代码，帮助我们来寻找细胞特异性转录状态和共关联模式。这篇发表于cell的软件，文献的分享在10X单细胞空间数据分析之表征细胞状态和生态型（EcoTyper），我们来看看软件的具体使用方法，当然了，我们更加关注单细胞和空间数据的部分

EcoTyper 是一个机器学习框架，用于从bulk和单细胞 (scRNA-seq) 表达数据中大规模识别细胞类型特异性转录状态及其共关联模式。

软件本身已经定义了癌症（Luca/Steen 等人，Cell 2021）和弥漫性大 B 细胞淋巴瘤（DLBCL）（Steen/Luca 等人，Cancer Cell 2021）中的细胞状态和生态型。当前版本的 EcoTyper 允许用户在他们自己的数据中恢复这两种肿瘤类别的细胞状态和生态型。此外，它允许用户在他们感兴趣的系统中发现和恢复细胞状态和生态型，包括直接从 scRNA-seq 数据中恢复。

安装EcoTyper

Basic resources

下面列出的 R 包是运行 EcoTyper 所必需的。版本号表示用于开发和测试 EcoTyper 代码的包版本。其他 R 版本也可能工作：

R (v3.6.0 and v4.1.0).
R packages: ComplexHeatmap (v2.2.0 and v2.8.0), NMF (v0.21.0 and v0.23.0), RColorBrewer (v1.1.2), cluster (v2.1.0 and v2.1.2)), circlize (v0.4.10 and v0.4.12), cowplot (v1.1.0 and v1.1.1), data.table (base package R v3.6.0 and v4.1.0), doParallel (v1.0.15 and v1.0.16), ggplot2 (v3.3.2, v3.3.3), grid (base package R v3.6.0 and v4.1.0), reshape2 (v1.4.4), viridis (v0.5.1 and v0.6.1), config (v0.3.1), argparse (v2.0.3), colorspace (v1.4.1 and v2.0.1), plyr (v1.8.6).

这些包与预先存储在 EcoTyper 文件夹中的其他资源一起，允许：

在自己的bulk RNA-seq、微阵列和 scRNA-seq 数据中执行先前定义的细胞状态和生态型的恢复 .
perform cell state and ecotype discovery in scRNA-seq and pre-sorted cell type-specific profiles

当然了，我们关注单细胞和空间数据的分析

When the input is scRNA-seq or bulk-sorted cell type-specific profiles (e.g., FACS-purified), EcoTyper performs the following major steps for discovering cell states and ecotypes:

Gene filtering: This step filters out genes that do not show cell type specificity.
Cell state discovery: This step enables identification and quantitation of cell type-specific transcriptional states.
Ecotype discovery: This step enables co-assignment of cell states into multicellular communities (ecotypes).

无论用于导出细胞状态和生态型的输入类型如何，EcoTyper 都可以在外部表达数据集中执行细胞状态和生态型恢复。可以bulk、scRNA-seq 和空间转录组数据进行恢复。

我们来看看示例，提供的 scRNA-seq 数据中恢复细胞状态和生态型

EcoTyper 预装了用户提供的 scRNA-seq 数据中先前在癌症和淋巴瘤中定义的细胞状态和生态型的参考指导恢复所需的资源。

我们先来看看主脚本EcoTyper_recovery_scRNA.R

运行一下，

参数解析：

-d/–discovery：用于定义细胞状态的发现数据集的名称。默认情况下，唯一可接受的值是 Carcinoma 和 Lymphoma（区分大小写），它们将分别恢复我们已经在癌和淋巴瘤中定义的细胞状态。如果用户在自己的数据中定义了细胞状态，则发现数据集的名称是用于运行细胞状态发现的配置文件的 Discovery dataset name 字段中提供的值。在教程中，我们将发现数据集的名称设置为 Carcinoma。
-m/–matrix：输入 scRNA-seq 矩阵的路径。 scRNA-seq 表达矩阵应该是一个制表符分隔的文件，第一列是基因符号，下一列是细胞。它应该有细胞标识符（例如条形码）作为列名，并且应该采用 TPM、CPM、FPKM 或任何其他合适的计数格式。基因符号和细胞标识符应该是唯一的。此外，我们建议列名不要包含由 R 函数 make.names 修改的特殊字符，例如名称开头有数字或包含空格、制表符或 - 等字符。本教程中使用的 CRC 癌症 scRNA-seq 数据如下所示：

-a/–annotation：制表符分隔的注释文件的路径。该文件应至少包含两列：与表达式矩阵的列具有相同值的 ID，以及包含每个细胞的细胞类型的 CellType（区分大小写）。 CellType 列中的值应指示每个细胞的细胞类型。这些值仅限于发现数据集中分析的一组细胞类型。如果参数 -d/–discovery 设置为 Carcinoma，则接受的列 CellType 值为：'B.cells'、'CD4.T.cells'、'CD8.T 细胞”、“树突细胞”、“内皮细胞”、“上皮细胞”、“成纤维细胞”、“肥大细胞”、“单核细胞和巨噬细胞”、“NK细胞”、“PCs”和“ PMN'。如果参数 -d/–discovery 设置为 Lymphoma，则接受的 CellType 列值为：'B.cells'、'Plasma.cells'、'T.cells.CD8'、'T.cells.CD4'、 'T.cells.follicular.helper'、'Tregs'、'NK.cells'、'单核细胞和巨噬细胞'、'树突细胞'、'肥大细胞'、'中性粒细胞'、'成纤维细胞'、'内皮细胞。细胞'。对于这两种情况，所有其他值都将被忽略。注释文件可以包含一个名为 Sample 的列。如果此列存在，除了细胞状态恢复外，还将执行生态型恢复。此外，此文件可以包含任意数量的列，可用于在输出热图中绘制颜色条（请参阅参数 -c/–columns）。

-c/–columns：注释文件中以逗号分隔的列名列表（请参阅参数 -a/–annotation），在输出热图中绘制为彩条。默认情况下，输出热图包含每个细胞分配到的细胞状态标签作为颜色条。此参数指示的列名称将添加到该颜色条中。
-z/–z-score：指示是否应运行显着性量化程序的标志（默认为 FALSE）。此过程允许用户确定在给定数据集中是否显着恢复了细胞状态。请注意，此过程可能非常缓慢，因为 NMF 模型在同一数据集上应用了 30 次。
-s/–subsample: An integer specifying the number of cells each cell type will be downsampled to. For values <50, no downsampling will be performed. Default: -1 (no downsampling).
-t/–threads: Number of threads. Default: 10.
-o/–output: Output folder. The output folder will be created if it does not exist.

运行一下单细胞数据

The outputs of this script include the following files, for each cell type provided:

The assignment of single cells to states:

两个热图：一个表示发现数据集中细胞状态标记基因表达的热图，一个表示 scRNA-seq 数据集中相同标记基因表达的热图，经过平滑处理以减轻 scRNA-seq 丢失的影响：

如果应用统计显着性量化方法，则每个细胞状态的结果 z 分数将输出到同一目录中：

The output for ecotypes includes:

The abundance (fraction) of each ecotype in each sample:

The assignment of samples to the carcinoma ecotype with the highest abundance. If the cell state fractions from the dominant ecotype are not significantly higher than the other cell state fractions in a given sample, the sample is considered unassigned and filtered out from this table:

A heatmap of cell state abundances across the samples assigned to ecotypes. Rows correspond to the cell states forming ecotypes, while columns correspond to the samples assigned to ecotypes:

示例2，Recovery of Lymphoma Cell States and Ecotypes in scRNA-seq Data

输出The assignment of single cells to states:

热图

Recovery of Cell States and Ecotypes in Spatial Transcriptomics data(空间转录组示例)，当然了，主要是10Xgenomics的空间数据

为了使 EcoTyper 在 Visium 数据中执行细胞状态和生态类型恢复，需要提供以下资源：

- the filtered feature-barcode matrices , and , in the format provided by 10x Genomics, and the file produced by the run summary images pipeline, containing the spatial position of barcodes.

如果要分析的系统中预期的主要细胞群被 EcoTyper 癌论文中分析的细胞群（B 细胞、CD4 T 细胞、CD8 T 细胞、树突细胞、内皮细胞、上皮细胞、成纤维细胞、肥大细胞、单核细胞/巨噬细胞、NK 细胞、浆细胞、中性粒细胞）或 EcoTyper 淋巴瘤纸（B 细胞、CD4 T 细胞、CD8 T 细胞、滤泡辅助 T 细胞、Tregs、树突细胞、内皮细胞、成纤维细胞、肥大细胞、单核细胞/ 巨噬细胞、NK细胞、浆细胞、中性粒细胞），那么需要：

Docker
Docker containers for CIBERSORTx Fractions and CIBERSORTx HiRes modules, both of which can be obtained from the CIBERSORTx website. Please follow the instructions from the website to install them.
A token required for running the docker containers, which can also be obtained from the CIBERSORTx website.

如果要分析的系统中预期的主要细胞群没有被 EcoTyper 癌论文中分析的细胞群（B 细胞、CD4 T 细胞、CD8 T 细胞、树突细胞、内皮细胞、上皮细胞、成纤维细胞、肥大细胞）概括、单核细胞/巨噬细胞、NK 细胞、浆细胞、中性粒细胞）或 EcoTyper 淋巴瘤论文（B 细胞、CD4 T 细胞、CD8 T 细胞、滤泡辅助 T 细胞、Tregs、树突状细胞、内皮细胞、成纤维细胞、肥大细胞、单核细胞 /巨噬细胞、NK细胞、浆细胞、中性粒细胞），则用户需要为这些群体提供自己的细胞类型比例估计：

首先看一下主脚本

运行一下

The configuration file

This script takes as input file a configuration file in YAML format. The configuration file for this tutorial is available in :

The configuration file has three sections, Input, Pipeline settings, and Output. We next will describe the expected content in each of these sections, and instruct the user how to set the appropriate settings in their applications.

Input section

The Input section contains settings regarding the input data.

Discovery dataset name

Discovery dataset name should contain the name of the discovery dataset used for defining cell states. By default, the only accepted values are Carcinoma and Lymphoma (case sensitive), which will recover the cell states that we defined across carcinomas and in lymphoma, respectively. If the user defined cell states in their own data (Tutorials 4-6), the name of the discovery dataset is the value provided in the Discovery dataset name field of the configuration file used for running discovery. For this tutorial, we set the name of the discovery dataset to Carcinoma.

Recovery dataset name

Recovery dataset name is the identifier used by EcoTyper to internally save and retrieve the information about the cell states/ecotypes abundances. Any value that contains alphanumeric characters and ’_’ is accepted for this field.

Input Visium directory

There are 4 input files needed for recovery on the visium data:

The filtered feature-barcode matrices , and , in the format provided by 10x Genomics, and the file produced by the run summary images pipeline, containing the spatial position of barcodes.

Recovery cell type fractions

Recovery cell type fractions should contain the path to a file containing the cell type fraction estimations for each spot on the visium array. This field is ignored when the discovery dataset is Carcinoma or Lymphoma or when the discovery has been performed as described in Tutorial 4, using Carcinoma_Fractions or Lymphoma_Fractions. It is only used when users provided their own cell type fractions for deriving cell states and ecotypes in Tutorial 4. In this case, the user needs to provide a path to a tab-delimited file for this field. The file should contain in the first column the same sample names used as column names in the input expression matrix, and in the next columns, the cell type fractions for the same cell populations used for discovering cell states and ecotypes. These fractions should sum up to 1 for each row. An example of such a file is provided in:

Since in this tutorial we use the Carcinoma dataset as the discovery dataset, this field is not required. However, if it needs to be provided, it can be set as follows:

Malignant cell of origin

The cell of origin population for the cancer type being analyzed, amongst the cell types used for discovery. This field is used for plotting a gray background in the resulting output plot, with the intensity of gray depicting the abundance of the cell of origin population in each spot. It is not used when the discovery dataset is Carcinoma or Lymphoma or when the discovery has been performed as described in Tutorials 4-6, using Carcinoma_Fractions or Lymphoma_Fractions. In these cases, the malignant cells are automatically considered to be originating from Epithelial.cells or B.cells, respectively. Otherwise, this field needs to contain a column name in the file provided in Recovery cell type fractions field, corresponding to the appropriate cell type of origin.

CIBERSORTx username and token

The fields CIBERSORTx username and CIBERSORTx token should contain the username on the CIBERSORTx website and the token necessary to run the CIBERSORTx source code. The token can be obtained from the CIBERSORTx website.

The output section

The Output section contains a single field, Output folder, which specifies the path where the final output will be saved. This folder will be created if it does not exist.

Number of threads

The last section, Pipeline settings, contains only one argument, the number of threads used for performing recovery:

The command line

After editing the configuration file (), the command line for recovering the cell states and ecotypes in Visium Spatial Gene Expression data looks as illustrated below. Please note that this script might take up to two hours to run on 10 threads. Also, since CIBERSORTx is run on each spot, the memory requirements might exceed the memory available on a typical laptop. We recommend that this tutorial is run on a server with >32GB of RAM.

The output format

EcoTyper generates for each cell type the following outputs:

Cell state abundances:

Plots illustrating the cell state abundance across state from each cell type. The intensity of charcoal represents the cell state abundance. The intensity of gray represents the fraction of the cancer cell of origin population:

Fibroblasts_spatial_heatmaps.png

Ecotype abundances:

Plots illustrating the ecotype abundances. The intensity of charcoal represents the cell state abundance. The intensity of gray represents the fraction of the cancer cell of origin population:
knitr::include_graphics("VisiumOutput/VisiumBreast/Ecotype_spatial_heatmaps.png")

示例3，De novo Discovery of Cell States and Ecotypes in scRNA-seq Data

在教程中，将说明如何从 scRNA-seq 表达矩阵开始对细胞状态和生态型进行从头识别。出于说明目的，我们使用来自结肠直肠癌的 scRNA-seq 的下采样版本作为发现数据集，可在 example_data/scRNA_CRC_data.txt 以及示例注释文件 example_data/scRNA_CRC_annotation.txt 中获得。

Overview of the EcoTyper workflow for discovering cell states in scRNA-seq data

EcoTyper 通过一系列步骤从 scRNA-seq 数据中获取细胞状态和生态型：

1、提取细胞类型特异性基因：去除在给定细胞类型中未特异性表达的基因，是降低识别虚假细胞状态可能性的重要考虑因素。在 scRNA-seq 数据中执行细胞状态发现之前，Ecotyper 默认应用非细胞类型特定基因的过滤器。具体来说，它在来自给定细胞类型的细胞和所有其他细胞类型组合的细胞之间执行差异表达。对于细胞类型的计算效率和平衡表示，此步骤仅使用每种细胞类型最多 500 个随机选择的细胞。从每种细胞类型中过滤掉 Q 值 > 0.05（双边 Wilcox 检验，使用 Benjamini-Hochberg 校正进行多假设校正）的基因。
2、相关矩阵上的细胞状态发现：EcoTyper 利用非负矩阵分解 (NMF) 从单细胞表达谱中识别转录定义的细胞状态。然而，直接应用于 scRNA-seq 表达矩阵的 NMF 可能表现不佳，因为 scRNA-seq 数据通常是稀疏的。因此，EcoTyper 首先将 NMF 应用于来自给定细胞类型的每对细胞之间的相关矩阵。为了提高计算效率，EcoTyper 在此步骤中最多仅使用 2,500 个随机选择的细胞。为了满足 NMF 的非负性要求，相关矩阵使用 posneg 变换单独处理。此函数将相关矩阵 Vi 转换为两个矩阵，一个仅包含正值，另一个仅包含符号反转的负值。这两个矩阵随后被连接以产生 Vi*。

对于每种细胞类型，EcoTyper 将 NMF 应用于一系列等级（细胞状态数），默认为 2-20 个状态。对于每个等级，NMF 算法使用不同的起始种子多次应用（我们建议至少 50 次），以提高鲁棒性。

3、选择细胞状态数：Cluster（状态）数选择是 NMF 应用中的一个重要考虑因素。我们发现，以前依赖最小化误差度量（例如，RMSE、KL 散度）或优化信息论度量的方法要么无法收敛，要么依赖于估算的基因数量。相比之下，cophenetic 系数量化了给定等级（即Cluster的数量）的分类稳定性，范围从 0 到 1，其中 1 是最大稳定的。虽然通常选择共同系数开始下降的等级，但这种方法很难应用于共同系数在等级间呈现多模态形状的情况，正如我们在某些细胞类型中发现的那样。因此，我们开发了一种更适合此类设置的启发式方法。在每种情况下，排名是根据在 2-20 个Clusters范围内评估的共生系数自动选择的（默认情况下）。具体来说，我们确定了在 2-20 区间内的第一次出现，其共相系数降至 0.95 以下（默认情况下），至少连续两个等级高于该水平。然后，我们选择了紧邻该交叉点的等级，该等级最接近 0.95（默认情况下）。
4、提取细胞状态信息：解析第 2 步产生的 NMF 输出，提取细胞状态信息用于下游分析。
5、在表达矩阵上重新发现细胞状态：在识别相关矩阵上的细胞状态之后，EcoTyper 执行差异表达以识别与每个细胞状态最高度相关的基因。生成的标记按每个状态的倍数变化进行排序，并选择跨细胞状态排名最高的前 1000 个基因用于新一轮的 NMF。如果可用的基因少于 1000 个，则选择所有基因。在 NMF 之前，每个基因都被缩放为均值 0 和单位方差。为了满足 NMF 的非负性要求，细胞类型特异性表达矩阵使用 posneg 转换单独处理。此函数将输入表达式矩阵 Vi 转换为两个矩阵，一个仅包含正值，另一个仅包含符号反转的负值。这两个矩阵随后被连接以产生 Vi*。
对于每种细胞类型，EcoTyper 仅将 NMF 应用于步骤 3 中选择的等级。和以前一样，NMF 算法应用多次（我们建议至少 50 次）具有不同的起始种子，以实现稳健性。
6、提取细胞状态信息：解析第 5 步产生的 NMF 输出，提取细胞状态信息用于下游分析。
7、细胞状态 QC 过滤器：虽然 posneg 变换需要满足标准化后 NMF 的非负性约束，但它会导致识别由负值多于正值的特征驱动的虚假细胞状态。为了解决这个问题，我们设计了一个自适应误报指数（AFI），这是一个新的指数，定义为 W 矩阵中对应于负特征和正特征的权重之和之间的比率。 EcoTyper 自动过滤具有 AFI > = 1 的状态。
8、生态型（细胞群落）发现：生态型或细胞群落是通过识别样本中细胞状态的共现模式得出的。首先，EcoTyper 利用 Jaccard 指数来量化发现队列中样本中每对细胞状态之间的重叠程度。为此，它将每个细胞状态 q 离散化为长度为 l 的二进制向量 a，其中 l = 发现队列中的样本数。总的来说，这些向量包括二进制矩阵 A，其行数与跨细胞类型和 l 列（样本）的细胞状态相同。给定样本 s，如果状态 q 是细胞类型 i 中所有状态中最丰富的状态，则 EcoTyper 将 A(q,s) 设置为 1；否则 A(q,s) ← 0。然后它计算矩阵 A 中行（状态）上的所有成对 Jaccard 索引，产生矩阵 J。使用超几何检验，它评估任何给定的细胞状态对 q 和k 没有重叠。在超几何 p 值 > 0.01 的情况下，J(q,k) 的 Jaccard 指数设置为 0（即没有重叠）。为了在适应异常值的同时识别社区，更新的 Jaccard 矩阵 J' 使用具有欧几里得距离的平均链接进行层次聚类（R stats 包中的 hclust）。然后通过轮廓宽度最大化来确定最佳聚类数。从进一步分析中排除具有少于 3 个细胞状态的Cluster。

Checklist before performing cell states and ecotypes discovery in scRNA-seq data

a user-provided scRNA-seq expression matrix, on which the discovery will be performed (a discovery cohort). For this tutorial, we will use the example data in example_data/scRNA_CRC_data.txt.
a sample annotation file, such as the one provided in example_data/scRNA_CRC_annotation.txt, with at least three columns: ID, CellType and Sample.

Cell states and ecotypes discovery in scRNA-seq data，先看一下主脚本EcoTyper_discovery_scRNA.R

该脚本将 YAML 格式的配置文件作为输入文件。本教程的配置文件位于 config_discovery_scRNA.yml 中：

配置文件包含三个部分，输入、输出和管道设置。接下来，我们将分别描述这三个部分的预期内容，并指导用户如何在其应用程序中设置适当的设置。

Input section

The Input section contains settings regarding the input data.

Discovery dataset name

Discovery dataset name is the identifier used by EcoTyper to internally save and retrieve the information about the cell states/ecotypes defined on this discovery dataset. It is also the name to be provided to the -d/–discovery argument of scripts and , when performing cell state/ecotypes recovery. Any value that contains alphanumeric characters and ’_’ is accepted for this field.

Expression matrix

Expression matrix field should contain the path to a tab-delimited file containing the expression data, with genes as rows and cells as columns. The expression matrix should be in the TPM, CPM or other suitable normalized space. It should have gene symbols on the first column and gene counts for each cell on the next columns. Column (cells) names should be unique. Also, we recommend that the column names do not contain special characters that are modified by the R function make.names, e.g. having digits at the beginning of the name or containing characters such as space, tab or -:

The expected format for the expression matrix is:

Annotation file

A path to an annotation file should be provided in the field Annotation file. This file should contain a column called ID with the same names (e.g. cell barcodes) as the columns of the expression matrix, a column called CellType indicating cell type for each cell, and a column called Sample indicating the sample identifier for each cell. The latter is used for ecotype discovery. This file can contain any number of additional columns. The additional columns can be used for defining sample batches (see Section Annotation file column to scale by below) and for plotting color bars in the heatmaps output (see Section Annotation file column(s) to plot below). For the current example, the annotation file has the following format:

Annotation file column to scale by

In order to discover pan-carcinoma cell states and ecotypes in the EcoType carcinoma paper, we standardize genes to mean 0 and unit variance within each tumor type (histology). Field Annotation file column to scale by allows users to specify a column name in the annotation file, by which the cells will be grouped when performing standardization. However, this is an analytical choice, depending on the purpose of the analysis. If the users are interested in defining cell states and ecotypes regardless of tumor type-specificity, this argument can be set to NULL. In this case, the standardization will be applied across all samples in the discovery cohort. The same will happen if the annotation file is not provided.

In the current example, this field is not used and therefore set to NULL.

Annotation file column(s) to plot

Annotation file column(s) to plot field specifies which columns in the annotation file will be used as color bar in the output heatmaps, in addition to the cell state label column, plotted by default.

The output section

The Output section contains a single field, Output folder, which specifies the path where the final output will be saved. This folder will be created if it does not exist.

Pipeline settings

The last section, Pipeline settings, contains settings controlling how EcoTyper is run.

Pipeline steps to skip

The Pipeline steps to skip option allows user to skip some of the steps outlined in section Overview of the EcoTyper workflow for discovering cell states. Please note that this option is only intended for cases when the pipeline had already been run once, and small adjustments are made to the parameters. For example, if the Cophenetic coefficient cutoff used in step 3 needs adjusting, the user might want to skip steps 1-2 and re-run from step 3 onwards.

Filter non cell type specific genes

Flag indicated whether to apply the filter for cell type specific genes in step 1, outlined in section Overview of the EcoTyper workflow for discovering cell states. For best results, we do recommend applying this filter.

Number of threads

The number of threads EcoTyper will be run on.

Number of NMF restarts

The NMF approach used by EcoTyper (Brunet et al.), can give slightly different results, depending on the random initialization of the algorithm. To obtain a stable solution, NMF is generally run multiple times with different seeds, and the solution that best explains the discovery data is chosen. Additionally, the variation of NMF solutions across restarts with different seeds is quantified using Cophenetic coefficients and used in step 4 of EcoTyper for selecting the number of states. The parameter Number of NMF restarts specifies how many restarts with different seed should EcoTyper perform for each rank selection, in each cell type. Since this is a very time consuming process, in this example we only use 5. However, for publication-quality results, we recommend at least 50 restarts.

Maximum number of states per cell type

Maximum number of states per cell type specifies the upper end of the range for the number of states possible for each cell type. The lower end is 2.

Cophenetic coefficient cutoff

This field indicates the Cophenetic coefficient cutoff, in the range [0, 1], used for automatically determining the number of states in step 4. Lower values generally lead to more clusters being identified. In this particular example, we set it to 0.975.

Jaccard matrix p-value cutoff

Ecotype identification on step 8 is performed by clustering a jaccard matrix that quantifies the sample overlap between each pair of states. Prior to performing ecotype identification, the jaccard matrix values corresponding to pairs of states for which the sample overlap is not significant are set to 0, in order to mitigate the noise introduced by spurious overlaps. The value provided in this field specifies the p-value cutoff above which the overlaps are considered non-significant. When the number of samples in the scRNA-seq dataset is small, such as in the current example, we recommend this filter is disabled (p-value cutoff = 1), to avoid over-filtering the jaccard matrix. However, we encourage users to set this cutoff to lower values (e.g. 0.05), if the discovery scRNA-seq dataset contains a number of samples large enough to reliably evaluate the significance of overlaps.

The command line

After editing the configuration file (), the de novo discovery cell states and ecotypes can be run as is illustrated below. Please note that this script might take 24-48 hours to run on 10 threads. Also, EcoTyper cannot be run on the example data from this tutorial using a typical laptop (16GB memory). We recommend that it is run on a server with >50-100GB of RAM or a high performance cluster.

The output format

EcoTyper generates for each cell type the following outputs:

Plots displaying the Cophenetic coefficient calculated in step 4. The horizontal dotted line indicates the Cophenetic coefficient cutoff provided in the configuration file Cophenetic coefficient cutoff field. The vertical dotted red line indicates the number of states automatically selected based on the Cophenetic coefficient cutoff provided. We recommend that users inspect this file to make sure that the automatic selection provides sensible results. If the user wants to adjust the Cophenetic coefficient cutoff after inspecting this plot, they can rerun the discovery procedure skipping steps 1-3. Please note that these plots indicate the number of states obtained before applying the filters for low-quality states in steps 6 and 7. Therefore, the final results will probably contain fewer states.

对于每种细胞类型，都会产生以下输出，此处以成纤维细胞为例：

在发现数据集中的样本中，在步骤 6 和 7（如果运行）中的 QC 过滤器之后剩余的细胞状态丰度：

Assignment of samples in the discovery dataset to the cell state with the highest abundance. Only samples assigned to the cell states remaining after the QC filters in steps 6 and 7 (if run) are included. The remaining ones are considered unassigned and removed from this table:

A heatmap illustrating the expression of genes used for cell state discovery, that have the highest fold-change in one of the cell states remaining after the QC filters in steps 6 and 7 (if run). In the current example, the heatmap includes in the top color bar two rows corresponding to Tissue and Histology, that have been provided in configuration file field Annotation file column(s) to plot, in addition to cell state labels always plotted:

The ecotype output files include:

The cell state composition of each ecotype (the set of cell states making up each ecotype):

The number of initial clusters obtained by clustering the Jaccard index matrix, selected using the average silhouette:

A heatmap of the Jaccard index matrix, after filtering ecotypes with less than 3 cell states:

The abundance of each ecotype in each sample in the discovery dataset::

The abundance of each ecotype in each sample in the discovery dataset:

The assignment of samples in the discovery dataset to ecotypes. The samples not assigned to any ecotype are filtered out from this file:

A heatmap of cell state fractions across the samples assigned to ecotypes:

安装EcoTyper

Basic resources

当然了，我们关注单细胞和空间数据的分析

我们来看看示例，提供的 scRNA-seq 数据中恢复细胞状态和生态型

我们先来看看主脚本EcoTyper_recovery_scRNA.R

运行一下，

参数解析：

运行一下单细胞数据

两个热图：一个表示发现数据集中细胞状态标记基因表达的热图，一个表示 scRNA-seq 数据集中相同标记基因表达的热图，经过平滑处理以减轻 scRNA-seq 丢失的影响：

示例2，Recovery of Lymphoma Cell States and Ecotypes in scRNA-seq Data

热图

Recovery of Cell States and Ecotypes in Spatial Transcriptomics data(空间转录组示例)，当然了，主要是10Xgenomics的空间数据

首先看一下主脚本

运行一下

The configuration file

Input section

Discovery dataset name

Recovery dataset name

Input Visium directory

Recovery cell type fractions

Malignant cell of origin

CIBERSORTx username and token

The output section

Number of threads

The command line

The output format

示例3，De novo Discovery of Cell States and Ecotypes in scRNA-seq Data

Overview of the EcoTyper workflow for discovering cell states in scRNA-seq data

EcoTyper 通过一系列步骤从 scRNA-seq 数据中获取细胞状态和生态型：

Checklist before performing cell states and ecotypes discovery in scRNA-seq data

Cell states and ecotypes discovery in scRNA-seq data，先看一下主脚本EcoTyper_discovery_scRNA.R

Input section

Discovery dataset name

Expression matrix

Annotation file

Annotation file column to scale by

Annotation file column(s) to plot

The output section

Pipeline settings

Pipeline steps to skip

Filter non cell type specific genes

Number of threads

Number of NMF restarts

Maximum number of states per cell type

Cophenetic coefficient cutoff

Jaccard matrix p-value cutoff

The command line

The output format

The ecotype output files include:

A heatmap of the Jaccard index matrix, after filtering ecotypes with less than 3 cell states:

The abundance of each ecotype in each sample in the discovery dataset::

当然，方法还是很不错的，如果有配套的单细胞空间数据，用一下这个方法寻找细胞状态和生态型，还是非常给力的，参考网址在EcoTyper