2 Basic usage of scstruc
This part will present how to use the package for the inference of the network and evaluation. Basically, if you have the object storing (normalized) expression matrix, the primary function scstruc
can perform inference. We will use the mock data generated by scran::mockSCE
function here.
library(scran)
library(scstruc)
library(ggraph)
library(bnlearn)
library(tibble)
set.seed(0)
sce <- mockSCE()
sce <- logNormCounts(sce)
sce
#> class: SingleCellExperiment
#> dim: 2000 200
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(2000): Gene_0001 Gene_0002 ... Gene_1999
#> Gene_2000
#> rowData names(0):
#> colnames(200): Cell_001 Cell_002 ... Cell_199
#> Cell_200
#> colData names(4): Mutation_Status Cell_Cycle
#> Treatment sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(1): Spikes
We obtain simulated data by mockSCE
function in scuttle, and log-normalize the count with the default parameter.
2.1 Bayesian network
Bayesian networks are probabilistic graphical models that represent a set of random variables \(X = \{X_1, \dots, X_N\}\), and their conditional dependencies using a directed acyclic graph (DAG). They are widely used in various domains, such as economics, medicine, decision-making, and for modeling complex relationships among variables such as transcriptomics data. In the package, we infer the relationships among genes, representing the transcriptional regulation between them.
The graphical separation between nodes in G implies the conditional independence of the corresponding variables and leads to factorization:
\[ P(X \mid G; \theta) = \prod_{i=1}^{N} P(X_i \mid \pi_{X_i}; \theta_{X_i}) \quad \text{where } \pi_{X_i} = \{ \text{parents of } X_i \text{ in } G \} \]
In the package, various algorithms are implemented tailored to account for the nature of SCT data, such as dropouts.
2.2 Preprocessing of single-cell transcriptomics data and preparation
To reduce computational burden and to obtain interpretable results, some preprocessing steps and the preparation is suggested.
2.2.1 Coarse-graining expression data (metacell expression)
The package can use metacell approach to reduce the computational burden required to perform structure learning from SCT data. The package features superCellMat
function using SuperCell
package available in CRAN. It employs a cell aggregation approach to reduce dataset complexity while preserving key biological variability (Bilous et al. 2022). This step is optional and one can use their untransformed data directly.
library(SuperCell)
dim(sce)
#> [1] 2000 200
rowData(sce)["ID"] <- row.names(sce)
sce <- superCellMat(sce, ID="ID")
#> SuperCell dimension: 2000 20
dim(sce)
#> [1] 2000 20
The matrix (p x n, unlike the other functions in the package) can be supplied to the function as well.
eco <- readRDS("../ecoli70.rds")
rawmat <- rbn(eco, 1000)
cgmat <- superCellMat(t(rawmat), pca=FALSE)
#> SuperCell dimension: 46 100
dim(cgmat)
#> [1] 46 100
2.2.2 Obtaining interesting gene identifiers
For inference, we first subset to genes in the interesting biological pathway (such as ECM receptor interaction in the cells annotated as vascular endothelial cells). For this purpose, two functions are prepared. One is fetching gene identifiers from Gene Ontology, and another is from KEGG PATHWAY.
2.2.2.1 getGOGenes
We can obtain genes involved in gene ontology by using the identifier.
library(scstruc)
library(org.Mm.eg.db)
genes <- getGOGenes("GO:0030198", orgDb=org.Mm.eg.db)
genes[1:5]
#> [1] "Abl1" "Abl1" "Abl2" "Adamts1" "Adamts1"
2.2.2.2 getKEGGPathwayGenes
We can obtain genes involved in KEGG PATHWAY by using the KEGG identifier with organism identifier. The below is obtaining genes involved in mTOR signaling pathway in Mus musculus. You need to specify organism database in order to correcrly convert the obtained identifiers.
library(KEGGREST)
genes <- getKEGGPathwayGenes("mmu04150", orgDb=org.Mm.eg.db)
genes[1:5]
#> [1] "Prkaa1" "Prkaa2" "Atp6v1h" "Prr5" "Braf"
You can pass these genes for the inference by scstruc
function. Also, we published a package that estimates GRNs from omics data based on enrichment analysis results in the past, and the approach could be useful (Sato et al. 2022).
2.3 Actual structure learning
scstruc
function needs SingleCellExperiment
object or those storing gene expression data. Also, gene list must be specified. In this analysis, randomly subset gene identifiers are used as input. changeSymbol
argument here refers to the changing of the gene names of subset matrix to those specified in symbolColumn
argument (default to Symbol
). In this example, we do not have corresponding symbol and the argument is set to FALSE. Also, algorithm is set to mmhc
by default, which performs max-min hill climbing approach.
included_genes <- sample(row.names(sce), 100)
gs <- scstruc(sce, included_genes, changeSymbol=FALSE)
names(gs)
#> [1] "net" "data"
gs$net
#>
#> Bayesian network learned via Hybrid methods
#>
#> model:
#> [Gene_0018][Gene_0021][Gene_0037][Gene_0047][Gene_0074]
#> [Gene_0090][Gene_0092][Gene_0110][Gene_0124][Gene_0145]
#> [Gene_0176][Gene_0232][Gene_0244][Gene_0249][Gene_0253]
#> [Gene_0267][Gene_0346][Gene_0432][Gene_0453][Gene_0468]
#> [Gene_0471][Gene_0512][Gene_0537][Gene_0543][Gene_0544]
#> [Gene_0554][Gene_0574][Gene_0617][Gene_0647][Gene_0725]
#> [Gene_0900][Gene_0946][Gene_1002][Gene_1095][Gene_1400]
#> [Gene_1401][Gene_1428][Gene_1473][Gene_1614][Gene_1673]
#> [Gene_1738][Gene_1770][Gene_1803][Gene_1807][Gene_1815]
#> [Gene_1860][Gene_1923][Gene_1972][Gene_0324|Gene_0018]
#> [Gene_0356|Gene_0092][Gene_0458|Gene_0176]
#> [Gene_0621|Gene_0512][Gene_0817|Gene_0647]
#> [Gene_0867|Gene_0725][Gene_0908|Gene_0037]
#> [Gene_0940|Gene_0617][Gene_0978|Gene_0554]
#> [Gene_1027|Gene_0074][Gene_1031|Gene_0471]
#> [Gene_1070|Gene_0468:Gene_0544:Gene_1400]
#> [Gene_1087|Gene_0090][Gene_1101|Gene_0725]
#> [Gene_1170|Gene_0047][Gene_1244|Gene_0253]
#> [Gene_1347|Gene_0267][Gene_1365|Gene_0110]
#> [Gene_1379|Gene_0512][Gene_1541|Gene_0453]
#> [Gene_1544|Gene_1401][Gene_1594|Gene_0574]
#> [Gene_1604|Gene_0018][Gene_1634|Gene_0453]
#> [Gene_1757|Gene_1095][Gene_1784|Gene_0074]
#> [Gene_1788|Gene_0543][Gene_1849|Gene_0090]
#> [Gene_1878|Gene_0574][Gene_1887|Gene_0544]
#> [Gene_1890|Gene_1473][Gene_1956|Gene_0176]
#> [Gene_0056|Gene_1788][Gene_0396|Gene_0940:Gene_1770]
#> [Gene_0455|Gene_1170:Gene_1972]
#> [Gene_0741|Gene_0232:Gene_1604][Gene_0879|Gene_1878]
#> [Gene_0920|Gene_1849][Gene_0955|Gene_1031]
#> [Gene_1034|Gene_1544][Gene_1414|Gene_0249:Gene_1379]
#> [Gene_1436|Gene_1347][Gene_1443|Gene_0021:Gene_0978]
#> [Gene_1668|Gene_1594][Gene_1689|Gene_1101]
#> [Gene_1822|Gene_0458][Gene_0076|Gene_1034]
#> [Gene_0311|Gene_0879:Gene_1087][Gene_0363|Gene_1668]
#> [Gene_1496|Gene_0455][Gene_1591|Gene_0920]
#> [Gene_0051|Gene_0076:Gene_1860]
#> nodes: 100
#> arcs: 61
#> undirected arcs: 0
#> directed arcs: 61
#> average markov blanket size: 1.42
#> average neighbourhood size: 1.22
#> average branching factor: 0.61
#>
#> learning algorithm:
#> Max-Min Hill-Climbing
#> constraint-based method:
#> Max-Min Parent Children
#> conditional independence test:
#> Pearson's Correlation
#> score-based method: Hill-Climbing
#> score: BIC (Gauss.)
#> alpha threshold: 0.05
#> penalization coefficient: 1.497866
#> tests used in the learning procedure: 21744
#> optimized: TRUE
By default, it returns the bn
object from bnlearn
and the actual data used for the inference. If returnBn
is set to FALSE
, the function returns the tidygraph
object. These can be used for various downstream tasks, such as visualization, probabilistic reasoning, evaluation based on biological pathway information. The details including the use of various algorithms for learning is described in the following sections.