2 Basic usage of scstruc

This part will present how to use the package for the inference of the network and evaluation. Basically, if you have the object storing (normalized) expression matrix, the primary function scstruc can perform inference. We will use the mock data generated by scran::mockSCE function here.

library(scran)
library(scstruc)
library(ggraph)
library(bnlearn)
library(tibble)
set.seed(0)
sce <- mockSCE()
sce <- logNormCounts(sce)
sce
#> class: SingleCellExperiment 
#> dim: 2000 200 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(2000): Gene_0001 Gene_0002 ... Gene_1999
#>   Gene_2000
#> rowData names(0):
#> colnames(200): Cell_001 Cell_002 ... Cell_199
#>   Cell_200
#> colData names(4): Mutation_Status Cell_Cycle
#>   Treatment sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(1): Spikes

We obtain simulated data by mockSCE function in scuttle, and log-normalize the count with the default parameter.

2.1 Bayesian network

Bayesian networks are probabilistic graphical models that represent a set of random variables \(X = \{X_1, \dots, X_N\}\), and their conditional dependencies using a directed acyclic graph (DAG). They are widely used in various domains, such as economics, medicine, decision-making, and for modeling complex relationships among variables such as transcriptomics data. In the package, we infer the relationships among genes, representing the transcriptional regulation between them.

The graphical separation between nodes in G implies the conditional independence of the corresponding variables and leads to factorization:

\[ P(X \mid G; \theta) = \prod_{i=1}^{N} P(X_i \mid \pi_{X_i}; \theta_{X_i}) \quad \text{where } \pi_{X_i} = \{ \text{parents of } X_i \text{ in } G \} \]

In the package, various algorithms are implemented tailored to account for the nature of SCT data, such as dropouts.

2.2 Preprocessing of single-cell transcriptomics data and preparation

To reduce computational burden and to obtain interpretable results, some preprocessing steps and the preparation is suggested.

2.2.1 Coarse-graining expression data (metacell expression)

The package can use metacell approach to reduce the computational burden required to perform structure learning from SCT data. The package features superCellMat function using SuperCell package available in CRAN. It employs a cell aggregation approach to reduce dataset complexity while preserving key biological variability (Bilous et al. 2022). This step is optional and one can use their untransformed data directly.

library(SuperCell)
dim(sce)
#> [1] 2000  200
rowData(sce)["ID"] <- row.names(sce)
sce <- superCellMat(sce, ID="ID")
#>   SuperCell dimension: 2000 20
dim(sce)
#> [1] 2000   20

The matrix (p x n, unlike the other functions in the package) can be supplied to the function as well.

eco <- readRDS("../ecoli70.rds")
rawmat <- rbn(eco, 1000)
cgmat <- superCellMat(t(rawmat), pca=FALSE)
#>   SuperCell dimension: 46 100
dim(cgmat)
#> [1]  46 100

2.2.2 Obtaining interesting gene identifiers

For inference, we first subset to genes in the interesting biological pathway (such as ECM receptor interaction in the cells annotated as vascular endothelial cells). For this purpose, two functions are prepared. One is fetching gene identifiers from Gene Ontology, and another is from KEGG PATHWAY.

2.2.2.1 getGOGenes

We can obtain genes involved in gene ontology by using the identifier.

library(scstruc)
library(org.Mm.eg.db)
genes <- getGOGenes("GO:0030198", orgDb=org.Mm.eg.db)
genes[1:5]
#> [1] "Abl1"    "Abl1"    "Abl2"    "Adamts1" "Adamts1"

2.2.2.2 getKEGGPathwayGenes

We can obtain genes involved in KEGG PATHWAY by using the KEGG identifier with organism identifier. The below is obtaining genes involved in mTOR signaling pathway in Mus musculus. You need to specify organism database in order to correcrly convert the obtained identifiers.

library(KEGGREST)
genes <- getKEGGPathwayGenes("mmu04150", orgDb=org.Mm.eg.db)
genes[1:5]
#> [1] "Prkaa1"  "Prkaa2"  "Atp6v1h" "Prr5"    "Braf"

You can pass these genes for the inference by scstruc function. Also, we published a package that estimates GRNs from omics data based on enrichment analysis results in the past, and the approach could be useful (Sato et al. 2022).

2.3 Actual structure learning

scstruc function needs SingleCellExperiment object or those storing gene expression data. Also, gene list must be specified. In this analysis, randomly subset gene identifiers are used as input. changeSymbol argument here refers to the changing of the gene names of subset matrix to those specified in symbolColumn argument (default to Symbol). In this example, we do not have corresponding symbol and the argument is set to FALSE. Also, algorithm is set to mmhc by default, which performs max-min hill climbing approach.

included_genes <- sample(row.names(sce), 100)
gs <- scstruc(sce, included_genes, changeSymbol=FALSE)
names(gs)
#> [1] "net"  "data"
gs$net
#> 
#>   Bayesian network learned via Hybrid methods
#> 
#>   model:
#>    [Gene_0018][Gene_0021][Gene_0037][Gene_0047][Gene_0074]
#>    [Gene_0090][Gene_0092][Gene_0110][Gene_0124][Gene_0145]
#>    [Gene_0176][Gene_0232][Gene_0244][Gene_0249][Gene_0253]
#>    [Gene_0267][Gene_0346][Gene_0432][Gene_0453][Gene_0468]
#>    [Gene_0471][Gene_0512][Gene_0537][Gene_0543][Gene_0544]
#>    [Gene_0554][Gene_0574][Gene_0617][Gene_0647][Gene_0725]
#>    [Gene_0900][Gene_0946][Gene_1002][Gene_1095][Gene_1400]
#>    [Gene_1401][Gene_1428][Gene_1473][Gene_1614][Gene_1673]
#>    [Gene_1738][Gene_1770][Gene_1803][Gene_1807][Gene_1815]
#>    [Gene_1860][Gene_1923][Gene_1972][Gene_0324|Gene_0018]
#>    [Gene_0356|Gene_0092][Gene_0458|Gene_0176]
#>    [Gene_0621|Gene_0512][Gene_0817|Gene_0647]
#>    [Gene_0867|Gene_0725][Gene_0908|Gene_0037]
#>    [Gene_0940|Gene_0617][Gene_0978|Gene_0554]
#>    [Gene_1027|Gene_0074][Gene_1031|Gene_0471]
#>    [Gene_1070|Gene_0468:Gene_0544:Gene_1400]
#>    [Gene_1087|Gene_0090][Gene_1101|Gene_0725]
#>    [Gene_1170|Gene_0047][Gene_1244|Gene_0253]
#>    [Gene_1347|Gene_0267][Gene_1365|Gene_0110]
#>    [Gene_1379|Gene_0512][Gene_1541|Gene_0453]
#>    [Gene_1544|Gene_1401][Gene_1594|Gene_0574]
#>    [Gene_1604|Gene_0018][Gene_1634|Gene_0453]
#>    [Gene_1757|Gene_1095][Gene_1784|Gene_0074]
#>    [Gene_1788|Gene_0543][Gene_1849|Gene_0090]
#>    [Gene_1878|Gene_0574][Gene_1887|Gene_0544]
#>    [Gene_1890|Gene_1473][Gene_1956|Gene_0176]
#>    [Gene_0056|Gene_1788][Gene_0396|Gene_0940:Gene_1770]
#>    [Gene_0455|Gene_1170:Gene_1972]
#>    [Gene_0741|Gene_0232:Gene_1604][Gene_0879|Gene_1878]
#>    [Gene_0920|Gene_1849][Gene_0955|Gene_1031]
#>    [Gene_1034|Gene_1544][Gene_1414|Gene_0249:Gene_1379]
#>    [Gene_1436|Gene_1347][Gene_1443|Gene_0021:Gene_0978]
#>    [Gene_1668|Gene_1594][Gene_1689|Gene_1101]
#>    [Gene_1822|Gene_0458][Gene_0076|Gene_1034]
#>    [Gene_0311|Gene_0879:Gene_1087][Gene_0363|Gene_1668]
#>    [Gene_1496|Gene_0455][Gene_1591|Gene_0920]
#>    [Gene_0051|Gene_0076:Gene_1860]
#>   nodes:                                 100 
#>   arcs:                                  61 
#>     undirected arcs:                     0 
#>     directed arcs:                       61 
#>   average markov blanket size:           1.42 
#>   average neighbourhood size:            1.22 
#>   average branching factor:              0.61 
#> 
#>   learning algorithm:                    
#>                                         Max-Min Hill-Climbing 
#>   constraint-based method:               
#>                                       Max-Min Parent Children 
#>   conditional independence test:         
#>                                         Pearson's Correlation 
#>   score-based method:                    Hill-Climbing 
#>   score:                                 BIC (Gauss.) 
#>   alpha threshold:                       0.05 
#>   penalization coefficient:              1.497866 
#>   tests used in the learning procedure:  21744 
#>   optimized:                             TRUE

By default, it returns the bn object from bnlearn and the actual data used for the inference. If returnBn is set to FALSE, the function returns the tidygraph object. These can be used for various downstream tasks, such as visualization, probabilistic reasoning, evaluation based on biological pathway information. The details including the use of various algorithms for learning is described in the following sections.