4 Hurdle model

For handling the zero-inflated nature of the single-cell transcriptomics data, the use of the hurdle model is presented. The hurdle model is a two-part model that models whether the observation is zero or not, and non-zero part separately. For this function, Hurdle algorithm should be specified in algorithm argument in scstruc.

Based on the inferred network in HurdleNormal, which proposed the multivariate Hurdle model and grouped lasso to learn the undirected network (McDavid et al. 2019), the directed acyclic graph is inferred based on score maximization using the constraints. By default, score will be BIC in bnlearn. The customized score is also introduced in this section. The first undirected network is selected from multiple lambdas based on the BIC criteria implemented in HurdleNormal.

library(HurdleNormal)
library(scstruc)

sce <- mockSCE()
sce <- logNormCounts(sce)
included_genes <- sample(row.names(sce), 20)
gs <- scstruc(sce, included_genes,
    changeSymbol=FALSE, algorithm="Hurdle")
gs$net
#> 
#>   Bayesian network learned via Score-based methods
#> 
#>   model:
#>    [Gene_0182][Gene_0419][Gene_0475][Gene_0708][Gene_0742]
#>    [Gene_0780][Gene_0795][Gene_0848][Gene_0921][Gene_0940]
#>    [Gene_1003][Gene_1110][Gene_1194][Gene_1291][Gene_1555]
#>    [Gene_1557][Gene_1609][Gene_1663][Gene_1808][Gene_1844]
#>   nodes:                                 20 
#>   arcs:                                  0 
#>     undirected arcs:                     0 
#>     directed arcs:                       0 
#>   average markov blanket size:           0.00 
#>   average neighbourhood size:            0.00 
#>   average branching factor:              0.00 
#> 
#>   learning algorithm:                    Hill-Climbing 
#>   score:                                 BIC (Gauss.) 
#>   penalization coefficient:              2.649159 
#>   tests used in the learning procedure:  0 
#>   optimized:                             TRUE

The score maximization function can be set arbitrarily (maximizeFun), set to hc by default. Greedy Equivalence Search can be performed via setting maximize argument to ges.

gs.tabu <- scstruc(sce, included_genes,
    changeSymbol=FALSE, algorithm="Hurdle",
    algorithm.args=list(maximizeFun=bnlearn::tabu))
gs.tabu$net
#> 
#>   Bayesian network learned via Score-based methods
#> 
#>   model:
#>    [Gene_0182][Gene_0419][Gene_0475][Gene_0708][Gene_0742]
#>    [Gene_0780][Gene_0795][Gene_0848][Gene_0921][Gene_0940]
#>    [Gene_1003][Gene_1110][Gene_1194][Gene_1291][Gene_1555]
#>    [Gene_1557][Gene_1609][Gene_1663][Gene_1808][Gene_1844]
#>   nodes:                                 20 
#>   arcs:                                  0 
#>     undirected arcs:                     0 
#>     directed arcs:                       0 
#>   average markov blanket size:           0.00 
#>   average neighbourhood size:            0.00 
#>   average branching factor:              0.00 
#> 
#>   learning algorithm:                    Tabu Search 
#>   score:                                 BIC (Gauss.) 
#>   penalization coefficient:              2.649159 
#>   tests used in the learning procedure:  0 
#>   optimized:                             TRUE

## This performs GES as score-based learning
gs.ges <- scstruc(sce, included_genes,
    changeSymbol=FALSE, algorithm="Hurdle",
    algorithm.args=list(maximize="ges"))
gs.ges$net
#> 
#>   Random/Generated Bayesian network
#> 
#>   model:
#>    [Gene_0182][Gene_0419][Gene_0475][Gene_0708][Gene_0742]
#>    [Gene_0780][Gene_0795][Gene_0848][Gene_0921][Gene_0940]
#>    [Gene_1003][Gene_1110][Gene_1194][Gene_1291][Gene_1555]
#>    [Gene_1557][Gene_1609][Gene_1663][Gene_1808][Gene_1844]
#>   nodes:                                 20 
#>   arcs:                                  0 
#>     undirected arcs:                     0 
#>     directed arcs:                       0 
#>   average markov blanket size:           0.00 
#>   average neighbourhood size:            0.00 
#>   average branching factor:              0.00 
#> 
#>   generation algorithm:                  Empty

4.1 Customized score for hurdle model

Additional score can be specified by using hurdle.bic function in algorithm.args argument. The score is defined as the sum of BIC values from continuous and logistic regression part of the hurdle model. Let \(Y = [y_{ij}]\) denote the log-normalized expression value of the gene i in subset cell j and \(Z = [z_{ij}]\) a 0-1 indicator value of whether the gene expression is zero. The score is defined as:

\[ \text{logit}\left(\Pr(Z_{ij} = 1)\right) \sim \beta_i G_j, \]

\[ \Pr(Y_{ij} = y \mid Z_{ij} = 1) \sim N(\beta_i G_j, \sigma^2). \]

gs2 <- scstruc(sce, included_genes,
    changeSymbol=FALSE, algorithm="Hurdle",
    algorithm.args=list("score"=hurdle.bic))
gs2$net
#> 
#>   Bayesian network learned via Score-based methods
#> 
#>   model:
#>    [Gene_0182][Gene_0419][Gene_0475][Gene_0708][Gene_0742]
#>    [Gene_0780][Gene_0795][Gene_0848][Gene_0921][Gene_0940]
#>    [Gene_1003][Gene_1110][Gene_1194][Gene_1291][Gene_1555]
#>    [Gene_1557][Gene_1609][Gene_1663][Gene_1808][Gene_1844]
#>   nodes:                                 20 
#>   arcs:                                  0 
#>     undirected arcs:                     0 
#>     directed arcs:                       0 
#>   average markov blanket size:           0.00 
#>   average neighbourhood size:            0.00 
#>   average branching factor:              0.00 
#> 
#>   learning algorithm:                    Hill-Climbing 
#>   score:                                 
#>                                  User-Provided Score Function 
#>   tests used in the learning procedure:  0 
#>   optimized:                             TRUE

As proposed in the MAST library, the cellular detection rate adjustment (CDR) can be performed in the scoring phase by cdrAdjuetment to TRUE. This applies inclusion of CDR term in the hurdle modeling and score maximizing phase.

4.2 `add.dropout`

Like in the other SCT data simulator, the excessive zero in the matrix can be simulated by add.dropout. The function takes absolute expression values if any of the expression in the matrix is negative.

net <- readRDS("../arth150.rds")
sim <- rbn(net, 100)
table(sim == 0)
#> 
#> FALSE 
#> 10700
sim.do <- sim * add.dropout(sim)
table(sim.do == 0)
#> 
#> FALSE  TRUE 
#>  4361  6339

3 Penalized regressions

5 Other algorithms and software

4 Hurdle model

4.1 Customized score for hurdle model

4.2 add.dropout

4.2 `add.dropout`