3 Penalized regressions
Penalized regression is a type of regression analysis that includes a penalty term to prevent overfitting and improve the generalizability of the model. It is particularly useful when dealing with high-dimensional data where the number of predictors is large compared to the number of observations. In terms of inference of GRN and BN structure learning, the penalized regression has been used successfully (Schmidt, Niculescu-Mizil, and Murphy 2007). The package implments the core algorithms for the use of penalization in the BN structure learning described below.
3.1 Algorithms
3.1.1 glmnet_BIC
and glmnet_CV
For L1 regularization (LASSO), the package uses the popular R library glmnet
. The glmnet_BIC
specified in the argument algo
will select the lambda by the minimum BIC criteria, while the glmnet_CV
choose the lambda based on cross validation with the specified fold numbers. glmnetBICpath
returns the BIC path given the data.frame
and the node name to be modeled. plot
contains the BIC path plot, and fit
contains the fitted object and BIC
the lambda and BIC data.frame
. The maximize function can be chosen by maximize
argument in algorithm.args
list. Greedy Equivalence Search can be performed via setting this argument to ges
.
library(scran)
library(scstruc)
library(bnlearn)
sce <- mockSCE()
sce <- logNormCounts(sce)
included_genes <- sample(row.names(sce), 30)
## Inference based on glmnet_CV and maximization by GES
gs <- scstruc(sce, included_genes,
algorithm="glmnet_CV",
algorithm.args=list("maximize"="ges"),
changeSymbol=FALSE)
gs$net
#>
#> Random/Generated Bayesian network
#>
#> model:
#> [Gene_0023][Gene_0145][Gene_0156][Gene_0425][Gene_0544]
#> [Gene_0725][Gene_0814][Gene_0993][Gene_1085][Gene_1214]
#> [Gene_1267][Gene_1292][Gene_1330][Gene_1332][Gene_1375]
#> [Gene_1569][Gene_1683][Gene_1700][Gene_1744][Gene_1922]
#> [Gene_2000][Gene_0348|Gene_1085:Gene_1700]
#> [Gene_0863|Gene_0425][Gene_1091|Gene_0814]
#> [Gene_1520|Gene_0425][Gene_0118|Gene_0348]
#> [Gene_0887|Gene_0348][Gene_1921|Gene_0863:Gene_1700]
#> [Gene_0986|Gene_0887][Gene_1541|Gene_1921:Gene_2000]
#> nodes: 30
#> arcs: 12
#> undirected arcs: 0
#> directed arcs: 12
#> average markov blanket size: 1.00
#> average neighbourhood size: 0.80
#> average branching factor: 0.40
#>
#> generation algorithm: Empty
## Visualization of glmnet_BIC criteria
## Just to obtain data to be used in the inference
gs <- scstruc(sce, included_genes,
changeSymbol=FALSE, returnData=TRUE)
set.seed(10)
glmnetBICpath(gs$data, sample(colnames(gs$data), 1))[["plot"]]

3.1.2 MCP_CV
and SCAD_CV
Same as glmnet_CV
, the library performs penalized regression baed on MCP and SCAD using ncvreg
library.
library(ncvreg)
mcp.net <- scstruc(sce, included_genes,
algorithm="MCP_CV", returnData=FALSE,
changeSymbol=FALSE)
scad.net <- scstruc(sce, included_genes,
algorithm="SCAD_CV", returnData=FALSE,
changeSymbol=FALSE)
## Using the bnlearn function to compare two networks
bnlearn::compare(mcp.net, scad.net)
#> $tp
#> [1] 6
#>
#> $fp
#> [1] 4
#>
#> $fn
#> [1] 0
3.1.3 L0-regularized regression
Based on the L0Learn
, the package performs structure learning based on L0-regularized regression. L0 regularization, also known as best subset selection, is a technique used to build simpler and more interpretable models by selecting only a small number of important variables. L0L1 and L0L2 regularization is a combination of L0, L1, and L2 regularization. For the details of L0-, L0L1-, or L0L2-regularized regression, please consult the original paper (Hazimeh, Mazumder, and Nonet 2023).
l0l2.net <- scstruc(sce, included_genes,
algorithm="L0L2_CV", returnData=FALSE,
changeSymbol=FALSE)
plotNet(l0l2.net)

3.1.4 CCDr
algorithm
Generally, the CCDr algorithm is the fastest algorithm and learns the network for multiple lambdas. By default, the function chooses the network with the best BIC value among multiple lambdas. To supress this effect and obtain all networks, set bestScore
to FALSE
in the algorithm.args
.
library(ccdrAlgorithm);library(sparsebnUtils)
ccdr.res <- scstruc(sce, included_genes,
algorithm="ccdr", changeSymbol=FALSE,
algorithm.args=list(bestScore=FALSE))
#> Setting `lambdas.length` to 10
#> Returning the bn per lambda from result of ccdr.run
names(ccdr.res$net)
#> [1] "14.1421356237309" "8.47798757230114"
#> [3] "5.08242002399425" "3.04683075789017"
#> [5] "1.82652705274248" "1.09497420090059"
#> [7] "0.656419787945471" "0.393513324471009"
ccdr.res$net[[4]]
#>
#> Random/Generated Bayesian network
#>
#> model:
#> [Gene_0023][Gene_0118][Gene_0145][Gene_0156][Gene_0425]
#> [Gene_0544][Gene_0725][Gene_0814][Gene_0887][Gene_0986]
#> [Gene_0993][Gene_1085][Gene_1091][Gene_1214][Gene_1267]
#> [Gene_1292][Gene_1330][Gene_1332][Gene_1375][Gene_1520]
#> [Gene_1541][Gene_1569][Gene_1683][Gene_1700][Gene_1744]
#> [Gene_1921][Gene_1922][Gene_2000][Gene_0348|Gene_0118]
#> [Gene_0863|Gene_0425]
#> nodes: 30
#> arcs: 2
#> undirected arcs: 0
#> directed arcs: 2
#> average markov blanket size: 0.13
#> average neighbourhood size: 0.13
#> average branching factor: 0.067
#>
#> generation algorithm: Empty
Therefore, the bootstrap-based inference can be performed very fast using CCDr algorithm. If bootstrapping is specified, the function performs learning the bootstrapped network for the multiple lambdas across all the replicates.
3.1.5 Precision lasso
Precision Lasso combines the LASSO and the precision matrix into the regularization process. This approach is particularly useful in biological and genomics studies, where highly-correlated features often appear. We implmented precision lasso in R and the feature is provided by plasso.fit
and plasso.fit.single
using RcppArmadillo
. The fixed lambda value should be specified.
pl.res <- scstruc(sce, included_genes, algorithm="plasso", changeSymbol=FALSE)