6 Evaluating the inferred networks

For the interpretation of the results, the assessment of inferred networks is crucial. This is because it is necessary to determine which network should be used for downstream analysis, and to assess how closely the inferred causal relationships resemble those of biologically validated networks. One of the core features of the scstruc is evaluating and selecting optimal algorithms from the inferred networks. We describe how to evaluate the inferred networks using various metrics in this section. The implemented metrics include:

True positive arcs
False positive arcs
False negative arcs
Precision
Recall
F1-score
Bayesian Information Criterion (if the data to be fitted is provided)
Structural Hamming Distance
Structural Intervention Distance (with or without symmetrization)
Kullback–Leibler divergence
AUPRC (for bootstrapped network only)

6.1 Evaluation functions

6.1.1 `metrics` function

This function accepts learned networks and the reference network (both should be bn object) and outputs data.frame consisting of various metrics.

library(scstruc)
net <- readRDS("ecoli70.rds")
data.inference <- rbn(net, 50)
infer <- hc(data.inference)
metrics(bn.net(net), list("inferred"=infer))
#>       algo referenceNode InferenceNode s0 edges SHD TP FP
#> 1 inferred            46            46 70    99 107 28 42
#>   FN TPR Precision    Recall        F1 SID KL BIC
#> 1 71 0.4       0.4 0.2828283 0.3313609 958 NA  NA

sid_sym argument can choose whether to symmetrze the SID, and SID.cran can choose whether to use SID implemented in CRAN package SID.

6.1.2 `metricsFromFitted` function

This function accepts parameter-fitted network and sampling number, as well as the algorithms to be used in the inference. Using rbn function in bnlearn, logic sampling is performed from fitted network. Here, we use ECOLI70 network from GeneNet R package, sampling 50 observations from the network. The testing algorithms can be specifed by argument algos. For the special algorithms, the arguments with the same name are provided. The function has return_data and return_net argument, which returns the data used in the inference and inferred networks. By default, only the metrics is returned.

mf <- metricsFromFitted(net, 50, algos=c("glmnet_CV", "glmnet_BIC", "L0_CV"))
#> glmnet_CV 5.95810103416443
#> glmnet_BIC 0.471029996871948
#> L0_CV
#> 5.5213348865509
#> MMHC 0.001 0.0360720157623291
#> MMHC 0.005 0.0354738235473633
#> MMHC 0.01 0.0376400947570801
#> MMHC 0.05 0.0452709197998047
#> Network computing finished
head(mf$metrics)
#>         algo s0 edges        KL       BIC SHD TP FP FN
#> 1  glmnet_CV 70    68  5.145568 -2347.568  69 31 39 37
#> 2 glmnet_BIC 70    38 14.594425 -2508.020  55 23 47 15
#> 3      L0_CV 70    30 64.437907 -2893.720  67 13 57 17
#> 4 mmhc_0.001 70    26 54.579934 -2797.366  59 15 55 11
#> 5 mmhc_0.005 70    28 44.462021 -2747.934  60 16 54 12
#> 6  mmhc_0.01 70    29 43.507040 -2728.172  59 17 53 12
#>         TPR Precision    Recall        F1 SID PPI
#> 1 0.4428571 0.4428571 0.4558824 0.4492754  NA  NA
#> 2 0.3285714 0.3285714 0.6052632 0.4259259  NA  NA
#> 3 0.1857143 0.1857143 0.4333333 0.2600000  NA  NA
#> 4 0.2142857 0.2142857 0.5769231 0.3125000  NA  NA
#> 5 0.2285714 0.2285714 0.5714286 0.3265306  NA  NA
#> 6 0.2428571 0.2428571 0.5862069 0.3434343  NA  NA
#>         time   BICnorm  N  p
#> 1 5.95810103 1.0000000 50 46
#> 2 0.47103000 0.8938548 50 46
#> 3 5.52133489 0.6387000 50 46
#> 4 0.03607202 0.7024417 50 46
#> 5 0.03547382 0.7351427 50 46
#> 6 0.03764009 0.7482163 50 46

The results can be visualized in the usual way by using the library like ggplot2. We use here plotthis library for visualizing.

library(plotthis)
library(ggplot2)
library(ggrepel)
ScatterPlot(mf$metrics, x="SHD", y="F1", color_by="algo", legend.position="none") +
    geom_text_repel(aes(label=algo), bg.colour="white")

6.2 Evaluating the causal validity

The primary objective of the package is evaluating the causal validity of the inferred networks. Two approach can be used, in the situations that the reference directed network is available or not. In most of the cases, the reference networks is not readily available.

6.2.1 Obtaining the directed acyclic graphs (DAGs) for the evaluation

For the interesting biological pathway, one can obtain DAG from the KEGG PATHWAY. The getKEGGEdges function accepts the pathway identifier and returns the DAG, though it will not always succeed. The function first parses the pathway information using ggkegg, and identify the largest components to be evaluated. If removeCycle is TRUE, the function identifies the minimum feedback using igraph function and remove these edges. This returns the bn object.

Suppose you are interested in inferring gene regulatory networks in mTOR signaling pathway from your dataset, you should first load DAG from the KEGG API.

library(scstruc)
dags <- getKEGGEdges("mmu04150", removeCycle=TRUE)
#> Removing Pik3ca|Mtor
graphviz.plot(dags)

Using the genes in this candidate pathway, inference is performed and performance metrics can be obtained based on the reference DAG.

mymet <- metrics(dags, list("Algo1"=net))

6.2.2 Intersection-Validation approach

In case there are no reference networks, we can use Insersection-Validation approach, proposed by Viinikka et al. (Viinikka, Eggeling, and Koivisto 2018) to evaluate which algorithm is optimal in terms of SHD and SID. For this purpose, interVal function is prepared. The function accepts input data, multiple algorithms to be tested, and parameters related to Intersection-Validation. The implemented metrics are SHD and SID.

The user should provide data and algorithms to be tested,

test.data <- head(gaussian.test, 50)
test <- interVal(test.data, algos=c("hc","mmhc","tabu"), ss=30)
test
#> $A0
#> 
#>   Random/Generated Bayesian network
#> 
#>   model:
#>    [A][B][C][D][E][G][F|E:G] 
#>   nodes:                                 7 
#>   arcs:                                  2 
#>     undirected arcs:                     0 
#>     directed arcs:                       2 
#>   average markov blanket size:           0.86 
#>   average neighbourhood size:            0.57 
#>   average branching factor:              0.29 
#> 
#>   generation algorithm:                  Empty 
#> 
#> 
#> $stat
#> # A tibble: 3 × 4
#>   AlgoNum SHD.stat SID.stat    en
#>     <dbl>    <dbl>    <dbl> <dbl>
#> 1       1      5.2      0     7.2
#> 2       2      2.7      1.5   2.3
#> 3       3      5.4      0     7.4
#> 
#> $raw.stat
#>     R AlgoNum SHD SID EdgeNumber
#> 1   1       1   6   0          8
#> 2   1       2   1   0          3
#> 3   1       3   6   0          8
#> 4   2       1   6   0          8
#> 5   2       2   3   3          2
#> 6   2       3   6   0          8
#> 7   3       1   4   0          6
#> 8   3       2   3   1          2
#> 9   3       3   5   0          7
#> 10  4       1   7   0          9
#> 11  4       2   3   1          2
#> 12  4       3   7   0          9
#> 13  5       1   5   0          7
#> 14  5       2   3   1          2
#> 15  5       3   5   0          7
#> 16  6       1   6   0          8
#> 17  6       2   3   3          2
#> 18  6       3   6   0          8
#> 19  7       1   4   0          6
#> 20  7       2   3   1          3
#> 21  7       3   5   0          7
#> 22  8       1   4   0          6
#> 23  8       2   3   3          2
#> 24  8       3   4   0          6
#> 25  9       1   5   0          7
#> 26  9       2   1   0          3
#> 27  9       3   5   0          7
#> 28 10       1   5   0          7
#> 29 10       2   4   2          2
#> 30 10       3   5   0          7

r argument is used to specify the iteration number, and ss argument is used to specify sub-sampling number. It leaves a message if connected node pairs, defined as the number of edges in the agreement graph, is below 15. returnA0 option can be used to return only the intersection of inferred networks at the first stage. output option can be specified to output the relevant data (data used for the inference, A0, and all the bn object).

6.3 AUPRC

Although the package focuses on Bayesian network evaluation, the commonly used metrics of area under precision-recall curve (AUPRC) can be calculated. calc.auprc accepts reference bn object and bn.strength object obtained by bootstrapping, and returns the AUPRC value. The function uses yardstick to calculate the value. The target argument specifies which column is to be used as weight, which is useful for the output from the software like GENIE3. The below example uses bootstrapped GES network for inference and calculates the AUPRC.

net <- readRDS("ecoli70.rds")
data.inference <- rbn(net, 50)
infer <- pcalg.boot(data.inference, R=30)
calc.auprc(bn.net(net), infer)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary         0.407

Accompanied prc.plot function is also prepared, which accepts the reference bn object and named list of strength and plots the PRC. Note that the weight (corresponding to edge confidence) column shuold be named as strength in this function.

By combining these methods, it is possible to determine which network is most suitable for assessment.

6.4 Evaluation based on `SERGIO`

6.4.1 Load and plot

SERGIO, a simulator of single-cell gene expression data, models the single-cell gene expression data based on the user-specified GRN. After cloning the repository, first load the GRN in the dataset. Then plot the loaded network using plotNet.

library(scstruc)
library(dplyr);library(igraph);library(bnlearn)

## De-noised_100G_9T_300cPerT_4_DS1
gt <- read.csv("De-noised_100G_9T_300cPerT_4_DS1/gt_GRN.csv", header=FALSE)
gt <- gt %>% `colnames<-`(c("from","to"))
gt$from <- paste0("G",gt$from)
gt$to <- paste0("G",gt$to)
g <- graph_from_data_frame(gt)

### Consider the GRN as DAG
is_dag(g)
#> [1] TRUE
ref.bn.ds1 <- bnlearn::as.bn(g)
ds1 <- plotNet(ref.bn.ds1, showText=FALSE)

## De-noised_400G_9T_300cPerT_5_DS2
gt <- read.csv("De-noised_400G_9T_300cPerT_5_DS2/gt_GRN.csv", header=FALSE)
gt <- gt %>% `colnames<-`(c("from","to"))
gt$from <- paste0("G",gt$from)
gt$to <- paste0("G",gt$to)
g <- graph_from_data_frame(gt)

### Consider the GRN as DAG
is_dag(g)
#> [1] TRUE
ref.bn.ds2 <- bnlearn::as.bn(g)
ds2 <- plotNet(ref.bn.ds2, showText=FALSE)

library(patchwork)
ds1 + ds2

6.4.2 Inference and evaluation

Load the expression data, and bootstrapped network is obtained using GES algorithm. The expression is coarse-grained beforehand. The performance metrics is calculated based on the functions in scstruc.

## DS1
df <- read.csv("De-noised_100G_9T_300cPerT_4_DS1/simulated_noNoise_0.csv", row.names=1)
dim(df)
#> [1]  100 2700
row.names(df) <- paste0("G",row.names(df))
df <- scstruc::superCellMat(as.matrix(df), pca=FALSE)
#>   SuperCell dimension: 100 270
input <- data.frame(t(as.matrix(df)))
ges.res <- pcalg.boot(input, R=25)

calc.auprc(ref.bn.ds1, ges.res)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary        0.0728
prc.plot(ref.bn.ds1, list("GES"=ges.res))+
    cowplot::theme_cowplot()

6.5 Adding dropout to GBN

As another method to reproduce dropout in non-SCT data, the add.dropout function is provided. In this example, the function was applied to data sampled from a Gaussian Bayesian network, and the estimation accuracy was compared using the same inference methods.

net <- readRDS("ecoli70.rds")

dat <- rbn(net, 100)
dat2 <- dat * add.dropout(dat, q=0.2)

raw <- pcalg.boot(dat, R=25)
raw.2 <- pcalg.boot(dat2, R=25)

prc.plot(bn.net(net),
    list("Raw"=raw, "DO"=raw.2))+
    cowplot::theme_cowplot()