3 Statistial analysis

stana provides functions to perform statistical analysis on the metagenotyping results based on loaded data.

library(stana)
library(phangorn)
library(ggtree)
library(ggstar)
## Examine sample object
load(system.file("extdata", "sysdata.rda", package = "stana"))

3.1 Consensus sequence calling

The consensus sequence calling can be performed using SNV MAF matrix (or the matrix converted from the allelic counts). It can filter the confident and user-defined positions and output the multiple sequence alignment in fastaList slot. The MIDAS and MIDAS2 output provides various statistics of SNVs that can be used for the filtering. The implementation is similar to that in MIDAS. Here we use pair-wise distances of sequences calculated by dist.ml in the default amino acid model (JC69), and performs neighbor-joining tree estimation by NJ.


## The consensusSeq* is prepared, but consensusSeq function can automatically 
## choose which functions to use
stana <- consensusSeqMIDAS2(stana, species="100003", verbose=FALSE)
#> # Beginning calling for 100003
#> # Original Site number: 5019
#> #  Profiled samples: 11
#> #  Included samples: 11

## Tree estimation and visualization by `phangorn` and `ggtree`
dm <- dist.ml(getSlot(stana, "fastaList")[["100003"]])
tre <- NJ(dm)
tre <- groupOTU(tre, getSlot(stana, "cl"))
tp <- ggtree(tre, aes(color=.data$group),
             layout='circular') +
        geom_tippoint(size=3) +
        ggtree::scale_color_manual(values=getSlot(stana, "colors"))
tp

The sequences are stored as phyDat class object in fastaList slot, the list with the species ID as name.

stana <- stana |>
    consensusSeq(argList=list(site_prev=0.8))
#> # Beginning calling for 100003
#> # Original Site number: 5019
#> #  Profiled samples: 11
#> #  Included samples: 11
getFasta(stana)[[1]]
#> 11 sequences with 4214 character and 2782 different site patterns.
#> The states are a c g t

Matrix of characters can be returned by return_mat=TRUE. This does not return stana object.

mat <- stana |>
    consensusSeq(argList=list(site_prev=0.8, return_mat=TRUE))
#> # Beginning calling for 100003
#> # Original Site number: 5019
#> #  Profiled samples: 11
#> #  Included samples: 11
mat |> dim()
#> [1]   11 4214

You can use the MSA stored in fastaList slot to infer the phylogenetic tree of your choices. The inferAndPlotTree function can be used to internally infer and plot the tree based on grouping. dist_method is set to dist.ml by default, and you can pass arguments to the function by tree_args.

library(phangorn)
stana <- inferAndPlotTree(stana, dist_method="dist.hamming", tree_args=list(exclude="all"))
getTree(stana)[[1]]
#> 
#> Phylogenetic tree with 11 tips and 10 internal nodes.
#> 
#> Tip labels:
#>   ERR1711593, ERR1711594, ERR1711596, ERR1711598, ERR1711603, ERR1711605, ...
#> 
#> Rooted; includes branch lengths.

Owning to the powerful functions in ggtree and ggtreeExtra, you can visualize the tree based on the metadata. You should set data.frame containing covariates to stana object by setMetadata function. And specify the covariates to meta argument in plotTree.

## Make example metadata
samples <- getSlot(stana, "snps")[["100003"]] |> colnames()
metadata <- data.frame(
    row.names=samples,
    treatment=factor(sample(1:3, length(samples), replace=TRUE)),
    marker=runif(length(samples))
)

## Set metadata
stana <- setMetadata(stana, metadata)

## Call consensus sequence
## Infer and plot tree based on metadata
stana <- stana |>
  consensusSeq(argList=list(site_prev=0.95)) |>
  inferAndPlotTree(meta=c("treatment","marker"))
#> # Beginning calling for 100003
#> # Original Site number: 5019
#> #  Profiled samples: 11
#> #  Included samples: 11
getFasta(stana)[[1]]
#> 11 sequences with 896 character and 625 different site patterns.
#> The states are a c g t
getTree(stana)[[1]]
#> 
#> Phylogenetic tree with 11 tips and 10 internal nodes.
#> 
#> Tip labels:
#>   ERR1711593, ERR1711594, ERR1711596, ERR1711598, ERR1711603, ERR1711605, ...
#> 
#> Rooted; includes branch lengths.
getSlot(stana, "treePlotList")[[1]]

The site_list is supported for the usecase such as calling limited to the certain gene regions. The below example calls the MSA in specific gene ID and output plot by ggmsa.

## get snpsInfo slot and filter to variants in specific gene IDs
cand_ids <- stana::getSlot(stana, "snpsInfo")[["100003"]] %>%
    dplyr::filter(gene_id=="UHGG000008_01913") %>%
    row.names()

## Call sequence
stana <- consensusSeqMIDAS2(stana, "100003", site_list=cand_ids)
#> # Beginning calling for 100003
#> # Original Site number: 5019
#> #  Profiled samples: 11
#> #  Included samples: 11
#> # site_list specified: 22

## Plot
if (requireNamespace("ggmsa")) {
    library(ggmsa)
    phangorn::write.phyDat(getFasta(stana)[["100003"]],
                           "test.fasta", format="fasta")
    ggmsa::ggmsa("test.fasta",seq_name = TRUE)+ ggmsa::geom_seqlogo()
}

3.2 Nonnegative matrix factorization

The loaded or calculated matrix can be used for the nonnegative matrix factorization (NMF) for the unsupervised identifications of factors within species. This calculates factor x sample and sample to feature matrix, and possibly finds the pattern for the within-species diversity. The results can be summarized by the functions such as plotAbundanceWithinSpecies.

The input can be SNV, gene, or gene family (KO) copy number table. The matrix can contain NA or -1 (zero depth at the position), so filtering should be performed. The NMF::nmf function or NNLM::nnmf function can be used for this purpose. By default, estimate is set to FALSE but if set to TRUE, it performs the estimation of rank within estimate_range. This assumes that multiple subspecies are in the samples and is not applicable where only one subspecies should be present. It chooses the rank based on the cophenetic correlation coefficient when the function uses the R package NMF, but the package implements a variety of algorithms and the rank selection method. Also, if NNLM, the rank selection based on cross-validation by randomly assigning the NA in the cell in the matrix, is performed.

The function outputs the related statistics like the proportion of NA or zero value, the relative profiles of factors after the estimation, or the features presented in each factor.

The method is set to snmf/r by default in NMF. For the larger matrix, the NNML function can be used for the faster computation by setting nnlm_flag to TRUE.

library(NMF)
stana <- NMF(stana, "100003", estimate=TRUE)[[1]]
#> # NMF started 100003, target: kos, method: snmf/r
#> # Original features: 20
#> # Original samples: 16
#> # Original matrix NA: NA
#> # Original matrix zero: 0.781
#> # Selecting KL loss
#> # Filtered features: 20
#> # Filtered samples: 16
#> # Chosen rank:3
#> # Rank 3
#> Mean relative abundances: 0.4506259 0.3696346 0.1797395 
#> Present feature per factor: 18 14 13
getSlot(stana, "nmf")
#> NULL

The users can specify the rank. The NMF slot stores the list of NMF results (or NNLM results) per species.

stana <- NMF(stana, "100003", rank=3, beta=0.001)
#> # NMF started 100003, target: kos, method: snmf/r
#> # Original features: 20
#> # Original samples: 16
#> # Original matrix NA: NA
#> # Original matrix zero: 0.781
#> # Selecting KL loss
#> # Filtered features: 20
#> # Filtered samples: 16
#> # Rank 3
#> Mean relative abundances: 0.4391153 0.3652978 0.1955869 
#> Present feature per factor: 15 11 12
getSlot(stana, "NMF")
#> $`100003`
#> <Object of class: NMFfit>
#>  # Model:
#>   <Object of class:NMFstd>
#>   features: 20 
#>   basis/rank: 3 
#>   samples: 16 
#>  # Details:
#>   algorithm:  snmf/r 
#>   seed:  random 
#>   RNG: 10403L, 624L, ..., -1119848976L [a0b56536ecb759f07f21b4b252fb5db8]
#>   distance metric:  <function> 
#>   residuals:  5.628987 
#>   parameters: beta=0.001 
#>   Iterations: 65 
#>   Timing:
#>      user  system elapsed 
#>      0.05    0.00    0.05

The resulting stana object can be used with the other function. plotAbundanceWithinSpecies plots the (relative) sample profiles per sample using the grouping criteria in stana object.

plotAbundanceWithinSpecies(stana, "100003", tss=TRUE)

The data can be obtained using return_data.

plotAbundanceWithinSpecies(stana, "100003", tss=TRUE, return_data=TRUE) %>% head()
#>                     1          2         3  group
#> ERR1711593 0.98582990 0.01417010 0.0000000 Group1
#> ERR1711594 0.85247014 0.14752986 0.0000000 Group1
#> ERR1711596 0.19498362 0.04848823 0.7565282 Group1
#> ERR1711598 0.17024848 0.22117519 0.6085763 Group1
#> ERR1711599 0.00000000 1.00000000 0.0000000 Group1
#> ERR1711602 0.01749079 0.98250921 0.0000000 Group1

The basis corresponds to the factor to feature matrix. This can represent functional profiles if the KO or gene copy number tables are used for the NMF. This information can be summarized to matrix of KEGG PATHWAY information using pathwayWithFactor.

library(pheatmap)
pheatmap(pathwayWithFactor(stana, "100003"))

These information can be further combined with the other functions in stana, like plotting factor profiles along with the tree inferred from consensus sequence alignment, linking allele frequency information and gene copy number data.

ab <- plotAbundanceWithinSpecies(stana, "100003", tss=TRUE, return_data=TRUE)
stana <- setMetadata(stana, ab)
stana <- inferAndPlotTree(stana, "100003", meta=colnames(ab))
getTreePlot(stana)
#> $`100003`

3.2.1 Scaling function for NNLM

As same as scale function in NMF, the helper function for NNLM resulting object is prepared (scaler.NNLM).

3.3 PERMANOVA

Using adonis2 function in vegan, one can compare distance matrix based on SNV frequency or gene (gene family) copy numbers, or tree-based distance between the specified group. When the target="tree" is specified, tree shuold be in treeList, with the species name as the key. The ape::cophenetic.phylo() is used to calculate distance between tips based on branch length. Distance method can be chosen from dist function in stats, and the default is set to manhattan. You can specify distArg to pass the arguments to dist. Also, the distance calculated directly from sequences can be used. In this case, target='fasta' should be chosen, and the function to calculate distance should be provided to AAfunc argument.

stana <- setTree(stana, "100003", tre)
stana <- doAdonis(stana, specs = "100003", target="tree")
#> # Performing adonis in 100003 target is tree
#> #  F: 0.719649945825046, R2: 0.0740407267582885, Pr: 0.695
getAdonis(stana)[["100003"]]
#> Permutation test for adonis under reduced model
#> Permutation: free
#> Number of permutations: 999
#> 
#> adonis2(formula = d ~ ., data = structure(list(group = c("Group1", "Group1", "Group1", "Group1", "Group2", "Group2", "Group2", "Group2", "Group2", "Group2", "Group2")), row.names = c("ERR1711593", "ERR1711594", "ERR1711596", "ERR1711598", "ERR1711603", "ERR1711605", "ERR1711606", "ERR1711609", "ERR1711611", "ERR1711612", "ERR1711618"), class = "data.frame"))
#>          Df SumOfSqs      R2      F Pr(>F)
#> Model     1  0.15557 0.07404 0.7196  0.695
#> Residual  9  1.94558 0.92596              
#> Total    10  2.10115 1.00000

The corresponding principal coordinate analysis plot using distance matrix can be drawn by specifying pcoa. The relative eigenvalues are plotted.

stana <- doAdonis(stana, specs = "100003",
    target="genes", pcoa=TRUE)
#> # Performing adonis in 100003 target is genes
#> #  F: 0.950009752773493, R2: 0.0635457614064265, Pr: 0.578

By default, the function uses grouping variable in cl slot. For more complex modeling, the slot meta can be populated by the data frame of samples by setMetadata like the example below. To use metadata, useMeta option should be specified with the formula argument.

stana <- doAdonis(stana, specs = "100003",
    target="genes", useMeta=TRUE, formula = "d ~ .")
#> # Performing adonis in 100003 target is genes
#> # Printing raw adonis results ...
#> No residual component
#> 
#> adonis2(formula = d ~ ., data = structure(list(`1` = c(0.985829902295573, 0.852470142666385, 0.194983622925863, 0.1702484777323, 0, 0.01749079304296, 0.169551265957738, 1, 0.774981822424855, 0, 0, 1, 0.145056521442649, 0.894711680596164, 0.725364565076885, 0.0951555820116268), `2` = c(0.0141700977044264, 0.147529857333615, 0.0484882264088995, 0.221175192027691, 1, 0.98250920695704, 0.830448734042262, 0, 0, 0.911065650363313, 0, 0, 0.693043447598221, 0.0456195965920271, 0.274635434923115, 0.67607988474988
#>          Df  SumOfSqs R2 F Pr(>F)
#> Model    15 501125509  1         
#> Residual  0         0  0         
#> Total    15 501125509  1
getAdonis(stana)[["100003"]]
#> No residual component
#> 
#> adonis2(formula = d ~ ., data = structure(list(`1` = c(0.985829902295573, 0.852470142666385, 0.194983622925863, 0.1702484777323, 0, 0.01749079304296, 0.169551265957738, 1, 0.774981822424855, 0, 0, 1, 0.145056521442649, 0.894711680596164, 0.725364565076885, 0.0951555820116268), `2` = c(0.0141700977044264, 0.147529857333615, 0.0484882264088995, 0.221175192027691, 1, 0.98250920695704, 0.830448734042262, 0, 0, 0.911065650363313, 0, 0, 0.693043447598221, 0.0456195965920271, 0.274635434923115, 0.67607988474988
#>          Df  SumOfSqs R2 F Pr(>F)
#> Model    15 501125509  1         
#> Residual  0         0  0         
#> Total    15 501125509  1

3.4 Comparing gene copy numbers

If you have genes slot filled in the stana object, gene copy numbers can be compared one by one using exact Wilcoxon rank-sum test using wilcox.exact in exactRankTests computing exact conditional p-values. Note that p-values are not adjusted for multiple comparisons made.

res <- compareGenes(stana, "100003")
#> # Testing total of 21806
res[["UHGG000008_01733"]]
#> 
#>  Exact Wilcoxon rank sum test
#> 
#> data:  c(1.154444, 2.404241, 0, 1.421386, 1.50773, 0) and c(0.535732, 1.709442, 1.31675, 3.44086, 2.712423, 1.923076, 1.062853, c(1.154444, 2.404241, 0, 1.421386, 1.50773, 0) and 1.21147, 0, 1.509217)
#> W = 22, p-value = 0.4256
#> alternative hypothesis: true mu is not equal to 0

3.5 Aggregating gene copy numbers

The gene copy numbers across multiple stana object can be aggregated. This returns the new stana object where the gene copy numbers are combined.

## This returns new stana object
stanacomb <- combineGenes(list(stana, stana), species="100003")
#> Common genes: 21806
#> Duplicate label found in group
dim(getSlot(stanacomb, "genes")[["100003"]])
#> [1] 21806    32

3.6 Performing Boruta or feature selection

Boruta algorithm can be run on matrices to obtain important marker (SNV position or gene) for distinguishing between group by doBoruta function. The function performs Boruta algorithm on specified data and returns the Boruta class result. By default, the function performs fixes to tentative input. To disable this, specify doFix=FALSE.

library(Boruta)
brres <- doBoruta(stana, "100003")
#> # Using grouping from the slot: Group1/Group2
#> # If needed, please provide preprocessed matrix to `mat`
#> # Feature number: 21806
#> # Performing Boruta
brres
#> $boruta
#> Boruta performed 99 iterations in 27.66723 secs.
#> Tentatives roughfixed over the last 99 iterations.
#>  15 attributes confirmed important: UHGG000008_01798,
#> UHGG004375_00184, UHGG025024_01181, UHGG061776_01339,
#> UHGG155301_01986 and 10 more;
#>  21791 attributes confirmed unimportant:
#> UHGG000008_00008, UHGG000008_00009, UHGG000008_00010,
#> UHGG000008_00012, UHGG000008_00015 and 21786 more;

Further, we visualize the copy numbers of important genes confirmed between the group.

confirmed_genes <- brres$boruta$finalDecision[brres$boruta$finalDecision=="Confirmed"] %>% names()
plotGenes(stana, "100003", confirmed_genes)+
  ggplot2::facet_wrap(.~geneID,scales="free_y")+
  cowplot::theme_cowplot() +
  cowplot::panel_border()