Tutorial on using BNfinder software

Bartek Wilczyński and Norbert Dojer

This short tutorial presents the most common possible uses of the BNfinder software. The first part of this tutorial is devoted to presenting possible options of the software an the input files on simplistic, synthetic examples. In the second part, we provide more realistic examples taken from published studies of data for inferring dynamic and static networks.

In this tutorial, we will assume that you are using the standalone BNfinder application as downloaded from http://bioputer.mimuw.edu.pl/software/bnf, however if you want, you can also try out these examples with our webserver at http://bioputer.mimuw.edu.pl/BIAS/BNFinder.

If you have any questions regarding this document or the described software, please contact us: bartek@mimuw.edu.pl or dojer@mimuw.edu.pl

Synthetic examples

This section shows on several simple networks, how to prepare datasets and set the parameters for network reconstruction with BNfinder. The examples include a simple static network, dynamic network and a network requiriing setting prior probabilities.

Simple static network

**Figure 1:** Very simple network consisting of 2 regulators and 4 observable regulatees.

The first example shows how to use BNfinder to learn a simple static Bayesian network. Let us imagine that we are analysisng cells under two conditions and and that we are interested whether any of the four genes: are responding to these conditions. We assume that the true network is depicted in Fig 1, i.e.

is not dependent on or ,
is more likely to be expressed under condition ,
is less likely to be expressed under condition ,
is more likely to be expressed under any of the conditions or .

We have collected 100 datapoints from this network, each consisting of both the state of conditions and the discrete state of expression of the genes. You can download the input file here data/input1.txt.

If you open the file in a text editor, please note that the first line contains the information on the assumed structure:

#regulators P1 P2

This represents the fact, that we assume that genes (

) can depend on conditions (

) and not the other way around.

You can try to run BNfinder on this file:

bnf -e input1.txt -n output1.sif -v

and you will see, that the network topology is reconstructed properly. Also the orientation of the regulatory interactions is inferred properly as you can see in the output file data/output1.sif.

You can also try to see whether the optimal network is representative for a larger set of possible suboptimal networks:

bnf -e input1.txt -n output1w.sif -v -i 5 -txt output1.txt

This time, in the output file (data/output1w.sif), the edge labels represent the relative weights of different edges. In the file data/output1.txt, we can find the originally computed weights for all considered suboptimal sets of parents for all genes.

We can also try to analyze the data for this network without discretization: data/input2.txt. In this case we need another directive to indicate that some of the dataseries are continuous:

#continuous G1 G2 G3 G4

Again if we run BNfinder on this data,

bnf -e input2.txt -n output2.sif -v

we can verify, that the output file contains correct information data/output2.sif.

Simple dynamic network

BNfinder can be used also to infer dynamic Bayesian networks from time series data. In this case it is not necessary to specify the regulators sets, because DBNs, unlike static networks do not need to be acyclic.

In the first dataset: data/input3.txt, we have 1 serie of 20 consecutive measurements of gene expression from gene network depicted in Fig. 2.

**Figure 2:** Very simple dynamic network consisting of 5 observables

If we run BNfinder on this data:

bnf -e input3.txt -n output3.sif -v

We can see that the program was unable to correctly reconstruct all the edges. Again, if we look at the statistics of edge occurences in suboptimal networks,

bnf -e input3.txt -n output3.sif -v -i 10 -txt output3.txt

we can see that the correct edges are the most commonly occuring ones, but they score lower than empty parent sets.

In this case we can show how perturbational data can be integrated into this framework. We have collected gene expression from 5 time-series containing one single gene knockout for each of the genes: data/input4.txt. The perturbations are noted by including the following lines in the preamble of the data file:

#perturbed EXP1 G1
#perturbed EXP2 G2
#perturbed EXP3 G3
#perturbed EXP4 G4
#perturbed EXP5 G0

If we run BNfinder on the perurbed data, we can see that all the edges are reconstructed with high confidence.

bnf -e input4.txt -n output4.sif -v -i 10 -txt output4.txt

Setting priors

**Figure 3:** Exemplary network containing dependencies of different strength

In some cases it might be useful to include some prior information on the network structure into the process of inference. We will illustrate this on an example of a simple network similar to the one described in section 1.1. This time it is an even simpler network , with 2 conditions and 2 genes as depicted in Fig. 3. Even though topology of the network is very simple, the problem lies in the fact that the dependence of

is weaker than the dependence of

. This is why if we run our software on the unmodified dataset data/input5.txt,

bnf -e input5.txt -n output5.sif -v  -i 10 -t output5.txt

we can see that the program is unable to recover the $P2\rightarrow G2$ edge. However, if we expect that

is responding weakly to its regulators, we can increase the prior probability of

being regulated by any of the factors

via decreasing its weight:

#prioredge G2 0.33 P2 P1

we can see the edge appearing in the result as expected (see data/input6a.txt):

bnf -e input6a.txt -n output6a.sif -v  -i 10 -t output6a.txt

Similarly, if we expect, that in general gene response to condition is weaker, we may modify the prior probability of the condition to be a regulator:

#priorvert 0.33 P2

The result of running BNFinder with this input (data/input6b.txt) is very similar to the previous one:

bnf -e input6b.txt -n output6b.sif -v  -i 10 -t output6b.txt

Examples of published datasets

In this section, we present two more realistic examples of published datasets used for inference of Bayesian networks. The first one consists of measurements of states of protein signalling network under different perturbations [1]. It's been used to infer causal relationships in the form of static Bayesian network.

The second dataset comes from documentation of the Banjo package [2] which can be downloaded from (1]. We took the data from the article, and transformed it into the format suitable for BNfinder. We also needed to specify several properties of the data in the preamble of the file data/sachs.inp

**Figure 4:** Reconstruction of the protein signalling network. Dark blue arrows represent dependencies found in literature. Light blue arrows represent dependencies found by BNfinder but not expected by Sachs et al. [1]

Firstly, we needed to specify that the data are continuous measurements:

#continuous praf pmek plcg PIP2 PIP3 p44/42 pakts473 PKA PKC P38 pjnk

Then, we needed to specify the expected layer structure of the signalling pathway we are studying:

#regulators plcg
#regulators PIP3
#regulators PIP2
#regulators PKC
#regulators PKA
#regulators praf
#regulators pjnk pmek P38 
#regulators p44/42
#regulators pakts473

Then we needed to specify which of the proteins are affected by different perturbations.

#perturbed cd3cd28psitect_0 PIP2
#perturbed cd3cd28psitect_1 PIP2
#perturbed cd3cd28psitect_2 PIP2
...
#perturbed cd3cd28g0076_0 PKC
#perturbed cd3cd28g0076_1 PKC
#perturbed cd3cd28g0076_2 PKC
...

When we finally run the BNfinder:

bnf -e sachs.inp -n sach.sif -v

We obtain the network presented in Fig. 4. As we can see, the topology is quite consistent with the literature data. Out of 17 expected edges, BNfinder recovers 11 correctly.

Dynamic Bayesian network

This is a dataset of substantial size which is used [3] to assess the performance of our inference algorithm. The input dataset (data/input7.txt) consists of 2000 measurments of 20 variables nd it takes approximately 3 hours to compute it on a modern PC (2.4Ghz Intel Core 2 duo).

We can run BNfinder with the following command:

bnf -e input7.txt -n output7.sif -v -l 5

In Fig. 5 we can see part of the network reconstructed by BNfinder. All the edges reported by Banjo are also present in the optimal network (dark blue). The optimal network contains a number of additional edges, not reported by Banjo.

**Figure 5:** The optimal network reconstructed by BNfinder from the dynamic benchmark dataset. The edges reported also by Banjo are shown in dark blue.

If you want to see how much faster the MDL algorithm is, you can also run BNfinder with the following command:

bnf -s MDL -e input7.txt -n output7mdl.sif -v -l 5

Bibliography

1: K. Sachs, O. Perez, D. Pe'er, D.A. Lauffenburger, and G.P. Nolan.
Causal protein-signaling networks derived from multiparameter single-cell data.
Science, 308:523-529, Apr 2005.
2: V. Anne Smith, Jing Yu, Tom V Smulders, Alexander J Hartemink, and Erich D Jarvis.
Computational inference of neural information flow networks.
PLoS Comput Biol, 2(11):e161, Nov 2006.
3: Bartek Wilczynski and Norbert Dojer.
BNfinder: Exact and efficient method for learning Bayesian networks.
submitted.