BNfinder: Exact and efficient method for learning Bayesian networks
Supplementary Materials

1 Polynomial-time exact algorithm

In the present section we give a brief exposition of the algorithm implemented in BNfinder and its computational cost for two generally used scoring criteria: Minimal Description Length and Bayesian-Dirichlet equivalence. For a fuller treatment, including detailed proofs, we refer the reader to [2].

1.1 Algorithm

A Bayesian network (BN) N is a representation of a joint distribution of a set of discrete random variables X={X₁,…,X_n}. The representation consists of two components:

a directed acyclic graph G=(X,E) encoding conditional (in-)dependencies
a family θ of conditional distributions P(X_i|Pa_i), where
Pa_i={Y∈X|(Y,X_i)∈E}

The joint distribution of X is given by

P(X)=

∏

i=1

P(X_i|Pa_i) (1)

The problem of learning a BN is understood as follows: given a multiset of X-instances D={x₁,…,x_N} find a network graph G that best matches D. The notion of a good match is formalized by means of a scoring function S(G:D) having positive values and minimized for the best matching network. Thus the point is to find a directed acyclic graph G with the set of vertices X minimizing S(G:D).

The BNfinder program is devoted to the case when there is no need to examine the acyclicity of the graph, for example:

When edges must be directed according to a given partial order, in particular when the sets of potential regulators and regulatees are disjoint.
When dealing with dynamic Bayesian networks. A dynamic BN describes stochastic evolution of a set of random variables over discretized time. Therefore conditional distributions refer to random variables in neighboring time points. The acyclicity constraint is relaxed, because the ”unrolled” graph (with a copy of each variable in each time point) is always acyclic (see [4] for more details). The following considerations apply to dynamic BNs as well.

In the sequel we consider some assumptions on the form of a scoring function. The first one states that S(G:D) decomposes into a sum over the set of random variables of local scores, depending on the values of a variable and its parents in the graph only.

Assumption 1 (additivity) S(G:D) = ∑_i=1ⁿ s(X_i,Pa_i:D|_{{X_i}∪Pa_i}), where D|_Y denotes the restriction of D to the values of the members of Y⊆X.

When there is no need to examine the acyclicity of the graph, this assumption allows to compute the parents set of each variable independently. Thus the point is to find Pa_i minimizing s(X_i,Pa_i:D|_{{X_i}∪Pa_i}) for each i.

Let us fix a dataset D and a random variable X. We denote by X′ the set of potential parents of X (possibly smaller than X due to given constraints on the structure of the network). To simplify the notation we continue to write s(Pa) for s(X,Pa:D|_{X}∪Pa).

The following assumption expresses the fact that scoring functions decompose into 2 components: g penalizing the complexity of a network and d evaluating the possibility of explaining data by a network.

Assumption 2 (splitting) s(Pa)=g(Pa)+d(Pa) for some functions g,d:P(X)→ℝ⁺ satisfying Pa⊆Pa′=⇒ g(Pa)≤ g(Pa′).

This assumption is used in the following algorithm to avoid considering networks with inadequately large component g.

Algorithm 1

Pa:=∅
for each P⊆X′ chosen according to g(P)
1. if s(P)<s(Pa) then Pa:=P
2. if g(P)≥ s(Pa) then return Pa; stop

In the above algorithm choosing according to g(P) means choosing increasingly with respect to the value of the component g of the local score.

Theorem 1 Suppose that the scoring function satisfies Assumptions 1-2. Then Algorithm 1 applied to each random variable finds an optimal network.

A disadvantage of the above algorithm is that finding a proper subset P⊆X′ involves computing g(P′) for all ⊆-successors P′ of previously chosen subsets. It may be avoided when a further assumption is imposed.

Assumption 3 (uniformity) |Pa|=|Pa′|=⇒ g(Pa)=g(Pa′).

The above assumption suggests the notation ĝ(|Pa|)=g(Pa). The following algorithm uses the uniformity of g to reduce the number of computations of the component g.

Algorithm 2

Pa:=∅
for p=1 to n
1. if ĝ(p)≥ s(Pa) then return Pa; stop
2. P=arg min_{{Y⊆X′ : |Y|=p}}s(Y)
3. if s(P)<s(Pa) then Pa:=P

Theorem 2 Suppose that the scoring function satisfies Assumptions 1-3. Then Algorithm 2 applied to each random variable finds an optimal network.

1.2 Minimal Description Length

The Minimal Description Length (MDL) scoring criterion originates from information theory [6]. A network N is viewed here as a model of compression of a dataset D. The optimal model minimizes the total length of the description, i.e. the sum of the description length of the model and of the compressed data.

Let us fix a dataset D={x₁,…,x_N} and a random variable X. Recall the decomposition s(Pa)=g(Pa)+d(Pa) of the local score for X. In the MDL score g(Pa) stands for the length of the description of the local part of the network (i.e. the edges ingoing to X and the conditional distribution P(X|Pa)) and d(Pa) is the length of the compressed version of X-values in D.

Let k_Y denote the cardinality of the set V_Y of possible values of the random variable Y∈X. Thus we have

g(Pa)=|Pa|logn+

logN

(k_X−1)

∏

Y∈Pa

k_Y

where logN/2 is the number of bits we use for each numeric parameter of the conditional distribution. This formula satisfies Assumption 2 but fails to satisfy Assumption 3. Therefore Algorithm 1 can be used to learn an optimal network, but Algorithm 2 cannot.

However, for many applications we may assume that all the random variables attain values from the same set V of cardinality k. In this case we obtain the formula

g(Pa)=|Pa|logn+

logN

(k−1)k^|Pa|

which satisfies Assumption 3. For simplicity, we continue to work under this assumption. The general case may be handled in much the same way.

Compression with respect to the network model is understood as follows: when encoding the X-values, the values of Pa-instances are assumed to be known. Thus the optimal encoding length is given by

d(Pa)=N· H(X|Pa)

where H(X|Pa)=−∑_v∈V∑_v∈V^PaP(v,v)logP(v|v) is the conditional entropy of X given Pa (the distributions are estimated from D).

Since all the assumptions from the previous section are satisfied, Algorithm 2 may be applied to learn the optimal network. Let us turn to the analysis of its complexity.

Theorem 3 The worst-case time complexity of Algorithm 2 for the MDL score is O(n^{log_k N}Nlog_k N).

1.3 Bayesian-Dirichlet equivalence

The Bayesian-Dirichlet equivalence (BDe) scoring criterion originates from Bayesian statistics [1]. Given a dataset D the optimal network structure G maximizes the posterior conditional probability P(G|D). We have

P(G|D)∝ P(G)P(D|G)=P(G)

∫

P(D|G,θ)P(θ|G)dθ

where P(G) and P(θ|G) are prior probability distributions on graph structures and conditional distributions’ parameters, respectively, and P(D|G,θ) is evaluated due to (1).

Heckerman et al. [5], following Cooper and Herskovits [1], identified a set of independence assumptions making possible decomposition of the integral in the above formula into a product over X. Under this condition, together with a similar one regarding decomposition of P(G), the scoring criterion

S(G:D)=−logP(G)−logP(D|G)

obtained by taking −log of the above term satisfies Assumption 1. Moreover, the decomposition s(Pa)=g(Pa)+d(Pa) of the local scores appears as well, with the components g and d derived from −logP(G) and −logP(D|G), respectively.

The distribution P((X,E))∝α^|E| with a penalty parameter 0<α<1 in general is used as a prior over the network structures. This choice results in the function

g(|Pa|)=|Pa|logα⁻¹

satisfying Assumptions 2 and 3.

However, it should be noticed that there are also used priors which satisfy neither Assumption 2 nor 3, e.g. P(G)∝α^Δ(G,G₀), where Δ(G,G₀) is the cardinality of the symmetric difference between the sets of edges in G and in the prior network G₀.

The Dirichlet distribution is generally used as a prior over the conditional distributions’ parameters. It yields

d(Pa)=log

⎛
⎜
⎜
⎜
⎜
⎜
⎝

∏

v∈V^|Pa|

Γ(

∑

v∈V

(H_v,v+N_v,v))

Γ(

∑

v∈V

H_v,v)

∏

v∈V

Γ(H_v,v)

Γ(H_v,v+N_v,v)

⎞
⎟
⎟
⎟
⎟
⎟
⎠

where Γ is the Gamma function, N_v,v denotes the number of samples in D with X=v and Pa=v, and H_v,v is the corresponding hyperparameter of the Dirichlet distribution.

Setting all the hyperparameters to 1 yields

d(Pa)=log

⎛
⎜
⎜
⎝

∏

v∈V^|Pa|

(k−1+

∑

v∈V

N_v,v)!

(k−1)!

∏

v∈V

N_v,v!

⎞
⎟
⎟
⎠

∑

v∈V^|Pa|

⎛
⎜
⎜
⎝

log( k−1+v∈V∑N_v,v)!−log(k−1)! −

∑

v∈V

logN_v,v!

⎞
⎟
⎟
⎠

where k=|V|. For simplicity, we continue to work under this assumption (following Cooper and Herskovits [1]). The general case may be handled in a similar way.

The following result allows to refine the decomposition of the local score into the sum of the components g and d.

Proposition 1 Define d_min=∑_v∈V(log(k−1+N_v)!−log(k−1)!−logN_v!), where N_v denotes the number of samples in D with X=v. Then d(Pa)≥ d_min for each Pa∈X.

By the above proposition, the decomposition of the local score given by s(Pa)=g′(Pa)+d′(Pa) with the components g′(Pa)=g(Pa)+d_min and d′(Pa)=d(Pa)−d_min satisfies all the assumptions required by Algorithm 2. Let us turn to the analysis of its complexity.

Theorem 4 The worst-case time complexity of Algorithm 2 for the BDe score with the decomposition of the local score given by s(Pa)=g′(Pa)+d′(Pa) is O(n^{Nlog_α⁻¹k}N²log_α⁻¹k).

1.4 Continuous variables

Both MDL and BDe scores are designed for discrete variables. In order to avoid arbitrary discretization of continuous data we adapted these scores to deal with continuous variables directly.

The distribution of each continuous variable is assumed to be a mixture of two normal distributions. Mixture parameters are estimated from data clustered with the k-means algorithm (k=2, cutting the set of variable values in the median yields the initial clusters). Then the parameters are used in transforming continuous values into probability distributions on the mixture components. These components are considered to be the two possible values (low and high) of a related hidden discrete variable.

In the learning procedure BNfinder analyses regulatory relationships between these hidden variables (and possibly variables that were discrete originally). Thus the scoring functions must be able to handle distributions on variable values instead of their determined values. In such cases BNfinder uses expected values of the original scores.

2 Manual

2.1 Installation

The BNF software uses setuptools, which is the standard library for packaging python software. After downloading the archive containing the current version of BNF, you should extract it to a directory of choice. In unix-like systems you can do it by typing

tar -xzf bnf-0.1.tgz

Once you have the sources extracted, the installation is performed by a single command

python setup.py install

in the source directory (it may require the administrator privileges).

This installs the BNfinder library to an apropriate location for your python interpreter, and a bnf script which may be accessed from a command line.

2.2 Usage

BNfinder can be executed by typing

> bnf <options>

The following options are available:

-h, --help: print out a brief summary of options and exit
-e, --expr <file>: load learning data from <file> (this option is obligatory)
-s, --score <name>: learn with <name> scoring criterion; possible names are BDE (default) and MDL
-l, --limit <number>: limit the search space to networks with at most <number> parents for each vertex
-i, --suboptimal <number>: compute <number> best scored parents sets for each vertex (default 1)
-d, --data-factor <number>: multiply (each item of) the dataset <number> times (default 1); this option may be used to change the proportion between d and g components of the scoring function (see the definition of the splitting assumption)
-v, --verbose: print out communicates on standard output
-p, --prior-pseudocount <number>: set the pseudocounts of data items with specified values of a vertex and its parents set to <number>/|V|^|Pa|+1 (resulting in the total pseudocount equal to <number>) – this method follows Heckerman et al [5]; when the option is unspecified, all pseudocounts are set to 1, following Cooper and Herskovitz [1]; pseudocounts are used as hyperparameters of the Dirichlet priors of the BDE scoring criterion and also in the estimation of the conditional probability distributions (CPDs) of learned network
-n, --net <file>: write the learned network graph to <file> in the SIF format
-t, --txt <file>: write the learned suboptimal parents sets to <file>
-b, --bif <file>: write the learned Bayesian network to <file> in the BIF format
-c, --cpd <file>: write the learned Bayesian network to <file> as a Python dictionary

2.3 Input format

The learning data must be passed to BNfinder in a text file splitted into 3 parts: preamble, experiment specification and experiment data. The preamble allows user to specify some features of data and/or network, while the next two parts contain the learning data, essentially formatted as a table with space- or tab-separated values.

2.3.1 Preamble

The preamble allows specifying experiment peturbations, structural constraints, vertex value types, vertex CPD types and edge weights. Each line in the preamble has the following form:

#<command> <arguments>

Experiments with perturbed values of some vertices carry no information regarding their regulatory mechanism. Thus including these experiments data in learning parents of their perturbed vertices biases the result (see [3] for a detailed treatment). The following command handles perturbations:

#perturbed <experiment/serie> <vertex list>: omit data from experiment (serie of experiments in the case of dynamic networks) <experiment/serie> when learning parents of vertices from <vertex list>

One possible way of specifying structural constraints with BNfinder is to list potential parents of particular vertices. An easier method is available for constraints of the cascade form, where the vertex set is splitted into a sequence of groups and each parent of a vertex must belong to one of previous groups (a simple but extremely useful example is a cascade with 2 groups: regulators and regulatees). There are 2 commands specifying structural constraints:

#parents <vertex> <vertex list>: restrict the set of potential parents of <vertex> to <vertex list>.
#regulators <vertex list>: restrict the set of potential parents of all vertices except specified with #parents command or with previous or present #regulators command to vertices included in <vertex list> of previous or present #regulators command.

Note that structural constraints forcing network’s acyclicity are necessery for learning a static Bayesian network with BNfinder.

Vertex value types may be specified with the following commands:

#discrete <vertex> <value list>: let <value list> be possible values of <vertex>
#continuous <vertex list>: let float numbers be possible values of all vertices in <vertex list>
#default <value list>: let <value list> be possible values of all vertices except specified with #discrete or #continuous command (when <value list> is FLOAT, float numbers are possible values)

Values in <value list> may be integers or words (strings without whitespaces). When some vertices are left unspecified, BNfinder tries to recognize their possible value sets. However it may miss, in particular when some float numbers are written in integer format or when some possible values are not represented in the dataset (note that the size of the set of possible values affects the score).

The space of possible CPDs of some vertices given their parents may be restricted to noisy-and or noisy-or distributions. In this case, the sets of possible values of these vertices and their potential parents must be either {0,1} or float numbers. Moreover, BNfinder should be executed with the MDL scoring criterion. The following commands specify vertices with noisy CPDs:

#and <vertex list>: restrict the space of possible CPDs of vertices from <vertex list> to noisy-and distributions
#or <vertex list>: restrict the space of possible CPDs of vertices from <vertex list> to noisy-or distributions

The following commands set prior weights on network edges:

#prioredge <vertex> <weight> <vertex list>: set the prior weights of all edges originating from vertices from <vertex list> and aiming at <vertex> to <weight>
#priorvert <weight> <vertex list>: set the prior weights of all edges originating from vertices from <vertex list> (except specified in <prioredge> command) to <weight>

Weights must be positive float numbers. Edges with greater weights are penalized harder. The default weight is 1.

2.3.2 Experiment specification

The experiment specification has the following form:

<name> <experiment list>

where <name> is a word starting with a symbol other then #. The form of experiment names depends on the data type and, consequently, on the type of learned network:

When the dataset consists of results of independent experiments and a static Bayesian network is to be learned, experiment names are words without the symbol ’:’.
When the dataset consists of results of time series experiments and a dynamic Bayesian network is to be learned, experiment names have the form <serie>:<condition>. Each serie must be ordered according to the condition times and cannot be interrupted by experiments from other series.

2.3.3 Experiment data

Each line of the experiment data part has the following form:

<vertex> <value list>

where <vertex> is a word and values are listed in the order corresponding to <experiment list>.

2.4 Output formats

2.4.1 SIF format

The SIF (Simple Interaction File), usually contained in files with .sif extension is the simplest of the supported formats and carries only information on the topology of the network. In this format, each line represents the fact of a single interaction. In our case such interaction represents the fact that one variable depends on some other variable. Each line contains three values:

Parent variable identifier,
type of interaction (currently +/- is reported when positive or negative correlation between variables is found, if BNfinder is run with -i option and reports more than 1 suboptimal network, the edge labels represent sum of posterior probabilities obtained by parent sets including that edge.
Child variable identifier.

To show it by example, the file:

A + B
B - C

Describes a network of the following shape:

A →⁺ B →⁻ C.

The main advantage of this format is that it can be read by the Cytoscape (http://cytoscape.org) software allowing for quick visualization. It is also trivial to use such data in one’s own software.

2.4.2 Suboptimal parents sets

Suboptimal parents sets are written to a file in a simple text format splitted into sections representing the sets of the parents of each vertex. Each section contains a leading line with the vertex name followed by lines representing its consecutive suboptimal parents sets. Each of these lines has the form:

 <relative probability> <vertex list>

were <relative probability> is the ratio of the set’s posterior probability to the posterior probability of the empty parents set and <vertex list> contains the elements of the set. Lines are ordered decreasingly according to <relative probability>.

To show it by example, the section:

C
 2.333333  B
 1.000000 
 0.592593  B A

reports 3 most probable parents sets of the vertex C: {B},∅,{B,A}. Moreover, it states that {B} is 2.333333 times more probable than the empty set and the corresponding ratio for {B,A} equals 0.592593.

2.4.3 BIF format

Bayesian Interchange Format (BIF) is a simple text format dedicated to Bayesian networks. It is supported in some BN applications (e.g. JavaBayes, Bayes Networks Editor) and may be easily converted with available tools to other popular formats (including XML formats and BNT format of K. Murphy’s Bayes Net Toolbox). BNfinder writes learned networks in BIF version 0.15.

2.4.4 Python dictionary

A network saved in <file> as a dictionary may be loaded to your Python environment by

eval(open(<file>).read())

The dictionary consists of items corresponding to all network’s vertices. Each item has the following form:

<vertex name> : <vertex dictionary>

Vertex dictionaries have the following items:

’vals’ : <value list>
’pars’ : <parent list>
’type’ : <CPD type>: (only for vertices with noisy CPDs, possible values of <CPD type> are ’and’ and ’or’)
’cpds’ : <CPD dictionary>

The form of the vertex CPD dictionary depends on the vertex type. In the case of noisy CPD, the dictionary items have the following form:

<vertex name> : <probability>: which means (in the case of noisy-and/-or distribution) that the considered vertex is assigned value 1/0 with <probability> given all its parents but <vertex name> equal 1/0

In the case of general CPD, the dictionary has items of the following form:

<value vector> : <distribution dictionary>

where <value vector> is a tuple of parents’ values and the distribution of the considered vertex given <parent list> = <value vector> is defined in <distribution dictionary> in the following way:

<value> : <probability>: means that the vertex is assigned <value> with <probability>
None : <probability>: means that the vertex is assigned with <probability> each of its possible values unspecified in a separate item

None : <probability>

means that given <parent list> equal to a value vector unspecified in a separate item the vertex is assigned each of its possible values with <probability>

2.4.5 Standard output

When BNfinder is executed from a command line with the option -v, it prints out communicates related to its current action: loading data, learning regulators of consecutive vertices and writing output files. Moreover, after finishing computations for a vertex its predicted best parents sets and their scores are reported and after finishing computations for all vertices BNfinder reports the score and structure of the optimal network.

References

[1]: Gregory F. Cooper and Edward Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992.
[2]: Norbert Dojer. Learning Bayesian Networks Does Not Have to Be NP-Hard. In Rastislav Královic and PawełUrzyczyn, editors, Proceedings of Mathematical Foundations of Computer Science 2006, pages 305–314. Springer-Verlag, 2006. LNCS 4162.
[3]: Norbert Dojer, Anna Gambin, Bartek Wilczyński, and Jerzy Tiuryn. Applying dynamic Bayesian networks to perturbed gene expression data. BMC Bioinformatics, 7:249, 2006.
[4]: N Friedman, K Murphy, and S Russell. Learning the structure of dynamic probabilistic networks. In G F Cooper and S Moral, editors, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Inteligence, pages 139–147, 1998.
[5]: David Heckerman, Dan Geiger, and David Maxwell Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3):197–243, Sep. 1995.
[6]: W. Lam and F. Bacchus. Learning Bayesian belief networks: An approach based on the MDL principle. Computational Intelligence, 10(3):269–293, 1994.

This document was translated from L^AT_EX by H^EV^EA.

BNfinder: Exact and efficient method for learning Bayesian networks Supplementary Materials

1 Polynomial-time exact algorithm

1.1 Algorithm

1.2 Minimal Description Length

1.3 Bayesian-Dirichlet equivalence

1.4 Continuous variables

2 Manual

2.1 Installation

2.2 Usage

2.3 Input format

2.3.1 Preamble

2.3.2 Experiment specification

2.3.3 Experiment data

2.4 Output formats

2.4.1 SIF format

2.4.2 Suboptimal parents sets

2.4.3 BIF format

2.4.4 Python dictionary

2.4.5 Standard output

References

BNfinder: Exact and efficient method for learning Bayesian networks
Supplementary Materials