BNfinder: Exact and efficient method for learning Bayesian networks
Supplementary Materials

In the present section we give a brief exposition of the algorithm implemented in BNFinder and its computational cost for two generally used scoring criteria: Minimal Description Length and Bayesian-Dirichlet equivalence. For a fuller treatment, including detailed proofs, we refer the reader to [2].

1 Polynomial-time exact algorithm

A Bayesian network (BN) N is a representation of a joint distribution of a set of discrete random variables X={X₁,…,X_n}. The representation consists of two components:

a directed acyclic graph G=(X,E) encoding conditional (in-)dependencies
a family θ of conditional distributions P(X_i|Pa_i), where
Pa_i={Y∈X|(Y,X_i)∈E}

The joint distribution of X is given by

P(X

i=1

P(X_i|Pa_i) (1)

The problem of learning a BN is understood as follows: given a multiset of X-instances D={x₁,…,x_N} find a network graph G that best matches D. The notion of a good match is formalized by means of a scoring function S(G:D) having positive values and minimized for the best matching network. Thus the point is to find a directed acyclic graph G with the set of vertices X minimizing S(G:D).

The BNFinder program is devoted to the case when there is no need to examine the acyclicity of the graph, for example:

When dealing with dynamic Bayesian networks. A dynamic BN describes stochastic evolution of a set of random variables over discretized time. Therefore conditional distributions refer to random variables in neighboring time points. The acyclicity constraint is relaxed, because the ”unrolled” graph (with a copy of each variable in each time point) is always acyclic (see [3] for more details). The following considerations apply to dynamic BNs as well.
In case of static Bayesian Networks, the user has to supply the algorithm with a partial ordering of the vertices, restricting the set of possible edges only to the ones consistent with the ordering. BNFinder lets the user to divide the set of variables into an ordered set of disjoint subsets of variables, where edges can only exist between variables from different subsets and they have to be consistent with the ordering. If such ordering is not known beforehand, one can try to run BNFinder with different orderings and choose a network with the best overall score.

In the sequel we consider some assumptions on the form of a scoring function. The first one states that S(G:D) decomposes into a sum over the set of random variables of local scores, depending on the values of a variable and its parents in the graph only.

Assumption 1 S(G:D) = Σ_i=1ⁿ s(X_i,Pa_i:D|_{{X_i}∪Pa_i}), where D|_Y denotes the restriction of D to the values of the members of Y⊆X.

When there is no need to examine the acyclicity of the graph, this assumption allows to compute the parents set of each variable independently. Thus the point is to find Pa_i minimizing s(X_i,Pa_i:D|_{{X_i}∪Pa_i}) for each i.

Let us fix a dataset D and a random variable X. We denote by X' the set of potential parents of X (possibly smaller than X due to given constraints on the structure of the network). To simplify the notation we continue to write s(Pa) for s(X,Pa:D|_{X}∪Pa).

The following assumption expresses the fact that scoring functions decompose into 2 components: g penalizing the complexity of a network and d evaluating the possibility of explaining data by a network.

Assumption 2 s(Pa)=g(Pa)+d(Pa) for some functions g,d:P(X)→R⁺ satisfying Pa⊆Pa'⇒ g(Pa)≤ g(Pa').

This assumption is used in the following algorithm to avoid considering networks with inadequately large component g.

Algorithm 1

Pa:=∅
for each P⊆X' chosen according to g(P)
1. if s(P)<s(Pa) then Pa:=P
2. if g(P)≥ s(Pa) then return Pa; stop

In the above algorithm choosing according to g(P) means choosing increasingly with respect to the value of the component g of the local score.

Theorem 1 Suppose that the scoring function satisfies Assumptions 1-2. Then Algorithm 1 applied to each random variable finds an optimal network.

A disadvantage of the above algorithm is that finding a proper subset P⊆X' involves computing g(P') for all ⊆-successors P' of previously chosen subsets. It may be avoided when a further assumption is imposed.

Assumption 3 |Pa|=|Pa'|⇒ g(Pa)=g(Pa').

The above assumption suggests the notation g(|Pa|)=g(Pa). The following algorithm uses the uniformity of g to reduce the number of computations of the component g.

Algorithm 2

Pa:=∅
for p=1 to n
1. if g(p)≥ s(Pa) then return Pa; stop
2. P=arg min_{{Y⊆X' : |Y|=p}}s(Y)
3. if s(P)<s(Pa) then Pa:=P

Theorem 2 Suppose that the scoring function satisfies Assumptions 1-3. Then Algorithm 2 applied to each random variable finds an optimal network.

2 Minimal Description Length

The Minimal Description Length (MDL) scoring criterion originates from information theory [5]. A network N is viewed here as a model of compression of a dataset D. The optimal model minimizes the total length of the description, i.e. the sum of the description length of the model and of the compressed data. MDL is effectively equivalent to Bayesian Information Criterion (BIC) (see [6]), which approximates Bayesian scores (see next section) and is also applicable to continuous data.

Let us fix a dataset D={x₁,…,x_N} and a random variable X. Recall the decomposition s(Pa)=g(Pa)+d(Pa) of the local score for X. In the MDL score g(Pa) stands for the length of the description of the local part of the network (i.e. the edges ingoing to X and the conditional distribution P(X|Pa)) and d(Pa) is the length of the compressed version of X-values in D.

Let k_Y denote the cardinality of the set V_Y of possible values of the random variable Y∈X. Thus we have

g(Pa)=|Pa|logn+

logN

(k_X−1)

Y∈Pa

k_Y

where logN/2 is the number of bits we use for each numeric parameter of the conditional distribution. This formula satisfies Assumption 2 but fails to satisfy Assumption 3. Therefore Algorithm 1 can be used to learn an optimal network, but Algorithm 2 cannot.

However, for many applications we may assume that all the random variables attain values from the same set V of cardinality k. In this case we obtain the formula

g(Pa)=|Pa|logn+

logN

(k−1)k^|Pa|

which satisfies Assumption 3. For simplicity, we continue to work under this assumption. The general case may be handled in much the same way.

Compression with respect to the network model is understood as follows: when encoding the X-values, the values of Pa-instances are assumed to be known. Thus the optimal encoding length is given by

d(Pa)=N⋅ H(X|Pa)

where H(X|Pa)=−Σ_v∈VΣ_v∈V^PaP(v,v)logP(v|v) is the conditional entropy of X given Pa (the distributions are estimated from D).

Since all the assumptions from the previous section are satisfied, Algorithm 2 may be applied to learn the optimal network. Let us turn to the analysis of its complexity.

Theorem 3 The worst-case time complexity of Algorithm 2 for the MDL score is O(n^{log_k N}Nlog_k N).

3 Bayesian-Dirichlet equivalence

The Bayesian-Dirichlet equivalence (BDe) scoring criterion originates from Bayesian statistics [1]. Given a dataset D the optimal network structure G maximizes the posterior conditional probability P(G|D). We have

P(G|D)∝ P(G)P(D|G)=P(G

)

∫P(D|G,θ)P(θ|G)dθ

where P(G) and P(θ|G) are prior probability distributions on graph structures and conditional distributions' parameters, respectively, and P(D|G,θ) is evaluated due to (1).

Heckerman et al. [4], following Cooper and Herskovits [1], identified a set of independence assumptions making possible decomposition of the integral in the above formula into a product over X. Under this condition, together with a similar one regarding decomposition of P(G), the scoring criterion

S(G:D)=−logP(G)−logP(D|G)

obtained by taking −log of the above term satisfies Assumption 1. Moreover, the decomposition s(Pa)=g(Pa)+d(Pa) of the local scores appears as well, with the components g and d derived from −logP(G) and −logP(D|G), respectively.

The distribution P((X,E))∝α^|E| with a penalty parameter 0<α<1 in general is used as a prior over the network structures. This choice results in the function

g(|Pa|)=|Pa|logα⁻¹

satisfying Assumptions 2 and 3.

However, it should be noticed that there are also used priors which satisfy neither Assumption 2 nor 3, e.g. P(G)∝α^Δ(G,G₀), where Δ(G,G₀) is the cardinality of the symmetric difference between the sets of edges in G and in the prior network G₀.

The Dirichlet distribution is generally used as a prior over the conditional distributions' parameters. It yields

d(Pa)=log

⎛
⎜
⎜
⎜
⎜
⎜
⎝

v∈V^|Pa|

Γ(

v∈V

(H_v,v+N_v,v))

Γ(

v∈V

H_v,v)

v∈V

Γ(H_v,v)

Γ(H_v,v+N_v,v)

⎞
⎟
⎟
⎟
⎟
⎟
⎠

where Γ is the Gamma function, N_v,v denotes the number of samples in D with X=v and Pa=v, and H_v,v is the corresponding hyperparameter of the Dirichlet distribution.

Setting all the hyperparameters to 1 yields

d(Pa)=log

⎛
⎜
⎜
⎝

v∈V^|Pa|

(k−1+

v∈V

N_v,v)!

(k−1)!

v∈V

N_v,v!

⎞
⎟
⎟
⎠

v∈V^|Pa|

⎛
⎜
⎜
⎝

log( k−1+v∈VΣN_v,v)!−log

(k−1)! −

v∈V

logN_v,v!

⎞
⎟
⎟
⎠

where k=|V|. For simplicity, we continue to work under this assumption (following Cooper and Herskovits [1]). The general case may be handled in a similar way.

The following result allows to refine the decomposition of the local score into the sum of the components g and d.

Proposition 1 Define d_min=Σ_v∈V(log(k−1+N_v)!−log(k−1)!−logN_v!), where N_v denotes the number of samples in D with X=v. Then d(Pa)≥ d_min for each Pa∈X.

By the above proposition, the decomposition of the local score given by s(Pa)=g'(Pa)+d'(Pa) with the components g'(Pa)=g(Pa)+d_min and d'(Pa)=d(Pa)−d_min satisfies all the assumptions required by Algorithm 2. Let us turn to the analysis of its complexity.

Theorem 4 The worst-case time complexity of Algorithm 2 for the BDe score with the decomposition of the local score given by s(Pa)=g'(Pa)+d'(Pa) is O(n^{Nlog_α⁻¹k}N²log_α⁻¹k).

4 Continuous variables

Both MDL and BDe scores were originally designed for discrete variables. In order to avoid arbitrary discretization of continuous data we adapted them to deal with continuous variables directly. In case of MDL our approach is essentially congruent to BIC, but because we choose specific mixture of Gaussian distributions, the method is also applicable to BDe. Moreover, it can be applied to heterogenous data sets joining together discrete and continuous variables.

The distribution of each continuous variable X is assumed to be a mixture of two normal distributions. The mixture components are considered to be the two possible values (low and high) of a related hidden discrete variable X' and X is viewed as its observable reflection. Therefore regulatory relationships are learned for discrete variables rather than continuous ones.

This approach yields the following form of conditional distributions on continuous variables:

P(X|Pa

v∈{low, high}

v∈{low, high}^|Pa|

P(X|X'=v)P(X'=v|Pa'=v)P(Pa'=v|Pa)=

v∈{low, high}

⎛
⎜
⎜
⎝

v∈{low, high}^|Pa|

P(X'=v|Pa'=v)P(Pa'=v|Pa)

⎞
⎟
⎟
⎠

P(X|X'=v)

Since distributions P(X|X'=v) are Gaussian, P(X|Pa) is a Gaussian mixture with parameters dependent on values of Pa.

In a preprocessing step parameters of P(X|X') are estimated separately for each variable X. Estimation is based on data clustering with the k-means algorithm (k=2, cutting the set of variable values in the median yields initial clusters). The parameters enable us to calculate P(X|X') as well as P(X'|X), and consequently P(Pa'|Pa). Thus the space of possible conditional distributions on continuous variables forms a family of Gaussian mixtures, parameterized by P(X'=v|Pa'=v), the only free parameters in P(X|Pa) and at the same time the parameters of conditional distributions on corresponding discrete variables.

From a technical point of view, BNFinder learns optimal network structures for these discrete variables, using scoring functions adapted to handle distributions on variable values instead of their determined values (expected values of original scores are computed). For continuous variables it gives optimal Bayesian networks from among all networks with conditional probability distributions belonging to the family of Gaussian mixtures defined above.

References

[1]: Gregory F. Cooper and Edward Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992.
[2]: Norbert Dojer. Learning Bayesian Networks Does Not Have to Be NP-Hard. In Rastislav Královic and PaweUrzyczyn, editors, Proceedings of Mathematical Foundations of Computer Science 2006, pages 305–314. Springer-Verlag, 2006. LNCS 4162.
[3]: N Friedman, K Murphy, and S Russell. Learning the structure of dynamic probabilistic networks. In G F Cooper and S Moral, editors, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Inteligence, pages 139–147, 1998.
[4]: David Heckerman, Dan Geiger, and David Maxwell Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3):197–243, Sep. 1995.
[5]: W. Lam and F. Bacchus. Learning Bayesian belief networks: An approach based on the MDL principle. Computational Intelligence, 10(3):269–293, 1994.
[6]: Richard E. Neapolitan. Learning Bayesian Networks. Prentice Hall, 2003.

BNfinder: Exact and efficient method for learning Bayesian networks Supplementary Materials

1 Polynomial-time exact algorithm

2 Minimal Description Length

3 Bayesian-Dirichlet equivalence

4 Continuous variables

References

BNfinder: Exact and efficient method for learning Bayesian networks
Supplementary Materials