BNfinder: Exact and efficient method for learning Bayesian networks
Supplementary Materials 
In the present section we give a brief exposition of the algorithm implemented in BNFinder and its computational cost for two generally used scoring criteria: Minimal Description Length and BayesianDirichlet equivalence.
For a fuller treatment, including detailed proofs, we refer the reader to [2].
1 Polynomialtime exact algorithm
A
Bayesian network (BN)
N is a representation of a joint distribution of a set of discrete random variables
X={
X_{1},…,
X_{n}}.
The representation consists of two components:

a directed acyclic graph G=(X,E) encoding conditional (in)dependencies
 a family θ of conditional distributions P(X_{i}Pa_{i}), where
Pa_{i}={Y∈X(Y,X_{i})∈E}
The joint distribution of
X is given by
P(X 
)= 

P(X_{i}Pa_{i})
(1) 
The problem of learning a BN is understood as follows:
given a multiset of
Xinstances
D={
x_{1},…,
x_{N}} find a network graph
G that best matches
D.
The notion of a good match is formalized by means of a
scoring function S(
G:
D) having positive values and minimized for the best matching network.
Thus the point is to find a directed acyclic graph
G with the set of vertices
X minimizing
S(
G:
D).
The BNFinder program is devoted to the case when there is no need to examine the acyclicity of the graph, for example:

When dealing with dynamic Bayesian networks. A dynamic BN describes stochastic evolution of a set of random variables over discretized time. Therefore conditional distributions refer to random variables in neighboring time points. The acyclicity constraint is relaxed, because the ”unrolled” graph (with a copy of each variable in each time point) is always acyclic (see [3] for more details). The following considerations apply to dynamic BNs as well.
 In case of static Bayesian Networks, the user has to supply the
algorithm with a partial ordering of the vertices, restricting the
set of possible edges only to the ones consistent with the
ordering. BNFinder lets the user to divide the set of variables into an
ordered set of disjoint subsets of variables, where edges can only
exist between variables from different subsets and they have to be
consistent with the ordering. If such ordering is not known
beforehand, one can try to run BNFinder with different orderings and
choose a network with the best overall score.
In the sequel we consider some assumptions on the form of a scoring function.
The first one states that
S(
G:
D) decomposes into a sum over the set of random variables of
local scores, depending on the values of a variable and its parents in the graph only.
Assumption 1
S(G:D) = Σ_{i=1}^{n} s(X_{i},Pa_{i}:D_{{Xi}∪Pai}), where D_{Y} denotes the restriction of D to the values of the members of Y⊆X.
When there is no need to examine the acyclicity of the graph, this assumption allows to compute the parents set of each variable independently.
Thus the point is to find
Pa_{i} minimizing
s(
X_{i},
Pa_{i}:
D
_{{Xi}∪Pai}) for each
i.
Let us fix a dataset
D and a random variable
X.
We denote by
X' the set of potential parents of
X (possibly smaller than
X due to given constraints on the structure of the network).
To simplify the notation we continue to write
s(
Pa) for
s(
X,
Pa:
D
_{{X}∪Pa}).
The following assumption expresses the fact that scoring functions decompose into 2 components:
g penalizing the complexity of a network and
d evaluating the possibility of explaining data by a network.
Assumption 2
s(Pa)=g(Pa)+d(Pa) for some functions g,d:P(X)→R^{+} satisfying Pa⊆Pa'⇒ g(Pa)≤ g(Pa').
This assumption is used in the following algorithm to avoid considering networks with inadequately large component
g.
Algorithm 1

Pa:=∅
 for each P⊆X' chosen according to g(P)

if s(P)<s(Pa) then Pa:=P
 if g(P)≥ s(Pa) then return Pa; stop

In the above algorithm
choosing according to g(P) means choosing increasingly with respect to the value of the component
g of the local score.
Theorem 1
Suppose that the scoring function satisfies Assumptions 12. Then Algorithm 1 applied to each random variable finds an optimal network.
A disadvantage of the above algorithm is that finding a proper subset
P⊆
X' involves computing
g(
P') for all ⊆successors
P' of previously chosen subsets.
It may be avoided when a further assumption is imposed.
Assumption 3
Pa=Pa'⇒ g(Pa)=g(Pa').
The above assumption suggests the notation
g(
Pa)=
g(
Pa).
The following algorithm uses the uniformity of
g to reduce the number of computations of the component
g.
Algorithm 2

Pa:=∅
 for p=1 to n

if g(p)≥ s(Pa) then return Pa; stop
 P=arg min_{{Y⊆X' : Y=p}}s(Y)
 if s(P)<s(Pa) then Pa:=P

Theorem 2
Suppose that the scoring function satisfies Assumptions 13. Then Algorithm 2 applied to each random variable finds an optimal network.
2 Minimal Description Length
The Minimal Description Length (MDL) scoring criterion originates from information theory [
5].
A network
N is viewed here as a model of compression of a dataset
D.
The optimal model minimizes the total length of the description, i.e. the sum of the description length of the model and of the compressed data.
MDL is effectively equivalent to Bayesian Information Criterion (BIC)
(see [
6]),
which approximates Bayesian scores (see next section)
and is also applicable to continuous data.
Let us fix a dataset
D={
x_{1},…,
x_{N}} and a random variable
X.
Recall the decomposition
s(
Pa)=
g(
Pa)+
d(
Pa) of the local score for
X.
In the MDL score
g(
Pa) stands for the length of the description of the local part of the network (i.e. the edges ingoing to
X and the conditional distribution
P(
X
Pa)) and
d(
Pa) is the length of the compressed version of
Xvalues in
D.
Let
k_{Y} denote the cardinality of the set
V_{Y} of possible values of the random variable
Y∈
X.
Thus we have
g(Pa)=Palogn+ 

(k_{X}−1) 

k_{Y} 
where log
N/2 is the number of bits we use for each numeric parameter of the conditional distribution.
This formula satisfies Assumption
2 but fails to satisfy Assumption
3.
Therefore Algorithm
1 can be used to learn an optimal network, but Algorithm
2 cannot.
However, for many applications we may assume that all the random variables attain values from the same set
V of cardinality
k.
In this case we obtain the formula
g(Pa)=Palogn+ 

(k−1)k^{Pa} 
which satisfies Assumption
3.
For simplicity, we continue to work under this assumption.
The general case may be handled in much the same way.
Compression with respect to the network model is understood as follows: when encoding the
Xvalues, the values of
Painstances are assumed to be known.
Thus the optimal encoding length is given by
d(Pa)=N⋅ H(XPa)
where
H(
X
Pa)=−Σ
_{v∈V}Σ
_{v∈VPa}P(
v,
v)log
P(
v
v) is the conditional entropy of
X given
Pa (the distributions are estimated from
D).
Since all the assumptions from the previous section are satisfied, Algorithm
2 may be applied to learn the optimal network.
Let us turn to the analysis of its complexity.
Theorem 3
The worstcase time complexity of Algorithm 2 for the MDL score is O(n^{logk N}Nlog_{k} N).
3 BayesianDirichlet equivalence
The BayesianDirichlet equivalence (BDe) scoring criterion originates from Bayesian statistics [
1].
Given a dataset
D the optimal network structure
G maximizes the
posterior conditional probability
P(
G
D).
We have
P(GD)∝ P(G)P(DG)=P(G 
) 
∫P(DG,θ)P(θG)dθ 
where
P(
G) and
P(θ
G) are
prior probability distributions on graph structures and conditional distributions' parameters, respectively, and
P(
D
G,θ) is evaluated due to (
1).
Heckerman et al. [
4], following Cooper and Herskovits [
1], identified a set of independence assumptions making possible decomposition of the integral in the above formula into a product over
X.
Under this condition, together with a similar one regarding decomposition of
P(
G), the scoring criterion
S(G:D)=−logP(G)−logP(DG)
obtained by taking −log of the above term satisfies Assumption
1.
Moreover, the decomposition
s(
Pa)=
g(
Pa)+
d(
Pa) of the local scores appears as well, with the components
g and
d derived from −log
P(
G) and −log
P(
D
G), respectively.
The distribution
P((
X,
E))∝α
^{E} with a penalty parameter 0<α<1 in general is used as a prior over the network structures.
This choice results in the function
g(Pa)=Palogα^{−1}
satisfying Assumptions
2 and
3.
However, it should be noticed that there are also used priors which satisfy neither Assumption
2 nor
3, e.g.
P(
G)∝α
^{Δ(G,G0)}, where Δ(
G,
G_{0}) is the cardinality of the symmetric difference between the sets of edges in
G and in the prior network
G_{0}.
The
Dirichlet distribution is generally used as a prior over the conditional distributions' parameters.
It yields
d(Pa)=log 
⎛
⎜
⎜
⎜
⎜
⎜
⎝ 





Γ(H_{v,v}) 

Γ(H_{v,v}+N_{v,v}) 

⎞
⎟
⎟
⎟
⎟
⎟
⎠ 
where Γ is the
Gamma function,
N_{v,v} denotes the number of samples in
D with
X=
v and
Pa=
v, and
H_{v,v} is the corresponding
hyperparameter of the Dirichlet distribution.
Setting all the hyperparameters to 1 yields
d(Pa)=log 
⎛
⎜
⎜
⎝ 






⎞
⎟
⎟
⎠ 
= 
= 

⎛
⎜
⎜
⎝ 
log( k−1+v∈VΣN_{v,v})!−log 
(k−1)!
− 

logN_{v,v}! 
⎞
⎟
⎟
⎠ 
where
k=
V.
For simplicity, we continue to work under this assumption (following Cooper and Herskovits [
1]).
The general case may be handled in a similar way.
The following result allows to refine the decomposition of the local score into the sum of the components
g and
d.
Proposition 1
Define d_{min}=Σ_{v∈V}(log(k−1+N_{v})!−log(k−1)!−logN_{v}!), where N_{v} denotes the number of samples in D with X=v.
Then d(Pa)≥ d_{min} for each Pa∈X.
By the above proposition, the decomposition of the local score given by
s(
Pa)=
g'(
Pa)+
d'(
Pa) with the components
g'(
Pa)=
g(
Pa)+
d_{min} and
d'(
Pa)=
d(
Pa)−
d_{min} satisfies all the assumptions required by Algorithm
2.
Let us turn to the analysis of its complexity.
Theorem 4
The worstcase time complexity of Algorithm 2 for the BDe score with the decomposition of the local score given by s(Pa)=g'(Pa)+d'(Pa) is O(n^{Nlogα−1k}N^{2}log_{α−1}k).
4 Continuous variables
Both MDL and BDe scores were originally designed for discrete
variables.
In order to avoid arbitrary discretization of continuous data we
adapted them to deal with continuous variables
directly. In case of MDL our approach is essentially congruent to BIC,
but because we choose specific mixture of Gaussian distributions, the
method is also applicable to BDe. Moreover, it can be applied to
heterogenous data sets joining together discrete and continuous
variables.
The distribution of each continuous variable
X is assumed to be
a mixture of two normal distributions.
The mixture components are considered to be the two possible values
(
low and
high) of a related hidden discrete variable
X'
and
X is viewed as its observable reflection.
Therefore regulatory relationships are learned
for discrete variables rather than continuous ones.
This approach yields the following form of conditional distributions on continuous variables:
P(XPa 
)=



P(XX'=v)P(X'=vPa'=v)P(Pa'=vPa)= 
= 


⎛
⎜
⎜
⎝ 

P(X'=vPa'=v)P(Pa'=vPa) 
⎞
⎟
⎟
⎠ 
P(XX'=v)

Since distributions
P(
X
X'=
v) are Gaussian,
P(
X
Pa) is a Gaussian mixture with parameters dependent on values of
Pa.
In a preprocessing step parameters of
P(
X
X') are estimated separately
for each variable
X.
Estimation is based on data clustering with the
kmeans algorithm
(
k=2, cutting the set of variable values in the median yields initial clusters).
The parameters enable us to calculate
P(
X
X') as well as
P(
X'
X),
and consequently
P(
Pa'
Pa).
Thus the space of possible conditional distributions on continuous variables
forms a family of Gaussian mixtures, parameterized by
P(
X'=
v
Pa'=
v),
the only free parameters in
P(
X
Pa) and at the same time the parameters
of conditional distributions on corresponding discrete variables.
From a technical point of view,
BNFinder learns optimal network structures for these discrete variables,
using scoring functions adapted to handle
distributions on variable values instead of their determined values
(expected values of original scores are computed).
For continuous variables it gives optimal Bayesian networks from among
all networks with conditional probability distributions
belonging to the family of Gaussian mixtures defined above.
References
 [1]

Gregory F. Cooper and Edward Herskovits.
A Bayesian method for the induction of probabilistic networks from
data.
Machine Learning, 9:309–347, 1992.
 [2]

Norbert Dojer.
Learning Bayesian Networks Does Not Have to Be NPHard.
In Rastislav Královic and PaweUrzyczyn, editors, Proceedings of Mathematical Foundations of Computer Science 2006, pages
305–314. SpringerVerlag, 2006.
LNCS 4162.
 [3]

N Friedman, K Murphy, and S Russell.
Learning the structure of dynamic probabilistic networks.
In G F Cooper and S Moral, editors, Proceedings of the
Fourteenth Conference on Uncertainty in Artificial Inteligence, pages
139–147, 1998.
 [4]

David Heckerman, Dan Geiger, and David Maxwell Chickering.
Learning Bayesian networks: The combination of knowledge and
statistical data.
Machine Learning, 20(3):197–243, Sep. 1995.
 [5]

W. Lam and F. Bacchus.
Learning Bayesian belief networks: An approach based on the MDL
principle.
Computational Intelligence, 10(3):269–293, 1994.
 [6]

Richard E. Neapolitan.
Learning Bayesian Networks.
Prentice Hall, 2003.