BNfinder: Exact and efficient method for learning Bayesian networks |
In the present section we give a brief exposition of the algorithm implemented in BNfinder and its computational cost for two generally used scoring criteria: Minimal Description Length and Bayesian-Dirichlet equivalence. For a fuller treatment, including detailed proofs, we refer the reader to [2].
A Bayesian network (BN) N is a representation of a joint distribution of a set of discrete random variables X={X1,…,Xn}. The representation consists of two components:
Pai={Y∈X|(Y,Xi)∈E} |
The joint distribution of X is given by
P(X)= |
| P(Xi|Pai) (1) |
The problem of learning a BN is understood as follows: given a multiset of X-instances D={x1,…,xN} find a network graph G that best matches D. The notion of a good match is formalized by means of a scoring function S(G:D) having positive values and minimized for the best matching network. Thus the point is to find a directed acyclic graph G with the set of vertices X minimizing S(G:D).
The BNfinder program is devoted to the case when there is no need to examine the acyclicity of the graph, for example:
In the sequel we consider some assumptions on the form of a scoring function. The first one states that S(G:D) decomposes into a sum over the set of random variables of local scores, depending on the values of a variable and its parents in the graph only.
When there is no need to examine the acyclicity of the graph, this assumption allows to compute the parents set of each variable independently. Thus the point is to find Pai minimizing s(Xi,Pai:D|{Xi}∪Pai) for each i.
Let us fix a dataset D and a random variable X. We denote by X' the set of potential parents of X (possibly smaller than X due to given constraints on the structure of the network). To simplify the notation we continue to write s(Pa) for s(X,Pa:D|{X}∪Pa).
The following assumption expresses the fact that scoring functions decompose into 2 components: g penalizing the complexity of a network and d evaluating the possibility of explaining data by a network.
This assumption is used in the following algorithm to avoid considering networks with inadequately large component g.
|
In the above algorithm choosing according to g(P) means choosing increasingly with respect to the value of the component g of the local score.
A disadvantage of the above algorithm is that finding a proper subset P⊆X' involves computing g(P') for all ⊆-successors P' of previously chosen subsets. It may be avoided when a further assumption is imposed.
The above assumption suggests the notation ĝ(|Pa|)=g(Pa). The following algorithm uses the uniformity of g to reduce the number of computations of the component g.
|
The Minimal Description Length (MDL) scoring criterion originates from information theory [6]. A network N is viewed here as a model of compression of a dataset D. The optimal model minimizes the total length of the description, i.e. the sum of the description length of the model and of the compressed data.
Let us fix a dataset D={x1,…,xN} and a random variable X. Recall the decomposition s(Pa)=g(Pa)+d(Pa) of the local score for X. In the MDL score g(Pa) stands for the length of the description of the local part of the network (i.e. the edges ingoing to X and the conditional distribution P(X|Pa)) and d(Pa) is the length of the compressed version of X-values in D.
Let kY denote the cardinality of the set VY of possible values of the random variable Y∈X. Thus we have
g(Pa)=|Pa|logn+ |
| (kX−1) |
| kY |
where logN/2 is the number of bits we use for each numeric parameter of the conditional distribution. This formula satisfies Assumption 2 but fails to satisfy Assumption 3. Therefore Algorithm 1 can be used to learn an optimal network, but Algorithm 2 cannot.
However, for many applications we may assume that all the random variables attain values from the same set V of cardinality k. In this case we obtain the formula
g(Pa)=|Pa|logn+ |
| (k−1)k|Pa| |
which satisfies Assumption 3. For simplicity, we continue to work under this assumption. The general case may be handled in much the same way.
Compression with respect to the network model is understood as follows: when encoding the X-values, the values of Pa-instances are assumed to be known. Thus the optimal encoding length is given by
d(Pa)=N· H(X|Pa) |
where H(X|Pa)=−∑v∈V∑v∈VPaP(v,v)logP(v|v) is the conditional entropy of X given Pa (the distributions are estimated from D).
Since all the assumptions from the previous section are satisfied, Algorithm 2 may be applied to learn the optimal network. Let us turn to the analysis of its complexity.
The Bayesian-Dirichlet equivalence (BDe) scoring criterion originates from Bayesian statistics [1]. Given a dataset D the optimal network structure G maximizes the posterior conditional probability P(G|D). We have
P(G|D)∝ P(G)P(D|G)=P(G) | ∫ | P(D|G,θ)P(θ|G)dθ |
where P(G) and P(θ|G) are prior probability distributions on graph structures and conditional distributions' parameters, respectively, and P(D|G,θ) is evaluated due to (1).
Heckerman et al. [5], following Cooper and Herskovits [1], identified a set of independence assumptions making possible decomposition of the integral in the above formula into a product over X. Under this condition, together with a similar one regarding decomposition of P(G), the scoring criterion
S(G:D)=−logP(G)−logP(D|G) |
obtained by taking −log of the above term satisfies Assumption 1. Moreover, the decomposition s(Pa)=g(Pa)+d(Pa) of the local scores appears as well, with the components g and d derived from −logP(G) and −logP(D|G), respectively.
The distribution P((X,E))∝α|E| with a penalty parameter 0<α<1 in general is used as a prior over the network structures. This choice results in the function
g(|Pa|)=|Pa|logα−1 |
satisfying Assumptions 2 and 3.
However, it should be noticed that there are also used priors which satisfy neither Assumption 2 nor 3, e.g. P(G)∝αΔ(G,G0), where Δ(G,G0) is the cardinality of the symmetric difference between the sets of edges in G and in the prior network G0.
The Dirichlet distribution is generally used as a prior over the conditional distributions' parameters. It yields
d(Pa)=log | ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ |
|
|
|
| ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ |
where Γ is the Gamma function, Nv,v denotes the number of samples in D with X=v and Pa=v, and Hv,v is the corresponding hyperparameter of the Dirichlet distribution.
Setting all the hyperparameters to 1 yields
d(Pa)=log | ⎛ ⎜ ⎜ ⎝ |
|
|
|
| ⎞ ⎟ ⎟ ⎠ | = |
= |
| ⎛ ⎜ ⎜ ⎝ | log( k−1+v∈V∑Nv,v)!−log(k−1)! − |
| logNv,v! | ⎞ ⎟ ⎟ ⎠ |
where k=|V|. For simplicity, we continue to work under this assumption (following Cooper and Herskovits [1]). The general case may be handled in a similar way.
The following result allows to refine the decomposition of the local score into the sum of the components g and d.
By the above proposition, the decomposition of the local score given by s(Pa)=g'(Pa)+d'(Pa) with the components g'(Pa)=g(Pa)+dmin and d'(Pa)=d(Pa)−dmin satisfies all the assumptions required by Algorithm 2. Let us turn to the analysis of its complexity.
The BNF software uses setuptools, which is the standard library for packaging python software. After downloading the archive containing the current version of BNF, you should extract it to a directory of choice. In unix-like systems you can do it by typing
tar -xzf bnf-0.1.tgz
Once you have the sources extracted, the installation is performed by a single command
python setup.py install
in the source directory (it may require the administrator privileges).
This installs the BNfinder library to an apropriate location for your python interpreter, and a bnf script which may be accessed from a command line.
BNfinder can be executed by typing
> bnf <options>
The following options are available:
The learning data must be passed to BNfinder in a text file splitted into 3 parts: preamble, experiment specification and experiment data. The preamble allows user to specify some features of data and/or network, while the next two parts contain the learning data, essentially formatted as a table with space- or tab-separated values.
The preamble allows specifying experiment peturbations, structural constraints, vertex value types, vertex CPD types and edge weights. Each line in the preamble has the following form:
#<command> <arguments>
Experiments with perturbed values of some vertices carry no information regarding their regulatory mechanism. Thus including these experiments data in learning parents of their perturbed vertices biases the result (see [3] for a detailed treatment). The following command handles perturbations:
One possible way of specifying structural constraints with BNfinder is to list potential parents of particular vertices. An easier method is available for constraints of the cascade form, where the vertex set is splitted into a sequence of groups and each parent of a vertex must belong to one of previous groups (a simple but extremely useful example is a cascade with 2 groups: regulators and regulatees). There are 2 commands specifying structural constraints:
Note that structural constraints forcing network's acyclicity are necessery for learning a static Bayesian network with BNfinder.
Vertex value types may be specified with the following commands:
Values in <value list> may be integers or words (strings without whitespaces). When some vertices are left unspecified, BNfinder tries to recognize their possible value sets. However it may miss, in particular when some float numbers are written in integer format or when some possible values are not represented in the dataset (note that the size of the set of possible values affects the score).
The space of possible CPDs of some vertices given their parents may be restricted to noisy-and or noisy-or distributions. In this case, the sets of possible values of these vertices and their potential parents must be either {0,1} or float numbers. Moreover, BNfinder should be executed with the MDL scoring criterion. The following commands specify vertices with noisy CPDs:
The following commands set prior weights on network edges:
Weights must be positive float numbers. Edges with greater weights are penalized harder. The default weight is 1.
The experiment specification has the following form:
<name> <experiment list>
where <name> is a word starting with a symbol other then #. The form of experiment names depends on the data type and, consequently, on the type of learned network:
Each line of the experiment data part has the following form:
<vertex> <value list>
where <vertex> is a word and values are listed in the order corresponding to <experiment list>.
The SIF (Simple Interaction File), usually contained in files with .sif extension is the simplest of the supported formats and carries only information on the topology of the network. In this format, each line represents the fact of a single interaction. In our case such interaction represents the fact that one variable depends on some other variable. Each line contains three values:
To show it by example, the file:
A + B B - C
Describes a network of the following shape:
A →+ B →− C. |
The main advantage of this format is that it can be read by the Cytoscape (http://cytoscape.org) software allowing for quick visualization. It is also trivial to use such data in one's own software.
Bayesian Interchange Format (BIF) is a simple text format. It is supported in some BN application (e.g. JavaBayes, Bayes Networks Editor), and may be easily converted to other popular formats (including XML formats and BNT format of K. Murphy's Bayes Net Toolbox). BNfinder writes learned networks in BIF version 0.15.
A network saved in <file> as a dictionary may be loaded to your Python environment by
eval(open(<file>).read())
The dictionary consists of items corresponding to all network's vertices. Each item has the following form:
<vertex name> : <vertex dictionary>
Vertex dictionaries have the following items:
The form of the vertex CPD dictionary depends on the vertex type. In the case of noisy CPD, the dictionary items have the following form:
In the case of general CPD, the dictionary has items of the following form:
This document was translated from LATEX by HEVEA.