2 Manual

2.1 Installation

The BNF software uses setuptools, which is the standard library for packaging python software. After downloading the archive containing the current version of BNF, you should extract it to a directory of choice. In unix-like systems you can do it by typing

tar -xzf bnf-0.1.tgz

Once you have the sources extracted, the installation is performed by a single command

python setup.py install

in the source directory (it may require the administrator privileges).

This installs the BNfinder library to an apropriate location for your python interpreter, and a bnf script which may be accessed from a command line.

2.2 Usage

BNfinder can be executed by typing

> bnf <options>

The following options are available:

-h, --help: print out a brief summary of options and exit
-e, --expr <file>: load learning data from <file> (this option is obligatory)
-s, --score <name>: learn with <name> scoring criterion; possible names are BDE (default) and MDL
-l, --limit <number>: limit the search space to networks with at most <number> parents for each vertex
-i, --suboptimal <number>: compute <number> best scored parents sets for each vertex (default 1)
-d, --data-factor <number>: multiply (each item of) the dataset <number> times (default 1); this option may be used to change the proportion between d and g components of the scoring function (see the definition of the splitting assumption)
-v, --verbose: print out communicates on standard output
-p, --prior-pseudocount <number>: set the pseudocounts of data items with specified values of a vertex and its parents set to <number>/|V|^|Pa|+1 (resulting in the total pseudocount equal to <number>) – this method follows Heckerman et al [5]; when the option is unspecified, all pseudocounts are set to 1, following Cooper and Herskovitz [1]; pseudocounts are used as hyperparameters of the Dirichlet priors of the BDE scoring criterion and also in the estimation of the conditional probability distributions (CPDs) of learned network
-n, --net <file>: write the learned network graph to <file> in the SIF format
-t, --txt <file>: write the learned suboptimal parents sets to <file>
-b, --bif <file>: write the learned Bayesian network to <file> in the BIF format
-c, --cpd <file>: write the learned Bayesian network to <file> as a Python dictionary

2.3 Input format

The learning data must be passed to BNfinder in a text file splitted into 3 parts: preamble, experiment specification and experiment data. The preamble allows user to specify some features of data and/or network, while the next two parts contain the learning data, essentially formatted as a table with space- or tab-separated values.

2.3.1 Preamble

The preamble allows specifying experiment peturbations, structural constraints, vertex value types, vertex CPD types and edge weights. Each line in the preamble has the following form:

#<command> <arguments>

Experiments with perturbed values of some vertices carry no information regarding their regulatory mechanism. Thus including these experiments data in learning parents of their perturbed vertices biases the result (see [3] for a detailed treatment). The following command handles perturbations:

#perturbed <experiment/serie> <vertex list>: omit data from experiment (serie of experiments in the case of dynamic networks) <experiment/serie> when learning parents of vertices from <vertex list>

One possible way of specifying structural constraints with BNfinder is to list potential parents of particular vertices. An easier method is available for constraints of the cascade form, where the vertex set is splitted into a sequence of groups and each parent of a vertex must belong to one of previous groups (a simple but extremely useful example is a cascade with 2 groups: regulators and regulatees). There are 2 commands specifying structural constraints:

#parents <vertex> <vertex list>: restrict the set of potential parents of <vertex> to <vertex list>.
#regulators <vertex list>: restrict the set of potential parents of all vertices except specified with #parents command or with previous or present #regulators command to vertices included in <vertex list> of previous or present #regulators command.

Note that structural constraints forcing network’s acyclicity are necessery for learning a static Bayesian network with BNfinder.

Vertex value types may be specified with the following commands:

#discrete <vertex> <value list>: let <value list> be possible values of <vertex>
#continuous <vertex list>: let float numbers be possible values of all vertices in <vertex list>
#default <value list>: let <value list> be possible values of all vertices except specified with #discrete or #continuous command (when <value list> is FLOAT, float numbers are possible values)

Values in <value list> may be integers or words (strings without whitespaces). When some vertices are left unspecified, BNfinder tries to recognize their possible value sets. However it may miss, in particular when some float numbers are written in integer format or when some possible values are not represented in the dataset (note that the size of the set of possible values affects the score).

The space of possible CPDs of some vertices given their parents may be restricted to noisy-and or noisy-or distributions. In this case, the sets of possible values of these vertices and their potential parents must be either {0,1} or float numbers. Moreover, BNfinder should be executed with the MDL scoring criterion. The following commands specify vertices with noisy CPDs:

#and <vertex list>: restrict the space of possible CPDs of vertices from <vertex list> to noisy-and distributions
#or <vertex list>: restrict the space of possible CPDs of vertices from <vertex list> to noisy-or distributions

The following commands set prior weights on network edges:

#prioredge <vertex> <weight> <vertex list>: set the prior weights of all edges originating from vertices from <vertex list> and aiming at <vertex> to <weight>
#priorvert <weight> <vertex list>: set the prior weights of all edges originating from vertices from <vertex list> (except specified in <prioredge> command) to <weight>

Weights must be positive float numbers. Edges with greater weights are penalized harder. The default weight is 1.

2.3.2 Experiment specification

The experiment specification has the following form:

<name> <experiment list>

where <name> is a word starting with a symbol other then #. The form of experiment names depends on the data type and, consequently, on the type of learned network:

When the dataset consists of results of independent experiments and a static Bayesian network is to be learned, experiment names are words without the symbol ’:’.
When the dataset consists of results of time series experiments and a dynamic Bayesian network is to be learned, experiment names have the form <serie>:<condition>. Each serie must be ordered according to the condition times and cannot be interrupted by experiments from other series.

2.3.3 Experiment data

Each line of the experiment data part has the following form:

<vertex> <value list>

where <vertex> is a word and values are listed in the order corresponding to <experiment list>.

2.4 Output formats

2.4.1 SIF format

The SIF (Simple Interaction File), usually contained in files with .sif extension is the simplest of the supported formats and carries only information on the topology of the network. In this format, each line represents the fact of a single interaction. In our case such interaction represents the fact that one variable depends on some other variable. Each line contains three values:

Parent variable identifier,
type of interaction (currently +/- is reported when positive or negative correlation between variables is found, if BNfinder is run with -i option and reports more than 1 suboptimal network, the edge labels represent sum of posterior probabilities obtained by parent sets including that edge.
Child variable identifier.

To show it by example, the file:

A + B
B - C

Describes a network of the following shape:

A →⁺ B →⁻ C.

The main advantage of this format is that it can be read by the Cytoscape (http://cytoscape.org) software allowing for quick visualization. It is also trivial to use such data in one’s own software.

2.4.2 Suboptimal parents sets

Suboptimal parents sets are written to a file in a simple text format splitted into sections representing the sets of the parents of each vertex. Each section contains a leading line with the vertex name followed by lines representing its consecutive suboptimal parents sets. Each of these lines has the form:

 <relative probability> <vertex list>

were <relative probability> is the ratio of the set’s posterior probability to the posterior probability of the empty parents set and <vertex list> contains the elements of the set. Lines are ordered decreasingly according to <relative probability>.

To show it by example, the section:

C
 2.333333  B
 1.000000 
 0.592593  B A

reports 3 most probable parents sets of the vertex C: {B},∅,{B,A}. Moreover, it states that {B} is 2.333333 times more probable than the empty set and the corresponding ratio for {B,A} equals 0.592593.

2.4.3 BIF format

Bayesian Interchange Format (BIF) is a simple text format dedicated to Bayesian networks. It is supported in some BN applications (e.g. JavaBayes, Bayes Networks Editor) and may be easily converted with available tools to other popular formats (including XML formats and BNT format of K. Murphy’s Bayes Net Toolbox). BNfinder writes learned networks in BIF version 0.15.

2.4.4 Python dictionary

A network saved in <file> as a dictionary may be loaded to your Python environment by

eval(open(<file>).read())

The dictionary consists of items corresponding to all network’s vertices. Each item has the following form:

<vertex name> : <vertex dictionary>

Vertex dictionaries have the following items:

’vals’ : <value list>
’pars’ : <parent list>
’type’ : <CPD type>: (only for vertices with noisy CPDs, possible values of <CPD type> are ’and’ and ’or’)
’cpds’ : <CPD dictionary>

The form of the vertex CPD dictionary depends on the vertex type. In the case of noisy CPD, the dictionary items have the following form:

<vertex name> : <probability>: which means (in the case of noisy-and/-or distribution) that the considered vertex is assigned value 1/0 with <probability> given all its parents but <vertex name> equal 1/0

In the case of general CPD, the dictionary has items of the following form:

<value vector> : <distribution dictionary>

where <value vector> is a tuple of parents’ values and the distribution of the considered vertex given <parent list> = <value vector> is defined in <distribution dictionary> in the following way:

<value> : <probability>: means that the vertex is assigned <value> with <probability>
None : <probability>: means that the vertex is assigned with <probability> each of its possible values unspecified in a separate item

None : <probability>

means that given <parent list> equal to a value vector unspecified in a separate item the vertex is assigned each of its possible values with <probability>

2.4.5 Standard output

When BNfinder is executed from a command line with the option -v, it prints out communicates related to its current action: loading data, learning regulators of consecutive vertices and writing output files. Moreover, after finishing computations for a vertex its predicted best parents sets and their scores are reported and after finishing computations for all vertices BNfinder reports the score and structure of the optimal network.