Learning NAT-Modeled Bayesian Network Structures with Bayesian Approach

We study Bayesian approach for learning structures of Bayesian networks (BNs) with local models. The local structures we focus on are Non-impeding noisy-AND Tree (NAT) models due to their multiple merits. We extend meta-nets to allow encoding of prior knowledge on NAT local structures and parameters. From the extended meta-nets, we develop a Bayesian Dirichlet (BD) scoring function for evaluating alternative NAT-modeled BN structures. A heuristic algorithm is presented for searching through the structure space that is signi(cid:28)cantly more complex than that of BN structures without local models. We experimentally demonstrate learning of NAT-modeled BNs, whose inference produces su(cid:30)ciently accurate posterior marginals and is signi(cid:28)cantly more e(cid:30)cient.


Introduction
Learning BNs from data is an important task in probabilistic reasoning.BNs avoid combinatorial explosion on the number of variables by encoding conditional independence in graphical structures, but space and inference time grow exponentially in the number of causes per eect due to tabular conditional probability distributions (CPDs).To overcome this limitation of tabular BNs, local models have been applied, such as noisy-OR, noisy-MAX [1], context-specic independence (CSI) [2], DeMorgan [3], tensor-decomposition [4], and cancellation [5].Merits of local models lead to learning BNs with local structures.
We focus on NAT local models [6] due to several merits: simple causal interactions (reinforcement/undermining), expressiveness (recursive mixture of causal interactions, multivalued, ordinal or nominal [7]), generality (generalizing noisy-OR, noisy-MAX, and DeMorgan), and orthogonality to CSI.While tabular BN inference is exponential in treewidth, inference is tractable with NAT-modeled BNs of high treewidth and low density.In particular, space of a tabular BN (measured by the total number of CPD parameters) is O(Ks n ), where K is the number of variables, s bounds domain sizes of variables, and n bounds numbers of causes (parents) per variable.In fully NAT-modeled BNs (see Section 2.2), dependencies of variables on their parents are quantied by NAT models instead of tabular CPDs, resulting in O(K s n) space.This eciency extends to inference when NAT-modeled BNs have structures of high treewidth (lower-bounded by n) and low density (measured by percentage of arcs beyond being singly connected) 1 .
A large literature exists on learning tabular BNs, e.g., [812].A common method is to combine heuristic search with a scoring function, where MDL [9] and BD [8,10,11] scores are often applied.This work focus on learning NAT-modeled BNs, due to their above merits.A recent work [13] enables learning NAT-modeled BNs based on MDL scores.The contribution of this work is a novel framework for learning structures of NAT-modeled BNs from data based on (extended) BD scores.
In the remainder, Section 2 reviews backgrounds on BD scores for learning tabular BN structures and on NAT-modeled BNs.Section 3 introduces the task of learning NAT-model BN structures with BD scores.Sections 4 through 6 present component BD subscores on likelihood, local structure prior, and global structure prior.Section 7 describes the heuristic search algorithm and analyzes its complexity.Experimental study is reported in Section 8.The Bayesian approach to structure learning integrates prior knowledge (denote by P ()) on G and Θ with data D: P () expresses subjective probabilistic knowledge about the probabilistic domain knowledge expressed through P r().We assume that D has N = |D| records on V , is complete (no missing value), and is exchangeable [10].Given a candidate structure G, the probability P (G, D) = P (G)P (D|G) is evaluated, where P (G) (structure prior) encodes prior knowledge on G. Likelihood P (D|G) can be evaluated using a meta-net Φ, derived from the base-net G and data D, which integrates prior knowledge on Θ with data where η is a normalizing constant.Sum ψ x|τ = s i=1 ψ xi|τ is the equivalent sample size.
Given a base-net G, Dirichlet priors on parameters in Θ, and complete data D, using the meta-net with D as evidence, likelihood P (D|G) can be evaluated [8,11] as We denote θ be x|τ = {θ be χ|τ } and θ be = {θ be x|τ }.
For simplicity, we denote domain knowledge by P () in this subsection (rather than by P r()).A causal event is a success or failure depending on if e is active up to a given value, is singleor multi-causal depending on the number of active causes, and is simple or congregate depending on value range of e. P (e k ← c ij ) = P (e k |c ij , c z0 : ∀z ̸ = i) (j > 0) is probability of a simple single-causal success, and is probability of a congregate multi-causal success, where j 1 , ..., j q > 0, X = {c 1 , ..., c q } (q > 1).The latter may be denoted as P (e ≥ e k ← x + ).Interactions among causes may be reinforcing or undermining as dened below.Denition 1.Let e k be an active eect value, R = {W 1 , ..., W m } (m ≥ 2) be a given partition of a set X ⊆ C of causes, S ⊂ R, and Y = ∪ Wi∈S W i .Sets of causes in R reinforce each other relative to e k , i ∀S P (e ≥ e k ← y + ) ≤ P (e ≥ e k ← x + ).They undermine each other i ∀S P (e ≥ e k ← y + ) > P (e ≥ e k ← x + ).
A NAT consists of one or more Non-Impeding Noisy-AND (NIN-AND) gates.A direct gate involves disjoint sets of causes W 1 , ..., W m .Each input event is a success e ≥ e k ← w i+ (i = 1, ..., m), e.g., Fig. 2    The leaky cause for an eect e represents all causes of e not explicitly named.A leaky cause may or may not be persistent [1].A non-persistent leaky cause can be modeled the same way as other causes.A persistent leaky cause is always active and leads to special issues [7].
A BN where dependencies of some families are specied as NAT models (instead of tabular CPDs) is a NAT-modeled BN.If all families of more than one parent are NAT-modeled, the BN is fully NAT-modeled.A tabular BN has O(Ks n ) space, while a fully NAT-modeled BN has O(K s n) space.CPDs of a BN family can be approximated into a NAT model by compression [7].Hence, a tabular BN can be approximated by a fully NAT-modeled BN.
Inference methods for tabular BNs can be applied to NAT-modeled BNs by converting them into ecient tabular BNs, e.g., by trans-causalization [14].The inference is tractable when NAT-modeled BNs have high treewidth and low density.

Learning NAT-model BN Structure with BD Scores
Learning structures of tabular BNs from data has been actively researched since 1990s.A widely applied approach is to evaluate each candidate structure by a scoring function, such as MDL or BD, and to limit the exponential structure space by heuristic search.To overcome the limitation of tabular BNs considered in Section 1, learning BNs with local models has also been pursued [1517].Work in [18,19] explored local equality conditions such as CSI with decision trees or decision graphs as local structures, based on MDL or BD scores.Inequality conditions such as those in Def. 1 were explored in learning NAT-modeled BNs based on MDL score [13].
In this work, we present the rst study of structure learning of NAT-modeled BNs based on the Bayesian approach and BD score.A NAT-modeled BN consists of a global DAG structure G, a local NAT structure L (including NAT topologies for all NAT-modeled families), single-causal parameters for all NAT families, and CPD parameters of the remaining tabular families.Given data D, we evaluate a candidate structure (G, L) by BD score where α is the normalizing constant 1/P (D).In the following sections, we consider each of the 3 components, P (D|G, L), P (L|G), and P (G), which we refer to as likelihood, local structure prior, and global structure prior.
Note that the learned BN is not necessarily fully NAT-modeled: Whether a family in the outcome structure is NAT-modeled or tabular depends on the score and search.

Likelihood
To dene likelihood P (D|G, L) for a NAT-modeled structure (G, L), we extend the metanet for learning tabular BNs to learning NAT-modeled BNs by representing local NAT models and single-causal parameters.We do so with an example rst and then generalize.
[NAT-modeled meta-nets] Consider the base-net G in Fig. 3 (a), where V = {a, b, c, d}, all variables are binary, and data D has size N = 2. Since c has 2 parents in G, L may specify its family to be tabular (tab) or NAT-modeled.If NAT-modeled, it may be a direct NIN-AND gate (di) or a dual gate (du).This local model type is represented in the meta-net by variable ω c ∈ {tab, di, du} in Fig. 3 (b).Since (G, L) is given, P (ω c ) consists of extreme values.For instance, if L species c family as a direct gate, then If family of c is tabular, it has 4 CPDs, and the meta-net has 4 corresponding θ nodes (Fig. 3   θ c←a1 = θ c|a1,b0 and θ c←b1 = θ c|a0,b1 .Making ω c as a parent of c allows all such cases to be handled correctly through CPD P (c|a, b, ω c , θ c|a0,b0 , θ c|a0,b1 , θ c|a1,b0 , θ c|a1,b1 ).
In general, for each variable x of 2 or more parents in G, the meta-net has a ω x variable on its local model type.The domain of ω x includes value tab for tabular, and possible NATs for the x family.For instance, if x has 5 parents in G, its family has 472 possible NAT models, and ω x has domain size 473.Given (G, L), we have P (ω x ) ∈ {0, 1}.
For each x ∈ V , the meta-net contains as many θ nodes as the meta-net for tabular BNs, one per CPD P r(x|π = τ ).When x family is NAT-modeled, each necessary θ node maps to a single-causal distribution P r(x ← c ij ), where c ij is an active value of cause c i , and remaining θ nodes are superuous.Since P r(x ← c ij ) = P r(x|τ ), where τ has exactly one active value, we denote the θ node by θ x|τ (instead of θ x←cij ) for consistency with the tabular case.Hence, when local structure L asserts x family to be NAT-modeled, only single-causal θ nodes are relevant.This dynamic dependency is eected through CPD P (x|π, ω x , ...) in the meta-net.Each θ node has a Dirichlet prior pdf.
[Properties of NAT-modeled meta-nets] We refer to meta-nets for learning tabular BNs as T-meta-nets.We term meta-nets for learning NAT-modeled BNs (dened above) as N-meta-nets.N-meta-nets have the following properties.
Theorem 1.Every NAT-modeled BN structure (G, L) over V has a well-dened N-meta-net given complete data D.
Proof: Let (G, L) be a NAT-modeled BN structure and D be the data over V .Initialize the N-meta-net as the T-meta-net for a tabular BN of structure G and data D, which consists of one instance of G for each record of D, a θ node for each CPD of the tabular BN, and corresponding arcs.For each variable v ∈ V , denote the set of θ nodes, one per CPD of v, by Θ v .
For each x ∈ V of parents π in G, where |π| ≥ 2, add a variable ω x to the N-meta-net.The domain of ω x consists of value tab and one value for each possible NAT topology over the x family.For each copy of x (one per record of D) in the N-meta-net, add ω x as a parent, set CPDs P (x|π, ω x , Θ x ), and set P (ω x ) to be deterministic according to the local model type of x specied by L. The N-meta-net for (G, L) and D is now constructed.[End] Note that existence of ω variables allows an N-meta-net to easily switch among all alternative NAT topologies for each NAT family.In other words, the N-meta-net can easily switch among all L local structures for a given global structure G, by modifying the prior distributions of ω variables.Furthermore, N-meta-nets are used to derive BD scores for NAT-modeled BNs, but are not directly computed during structure learning, as seen below.
Next, we show that parameter independence of T-meta-nets (see, e.g., [11]) also applies to N-meta-nets.The proof utilizes the well-known d-separation [20].
Theorem 2. In an N-meta-net, any two disjoint subsets of θ variables are independent.
Proof: It suces to show that any two θ nodes are d-separated.Each θ node has only outgoing arcs in the N-meta-net.Hence, any path between two θ nodes u and v cannot be directed.There must be a node x that is head-to-head on the path (→ x ←), which blocks the path, e.g., the path (θ a , a 1 , c 1 , b 1 , θ b ) in Fig. 3 (b).Since every such path is blocked, u and v are d-separated.[End] Parameter independence of N-meta-nets also holds conditioned on data D: Theorem 3. In an N-meta-net, any two disjoint subsets of θ variables are independent conditioned on a complete data set D.
Proof: It suces to show that any two θ nodes are d-separated conditioned on D. Consider a path between θ nodes u and v, with remaining nodes on the path Z ⊂ V .Since D is complete, every z ∈ Z is observed.If there is a node z that is head-to-tail (→ z →) or tail-to-tail (← z →) on the path, then the path is blocked, e.g., (θ a , a 1 , c 1 , b 1 , d 1 , θ d|b0 ) in Fig. 3 (b).
If no node z ∈ Z exists on the path that is head-to-tail or tail-to-tail, then the path must be u → x ← v, where x is head-to-head.It must be the case that u denotes θ x|τ , v denotes θ x|τ ′ , and instantiations τ and τ ′ of parents π of x (in G) dier, e.g., (θ c|a0,b0 , c 1 , θ c|a0,b1 ) in Fig. 3 (b).Since at most one of τ and τ ′ is consistent with D, at least one arc, u → x or x ← v, can be equivalently removed, and the path disappears.
Since every path between u and v is either blocked or can be equivalently removed, u and v are d-separated.[End] Note that Theorem 2 still holds if the subsets include ω variables.However, that is not the case for Theorem 3. Conditioned on D, a θ node and a related ω node are not d-separated.For instance, the path (ω c , c 1 , θ c|a0,b1 ) in Fig. 3  where subscore for family (x, π) with π = τ is and subscore for family (x, π) is The decomposability is a direct consequence of parameter independence in T-meta-nets.By Theorem 3 on parameter independence in N-meta-nets, we have decomposability for NAT-modeled likelihood P (D|G, L): As N-meta-nets represent tabular families equivalently as T-meta-nets, subscore SS(D, x) for a tabular (x, π) family can be evaluated by Eqns.(4.2) and (4.1).
[Subscore of NAT family] From Def.
, where θ be i−1 is Bayes estimates given D i−1 .By the chain rule of BNs, P r θ be i−1 (d i ) = χ,τ ∼d i θ be χ|τ,i−1 , where χ, τ ∼ d i selects (x, π) that is consistent with d i .From the above, we have Since D is exchangeable, the order of data records in the 2nd expression does not matter.
The 4th expression means that factors for each x family are independent of others.Hence, in 4th expression, data records relative to each x may be ordered dierently.Therefore, for each NAT x family, we order records in the 4th expression from 1 to N by type of τ : Type 0 records rst, followed by Type 2, and followed by Type 3, breaking ties arbitrarily.We analyze subscore SS(D, x) for a NAT family below, using this order: Consider contribution of record d i to SS(D, x), where χ, τ ∼ d i and τ is Type 0. If χ is active, (χ, τ ) is impossible for NAT x family.Hence, θ be χ|τ,i−1 = 0, SS(D, x) = 0, and P (D|G, L) = 0.It signies that (x, π) being NAT contradicts data D. Hence, either (x, π) is tabular, or (x, π) is NAT with an (extra) persistent leaky cause.On the other hand, if χ is inactive, then without visible impact to SS(D, x).In summary, if D has any d i where χ is active and τ is Type 0, P (D|G, L) = 0.If χ is inactive, d i can be ignored when processing SS(D, x).
If τ is Type 2, contribution of record d i to SS(D, x) is θ be χ|τ,i−1 , where χ, τ ∼ d i , and it is determined by the x family NAT and relevant Type 1 Bayes estimates θ be χ ′ |τ ′ ,i−1 , each according to Eqn. (4.7).Due to the above type-based record ordering, all Type 1 records are indexed lower than d i .Hence, It then follows that contribution of d i to SS(D, x) is independent of index i: By Eqn.(4.8), if (χ|τ ) occurs m times in D, they contribute (θ be χ|τ ) m to SS(D, x).This reveals that the type-based record ordering is only convenient for justifying soundness, but is not algorithmically necessary.

Local Structure Prior
We next consider local structure prior P (L|G), where L species the local structure for every x family in G. Learning BNs with local decision trees was studied in [18] based on MDL scores, and an alternative BD score was proposed with where DL() is description length under MDL principle, and α is normalizing constant.
Applying the idea to NAT-modeled BNs, we specify DL(x, L x ) for each x family, where L x extracts local structure for x family from L, and If x family is tabular by L, we have (see [13]) If x family is structured as a NAT T x , we have where DL(T x ) and DL(SC x ) are description lengths of NAT and single-causals [13].

Global Structure Prior
We consider global structure prior P (G).Preference of simpler DAG G is suggested in [8,15] by assuming (1) independent parent sets: and ( 2) independent individual parents: P (y → x|z → x) = P (y → x), where y → x is an arc in G. Since no specic form of P (y → x) is suggested in [8,15], we develop the following: We assign where k ∈ (0, 1) and η is a constant.When x is root in G, P (π = ∅) = η.It can be shown that the assignment satises the following properties, which favor simpler structures: (1) If x has w parents and v has q > w parents, then P (π x ) > P (π v ).
(2) If G has n nodes, then P (G) = η n k m , where m counts the number of arcs in G.
From the 2nd property, constant η can be ignored when comparing two DAGs.This is desirable as the number of alternative G is intractable.As an example, for G in Fig. 3 (a), assuming k = 0.5, the global structure prior without η is 1 * 1 * 0.25 * 0.5 = 0.125.
We conclude BD scores for learning NAT-modeled BNs with their decomposability: Theorem 4. The likelihood, local structure prior, and global structure prior dened above for learning NAT-modeled BNs are each decomposable by variable family, and so is the BD score specied by them.

Algorithm and Complexity
Learning BNs is NP-complete, and learning NAT-modeled BNs involves an even larger (G, L) space.For instance, a single NAT family of 8 causes has 1,320,064 alternative NAT structures (each encodes a unique set of causal interactions among the causes).A given G with exactly two such NAT families would have 1.7 × 10 12 alternative L structures.To improve eciency of learning, we apply heuristic search as presented below.The presentation focuses on structure learning, although our implementation (Section 8) also learns parameters (tabular CPDs and NAT single-causals).
The top level algorithm LearnN atBnByBD takes data D over V as input.It learns a NAT-modeled BN structure (G, L), where G is possibly over a superset of V (due to persistent leaky causes).It uses heuristic search to nd a best structure over the intractable (G, L) space.
It adopts a sequence of (G, L), from empty G to the nal structure.Each (G, L) is computed by OneRoundSearch and improves BD score.It diers from the previous (G, L) by one arc (through arc operation add, delete or reverse), and may change local structure for up to two families.

Experimental Study
Preliminary experiment is conducted to evaluate the above BD score and structure learning algorithm with the objective below: Suppose that the data-generating environment can be expressed as a NAT-modeled BN B 1 .Then an equivalent tabular BN B 2 exists with its joint probability distribution (JPD) identical to B 1 .We want to answer the question: Can our algorithm learn a NAT-modeled BN B 3 such that posterior marginals by inference with B 3 approximate those by B 2 well, and at the same time inference with B 3 is signicantly more ecient than B 2 ?
We simulated 30 fully NAT-modeled source BNs (B 1 ), each of 80 binary or ternary variables.Each variable has a maximum of 12 parents.The DAG of each source BN has 5% extra arcs beyond being singly connected.Hence, each source BN is multiply connected with a high treewidth (at least 12) and low density.For each source BN, the equivalent tabular peer BN (B 2 ) is derived, from which a data set of size 5000 is sampled.LearnN atBnByBD is implemented in Java and run on a desktop with AMD Ryzen 7 5800X 8-Core processor at 3.8 GHz by single-thread computation, to learn a NAT-modeled BN (B 3 ) from each data set.
For each pair of peer BN and learned BN, we performed 10 runs of inferences, each with random observations on 2 to 8 variables (up to 10% of all variables): a total of 600 runs.Posterior marginal errors of learned BNs, relative to peer BNs, averaged over 10 runs are shown in Fig. 4 (a).Posterior marginals from learned BNs approximate those from peer BNs suciently well (average error at about 0.025).As shown in Fig. 4 (b), learned BNs have average inference runtime (msec in log10) of 25 msec, while peer BNs take about 900 msec.Hence, learned BNs are signicantly more ecient than peer BNs (36 times faster on average).

Conclusion
We presented the rst Bayesian framework for learning structures of BNs with NAT local models, where NAT models are chosen due to their multiple merits.In particular, we extended meta-nets for learning tabular BNs to N-meta-nets to enable representation of NAT-modeled families and single-causal parameters.We showed formally that N-metanets are expressive, and satisfy parameter independence.Using N-meta-nets, we developed BD scores for learning NAT-modeled BN structures, consisting of likelihood, local structure prior and global structure prior.The BD scores were shown to be decomposable.A heuristic algorithm for learning NAT-modeled BN structures with BD scores is presented for search through the structure space that is signicantly more complex than the space in learning tabular BNs.Our experiment showed that when the data-generating environment can be expressed as NAT-modeled BNs, a NAT-modeled BN can be learned whose inference is suciently accurate while being signicantly more ecient than the tabular BN alternative.

1 .
BD Scores for Learning Tabular BN Structures A tabular BN over a set V of variables has a structure G and a collection Θ of parameter sets.G is a directed acyclic graph (DAG), whose nodes are labeled by variables in V .Each x ∈ V and its parents π in G forms a family.Dependency of x on π is specied by a set of CPDs, with one CPD P r(x|π = τ ) per instantiation τ of π.Each parameter set θ x|τ ∈ Θ species a CPD P r(x|π = τ ) (as domain knowledge).

D.
For each θ x|τ ∈ Θ, there is a variable in Φ, which we denote also by θ x|τ .Prior knowledge on P r(x|π = τ ) is encoded by a probability density function (pdf) ρ(θ x|τ ).For each record d i ∈ D (superscripts index records), meta-net Φ contains an instance G i of base-net G.For each x ∈ V , besides π from G, x in G i has extra parents θ x|τ , one per instantiation of π.

Fig. 2 (
Fig. 2 (c) shows a NAT, where causes h 1 and h 2 reinforce each other, and so do b 1 and b 2 .However, the two groups undermine each other.That is, for gate g 1 , each W i (as in Def. 1) is a general set.See [6] for a formal denition of NATs.From the NAT and probabilities of its input events, in the general form P (e k ← c ij ) (j, k > 0), called single-causals, P (e ≥ e 1 ← h 11 , h 21 , b 11 , b 21 ) can be obtained.From single-causals and all (b) bottom).If family of c is NAT-modeled, only two θ nodes are well-dened: