This note illustrates how to apply the generic formula of the Kullback-Leibler divergence between two densities of two different exponential families [2].
This column is also available as the file KLPoissonGeometricDistributions.pdf.
It is well-known that the Kullback-Leibler between two densities Pθ1 and Pθ2 of the same exponential family amounts to a reverse Bregman divergence between the corresponding natural parameters for the Bregman generator set to the cumulant function F(θ) [1]:
![DKL [Pθ : Pθ] = BF *(θ1 : θ2) = BF (θ2 : θ1) := F (θ2) - F(θ1)- (θ2 - θ1)⋅∇F (θ1).
1 2](KLPoissonGeometricDistributions0x.png)
The following formula for the Kullback-Leibler divergence (KLD) between two densities Pθ and Qθ′ of two
different exponential families
(with cumulant function F
) and
(with cumulant function F
) was reported
in [2] (Proposition 5):
![]() | (1) |
When
=
(and F = F
= F
), we recover the reverse Fenchel-Young divergence which corresponds to the
reverse Bregman divergence:
![D [P : P ′] = F (θ′)+ F*(η)- η⋅θ′ =: Y *(θ′ : η) = Y * (η : θ′).
KL θ θ F,F F ,F](KLPoissonGeometricDistributions2x.png)
Consider the KLD between a Poisson probability mass function (pmf) and a geometric pmf. The canonical decomposition of the Poisson and geometric pmfs are summarized in Table 1.
Poisson family | Geometric family |
|
| support | ℕ ∪{0} | ℕ ∪{0} |
| base measure | counting measure | counting measure |
| ordinary parameter | rate λ > 0 | success probability p ∈ (0,1) |
| pmf | exp(-λ) | (1 - p)xp |
| sufficient statistic | t (x) = x | t (x) = x |
| natural parameter | θ(λ) = log λ | θ(p) = log(1 - p) |
| cumulant function | F (θ) = exp(θ) | F (θ) = -log(1 - exp(θ)) |
F (λ) = λ | F (p) = -log(p) |
|
| auxiliary measure term | k (x) = -log x! | k (x) = 0 |
| moment parameter η = E[t(x)] | η = λ | η = = - 1 |
| negentropy (convex conjugate) | F *(θ(λ)) = λlog λ - λ | F *(θ(p)) = (1 - )log(1 - p) + log p |
| (F*(η) = θ ⋅ η - F(θ)) |
Thus we calculate the KLD between two geometric distributions Qp1 and Qp2 as
![DKL [Qp1 : Qp2] = BFQ (θ(p2) : θ(p1)),
= FQ (θ(p2))- FQ(θ(p1))- (θ(p2)- θ(p1))η(p1),](KLPoissonGeometricDistributions7x.png)
That is, we have
![|------------------(---)--(------)----------|
| p1 -1 1-- p1|
|DKL [Qp1 : Qp2] = log p2 - 1- p1 log1 - p2 .
--------------------------------------------](KLPoissonGeometricDistributions8x.png)
The following code in Maxima (https://maxima.sourceforge.io/) check the above formula.
Geometric(x,p):=((1-p)**x)*p; nbterms:50; KLGeometricSeries(p1,p2):=sum((Geometric(x,p1)*log(Geometric(x,p1)/Geometric(x,p2))),x,0,nbterms); KLGeometricFormula(p1,p2):=log(p1/p2)-log((1-p2)/(1-p1))*((1/p1)-1); p1:0.2; p2:0.6; float(KLGeometricSeries(p1,p2)); float(KLGeometricFormula(p1,p2));
Evaluating the above code, we get:
(%o7) 1.673553688712277 (%o8) 1.673976433571672
Thus we have the KLD between a Poisson pmf pλ and a geometric pmf qp is equal to
![DKL [P λ : Qp ] = FQ (θ′)+ F *P(η)- EPθ[tQ(x)]⋅θ′ + EPθ[kP (x )- kQ(x)], (2)
= - logp+ λ logλ - λ(1+ p)- E [log x!] (3)
Pλ](KLPoissonGeometricDistributions9x.png)
Since Epλ[-log x!] = -∑
k=0∞e-λ
, we have
![|---------------------------------------------∞∑------k-------|
|DKL [P λ : Qp ] = - logp +λ log λ- λ - λlog(1 - p) - e- λλ-log(k!)
| k=0 k! |
-------------------------------------------------------------](KLPoissonGeometricDistributions11x.png)
We check in Maxima the above formula:
Poisson(x,lambda):=(lambda**x)*exp(-lambda)/x!; KLseries(lambda,p):=sum((Poisson(x,lambda)*log(Poisson(x,lambda)/Geometric(x,p))),x,0,nbterms); KLformula(lambda,p):=-log(p)+lambda*log(lambda)-lambda-lambda*log(1-p) -sum(exp(-lambda)*(lambda**x)*log(x!)/x!,x,0,nbterms); lambda:5.6; p:0.3; float(KLseries(lambda,p)); float(KLformula(lambda,p));
Evaluating the above code, we get
(%o14) 0.9378529269681795 (%o15) 0.9378529269681785
[1] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, and John Lafferty. Clustering with Bregman divergences. Journal of machine learning research, 6(10), 2005.
[2] Frank Nielsen. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy, 23(4):464, 2021.