The Kullback-Leibler divergence between a Poisson distribution and a geometric distribution

Frank Nielsen
Frank.Nielsen@acm.org

October 4, 2021

Abstract

This note illustrates how to apply the generic formula of the Kullback-Leibler divergence between two densities of two different exponential families [2].

This column is also available as the file KLPoissonGeometricDistributions.pdf.

It is well-known that the Kullback-Leibler between two densities Pθ1 and Pθ2 of the same exponential family amounts to a reverse Bregman divergence between the corresponding natural parameters for the Bregman generator set to the cumulant function F(θ[1]:

DKL [Pθ  : Pθ] = BF *(θ1 : θ2) = BF (θ2 : θ1) := F (θ2) - F(θ1)- (θ2 - θ1)⋅∇F (θ1).
      1    2

The following formula for the Kullback-Leibler divergence (KLD) between two densities Pθ and Qθ of two different exponential families P (with cumulant function FP) and Q (with cumulant function FQ) was reported in [2] (Proposition 5):

DKL [Pθ : Q θ′] = FQ (θ′)+ F *(η )- EP [tQ(x)]⋅θ′ + EP [kP (x )- kQ(x)].
                       P        θ             θ
(1)

When P = Q (and F = FP = FQ), we recover the reverse Fenchel-Young divergence which corresponds to the reverse Bregman divergence:

D   [P : P ′] = F (θ′)+ F*(η)- η⋅θ′ =: Y *(θ′ : η) = Y * (η : θ′).
  KL  θ   θ                         F,F          F ,F

Consider the KLD between a Poisson probability mass function (pmf) and a geometric pmf. The canonical decomposition of the Poisson and geometric pmfs are summarized in Table 1.


Poisson family P Geometric family Q



support ∪{0} ∪{0}
base measure counting measure counting measure
ordinary parameter rate λ > 0 success probability p (0,1)
pmf  x
λx! exp(-λ) (1 - p)xp
sufficient statistic tP(x) = x tQ(x) = x
natural parameter θ(λ) = log λ θ(p) = log(1 - p)
cumulant function FP(θ) = exp(θ) FQ(θ) = -log(1 - exp(θ))
FP(λ) = λ FQ(p) = -log(p)
auxiliary measure term kP(x) = -log x! kQ(x) = 0
moment parameter η = E[t(x)]η = λ η = -eθθ
1-e = 1
p - 1
negentropy (convex conjugate) FP*(θ(λ)) = λlog λ - λFQ*(θ(p)) = (1 -1
p)log(1 - p) + log p
(F*(η) = θ η - F(θ))

Table 1: Canonical decomposition of the Poisson and the geometric discrete exponential families.

Thus we calculate the KLD between two geometric distributions Qp1 and Qp2 as

DKL [Qp1 : Qp2] =  BFQ (θ(p2) : θ(p1)),
               =  FQ (θ(p2))- FQ(θ(p1))- (θ(p2)- θ(p1))η(p1),

That is, we have

|------------------(---)--(------)----------|
|                    p1        -1     1-- p1|
|DKL [Qp1 : Qp2] = log p2 -  1- p1  log1 - p2 .
--------------------------------------------

The following code in Maxima (https://maxima.sourceforge.io/) check the above formula.

Geometric(x,p):=((1-p)**x)*p;
nbterms:50;
KLGeometricSeries(p1,p2):=sum((Geometric(x,p1)*log(Geometric(x,p1)/Geometric(x,p2))),x,0,nbterms);
KLGeometricFormula(p1,p2):=log(p1/p2)-log((1-p2)/(1-p1))*((1/p1)-1);
p1:0.2;
p2:0.6;
float(KLGeometricSeries(p1,p2));
float(KLGeometricFormula(p1,p2));

Evaluating the above code, we get:

(%o7) 1.673553688712277
(%o8) 1.673976433571672

Thus we have the KLD between a Poisson pmf pλ and a geometric pmf qp is equal to

DKL [P λ : Qp ] = FQ (θ′)+ F *P(η)- EPθ[tQ(x)]⋅θ′ + EPθ[kP (x )- kQ(x)],         (2)
             =  - logp+ λ logλ - λ(1+ p)- E  [log x!]                        (3)
                                          Pλ

Since Epλ[-log x!] = - k=0e-λλklog(k!)
   k!, we have

|---------------------------------------------∞∑------k-------|
|DKL [P λ : Qp ] = - logp +λ log λ- λ - λlog(1 - p) - e- λλ-log(k!)
|                                            k=0       k!    |
-------------------------------------------------------------

We check in Maxima the above formula:

Poisson(x,lambda):=(lambda**x)*exp(-lambda)/x!;
KLseries(lambda,p):=sum((Poisson(x,lambda)*log(Poisson(x,lambda)/Geometric(x,p))),x,0,nbterms);
KLformula(lambda,p):=-log(p)+lambda*log(lambda)-lambda-lambda*log(1-p)
-sum(exp(-lambda)*(lambda**x)*log(x!)/x!,x,0,nbterms);
lambda:5.6;
p:0.3;
float(KLseries(lambda,p));
float(KLformula(lambda,p));

Evaluating the above code, we get

(%o14) 0.9378529269681795
(%o15) 0.9378529269681785

References

[1]   Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, and John Lafferty. Clustering with Bregman divergences. Journal of machine learning research, 6(10), 2005.

[2]   Frank Nielsen. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy, 23(4):464, 2021.