Jensen-Shannon divergence and diversity index:
Origins and some extensions

Frank Nielsen
Sony Computer Science Laboratories Inc.
Tokyo, Japan

April 2021

available also in PDF

Abstract

Lin coined the skewed Jensen-Shannon divergence between two distributions in 1991, and further extended it to the Jensen-Shannon diversity of a set of distributions. Sibson proposed the information radius based on Rényi α-entropies in 1969, and recovered for the special case of α = 1 the Jensen-Shannon diversity index. In this note, we summarize how the Jensen-Shannon divergence and diversity index were extended by either considering skewing vectors or using mixtures induced by generic means.

1 Origins

Let (X,F) be a measure space, and (w1,P1),,(wn,Pn) be n weighted probability measures dominated by a measure μ (with wi > 0 and wi = 1). Denote by P := {(w1,p1),,(wn,pn)} the set of their weighted Radon-Nikodym densities pi = ddPμi with respect to μ.

A statistical divergence D[p : q] is a measure of dissimilarity between two densities p and q (i.e., a 2-point distance) such that D[p : q] 0 with equality if and only if p(x) = q(x) μ-almost everywhere. A statistical diversity index D(P) is a measure of variation of the weighted densities in P related to a measure of centrality, i.e., a n-point distance which generalizes the notion of 2-point distance when P2(p,q) := {(1
2,p1),(1
2,p2)}:

D [p : q] := D (P2(p,q)).

The fundamental measure of dissimilarity in information theory is the I-divergence (also called the Kullback-Leibler divergence, KLD, see Equation (2.5) page 5 of [5]):

            ∫         ( p(x))
DKL [p : q] :=  p(x)log  ----  dμ(x).
             X          q(x)

The KLD is asymmetric (hence the delimiter notation “:” instead of ‘,’) but can be symmetrized by defining the Jeffreys J-divergence (Jeffreys divergence, denoted by I2 in Equation (1) in 1946’s paper [4]):

                                  ∫                 (p(x))
DJ [p,q] := DKL [p : q]+ DKL [q : p] = (p(x) - q(x )) log q(x)  dμ (x ).
                                   X

Although symmetric, any positive power of Jeffreys divergence fails to satisfy the triangle inequality: That is, DJα is never a metric distance for any α > 0, and furthermore DJα cannot be upper bounded.

In 1991, Lin proposed the asymmetric K-divergence (Equation (3.2) in [7]):

                [        ]
D  [p : q] := D  p : p-+-q ,
  K          KL       2

and defined the L-divergence by analogy to Jeffreys’s symmetrization of the KLD (Equation (3.4) in [7]):

DL [p,q] = DK [p : q]+ DK [q : p].

By noticing that

            [     ]
             p-+-q
DL [p,q] = 2h   2    - (h[p] + h[q]),

where h denotes Shannon entropy (Equation (3.14) in [7]), Lin coined the (skewed) Jensen-Shannon divergence between two weighted densities (1 - α,p) and (α,q) for α (0,1) as follows (Equation (4.1) in [7]):

DJS,α[p,q] = h[(1- α)p + αq]- (1 - α)h[p]- αh[q].
(1)

Finally, Lin defined the generalized Jensen-Shannon divergence (Equation (5.1) in [7]) for a finite weighted set of densities:

           [       ]
D   [P] = h ∑  w p   - ∑  w h[p ].
  JS         i   i i    i  i   i

This generalized Jensen-Shannon divergence is nowadays called the Jensen-Shannon diversity index.

To contrast with the Jeffreys’ divergence, the Jensen-Shannon divergence (JSD) DJS := DJS,12 is upper bounded by log 2 (does not require the densities to have the same support), and √ ----
  DJS is a metric distance [23]. Lin cited precursor work [178] yielding definition of the Jensen-Shannon divergence: The Jensen-Shannon divergence of Eq. 1 is the so-called “increments of entropy” defined in (19) and (20) of [17].

The Jensen-Shannon diversity index was also obtained very differently by Sibson in 1969 when he defined the information radius [16] of order α using Rényi α-means and Rényi α-entropies [15]. In particular, the information radius IR1 of order 1 of a weighted set P of densities is a diversity index obtained by solving the following variational optimization problem:

             ∑n
IR1[P ] := min    wiDKL [pi : c].
           c  i=1
(2)

Sibson solved a more general optimization problem, and obtained the following expression (term K1 in Corollary 2.3 [16]):

          [       ]
IR [P] = h ∑  w p   - ∑  w h[p ] := D  [P].
  1         i   ii     i  i   i     JS

Thus Eq. 2 is a variational definition of the Jensen-Shannon divergence.

2 Some extensions

References

[1]    Jacob Deasy, Nikola Simidjievski, and Pietro Liò. Constraining Variational Inference with Geometric Jensen-Shannon Divergence. In Advances in Neural Information Processing Systems, 2020.

[2]    Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions. IEEE Transactions on Information theory, 49(7):1858–1860, 2003.

[3]    Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., page 31. IEEE, 2004.

[4]    Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.

[5]    Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.

[6]    Lillian Lee. On the effectiveness of the skew divergence for statistical language analysis. In Artificial Intelligence and Statistics (AISTATS), page 65–72, 2001.

[7]    Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.

[8]    Jianhua Lin and SKM Wong. Approximation of discrete probability distributions based on a new divergence measure. Congressus Numerantium (Winnipeg), 61:75–80, 1988.

[9]    Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv preprint arXiv:1009.4004, 2010. URL https://arxiv.org/abs/1009.4004.

[10]    Frank Nielsen. On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy, 21(5), 2019. ISSN 1099-4300. doi: 10.3390/e21050485. URL https://www.mdpi.com/1099-4300/21/5/485.

[11]    Frank Nielsen. On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid. Entropy, 22(2), 2020. ISSN 1099-4300. doi: 10.3390/e22020221. URL https://www.mdpi.com/1099-4300/22/2/221.

[12]    Frank Nielsen. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy, 23(4), 2021. ISSN 1099-4300. doi: 10.3390/e23040464. URL https://www.mdpi.com/1099-4300/23/4/464.

[13]    Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE transactions on Information Theory, 55(6):2882–2904, 2009.

[14]    Frank Nielsen and Kazuki Okamura. On f-divergences between cauchy distributions. arXiv:2101.12459, 2021.

[15]    Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.

[16]    Robin Sibson. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(2):149–160, 1969.

[17]    Andrew KC Wong and Manlai You. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, (5):599–609, 1985.