Jensen-Shannon divergence and diversity index: Origins and some extensions

available also in PDF

Abstract

Lin coined the skewed Jensen-Shannon divergence between two distributions in 1991, and further extended it to the Jensen-Shannon diversity of a set of distributions. Sibson proposed the information radius based on Rényi α-entropies in 1969, and recovered for the special case of α = 1 the Jensen-Shannon diversity index. In this note, we summarize how the Jensen-Shannon divergence and diversity index were extended by either considering skewing vectors or using mixtures induced by generic means.

Let (

,μ) be a measure space, and (w₁,P₁),…,(w_n,P_n) be n weighted probability measures dominated by a measure μ (with w_i > 0 and ∑ w_i = 1). Denote by

:= {(w₁,p₁),…,(w_n,p_n)} the set of their weighted Radon-Nikodym densities p_i = ddPμi

with respect to μ.

A statistical divergence D[p : q] is a measure of dissimilarity between two densities p and q (i.e., a 2-point distance) such that D[p : q] ≥ 0 with equality if and only if p(x) = q(x) μ-almost everywhere. A statistical diversity index D(

) is a measure of variation of the weighted densities in

related to a measure of centrality, i.e., a n-point distance which generalizes the notion of 2-point distance when

₂(p,q) := {( 1
2

,p₁),(

,p₂)}:

The fundamental measure of dissimilarity in information theory is the I-divergence (also called the Kullback-Leibler divergence, KLD, see Equation (2.5) page 5 of [5]):

The KLD is asymmetric (hence the delimiter notation “:” instead of ‘,’) but can be symmetrized by defining the Jeffreys J-divergence (Jeffreys divergence, denoted by I₂ in Equation (1) in 1946’s paper [4]):

Although symmetric, any positive power of Jeffreys divergence fails to satisfy the triangle inequality: That is, D_J^α is never a metric distance for any α > 0, and furthermore D_J^α cannot be upper bounded.

and defined the L-divergence by analogy to Jeffreys’s symmetrization of the KLD (Equation (3.4) in [7]):

where h denotes Shannon entropy (Equation (3.14) in [7]), Lin coined the (skewed) Jensen-Shannon divergence between two weighted densities (1 - α,p) and (α,q) for α ∈ (0,1) as follows (Equation (4.1) in [7]):

Finally, Lin defined the generalized Jensen-Shannon divergence (Equation (5.1) in [7]) for a finite weighted set of densities:

This generalized Jensen-Shannon divergence is nowadays called the Jensen-Shannon diversity index.

To contrast with the Jeffreys’ divergence, the Jensen-Shannon divergence (JSD) D_JS := D_JS, is upper bounded by log 2 (does not require the densities to have the same support), and √ ----
DJS

is a metric distance [2, 3]. Lin cited precursor work [17, 8] yielding definition of the Jensen-Shannon divergence: The Jensen-Shannon divergence of Eq. 1 is the so-called “increments of entropy” defined in (19) and (20) of [17].

The Jensen-Shannon diversity index was also obtained very differently by Sibson in 1969 when he defined the information radius [16] of order α using Rényi α-means and Rényi α-entropies [15]. In particular, the information radius IR₁ of order 1 of a weighted set

of densities is a diversity index obtained by solving the following variational optimization problem:

Sibson solved a more general optimization problem, and obtained the following expression (term K₁ in Corollary 2.3 [16]):

2 Some extensions

References

[1] Jacob Deasy, Nikola Simidjievski, and Pietro Liò. Constraining Variational Inference with Geometric Jensen-Shannon Divergence. In Advances in Neural Information Processing Systems, 2020.

[2] Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions. IEEE Transactions on Information theory, 49(7):1858–1860, 2003.

[3] Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., page 31. IEEE, 2004.

[4] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.

[5] Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.

[6] Lillian Lee. On the effectiveness of the skew divergence for statistical language analysis. In Artificial Intelligence and Statistics (AISTATS), page 65–72, 2001.

[7] Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.

[8] Jianhua Lin and SKM Wong. Approximation of discrete probability distributions based on a new divergence measure. Congressus Numerantium (Winnipeg), 61:75–80, 1988.

[9] Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv preprint arXiv:1009.4004, 2010. URL https://arxiv.org/abs/1009.4004.

[10] Frank Nielsen. On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy, 21(5), 2019. ISSN 1099-4300. doi: 10.3390/e21050485. URL https://www.mdpi.com/1099-4300/21/5/485.

[11] Frank Nielsen. On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid. Entropy, 22(2), 2020. ISSN 1099-4300. doi: 10.3390/e22020221. URL https://www.mdpi.com/1099-4300/22/2/221.

[12] Frank Nielsen. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy, 23(4), 2021. ISSN 1099-4300. doi: 10.3390/e23040464. URL https://www.mdpi.com/1099-4300/23/4/464.

[13] Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE transactions on Information Theory, 55(6):2882–2904, 2009.

[14] Frank Nielsen and Kazuki Okamura. On f-divergences between cauchy distributions. arXiv:2101.12459, 2021.

[15] Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.

[16] Robin Sibson. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(2):149–160, 1969.

[17] Andrew KC Wong and Manlai You. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, (5):599–609, 1985.