Discrepancies, dissimilarities, divergences, and distances

Frank Nielsen
Sony Computer Science Laboratories Inc.
Tokyo, Japan

13th August 2021, updated August 16, 2021

This is a working document which will be frequently updated with materials concerning the discrepancy between two distributions.

This document is also available in the PDF Distance.pdf

There are many other acronyms used in the literature for referencing a dissimilarity; For example, the following 5 D’s: Discrepancies, deviations, dissimilarities, divergences, and distances.

Contents

 1 Statistical distances between densities with computationally intractable normalizers
 2 Statistical distances between empirical distributions and densities with computationally intractable normalizers
 3 The Jensen-Shannon divergence and some generalizations
  3.1 Origins of the Jensen-Shannon divergence
  3.2 Some extensions of the Jensen-Shannon divergence
 4 Statistical distances between mixtures
  4.1 Approximating and/or fast statistical distances between mixtures
  4.2 Bounding statistical distances between mixtures
  4.3 Newly designed statistical distances yielding closed-form formula for mixtures

1 Statistical distances between densities with computationally intractable normalizers

Consider a density p(x) = ˜p(Zx)
 p where ˜p (x) is an unnormalized computable density and Zp = p(x)dμ(x) the computationally intractable normalizer (also called in statistical physics the partition function or free energy). A statistical distance D[p1 : p2] between two densities p1(x) = ˜p1(x)
Zp1 and p2(x) = ˜p2(x)
 Zp2 with computationally intractable normalizers Zp1 and Zp2 is said projective (or two-sided homogeneous) if and only if

∀λ1 > 0,λ2 > 0,  D[p1 : p2] = D [λ1p1 : λ2p2].

In particular, letting λ1 = Zp1 and λ2 = Zp2, we have

D [p : p ] = D[˜p : ˜p ].
    1  2       1   2

Notice that the rhs. does not rely on the computationally intractable normalizers. These projective distances are useful in statistical inference based on minimum distance estimators [2] (see next Section).

Here are a few statistical projective distances:

2 Statistical distances between empirical distributions and densities with computationally intractable normalizers

When estimating the parameter ˆθ for a parametric family of distributions {pθ} from i.i.d. observations S = {x1,,xn}, we can define a minimum distance estimator (MDE):

θˆ= argmin D [pS : pθ],
         θ

where pS = 1
n i=1nδxi is the empirical distribution (normalized). Thus we need only a right-sided projective divergence to estimate models with computationally intractable normalizers. For example, the Maximum Likelihood Estimator (MLE) is a MDE wrt. the KLD:

θˆMLE  = argmin DKL [pS : pθ].
            θ

It is thus interesting to study the impact of the choice of the distance D to the properties of the corresponding estimator (e.g., γ-divergences yields provably robust estimators [6]).

3 The Jensen-Shannon divergence and some generalizations

3.1 Origins of the Jensen-Shannon divergence

Let (X,F) be a measure space, and (w1,P1),,(wn,Pn) be n weighted probability measures dominated by a measure μ (with wi > 0 and wi = 1). Denote by P := {(w1,p1),,(wn,pn)} the set of their weighted Radon-Nikodym densities pi = dPi
dμ with respect to μ.

A statistical divergence D[p : q] is a measure of dissimilarity between two densities p and q (i.e., a 2-point distance) such that D[p : q] 0 with equality if and only if p(x) = q(x) μ-almost everywhere. A statistical diversity index D(P) is a measure of variation of the weighted densities in P related to a measure of centrality, i.e., a n-point distance which generalizes the notion of 2-point distance when P2(p,q) := {(12,p1),(12,p2)}:

D [p : q] := D (P2(p,q)).

The fundamental measure of dissimilarity in information theory is the I-divergence (also called the Kullback-Leibler divergence, KLD, see Equation (2.5) page 5 of [12]):

             ∫         ( p(x))
DKL  [p : q] :=  p(x)log  q(x)  dμ(x).
              X

The KLD is asymmetric (hence the delimiter notation “:” instead of ‘,’) but can be symmetrized by defining the Jeffreys J-divergence (Jeffreys divergence, denoted by I2 in Equation (1) in 1946’s paper [8]):

                                  ∫                 (    )
DJ [p,q] := DKL [p : q]+ DKL [q : p] = (p(x) - q(x )) log p(x)  d μ(x).
                                   X                 q(x)

Although symmetric, any positive power of Jeffreys divergence fails to satisfy the triangle inequality: That is, DJα is never a metric distance for any α > 0, and furthermore DJα cannot be upper bounded.

In 1991, Lin proposed the asymmetric K-divergence (Equation (3.2) in [14]):

                [        ]
DK [p : q] := DKL p : p-+-q ,
                      2

and defined the L-divergence by analogy to Jeffreys’s symmetrization of the KLD (Equation (3.4) in [14]):

DL [p,q] = DK [p : q]+ DK [q : p].

By noticing that

            [     ]
             p-+-q
DL [p,q] = 2h   2    - (h[p]+ h[q]),

where h denotes Shannon entropy (Equation (3.14) in [14]), Lin coined the (skewed) Jensen-Shannon divergence between two weighted densities (1 - α,p) and (α,q) for α (0,1) as follows (Equation (4.1) in [14]):

DJS,α[p,q] = h[(1- α)p + αq]- (1 - α)h[p]- αh[q].
(1)

Finally, Lin defined the generalized Jensen-Shannon divergence (Equation (5.1) in [14]) for a finite weighted set of densities:

           [       ]
            ∑          ∑
DJS [P ] = h    wipi  -    wih [pi].
             i          i

This generalized Jensen-Shannon divergence is nowadays called the Jensen-Shannon diversity index.

To contrast with the Jeffreys’ divergence, the Jensen-Shannon divergence (JSD) DJS := DJS,12 is upper bounded by log 2 (does not require the densities to have the same support), and √DJS-- is a metric distance [45]. Lin cited precursor work [4215] yielding definition of the Jensen-Shannon divergence: The Jensen-Shannon divergence of Eq. 1 is the so-called “increments of entropy” defined in (19) and (20) of [42].

The Jensen-Shannon diversity index was also obtained very differently by Sibson in 1969 when he defined the information radius [40] of order α using Rényi α-means and Rényi α-entropies [38]. In particular, the information radius IR1 of order 1 of a weighted set P of densities is a diversity index obtained by solving the following variational optimization problem:

              n
IR [P ] := min ∑  w D   [p  : c].
  1        c      i  KL  i
              i=1
(2)

Sibson solved a more general optimization problem, and obtained the following expression (term K1 in Corollary 2.3 [40]):

           [       ]
            ∑          ∑
IR1 [P ] = h    wipi  -    wih [pi] := DJS[P ].
             i          i

Thus Eq. 2 is a variational definition of the Jensen-Shannon divergence.

3.2 Some extensions of the Jensen-Shannon divergence

4 Statistical distances between mixtures

Pearson [36] first considered a unimodal Gaussian mixture of two components for modeling distributions crabs in 1894. Statistical mixtures [16] like the Gaussian mixture models (GMMs) are often met in information sciences, and therefore it is important to assess their dissimilarities. Let m(x) = i=1kwipi(x) and m(x) = i=1kwipi(x) be two finite statistical mixtures. The KLD between two GMMs m and mis not analytic [41] because of the log-sum terms:

              ∫
          ′              m-(x)-
DKL [m : m ] =  m (x)log m ′(x)dx.

However, the KLD between two GMMs with the same prescribed components pi(x) = pi(x) = pμi,Σi(x) (i.e., k = k, and only the normalized positive weights may differ) is provably a Bregman divergence [28] for the differential negentropy F(θ):

DKL [m (θ) : m (θ′)] = BF (θ,θ′),

where m(θ) = i=1k-1wipi(x) + (1 - i=1k-1wi)pk(x) and F(θ) = m(θ)log m(θ)dx. The family {mθ  θ Δk-1} is called a mixture family in information geometry, where Δk-1 denotes the (k - 1)-dimensional open standard simplex. However, F(θ) is usually not available in closed-form because of the log-sum integral. In some special cases like the mixture of two prescribed Cauchy distributions, we get a closed-form formula for the KLD, JSD, etc. [3025]. Thus when dealing with mixtures (like GMMs), we either need efficient approximating (§4.1), bounding (§4.2) KLD techniques, or new distances (§4.3) that yields closed-form formula between mixture densities.

4.1 Approximating and/or fast statistical distances between mixtures

4.2 Bounding statistical distances between mixtures

4.3 Newly designed statistical distances yielding closed-form formula for mixtures

Initially created 13th August 2021 (last updated August 16, 2021).

References

[1]   Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 1998.

[2]   Ayanendranath Basu, Hiroyuki Shioya, and Chanseok Park. Statistical inference: the minimum distance approach. Chapman and Hall/CRC, 2019.

[3]   Jacob Deasy, Nikola Simidjievski, and Pietro Liò. Constraining Variational Inference with Geometric Jensen-Shannon Divergence. In Advances in Neural Information Processing Systems, 2020.

[4]   Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions. IEEE Transactions on Information theory, 49(7):1858–1860, 2003.

[5]   Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., page 31. IEEE, 2004.

[6]   Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9):2053–2081, 2008.

[7]   Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

[8]   Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.

[9]   Robert Jenssen, Jose C Principe, Deniz Erdogmus, and Torbjørn Eltoft. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. Journal of the Franklin Institute, 343(6):614–629, 2006.

[10]   MC Jones, Nils Lid Hjort, Ian R Harris, and Ayanendranath Basu. A comparison of related density-based minimum divergence estimators. Biometrika, 88(3):865–873, 2001.

[11]   Kittipat Kampa, Erion Hasanbelliu, and Jose C Principe. Closed-form Cauchy-Schwarz PDF divergence for mixture of Gaussians. In The 2011 International Joint Conference on Neural Networks, pages 2578–2585. IEEE, 2011.

[12]   Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.

[13]   Lillian Lee. On the effectiveness of the skew divergence for statistical language analysis. In Artificial Intelligence and Statistics (AISTATS), page 65?72, 2001.

[14]   Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.

[15]   Jianhua Lin and SKM Wong. Approximation of discrete probability distributions based on a new divergence measure. Congressus Numerantium (Winnipeg), 61:75–80, 1988.

[16]   Geoffrey J McLachlan and Kaye E Basford. Mixture models: Inference and applications to clustering, volume 38. M. Dekker New York, 1988.

[17]   Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv preprint arXiv:1009.4004, 2010.

[18]   Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR), pages 1723–1726. IEEE, 2012.

[19]   Frank Nielsen. On the Jensen?Shannon Symmetrization of Distances Relying on Abstract Means. Entropy, 21(5), 2019.

[20]   Frank Nielsen. The statistical Minkowski distances: Closed-form formula for Gaussian mixture models. In International Conference on Geometric Science of Information, pages 359–367. Springer, 2019.

[21]   Frank Nielsen. On a Generalization of the Jensen?Shannon Divergence and the Jensen?Shannon Centroid. Entropy, 22(2), 2020.

[22]   Frank Nielsen. Fast approximations of the Jeffreys divergence between univariate Gaussian mixture models via exponential polynomial densities. arXiv preprint arXiv:2107.05901, 2021.

[23]   Frank Nielsen. Fast approximations of the jeffreys divergence between univariate gaussian mixture models via exponential polynomial densities. arXiv preprint arXiv:2107.05901, 2021.

[24]   Frank Nielsen. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy, 23(4), 2021.

[25]   Frank Nielsen. The dually flat information geometry of the mixture family of two prescribed Cauchy components. arXiv preprint arXiv:2104.13801, 2021.

[26]   Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE transactions on Information Theory, 55(6):2882–2904, 2009.

[27]   Frank Nielsen and Richard Nock. Patch matching with polynomial exponential families and projective divergences. In International Conference on Similarity Search and Applications, pages 109–116. Springer, 2016.

[28]   Frank Nielsen and Richard Nock. On the geometry of mixtures of prescribed distributions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2861–2865. IEEE, 2018.

[29]   Frank Nielsen and Kazuki Okamura. On f-divergences between cauchy distributions. arXiv:2101.12459, 2021.

[30]   Frank Nielsen and Kazuki Okamura. On f-divergences between Cauchy distributions. arXiv preprint arXiv:2101.12459, 2021.

[31]   Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016.

[32]   Frank Nielsen and Ke Sun. Combinatorial bounds on the α-divergence of univariate mixture models. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 4476–4480. IEEE, 2017.

[33]   Frank Nielsen and Ke Sun. Guaranteed Deterministic Bounds on the total variation distance between univariate mixtures. In 28th IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2018, Aalborg, Denmark, September 17-20, 2018, pages 1–6. IEEE, 2018.

[34]   Frank Nielsen and Ke Sun. Clustering in Hilbert’s projective geometry: The case studies of the probability simplex and the elliptope of correlation matrices. In Geometric Structures of Information, pages 297–331. Springer, 2019.

[35]   Frank Nielsen, Ke Sun, and Stéphane Marchand-Maillet. On Hölder projective divergences. Entropy, 19(3):122, 2017.

[36]   Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110, 1894.

[37]   Souvik Ray, Subrata Pal, Sumit Kumar Kar, and Ayanendranath Basu. Characterizing the functional density power divergence class. arXiv preprint arXiv:2105.06094, 2021.

[38]   Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.

[39]   Olivier Schwander, Stéphane Marchand-Maillet, and Frank Nielsen. Comix: Joint estimation and lightspeed comparison of mixture models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016, pages 2449–2453. IEEE, 2016.

[40]   Robin Sibson. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(2):149–160, 1969.

[41]   Sumio Watanabe, Keisuke Yamazaki, and Miki Aoyagi. Kullback information of normal mixture is not an analytic function. IEICE technical report. Neurocomputing, 104(225):41–46, 2004.

[42]   Andrew KC Wong and Manlai You. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, (5):599–609, 1985.