## 2014 |

Taborda, Camilo G; Perez-Cruz, Fernando; Guo, Dongning New Information-Estimation Results for Poisson, Binomial and Negative Binomial Models Inproceedings 2014 IEEE International Symposium on Information Theory, pp. 2207–2211, IEEE, Honolulu, 2014, ISBN: 978-1-4799-5186-4. Abstract | Links | BibTeX | Tags: Bregman divergence, Estimation, estimation measures, Gaussian models, Gaussian processes, information measures, information theory, information-estimation results, negative binomial models, Poisson models, Stochastic processes @inproceedings{Taborda2014, title = {New Information-Estimation Results for Poisson, Binomial and Negative Binomial Models}, author = {Camilo G Taborda and Fernando Perez-Cruz and Dongning Guo}, url = {http://ieeexplore.ieee.org/articleDetails.jsp?arnumber=6875225}, doi = {10.1109/ISIT.2014.6875225}, isbn = {978-1-4799-5186-4}, year = {2014}, date = {2014-06-01}, booktitle = {2014 IEEE International Symposium on Information Theory}, pages = {2207--2211}, publisher = {IEEE}, address = {Honolulu}, abstract = {In recent years, a number of mathematical relationships have been established between information measures and estimation measures for various models, including Gaussian, Poisson and binomial models. In this paper, it is shown that the second derivative of the input-output mutual information with respect to the input scaling can be expressed as the expectation of a certain Bregman divergence pertaining to the conditional expectations of the input and the input power. This result is similar to that found for the Gaussian model where the Bregman divergence therein is the square distance. In addition, the Poisson, binomial and negative binomial models are shown to be similar in the small scaling regime in the sense that the derivative of the mutual information and the derivative of the relative entropy converge to the same value.}, keywords = {Bregman divergence, Estimation, estimation measures, Gaussian models, Gaussian processes, information measures, information theory, information-estimation results, negative binomial models, Poisson models, Stochastic processes}, pubstate = {published}, tppubtype = {inproceedings} } In recent years, a number of mathematical relationships have been established between information measures and estimation measures for various models, including Gaussian, Poisson and binomial models. In this paper, it is shown that the second derivative of the input-output mutual information with respect to the input scaling can be expressed as the expectation of a certain Bregman divergence pertaining to the conditional expectations of the input and the input power. This result is similar to that found for the Gaussian model where the Bregman divergence therein is the square distance. In addition, the Poisson, binomial and negative binomial models are shown to be similar in the small scaling regime in the sense that the derivative of the mutual information and the derivative of the relative entropy converge to the same value. |

## 2012 |

Leiva-Murillo, Jose M; Artés-Rodríguez, Antonio Information-Theoretic Linear Feature Extraction Based on Kernel Density Estimators: A Review Journal Article IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42 (6), pp. 1180–1189, 2012, ISSN: 1094-6977. Abstract | Links | BibTeX | Tags: Bandwidth, Density, detection theory, Entropy, Estimation, Feature extraction, Feature extraction (FE), information theoretic linear feature extraction, information theory, information-theoretic learning (ITL), Kernel, Kernel density estimation, kernel density estimators, Machine learning @article{Leiva-Murillo2012a, title = {Information-Theoretic Linear Feature Extraction Based on Kernel Density Estimators: A Review}, author = {Jose M Leiva-Murillo and Antonio Artés-Rodríguez}, url = {http://www.tsc.uc3m.es/~antonio/papers/P44_2012_Information Theoretic Linear Feature Extraction Based on Kernel Density Estimators A Review.pdf http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6185689}, issn = {1094-6977}, year = {2012}, date = {2012-01-01}, journal = {IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)}, volume = {42}, number = {6}, pages = {1180--1189}, abstract = {In this paper, we provide a unified study of the application of kernel density estimators to supervised linear feature extraction by means of criteria inspired by information and detection theory. We enrich this study by the incorporation of two novel criteria to the study, i.e., the mutual information and the likelihood ratio test, and perform both a theoretical and an experimental comparison between the new methods and other ones previously described in the literature. The impact of the bandwidth selection of the density estimator in the classification performance is discussed. Some theoretical results that bound classification performance as a function or mutual information are also compiled. A set of experiments on different real-world datasets allows us to perform an empirical comparison of the methods, in terms of both accuracy and computational complexity. We show the suitability of these methods to determine the dimension of the subspace that contains the discriminative information.}, keywords = {Bandwidth, Density, detection theory, Entropy, Estimation, Feature extraction, Feature extraction (FE), information theoretic linear feature extraction, information theory, information-theoretic learning (ITL), Kernel, Kernel density estimation, kernel density estimators, Machine learning}, pubstate = {published}, tppubtype = {article} } In this paper, we provide a unified study of the application of kernel density estimators to supervised linear feature extraction by means of criteria inspired by information and detection theory. We enrich this study by the incorporation of two novel criteria to the study, i.e., the mutual information and the likelihood ratio test, and perform both a theoretical and an experimental comparison between the new methods and other ones previously described in the literature. The impact of the bandwidth selection of the density estimator in the classification performance is discussed. Some theoretical results that bound classification performance as a function or mutual information are also compiled. A set of experiments on different real-world datasets allows us to perform an empirical comparison of the methods, in terms of both accuracy and computational complexity. We show the suitability of these methods to determine the dimension of the subspace that contains the discriminative information. |

## 2010 |

Koch, Tobias; Lapidoth, Amos Gaussian Fading Is the Worst Fading Journal Article IEEE Transactions on Information Theory, 56 (3), pp. 1158–1165, 2010, ISSN: 0018-9448. Abstract | Links | BibTeX | Tags: Additive noise, channel capacity, channels with memory, Distribution functions, ergodic fading processes, Fading, fading channels, flat fading, flat-fading channel capacity, Gaussian channels, Gaussian fading, Gaussian processes, H infinity control, high signal-to-noise ratio (SNR), Information technology, information theory, multiple-input single-output fading channels, multiplexing gain, noncoherent, noncoherent channel capacity, peak-power limited channel capacity, Signal to noise ratio, signal-to-noise ratio, single-antenna channel capacity, spectral distribution function, time-selective, Transmitters @article{Koch2010a, title = {Gaussian Fading Is the Worst Fading}, author = {Tobias Koch and Amos Lapidoth}, url = {http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5429105}, issn = {0018-9448}, year = {2010}, date = {2010-01-01}, journal = {IEEE Transactions on Information Theory}, volume = {56}, number = {3}, pages = {1158--1165}, abstract = {The capacity of peak-power limited, single-antenna, noncoherent, flat-fading channels with memory is considered. The emphasis is on the capacity pre-log, i.e., on the limiting ratio of channel capacity to the logarithm of the signal-to-noise ratio (SNR), as the SNR tends to infinity. It is shown that, among all stationary and ergodic fading processes of a given spectral distribution function and whose law has no mass point at zero, the Gaussian process gives rise to the smallest pre-log. The assumption that the law of the fading process has no mass point at zero is essential in the sense that there exist stationary and ergodic fading processes whose law has a mass point at zero and that give rise to a smaller pre-log than the Gaussian process of equal spectral distribution function. An extension of these results to multiple-input single-output (MISO) fading channels with memory is also presented.}, keywords = {Additive noise, channel capacity, channels with memory, Distribution functions, ergodic fading processes, Fading, fading channels, flat fading, flat-fading channel capacity, Gaussian channels, Gaussian fading, Gaussian processes, H infinity control, high signal-to-noise ratio (SNR), Information technology, information theory, multiple-input single-output fading channels, multiplexing gain, noncoherent, noncoherent channel capacity, peak-power limited channel capacity, Signal to noise ratio, signal-to-noise ratio, single-antenna channel capacity, spectral distribution function, time-selective, Transmitters}, pubstate = {published}, tppubtype = {article} } The capacity of peak-power limited, single-antenna, noncoherent, flat-fading channels with memory is considered. The emphasis is on the capacity pre-log, i.e., on the limiting ratio of channel capacity to the logarithm of the signal-to-noise ratio (SNR), as the SNR tends to infinity. It is shown that, among all stationary and ergodic fading processes of a given spectral distribution function and whose law has no mass point at zero, the Gaussian process gives rise to the smallest pre-log. The assumption that the law of the fading process has no mass point at zero is essential in the sense that there exist stationary and ergodic fading processes whose law has a mass point at zero and that give rise to a smaller pre-log than the Gaussian process of equal spectral distribution function. An extension of these results to multiple-input single-output (MISO) fading channels with memory is also presented. |

## 2008 |

Perez-Cruz, Fernando Kullback-Leibler Divergence Estimation of Continuous Distributions Inproceedings 2008 IEEE International Symposium on Information Theory, pp. 1666–1670, IEEE, Toronto, 2008, ISBN: 978-1-4244-2256-2. Abstract | Links | BibTeX | Tags: Convergence, density estimation, Density measurement, Entropy, Frequency estimation, H infinity control, information theory, k-nearest-neighbour density estimation, Kullback-Leibler divergence estimation, Machine learning, Mutual information, neuroscience, Random variables, statistical distributions, waiting-times distributions @inproceedings{Perez-Cruz2008, title = {Kullback-Leibler Divergence Estimation of Continuous Distributions}, author = {Fernando Perez-Cruz}, url = {http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4595271}, isbn = {978-1-4244-2256-2}, year = {2008}, date = {2008-01-01}, booktitle = {2008 IEEE International Symposium on Information Theory}, pages = {1666--1670}, publisher = {IEEE}, address = {Toronto}, abstract = {We present a method for estimating the KL divergence between continuous densities and we prove it converges almost surely. Divergence estimation is typically solved estimating the densities first. Our main result shows this intermediate step is unnecessary and that the divergence can be either estimated using the empirical cdf or k-nearest-neighbour density estimation, which does not converge to the true measure for finite k. The convergence proof is based on describing the statistics of our estimator using waiting-times distributions, as the exponential or Erlang. We illustrate the proposed estimators and show how they compare to existing methods based on density estimation, and we also outline how our divergence estimators can be used for solving the two-sample problem.}, keywords = {Convergence, density estimation, Density measurement, Entropy, Frequency estimation, H infinity control, information theory, k-nearest-neighbour density estimation, Kullback-Leibler divergence estimation, Machine learning, Mutual information, neuroscience, Random variables, statistical distributions, waiting-times distributions}, pubstate = {published}, tppubtype = {inproceedings} } We present a method for estimating the KL divergence between continuous densities and we prove it converges almost surely. Divergence estimation is typically solved estimating the densities first. Our main result shows this intermediate step is unnecessary and that the divergence can be either estimated using the empirical cdf or k-nearest-neighbour density estimation, which does not converge to the true measure for finite k. The convergence proof is based on describing the statistics of our estimator using waiting-times distributions, as the exponential or Erlang. We illustrate the proposed estimators and show how they compare to existing methods based on density estimation, and we also outline how our divergence estimators can be used for solving the two-sample problem. |

Leiva-Murillo, Jose M; Salcedo-Sanz, Sancho; Gallardo-Antolín, Ascensión; Artés-Rodríguez, Antonio A Simulated Annealing Approach to Speaker Segmentation in Audio Databases Journal Article Engineering Applications of Artificial Intelligence, 21 (4), pp. 499–508, 2008. Abstract | Links | BibTeX | Tags: Audio indexing, information theory, Simulated annealing, Speaker segmentation @article{Leiva-Murillo2008c, title = {A Simulated Annealing Approach to Speaker Segmentation in Audio Databases}, author = {Jose M Leiva-Murillo and Sancho Salcedo-Sanz and Ascensión Gallardo-Antolín and Antonio Artés-Rodríguez}, url = {http://www.sciencedirect.com/science/article/pii/S0952197607000954}, year = {2008}, date = {2008-01-01}, journal = {Engineering Applications of Artificial Intelligence}, volume = {21}, number = {4}, pages = {499--508}, abstract = {In this paper we present a novel approach to the problem of speaker segmentation, which is an unavoidable previous step to audio indexing. Mutual information is used for evaluating the accuracy of the segmentation, as a function to be maximized by a simulated annealing (SA) algorithm. We introduce a novel mutation operator for the SA, the Consecutive Bits Mutation operator, which improves the performance of the SA in this problem. We also use the so-called Compaction Factor, which allows the SA to operate in a reduced search space. Our algorithm has been tested in the segmentation of real audio databases, and it has been compared to several existing algorithms for speaker segmentation, obtaining very good results in the test problems considered.}, keywords = {Audio indexing, information theory, Simulated annealing, Speaker segmentation}, pubstate = {published}, tppubtype = {article} } In this paper we present a novel approach to the problem of speaker segmentation, which is an unavoidable previous step to audio indexing. Mutual information is used for evaluating the accuracy of the segmentation, as a function to be maximized by a simulated annealing (SA) algorithm. We introduce a novel mutation operator for the SA, the Consecutive Bits Mutation operator, which improves the performance of the SA in this problem. We also use the so-called Compaction Factor, which allows the SA to operate in a reduced search space. Our algorithm has been tested in the segmentation of real audio databases, and it has been compared to several existing algorithms for speaker segmentation, obtaining very good results in the test problems considered. |

## 2007 |

Leiva-Murillo, Jose M; Artés-Rodríguez, Antonio Maximization of Mutual Information for Supervised Linear Feature Extraction Journal Article IEEE Transactions on Neural Networks, 18 (5), pp. 1433–1441, 2007, ISSN: 1045-9227. Abstract | Links | BibTeX | Tags: Algorithms, Artificial Intelligence, Automated, component-by-component gradient-ascent method, Computer Simulation, Data Mining, Entropy, Feature extraction, gradient methods, gradient-based entropy, Independent component analysis, Information Storage and Retrieval, information theory, Iron, learning (artificial intelligence), Linear discriminant analysis, Linear Models, Mutual information, Optimization methods, Pattern recognition, Reproducibility of Results, Sensitivity and Specificity, supervised linear feature extraction, Vectors @article{Leiva-Murillo2007, title = {Maximization of Mutual Information for Supervised Linear Feature Extraction}, author = {Jose M Leiva-Murillo and Antonio Artés-Rodríguez}, url = {http://ieeexplore.ieee.org/articleDetails.jsp?arnumber=4298118}, issn = {1045-9227}, year = {2007}, date = {2007-01-01}, journal = {IEEE Transactions on Neural Networks}, volume = {18}, number = {5}, pages = {1433--1441}, publisher = {IEEE}, abstract = {In this paper, we present a novel scheme for linear feature extraction in classification. The method is based on the maximization of the mutual information (MI) between the features extracted and the classes. The sum of the MI corresponding to each of the features is taken as an heuristic that approximates the MI of the whole output vector. Then, a component-by-component gradient-ascent method is proposed for the maximization of the MI, similar to the gradient-based entropy optimization used in independent component analysis (ICA). The simulation results show that not only is the method competitive when compared to existing supervised feature extraction methods in all cases studied, but it also remarkably outperform them when the data are characterized by strongly nonlinear boundaries between classes.}, keywords = {Algorithms, Artificial Intelligence, Automated, component-by-component gradient-ascent method, Computer Simulation, Data Mining, Entropy, Feature extraction, gradient methods, gradient-based entropy, Independent component analysis, Information Storage and Retrieval, information theory, Iron, learning (artificial intelligence), Linear discriminant analysis, Linear Models, Mutual information, Optimization methods, Pattern recognition, Reproducibility of Results, Sensitivity and Specificity, supervised linear feature extraction, Vectors}, pubstate = {published}, tppubtype = {article} } In this paper, we present a novel scheme for linear feature extraction in classification. The method is based on the maximization of the mutual information (MI) between the features extracted and the classes. The sum of the MI corresponding to each of the features is taken as an heuristic that approximates the MI of the whole output vector. Then, a component-by-component gradient-ascent method is proposed for the maximization of the MI, similar to the gradient-based entropy optimization used in independent component analysis (ICA). The simulation results show that not only is the method competitive when compared to existing supervised feature extraction methods in all cases studied, but it also remarkably outperform them when the data are characterized by strongly nonlinear boundaries between classes. |