Binomial methods for binary message compression
Search for numerical encoding methods for lossless binary message. Consideration of methods for compressing digital data using binomial coefficients. Using the decomposition of the Bernoulli information source and calculating the maximum value of entropy.
Рубрика | Программирование, компьютеры и кибернетика |
Вид | статья |
Язык | английский |
Дата добавления | 02.10.2024 |
Размер файла | 228,0 K |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Размещено на http://allbest.ru
Sumy State University
Binomial methods for binary message compression
Borysenko Oleksiy
doctor of technical sciences, professor
Ukraine
Резюме
Биномиальные методы сжатия двоичных сообщений
Предложены два численных метода сжатия двоичных сообщений без потерь с использованием биномиальных коэффициентов. Методы основаны на разложении источника информации Бернулли на комбинаторный источник A и вероятностный источник B. Соответственно, энтропия источника также разлагается на две энтропии. Энтропия источника A решает проблему переборного кодирования сжимаемых сообщений, а энтропия источника B для первого метода решает проблему оптимального кодирования. Недостатком этого метода является необходимость статистических тестов. Во втором способе оптимальное кодирование заменяется вычислением максимального значения энтропии источника B. Эта замена позволяет избежать статистических тестов, хотя и приводит к относительно небольшой избыточности сжатых сообщений, которая уменьшается по мере увеличения их длины. Работа методов показана на конкретных примерах, что позволяет лучше понять как идею методов, так и их практическую реализацию.
Ключевые слова: энтропия, сообщение, биномиальный коэффициент, источник информации, вероятности.
Summary
Two lossless numerical compression methods for binary messages using binomial coefficients are proposed. The methods are based on the decomposition of a Bernoulli source of information into a combinatorial source A and a probabilistic source B. Accordingly, the entropy of the source is also decomposed into two entropies. The entropy of source A solves the problem of enumeration encoding of compressible messages, and the entropy of source B for the first method solves the problem of optimal coding. The disadvantage of this method is the need for statistical tests. In the second method, the calculation of the maximum value of the entropy of the source B replaces the optimal encoding. This replacement avoids statistical tests, although it leads to a relatively small redundancy of compressed messages, which decreases as their length increases. The work of the methods is shown on specific examples, which allows you to better understand both the idea of the methods and their practical implementation.
Keywords: entropy, message, binomial coefficient, source of information, probabilities.
Introduction
In this paper, we study the issue of compression of binary messages of increased length n without loss of information. Such a task in practice arises quite often when it is necessary to compress transmitted or stored binary messages, while protecting them from unauthorized access, in the case of plotting graphs and images. But there are other tasks where binary message compression is very useful, for example, in control tasks. The transmission of binary messages is a special case of the transmission of symbolic messages consisting of letters, multidigit numbers, and other similar symbols. But they, too, are ultimately encoded in binary sequences. Therefore, we can say that compression of binary sequences is a universal operation for messages in any form of representation.
In this paper, the compression problem is solved under the condition that the probability distribution is binomial. This means that the probability of the occurrence of symbols 1 and 0 in the studied binary message remains unchanged and independent of the probabilities of the previous bit values. In real conditions, these requirements are usually violated, but to varying degrees. Therefore, these violations can often be neglected. In addition, the solution of the compression problem proposed in this paper, with its further development, can be adapted to new requirements related to the dependence of the generated symbols on their previous values. numerical encoding binary message entropy bernoulli
The effect of message compression is determined by the difference between the maximum and real entropy, which depends primarily on the entropy of the messages being compressed. With their equiprobable appearance, the entropy is maximum and, accordingly, compression is impossible, and with a non-equiprobable occurrence, the entropy decreases and redundant information appears, which is eliminated by compression. The compression effect will be greatest when the entropy of the information source becomes 0, which is possible when only one message is transmitted with a probability of 1. In all other cases, the lossless compression effect takes an intermediate value between the maximum entropy value and zero. The case when the probabilities of all messages are equal to 0 is not considered, since it indicates that there is no communication system.
To solve the problem of binary message compression for small values of n, efficient Shannon-Fano and Huffman optimal coding algorithms were proposed. They gave a limit to the efficiency of compression and are widely used in practice to this day. Based on them, a number of modified compression algorithms have been developed that have improved the efficiency of optimal coding in general. However, all these algorithms have a significant drawback, which consists in the need to conduct statistical tests to determine the probabilities of messages. Therefore, many algorithms have been developed that do not require preliminary statistical procedures, such as dictionary compression. However, all these algorithms did not have such a theoretical justification as the Shannon-Fano and Huffman compression methods, which did not allow them to be fully evaluated and applied in practice.
However, it was difficult to apply the Shannon-Fano and Huffman methods with an increased length n of compressible binary messages, since, on the one hand, this led to an increase in the number of statistical tests, and, on the other hand, the number of optimal coding operations. So, if the length of messages is n, then their number N =2 . This means that as n grows, the value of N grows exponentially. If, for example, n = 8, then N = 256, and if n = 10, N = 1024. Accordingly, the collection of statistical data and their application for compressing binary messages becomes difficult as n grows.
Formulation of the problem. In this work, the task is to develop methods for compressing binary messages using the theory of optimal coding, while reducing or eliminating the set of statistical data for compressible messages.
The solution of the problem
At the heart of modern information theory is Shannon's formula for calculating the entropy of probabilistic source messages
where N defines the number of messages generated by the source A , and probability pj denotes the probability of the j message of length N. If binary messages of length n are generated, then the value N = 2 .
The idea of the proposed solution for compressing binary messages of length n is based on a theorem proved in [1]. It says that the entropy of the source A of probabilistic binary messages
H (A*) = H(A) + H (B), (2)
where
(3)
- entropy of the source generating equiprobable messages of length n;
(4)
- entropy of the probabilistic source A generating messages of length with probability Pk ;
(5)
- binomial coefficient;
Pk = PkCnk (6)
- probability of generating a message j,
j = 1,2, ..., 2n; with k = 0.1, ..., n units;
Pk = Pk (1 -P)n - k (7)
- probability of occurrence in the message generated by the source A* j, j =1,2, ..., 2n; k = 0.1, ..., n units;
P - probability of occurrence of unit in the message.
The meaning of entropy is that the binary messages generated by the information source A can be replaced by binary numbers with lengths equal to the value rounded up, which specifies their number in the set of messages with the same number of units k.
Then each such message will be encoded with its own binary number. Their number will obviously be equal to [2]. This means that messages are compressed using numerical coding, which consists in mapping compressed messages into a set of numbers, in this case binary.
The methods of binomial compression will be shown directly in the examples.
1. Binomial compression method with optimal coding
Example 1. If n = 3, then N = 8, and k = 0, 1, 2, 3. Accordingly, binomial coefficients C30 = 1, C3 = 3, C32 = 3, C3 = 1. This means that the original 8 compressible messages are divided into 4 sets consisting of one 000 message with the number of units k = 0, three messages 001,010, 001 with k = 1, three messages 011, 101, 110 with k = 2, and one message with k = 3. Since log2 C° = 1 and log2 C33 = 1 are equal to 0, the corresponding messages 000 and 111 are not encoded in any way. Messages 001, 010, 001 and 011, 101, 110 are encoded in binary sets. These can be sets 00, 01, 11, but there can be others, for example 00, 10, 11. It is obvious that there are possible options for the number coding for each group of messages for this example C4 3= 4.
Since formula (3) for the value H(A) contains, in addition to binomial coefficients, the distribution of probabilities Pk of generating sets of messages with different values of k, the entropy also determines the efficiency of compression of messages generated by source 4. This makes it possible to abandon it in case of inefficient compression.
This capability allows the respective compression algorithms to adapt to the message probability Pk distribution. As a result, by slightly reducing the compression efficiency, it is possible to significantly increase the compression speed and thereby increase its overall efficiency.
Example 2. Let's assume that the probabilities P taken in Example 1 are equally probable. This means that each of the 4 groups of messages has a probability of occurrence of 0.25. Then the entropy
Example 3. Let us now set the probabilities of the appearance of 4 groups of messages, as
Then we define the entropy
As you can see, with the considered unequal probabilities Pk , the entropy H(A) will be less by 0.8 - 0.6 = 0.2 bits than with the equal probabilities considered above.
Compared to 3 bits for entropy H(A) taken at the same probabilities Pk , there was a compression of 2.2 bits.
But this is an apparent compression, since it does not take into account the entropy H(B), and in these examples it has a significant impact on the compression efficiency.
Source B entropy H(B) uses a probability Pk distribution to determine the missing information in message numbers so that they can be converted into a compressible sequence when it is restored.
This information must be transmitted along with the message number to the receiver during compression, thereby reducing the compression efficiency of source 4.
But since the length of the corresponding codeword added to the length of the number according to expression (2) is equal to n, it means that the compression efficiency messages for a given probability Pk distribution reaches a possible maximum, determined by entropy (1). To obtain it, it is necessary to perform optimal source B coding, for example, with the Shannon-Fano code.
Example 4. Assume, as in example 2, that the probabilities p for 4 groups of messages are equally probable and equal to 0.25 bits. Then
Respectively
H (A*) = H(A) + H(B) = 0,8+2 = 2,8 бит.
This value is less than 3 bits, which it should theoretically be equal to, since errors were allowed in the calculation of the logarithms of the combinations. Their influence will be much less for large lengths of compressed messages.
Example 5. Calculate the entropy H(B) for the probability p distributions 0.5, 0.25, 0.125 and 0.125:
Then
H (A*) = H(A) + H (B) = 0,6 +1,75 = 2,35 бит.
There is a compression of information by 3 - 2.35 = 0.65 bits.
Example 6. It is required to construct an optimal code for probabilities P = 0,5, p = 0,25, p = 0,125, p = 0,125 and estimate its average length. In accordance with the well-known rules for constructing optimal codes, it will have the form 0, 10, 110, 111. Its average length is 0.5 x 1 + 0.25 x 2 + 0.125 x 3 + 0.125 x 3 = 1.75.
As you can see, it coincides with the entropy H (B) calculated in the previous example 5.
Example 7. Build the optimal code for entropy H(A ). To do this, you need to connect the corresponding entropy code words H(A) and H(B).
As a result, the code obtained after compression for 4 values of k will look like
0; 1000, 1001, 1010; 110 00, 110 01, 110 10; 111.
The disadvantage of such coding is the need to know the probabilities Pk , which can be obtained through statistical tests. This shortcoming can be eliminated by uniform encoding of source B messages.
2. Binomial compression method with redundant information
This method uses the same operations as the previous one, except for the optimal source B encoding. Instead, source A codewords are supplemented with uniform source B codewords of length log2 n . In this case, additional redundancy appears, but it plays a noticeable role only for small lengths of compressible messages. As the message length grows, it decreases significantly and, accordingly, the compression efficiency increases. The absence of optimal coding makes it possible to abandon statistical tests, which greatly simplifies the compression method and makes it more accurate and efficient. In addition, the scope of this method is expanded in comparison with the methods of optimal coding.
Conclusions
The paper proposes two methods for compressing binary messages based on binomial numbers. The first of them uses optimal coding, and the second eliminates it at the cost of additional redundancy in compressed messages. But it does not require statistical tests and expands its scope.
References
[1] Borisenko O. A. (1995) On the decomposition of Bernoulli sources of information. Bulletin of SSU, (3), 57 - 59.
[2] Borisenko O. A. (2019). Discrete Math. Sumy: University book.
Размещено на Allbest.ru
...Подобные документы
A database is a store where information is kept in an organized way. Data structures consist of pointers, strings, arrays, stacks, static and dynamic data structures. A list is a set of data items stored in some order. Methods of construction of a trees.
топик [19,0 K], добавлен 29.06.2009Data mining, developmental history of data mining and knowledge discovery. Technological elements and methods of data mining. Steps in knowledge discovery. Change and deviation detection. Related disciplines, information retrieval and text extraction.
доклад [25,3 K], добавлен 16.06.2012Technical and economic characteristics of medical institutions. Development of an automation project. Justification of the methods of calculating cost-effectiveness. General information about health and organization safety. Providing electrical safety.
дипломная работа [3,7 M], добавлен 14.05.2014Спецификация организации службы Short Message Service. Алгоритм работы сервера и возможность расширения функциональных возможностей. Реализация проекта на языке высокого уровня С++ на платформе Linux. Расчет себестоимости и цены программного продукта.
дипломная работа [168,6 K], добавлен 19.01.2014Information security problems of modern computer companies networks. The levels of network security of the company. Methods of protection organization's computer network from unauthorized access from the Internet. Information Security in the Internet.
реферат [20,9 K], добавлен 19.12.2013Program game "Tic-tac-toe" with multiplayer system on visual basic. Text of source code for program functions. View of main interface. There are functions for entering a Players name and Game Name, keep local copy of player, graiting message in chat.
лабораторная работа [592,2 K], добавлен 05.07.2009Понятие и виды Web-хостинга. Анализ рынка хостинговых компаний. Языки Web-программирования: HTML, PHP, Water, Clear Methods Steam. Web-дизайн и браузеры. Возможности современных визуальных HTML-редакторов. Создание сайта "Каталога хостинговых компаний".
курсовая работа [537,6 K], добавлен 15.01.2012Двоично-десятичный формат (BCD - Binary Coded Decimal). Преобразование ASCII формата в двоичный формат. Арифметические инструкции и флаги. Форматы арифметических данных. Выполнение арифметических операции. Сложение. Вычитание. Умножение. Деление.
доклад [16,2 K], добавлен 22.09.2008Technical methods of supporting. Analysis of airplane accidents. Growth in air traffic. Drop in aircraft accident rates. Causes of accidents. Dispatcher action scripts for emergency situations. Practical implementation of the interface training program.
курсовая работа [334,7 K], добавлен 19.04.2016Web Forum - class of applications for communication site visitors. Planning of such database that to contain all information about an user is the name, last name, address, number of reports and their content, information about an user and his friends.
отчет по практике [1,4 M], добавлен 19.03.2014Consideration of a systematic approach to the identification of the organization's processes for improving management efficiency. Approaches to the identification of business processes. Architecture of an Integrated Information Systems methodology.
реферат [195,5 K], добавлен 12.02.2016Возможности Search: управление документами и данными об изделиях, маршрутизация документов и заданий; основные параметры документа. База данных объектов и информационная поддержка их жизненного цикла. Интерфейс пользователя, редактирование спецификаций.
отчет по практике [1,7 M], добавлен 23.12.2009The material and technological basis of the information society are all sorts of systems based on computers and computer networks, information technology, telecommunication. The task of Ukraine in area of information and communication technologies.
реферат [29,5 K], добавлен 10.05.2011Особенности технологии параллельного программирования, описание компилятора OpenMP (Open Multi-Processing) и MPI (Message Passing Interface). Постановка задачи о ранце и пример ее решения на С++. Решение задачи о ранце на OpenMP со многими потоками.
магистерская работа [1,8 M], добавлен 08.03.2012Проблемы оценки клиентской базы. Big Data, направления использования. Организация корпоративного хранилища данных. ER-модель для сайта оценки книг на РСУБД DB2. Облачные технологии, поддерживающие рост рынка Big Data в информационных технологиях.
презентация [3,9 M], добавлен 17.02.2016Разработка программы в STEP7, которая реализует логику работы объекта управления, согласно заданному варианту. Графический дизайнер WinCC. Катушки установки и сброса. Программный интерфейс Message Passing Interface. Добавление тегов и значений в WinCC.
курсовая работа [1,3 M], добавлен 10.01.2015Классификация задач DataMining. Создание отчетов и итогов. Возможности Data Miner в Statistica. Задача классификации, кластеризации и регрессии. Средства анализа Statistica Data Miner. Суть задачи поиск ассоциативных правил. Анализ предикторов выживания.
курсовая работа [3,2 M], добавлен 19.05.2011Определение понятия системы доставки медиаконтента Digital Signage; изучение области ее применения, преимуществ и недостатков. Рассмотрение технических средств и программного обеспечения. Анализ опыта применения системы на базе Научной библиотеки УдГУ.
курсовая работа [48,2 K], добавлен 03.06.2014Модели звуковых карт, их возможности, качество звука и размеры. Устройство звуковых карт и принципы их функционирования. Методы генерации звука, применяющиеся в звуковых платах. Особенности системы пространственного звуковоспроизведения Dolby Digital.
реферат [34,8 K], добавлен 13.03.2011Описание функциональных возможностей технологии Data Mining как процессов обнаружения неизвестных данных. Изучение систем вывода ассоциативных правил и механизмов нейросетевых алгоритмов. Описание алгоритмов кластеризации и сфер применения Data Mining.
контрольная работа [208,4 K], добавлен 14.06.2013