Entropy, Redundancy, and Communication

glenn lawyer

SpråkTeknologi Höst 2001

 

A comparison of different text types in terms of their information-theoretical redundancy measure.

Entropy and Redundancy

Information theory seeks to understand how much information is contained in a given source. One of the base measures in information theory is entropy. Roughly speaking, the entropy of a source represents how much it can be compressed by an optimal lossless algorithm. It is a measure of information.

The entropy H of a given source X is defined as

H(X) = - ∑ p(x) log(p(x))

where the source X is modeled as a random variable, and p(x) is the probability of an element x X occuring.

The entropy of a given source is affected by the number of elements in X. Thus a normalized measure, redundancy, is better for comparing multiple sources. Redundancy compares the actual entropy of a source to its theoretical maximum entropy.

R(X) = 1 - H actual (X)/H max(X)

The higher the redundancy, the more a given source can be compressed. See the course text, p224, or the lecture notes for more information.

 

Communication

I was curious about redundancy in human communication. I selected texts from several different genres and measured the redundancy in each genre. I was especially interested in seeing if texts considered redundant from a semantical viewpoint were also redundant from a information theoretical perspective.

Redundancy does have a real use. It is helpful if you need to prevent information loss when transmitting through a noisy channel. One obvious such channel is oral presentation. A text that is to be read aloud uses redundancy to insure that the listeners understand what is said. A less obvious noisy channel is time. Holy scriptures need to be kept intact across millenium. Redundancy provides a strong error-checking mechanism.

I examine four text collections. The first collection includes the first 5 books of the Old Testament, the Psalms, and Proverbs. This is a text designed to be presented out loud and also to be preserved across millenium. It contains a great deal of semantic redundancy, as one would expect.

The second collection comes from Shakespeare. I look at As You Like It, All's Well that Ends Well, and Macbeth. These texts are also designed for oral transmission. Shakespeare, however, made no attempt to preserve them word for word from one generation to the next. Shakespeare constantly re-wrote his texts, often making significant changes.

My third collection is the first two books of Paradise Lost. This work is not intended to be read aloud to an audience. Nor did the author need to include redundancy to insure accurate reproduction. So one would expect less redundancy here than in the above two collections.

Finally, I include some modern texts. I examine 5 news articles from the CNN website. These are meant to be read, not to be heard. So one would expect them to be the least redundant. On the other hand, the articles themselves are not very long, which can skew the statistics significantly.

Results

The results were a suprize. All of the documents had approximately the same redundancy! Here is a list of the measurements:

------- genesis -------

entropy: 2.33

max entropy: 8.94

redundance: 0.74 %

------- exodus -------

entropy: 2.26

max entropy: 8.64

redundance: 0.74 %

------- numbers -------

entropy: 2.37

max entropy: 8.62

redundance: 0.72 %

------- deut -------

entropy: 2.11

max entropy: 8.48

redundance: 0.75 %

------- levit -------

entropy: 2.19

max entropy: 8.29

redundance: 0.74 %

 

------- Shakespeare/allswell -------

entropy: 1.94

max entropy: 8.55

redundance: 0.77 %

------- shakespear/asYouLikeIt -------

entropy: 1.9

max entropy: 8.42

redundance: 0.77 %

------- Shakespeare/macbeth -------

entropy: 2.07

max entropy: 8.32

redundance: 0.75 %

 

------- Milton/paradise -------

entropy: 1.77

max entropy: 7.86

redundance: 0.78 %

------- news/ALS -------

entropy: 1.76

max entrop: 4.91

redundance: 0.64 %

------- news/GLOB -------

entropy: 2.07

max entrop: 5.11

redundance: 0.6 %

------- news/ATTA -------

entropy: 2.03

max entrop: 4.81

redundance: 0.58 %

------- news/AIR -------

entropy: 2.44

max entrop: 4.64

redundance: 0.47 %

Discussion

So we have a set of texts that I judge to have very different redundancy contents, yet they have almost the same redundancy measure. Why?

First we should look at what our metric considers. If a few words have a high frequency in a text, then the text will have a high redundancy. If all words appear with equal frequency, then redundancy will be 0.

In language, many words are very frequent. For example, this document contains the word "the" 51 times. The document itself contains 832 words (excluding the results). Semantically, this should not make the document more redundant. From an information theory approach, it does.

Secondly, different words with the same meaning are counted as separate words. The word "heard" appears only once in the communication section, yet a significant part of the discussion involves hearing. The word "token" has yet to appear in this document, yet it is a central concept to the first section of this document. We have not explicitly discussed a computer's ability to understand as a human does, even though that can be seen as the subject. How redundant is this paragraph? (37%)

Third, the probabilities are estimated by a frequency count. This gives an acceptable estimate for a large sample of a frequent event.(ie, counts of the word "the" in a large text). It is not at all accurate for infrequent events. Nor is it especially good for a document with several subjects. The word "witch" is very common in some sections of Macbeth, and non-existent in others. It is not a Gaussian evenly distributed throughout the text.

It would be very interesting to see a study of redundancy that considered these factors. I suggest the following:

1) Only take counts on verbs and nouns.

2) group words with similar meanings.

3) create a more accurate probability model

 

EXTRA (note: the word statistics above do not include this section)

I wrote this using Word '97, which contains an auto-summary feature. I set it to compress to 25% of the original. Here it is, verbatim, including formatting:

 

Summary

Entropy, Redundancy, and Communication

A comparison of different text types in terms of their information-theoretical redundancy measure.

Entropy and Redundancy

Information theory seeks to understand how much information is contained in a given source. Redundancy compares the actual entropy of a source to its theoretical maximum entropy.

Communication

I was curious about redundancy in human communication. I selected texts from several different genres and measured the redundancy in each genre. Redundancy does have a real use. Redundancy provides a strong error-checking mechanism.

I examine four text collections. Finally, I include some modern texts. Results

entropy: 2.33

max entropy: 8.94

redundance: 0.74 %

entropy: 2.26

max entropy: 8.64

redundance: 0.74 %

entropy: 2.37

max entropy: 8.62

redundance: 0.72 %

entropy: 2.11

max entropy: 8.48

redundance: 0.75 %

entropy: 2.19

max entropy: 8.29

redundance: 0.74 %

 

------- Shakespeare/allswell -------

entropy: 1.94

max entropy: 8.55

redundance: 0.77 %

entropy: 1.9

max entropy: 8.42

redundance: 0.77 %

------- Shakespeare/macbeth -------

entropy: 2.07

max entropy: 8.32

redundance: 0.75 %

 

entropy: 1.77

max entropy: 7.86

redundance: 0.78 %

------- news/ALS -------

entropy: 1.76

max entrop: 4.91

redundance: 0.64 %

------- news/GLOB -------

entropy: 2.07

max entrop: 5.11

redundance: 0.6 %

------- news/ATTA -------

entropy: 2.03

max entrop: 4.81

redundance: 0.58 %

entropy: 2.44

Discussion

If a few words have a high frequency in a text, then the text will have a high redundancy. If all words appear with equal frequency, then redundancy will be 0.

In language, many words are very frequent. For example, this document contains the word "the" 51 times. The document itself contains 832 words (excluding the results). Secondly, different words with the same meaning are counted as separate words. 2) group words with similar meanings.