In: Statistics and Probability
A corpus is a technical term for a collection of texts used to analyze a language and verify its linguistic properties. The first modern, computer-readable corpus was the Brown Corpus of Standard American English, compiled by Henry Kucera and W. Nelson Francis of Brown University. The Brown Corpus draws from American English texts printed in 1961 and was for many years a widely cited resource in computational linguistics.
The five most frequently occurring words in the Brown Corpus are the, of, and, to, and a. Consider a data set consisting of all occurrences of these words in the Corpus. The values of the variable named Word are and, to, of, the, and a, so Word is a nominal variable with five categories.
Frequency and relative frequency distributions are constructed to summarize the data. They are shown in the table that follows
able 1 |
||
---|---|---|
Word |
Frequency |
Relative Frequency |
(Thousands of occurrences) |
||
and | 28.9 | 0.1566 |
to | 26.1 | 0.1415 |
of | 36.4 | 0.1973 |
the | 70.0 | 0.3794 |
a | 23.1 | 0.1252 |
Total | 184.5 | 1.0000 |
A census is an enumeration of a population. The U.S. Census Bureau conducts a census every 10 years, but in addition, the Population Estimates Program of the bureau publishes population estimates for incorporated places every year. According to 2007 estimates, the five largest U.S. cities (by population) are New York City, Los Angeles, Chicago, Houston, and Phoenix.
Consider a data set consisting of all the residents of these five cities. The values of the variable named City are Phoenix, Chicago, Houston, Los Angeles, and New York City, so City is a nominal variable with five categories. Frequency and relative frequency distributions are provided in the table below.
Table 1 |
||
---|---|---|
City |
Frequency |
Relative Frequency |
(Millions of people) |
||
Phoenix | 1.55 | 0.0829 |
Chicago | 2.84 | 0.1519 |
Houston | 2.21 | 0.1182 |
Los Angeles | 3.83 | 0.2048 |
New York City | 8.27 | 0.4422 |
Total | 18.70 | 1.0000 |
In 1935, Harvard linguist George Zipf pointed out that the frequency of the kth most frequent word in a language is roughly proportional to 1/k. This implies that the second most frequent word in a language has a frequency one-half that of the most frequent word, the third most frequent word has a frequency one-third that of the most frequent word, and so on. A distribution that follows this rule is said to obey Zipf’s Law.
Zipf’s Law has been observed not only in word distributions, but in other phenomena as well, such as the populations of cities. Answer the questions/ blank below in percentages
The frequency of the second most frequent word in the Brown Corpus is ----------- that of the most frequent word. The population of the second largest city in the United States is-------- that of the largest city.
The frequency of the fourth most frequent word in the Brown Corpus is-------------- that of the most frequent word. The population of the fourth largest city in the United States is --------- that of the largest city.
We have rounded answers to nearest integer.
The frequency of the second most frequent word in the Brown Corpus is 52% that of the most frequent word.
Explanation
Frequency of most frequent word = 70.0
Frequency of 2nd most frequent word = 36.4
Percentage = (36.4/70)*100 = 52%
The population of the second largest city in the United States is 46% that of the largest city.
Explanation
Population of largest city = 8.27
Population of the second largest city = 3.83
Percentage = (3.83/8.27)*100 = 46.3%
The frequency of the fourth most frequent word in the Brown Corpus is 37% that of the most frequent word.
Explanation
Frequency of most frequent word = 70.0
Frequency of 4th most frequent word = 26.1
Percentage = (26.1/70)*100 = 37.3%
The population of the fourth largest city in the United States is 27% that of the largest city.
Explanation
Population of largest city = 8.27
Population of the fourth largest city = 2.21
Percentage = (2.21/8.27)*100 = 26.7%