In: Computer Science
Programming in C (not C++)
The high level goal of this project is to write a program called "wordfreak" that takes "some files" as input, counts how many times each word occurs across them all (considering all letters to be lower case), and writes those words and associated counts to an output file in alphabetical order.
We provide you some example book text files to test your program on. For example, if you ran
$ ./wordfreak aladdin.txt
Then the contents of the output file would be:
$ cat output.txt
a : 49
aback : 1
able : 1
...
required : 1
respectfully : 1
retraced : 1
...
that : 11
the : 126
their : 2
...
you : 20
young : 1
your : 7
The words from all the input files will be counted together. If a word appears 3 times in one input file and 4 times in another, it will be counted 7 times between the two.
Input
wordfreak needs to be able to read input from 3 sources: standard input, files given in argv, and a file given as the environment variable. It should read words from all these that are applicable (always standard in, sometimes the other 2).
A working implementation must be able to accept input entered directly into the terminal, with the end of such input signified by the EOF character (^D (control+d)):
$ ./wordfreak
I can write words here,
and end the file with control plus d
$ cat output.txt
and : 1
can : 1
control : 1
d : 1
end : 1
file : 1
here : 1
i : 1
plus : 1
the : 1
with : 1
words : 1
write : 1
However, it should alternately be able to accept a file piped in to standard input via bash’s operator pipe:
$ cat aladdin.txt | ./wordfreak
It should be noted that your program has no real way to tell which of these two situations is occuring, it just sees information written to standard input. However, by just treating standard input like a file, you will get both of these behaviours.
A working implementation must also accept files as command line arguments:
$ ./wordfreak aladdin.txt iliad.txt odyssey.txt
Finally, a working implementation must also accept an environment variable called WORD_FREAK set to a single file from the command line to be analyzed:
$ WORD_FREAK=aladdin.txt ./wordfreak
And of course, it should be able to do all of these at once
$ cat newton.txt | WORD_FREAK=aladdin.txt ./wordfreak iliad.txt odyssey.txt
Words
Words should be comprised of only alpha characters, and all alpha characters should be taken to be lower case.
For example "POT4TO???" would give the words "pot" and "to". And the word "isn’t" would be read as "isn" and "t". While this isn't necessarily intuitively correct, this is what your code is expected to do:
$ echo "Isn’t that a POT4TO???" | ./wordfreak
$ cat output.txt
a : 1
isn : 1
pot : 1
t : 1
that : 1
to : 1
You are required to store the words in a specific data structure. You should have a binary search tree for each letter 'a' to 'z' that stores the words starting with that letter (and their counts). This can be thought of as a hash function from strings to binary search trees, where the hashing function is just first_letter - 'a'. Note that these BSTs will not likely be balanced; that is fine.
Output
The words should be written to the file alphabetically (the BSTs make this fairly trivial). Each word will give a line of the form "[word][additional space] : [additional space][number]\n". The caveat is that all the colons need to line up. The words are left-aligned and the longest will have a single space between its end and the colon (note "respectfully" in the example below); the numbers are right-aligned and the longest will have a single space between the colon and its beginning (note 126 in the example below).
$ ./wordfreak aladdin.txt
$ cat output.txt
a : 49
...
respectfully : 1
...
the : 126
...
your : 7
The output file should be named output.txt. Note that when opening the file to write to, you will either need to create the file or remove all existing contents, so make use of open()'s O_CREAT and O_TRUNC. Moreover, you will want the file’s permissions to be set so that it can be read. open()’s third argument determines permissions of created files, something like 0644 will make it readable.
restricted to only using the following system calls: open(), close(), read(), write(), and lseek() for performing I/O. You are allowed to use other C library calls (e.g., malloc(), free()). However, all I/O is restricted to the Linux kernel’s direct API support for I/O. You are also allowed to use sprintf() to make formatting easier.
i programmed the code to find wordcount from one file at a time but i can't do it for multiple files as it was getting error sorry......
#include <stdio.h>
#include <ctype.h>
enum {INITIAL,WORD,SPACE};
int main()
{
int c;
int state = INITIAL;
int wcount = 0;
c = getchar();
while (c != EOF)
{
switch (state)
{
case INITIAL: wcount = 0;
if (isalpha(c) || c=='\'')
{
wcount++;
state = WORD;
}
else
state = SPACE;
break;
case WORD: if (!isalpha(c) && c!='\'')
state = SPACE;
break;
case SPACE: if (isalpha(c) || c=='\'')
{
wcount++;
state = WORD;
}
}
c = getchar();
}
printf ("%d words\n", wcount);
return 0;
}