A lot of the concepts and techniques that we have seen so far in the book come together in this little project. Its aim is to read a text file, remove all the characters that are not used in words, and count the frequencies of the words in the remaining text. This can be useful, for example, when counting the word density on a web page, the frequency of DNA sequences, or the number of hits on a website that came from various IP addresses. This can be done in some 10 lines of code. For example, when words1.txt
contains the sentence to be, or not to be, that is the question!
, then this is the output of the program:
Word : frequency be : 2 is : 1 not : 1 or : 1 question : 1 that : 1 the : 1 to : 2
Here is the code with comments:
# code in chapter 5\word_frequency.jl:
# 1- read in text file:
str = readall("words1.txt")
# 2- replace non alphabet characters from text with a space:
nonalpha = r"(\W\s?)" # define a regular expression
str = replace(str, nonalpha, ...