The data we will use for the first part of this chapter is a set of books from Project Gutenberg at www.gutenberg.org, which is a repository of public domain literature works. The books I used for these experiments come from a variety of authors:
- Booth Tarkington (22 titles)
- Charles Dickens (44 titles)
- Edith Nesbit (10 titles)
- Arthur Conan Doyle (51 titles)
- Mark Twain (29 titles)
- Sir Richard Francis Burton (11 titles)
- Emile Gaboriau (10 titles)
Overall, there are 177 documents from 7 authors, giving a significant amount of text to work with. A full list of the titles, along with download links and a script to automatically fetch them, is given in the code bundle called getdata.py
. If running the code results in significantly fewer books than above, the mirror may be down. See this website for more mirror URLs to try in the script: https://www.gutenberg.org/MIRRORS.ALL
To download these books, we use the requests library to download the files into our data directory.
First, in a new...