Luckily for us, the team behind StackOverflow provides most of the data behind the StackExchange universe to which StackOverflow belongs under a cc-wiki license. At the time of writing this book, the latest data dump can be found at https://archive.org/details/stackexchange. It contains data dumps of all Q&A sites of the StackExchange family. For StackOverflow, you will find multiple files, of which we only need the stackoverflow.com-Posts.7z
file, which is 5.2 GB.
After downloading and extracting it, we have around 26 GB of data in the format of XML, containing all questions and answers as individual row
tags within the root
tag posts:
<?xml version="1.0" encoding="utf-8"?> <posts> ... <row Id="4572748" PostTypeId="2" ParentId="4568987" CreationDate="2011-01-01T00:01:03.387" Score="4" ViewCount="" Body="<p>IANAL, but <a href="http://support.apple.com/kb/HT2931" rel="nofollow">this<...