Book Image

Mastering Python High Performance

Book Image

Mastering Python High Performance

Overview of this book

Table of Contents (15 chapters)

The initial code base


Let's now list all of the code that we'll optimize in future, based on the earlier description.

The first of the following points is quite simple: a single file script that takes care of scraping and saving in JSON format like we discussed earlier. The flow is simple, and the order is as follows:

  1. It will query the list of questions page by page.

  2. For each page, it will gather the question's links.

  3. Then, for each link, it will gather the information listed from the previous points.

  4. It will move on to the next page and start over again.

  5. It will finally save all of the data into a JSON file.

The code is as follows:

from bs4 import BeautifulSoup
import requests
import json


SO_URL = "http://scifi.stackexchange.com"
QUESTION_LIST_URL = SO_URL + "/questions"
MAX_PAGE_COUNT = 20

global_results = []
initial_page = 1 #first page is page 1

def get_author_name(body):
  link_name = body.select(".user-details a")
  if len(link_name) == 0:
    text_name = body.select(".user-details")
...