-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Python Web Scraping
By :
To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists the limitations for some popular filesystems:
|
Operating system |
File system |
Invalid filename characters |
Maximum filename length |
|---|---|---|---|
|
Linux |
Ext3/Ext4 |
/ and \0 |
255 bytes |
|
OS X |
HFS Plus |
: and \0 |
255 UTF-16 code units |
|
Windows |
NTFS |
\, /, ?, :, *, ", >, <, and | |
255 characters |
To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, basic punctuation, and replace all other characters with an underscore, as shown in the following code:
>>> import re
>>> url = 'http://example.webscraping.com/default/view/Australia-1'
>>> re.sub('[^/0-9a-zA-Z\-.,;_ ]', '_', url)
'http_//example.webscraping.com/default/view/Australia-1'Additionally, the filename and the parent directories...
Change the font size
Change margin width
Change background colour