So, in the earlier recipe, we saw how to make HTTP requests, and you also learnt how to parse a web response. It's time to move ahead and download content from the Web. You know that the WWW is not just about HTML pages. It contains other resources, such as text files, documents, and images, among many other formats. Here, in this recipe, you'll learn ways to download images in Python with an example.
To download images, we will need two Python modules, namely BeautifulSoup
and urllib2
. We could use the requests
module instead of urrlib2
, but this will help you learn about urllib2
as an alternative that can be used for HTTP requests, so you can boast about it.
Before starting this recipe, we need to answer two questions. What kind of images would we like to download? From which location on the Web do I download the images? In this recipe, we download Avatar movie images from Google (https://google.com) images search. We download the top five images that match the search criteria. For doing this, let's import the Python modules and define the variables we'll need:
from bs4 import BeautifulSoup import re import urllib2 import os ## Download paramters image_type = "Project" movie = "Avatar" url = "https://www.google.com/search?q="+movie+"&source=lnms&tbm=isch"
OK then, let's now create a
BeautifulSoup
object with URL parameters and appropriate headers. See the use ofUser-Agent
while making HTTP calls with Python'surllib
module. Therequests
module uses its ownUser-Agent
while makingHTTP
calls:header = {'User-Agent': 'Mozilla/5.0'} soup = BeautifulSoup(urllib2.urlopen (urllib2.Request(url,headers=header)))
Google images are hosted as static content under the domain name
http://www.gstatic.com/
. So, using theBeautifulSoup
object, we now try to find all the images whose source URL containshttp://www.gstatic.com/
. The following code does exactly the same thing:images = [a['src'] for a in soup.find_all("img", {"src": re.compile("gstatic.com")})][:5] for img in images: print "Image Source:", img
The output of the preceding code snippet can be seen in the following screenshot. Note how we get the image source URL on the Web for the top five images:
Now that we have the source URL of all the images, let's download them. The following Python code uses the
urlopen()
method toread()
the image and downloads it onto the local file system:for img in images: raw_img = urllib2.urlopen(img).read() cntr = len([i for i in os.listdir(".") if image_type in i]) + 1 f = open(image_type + "_"+ str(cntr)+".jpg", 'wb') f.write(raw_img) f.close()
When the images get downloaded, we can see them on our editor. The following snapshot shows the top five images we downloaded and
Project_3.jpg
looks as follows:
So, in this recipe, we looked at downloading content from the Web. First, we defined the parameters for download. Parameters are like configurations that define the location where the downloadable resource is available and what kind of content is to be downloaded. In our example, we defined that we have to download Avatar movie images and, that too, from Google.
Then we created the BeautifulSoup
object, which will make the URL request using the urllib2
module. Actually, urllib2.Request()
prepares the request with the configuration, such as headers and the URL itself, and urllib2.urlopen()
actually makes the request. We wrapped the HTML response of the urlopen()
method and created a BeautifulSoup
object so that we could parse the HTML response.
Next, we used the soup object to search for the top five images present in the HTML response. We searched for images based on the img
tag with the find_all()
method. As we know, find_all()
returns a list of image URLs where the picture is available on Google.
Finally, we iterated through all the URLs and again used the urlopen()
method on URLs to read()
the images. Read()
returns the image in a raw format as binary data. We then used this raw image to write to a file on our local file system. We also added a logic to name the image (they actually auto-increment) so that they're uniquely identified in the local file system.
That's nice! Exactly what we wanted to achieve! Now let's up the ante a bit and see what else we can explore in the next recipe.