Book Image

Learning Python Network Programming

By : Dr. M. O. Faruque Sarker, Samuel B Washington, Sam Washington
Book Image

Learning Python Network Programming

By: Dr. M. O. Faruque Sarker, Samuel B Washington, Sam Washington

Overview of this book

Table of Contents (17 chapters)
Learning Python Network Programming
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

HTML and screen scraping


Although more and more services are offering their data through APIs, when a service doesn't do this then the only way of getting the data programmatically is to download its web pages and then parse the HTML source code. This technique is called screen scraping.

Though it sounds simple enough in principle, screen scraping should be approached as a last resort. Unlike XML, where the syntax is strictly enforced and data structures are usually reasonably stable and sometimes even documented, the world of web page source code is a messy one. It is a fluid place, where the code can change unexpectedly and in a way that can completely break your script and force you to rework the parsing logic from scratch.

Still, it is sometimes the only way to get essential data, so we're going to take a brief look at developing an approach toward scraping. We will discuss ways to reduce the impact when the HTML code does change.

You should always check a site's terms and conditions before...