While Python’s Requests module can emulate the actions of a full-blown web browser, arguably the most frequently called-on use case is to download web content into a Python application. While some of the most efficient uses of such functionality involves the downloading of XML or JSON data into an application, another use can involve more “old fashioned” text scraping of human-readable Web content. In this continuation of our tutorial series on Python network development, we will discuss how to work with the Requests module, work with HTTPS, and networking clients.
Python Requests Module
There are a lot of things that a web browser does that end-users take for granted, which must be factored into any Web-enabled Python application. The three big things are:
- Timeouts, or else the application will block forever.
- Redirects, or else the code will get caught in an endless loop.
- An up-to-date Operating System and Python installation, as those are responsible for ensuring that current SSL ciphers are supported.
The examples in this Python tutorial will make use of the Requests module, with an example that downloads conventional content (although this content could be in the form of structured data), as well as an example that downloads a file through an HTTPS connection.
While the Requests module is usually included in most Python installations, it is possible that it may not be present. In this case, it can be installed with the command:
$ pip3 install requests
In Windows, this gives output similar to what is shown below:
Figure 1 – Installing the Requests module in Windows
Downloading Content with Python Requests Module
The website, The Unix Time Now, displays the current Unix Timestamp. It is a handy reference for those (more common than most programmers would like to admit) instances where it is necessary to know what the current Unix Timestamp is. However, the programming environment is not terribly conducive to providing it, such as the case with .NET-based application development. This website can also serve as a gentle introduction into reading the time as a value from the source code of the site.
The image below shows the section of the source code of the above link, in which the Unix Timestamp is displayed. Note that, unlike the dynamically updated value shown when browsing to the site in a traditional web browser, this will be a static value that only gets updated when the page is loaded once again:
Figure 2 – The text to look for.
The snippet above may look like XML, but is actually HTML 5. And while HTML 5 “looks like” XML, it is not the same thing, and XML parsers cannot parse HTML 5.
The Python code example below will connect to this website and parse out the Unix Timestamp:
# demo-http-1.py import requests import sys def main(argv): try: # Specify a half-second timeout and no redirects. webContent = requests.get ("https://www.unixtimenow.com", timeout=0.5, allow_redirects=False) # Uncomment below to print the source code of the page. #print (webContent.text) # Now do some good old-fashioned text-scraping to get the value. startIndex = 0 try: startIndex = webContent.text.index("The Unix Time Now is ") # Needed because we need the location after the text above. startIndex = startIndex + len("The Unix Time Now is ") print ("Found starting Text at [" + str(startIndex) + "]") except ValueError: print ("The starting text was not found.") stringToSearch = webContent.text[startIndex:] endIndex = 0 try: endIndex = stringToSearch.index(" ") print ("Found ending Text at [" + str(endIndex) + "]") except ValueError: print ("The ending text was not found.") timeStr = stringToSearch[:endIndex] print ("Time String is [" + timeStr + "]") webContent.close() except requests.exceptions.ConnectionError as err: print ("Can't connect due to connection error [" + str(err) + "]") except requests.exceptions.Timeout as err: print ("Can't connect because timeout was exceeded.") except requests.exceptions.RequestException as err: print ("Can't connect due to other Request Error [" + str(err) + "]") if __name__ == "__main__": main(sys.argv[1:])
The code above gives the following output:
Figure 3 – Extracting the Unix Timestamp
Downloading Files with the Python Requests Module
The website, www.httpbin.org, provides a plethora of testing tools for web development. In this example, the Requests module will be used to download an image from this site, located at https://httpbin.org/image/jpeg. No filename is specified for the image; however, if one were specified, it would be in the content headers.
The Python code below will display the content headers and save the file locally:
# demo-http-2.py import requests import sys def main(argv): try: # Specify a half-second timeout and no redirects. webContent = requests.get ("https://httpbin.org/image/jpeg", timeout=0.5, allow_redirects=False) # This code "knows" that the sample file being downloaded is a JPEG image. If the file # format is not known, then look at the headers to determine the file type. print (webContent.headers) # Even if you use Linux this should be written as a binary file. fp = open ("image.jpg", "wb") fp.write(webContent.content) fp.close() webContent.close() except requests.exceptions.ConnectionError as err: print ("Can't connect due to connection error [" + str(err) + "]") except requests.exceptions.Timeout as err: print ("Can't connect because timeout was exceeded.") except requests.exceptions.RequestException as err: print ("Can't connect due to other Request Error [" + str(err) + "]") if __name__ == "__main__": main(sys.argv[1:])
Running this code in your integrated development environment (IDE) gives the following output. Note the change in the directory listing:
Figure 4 – The file data downloaded and saved, with HTTP headers highlighted.
Unlike this example, most file or image downloads usually have a filename attached to the content. If this was the case, the name would have appeared in the headers above, which are highlighted in red. Additionally, the “Content-Type” header can be used to infer a file extension based on what is provided.
The downloaded and saved image matches what was found on the website:
Figure 5 – The original image.
Figure 6 – The saved image.
Other HTTPS and Python Considerations
As stated earlier, the examples included here barely scratch the surface of what the Requests module can do. The full API reference at Quickstart — Requests 2.28.0 documentation allows for this code to be extended into far more complex web-client applications.
Lastly, HTTPS is heavily dependent on both the operating system and Python Installation being kept up to date. HTTPS ciphers, along with the certificates used internally to verify website authenticity, are changing at a rapid clip. If the ciphers supported by the local computer’s operating system are no longer supported by a remote web server, then HTTPS communications will not be possible.
Python Socket Module and Network Programming
The Python Socket module features an “easier” “create server” function that can take care of most of the typical assumptions that one would make when running a server, and, as the module implements nearly all of the corresponding C/C++ Linux library functions, it is easy for a developer who is coming from that background to make the move into Python.
Python’s Server functionality is so robust that a full-fledged web server can be implemented right in the code, absent much of the configuration hassles and complications that come with “traditional” server daemons, such as Microsoft Internet Information Server or Apache httpd. This functionality can be extended into robust web applications as well.