There are times when it really helps to know where someone who is browsing your site is located. There may be no particular reason you might be in need of this information, but say you are talking to someone who sounds like, or could possibly be, a scammer, and you are interested in knowing where they are located as part of your personal “threat analysis.” Of course, just because someone might be (possibly) browsing your site from behind a VPN or from a different country than you are expecting is not a reason to conclude that there is malicious intent. But on the other hand, if someone you are chatting with is claiming to be from a certain part of say, the United States, but a lookup of their IP address shows that user is in a different part of the world, there might be a reason to be suspicious.
You may have noticed a lot of photo sharing sites offer the ability to determine which country someone is browsing from. This programming tutorial demonstrates one way to determine this information for yourself.
What is IP Address Geolocation?
IP Address Geolocation refers to either a physical location associated with an IP address, or to the act of getting that information. Even from the very beginnings of the Internet, IP addresses always had some sort of geolocation data associated with them. In the broadest sense, you could look up the continent with which an IP address is associated via IANA IPv4 Address Space Registry, although in the case of this link, you would need to substitute the whois server specified for the particular region of the world that it manages.
Fast forward a few decades and we now live in a world where most computers, mobile devices, and pretty much everything else has some sort of location-determining technology and some sort of Internet connection built-in, and it was only inevitable that near-precise determination of a particular IP address’ geolocation would become possible.
Scope and Limitations of IP Address Geolocation
IP Address Geolocation, as the name implies, refers to locations associated only with IP addresses. This may or may not correspond to the precise physical location of an individual computer, mobile device, or other technology which has an Internet connection. IP Address Geolocation also does not return any meaningful information about non-routable or private IP addresses (e.g., 192.168.xxx.xxx or 10.xxx.xxx.xxx IPv4 addresses or IPv6 addresses which start with fc or fd). The main reason for this is because many computers may share a single public IP address, as is the case with most mobile devices.
IP Address Geolocation is also highly subjective. There is no singular authority that records this information “in stone,” although there are many services which record such information. There are many different and potentially conflicting sources of geolocation information for a particular IP address as well, such as:
- The location provided by the Internet Provider which owns the address in question.
- The location-service-determined location of one or more devices which use or share an IP address.
- A VPN being used by a user to mask his or her physical location.
So at best, IP Address Geolocation can give you a ballpark estimate of where a user may be located. With that being said, there are still a great many things that this information could be used for so let’s jump right in.
Read: Top Python Frameworks
How to Find IP Addresses
Of course, we will need some source material to begin our work. Say we have set up a website that hosts the following image:
The image of this beautiful cat is in the Public Domain, and is attributed as follows: “Cat” by Salvatore Gerace is marked with Public Domain Mark 1.0. The original image can be downloaded from https://www.flickr.com/photos/45215772@N02/18223540618.
On this particular example server, this image will be saved in the web root as me-medium.jpg. Most web servers, including the one which hosts this particular site, use log files to track the IP addresses which browse the site. This particular site, which is running on Apache httpd within a Docker Container, has the following log entries, including one that was unexpected:
Figure 2 – Example Access Log Entries
This web server being implemented as a Docker Container has no bearing on it having log files. All properly configured web servers, whether they run within a Docker Container or on fully-virtualized environments or on actual physical servers will have log files somewhere. For Apache httpd, the log file location is usually under the /var/log/apache2 or /var/log/httpd directory. The Apache httpd configuration files will specify the exact location. No matter where the log files are stored, some sort of console access, either via a direct login or an SSH session, will be needed to access the files. In most Apache httpd installations, root access is also required.
In the case of this particular site, a Docker Container was used because it:
- Allows for free usage of root in a restricted environment, in a way that cannot harm the Docker host.
- Makes it easy to start up or take down the site without having to make configuration changes directly to the server itself.
- When run in interactive mode, it is much easier to edit configuration files and experiment with various settings than running as a server daemon directly.
There is, of course, one major downside. The cron daemon and Docker Containers really do not play well together, especially when attempting to run Apache httpd. While the cron daemon and Apache httpd daemons can be run from the command line in interactive mode, running them both together in the background is complex and problematic.
The Apache httpd instance within this particular Docker Container stores its access logs in the file /var/log/apache2/basic-https-access.log within the Container’s filesystem.
IP Address Geolocation Services
Geolocation cannot happen without a service that can provide such information. A simple Google Search can show multiple IP Address Geolocation Services. Two which are free for limited usage are AbstractAPI and IpGeolocation API. Both of these services require a user account and issue API keys for programmatic usage. In the listing in Figure 2, I decided to try these APIs on the IP address 184.108.40.206, as it happened to “randomly” hit my web server with a failed attempt at an exploit. As the APIs for both AbstractAPI and IpGeolocation API are web based, I was able to use the following URLs to geolocate this IP address:
- AbstractAPI: https://ipgeolocation.abstractapi.com/v1/?api_key=your-api-key&ip_address=220.127.116.11
- Ip Geolocation API: https://api.ipgeolocation.io/ipgeo?apiKey=your-api-key&ip=18.104.22.168
AbstractAPI gives the following information:
Ip Geolocation API has a somewhat different take on this IP address:
Both services deliver data via JSON, and the FireFox browser automatically formats this information into an easy-to-read tabular format. Other browsers may show all of this information on a single line.
As for the IP Address 22.214.171.124 in particular, we can see that it is associated with the nation of Belize. Unfortunately, no further information about this IP address is available. Contrast this to another entry on this list, 126.96.36.199:
There is definitely a lot more information here. Not only do we know that this IP address is associated with the United States, but we also know which city and state within the US we are dealing with, namely Trenton, New Jersey. We even get the ZIP Code, which further nails down this particular location.
Beyond the country information, there is no rhyme or reason to what other information may be provided.
Now with the basic manual process outlined, we can move on to automating it. The next section will explain how to use a Python script to parse the log file and get the information related to each IP address.
How to Collect IP Geolocation with Python
The Python code below performs a basic analysis of the log file /var/log/apache2/basic-https-access.log and makes use of the AbstractAPI tool to look up the geolocation information for each IP in the log file that has browsed the me-medium.jpg file:
# parser.py import json import os import re import requests import sys # Suit to taste. Remember that using the root home directory is only acceptable when running # as a Docker container. pathToCache = "/root/ip-cache/" pathToLogFile = "/var/log/apache2/basic-https-access.log" pathToOutputFile = "/var/www/basic-https-webroot/findings.html" matchingFilename = "me-medium.jpg" myApiKey = "my-api-key-code" def main(argv): records = "" try: # Open the Apache httpd log file for reading: with open(pathToLogFile) as input_file: for x, line in enumerate(input_file): # Strip newlines from right (trailing newlines) currentLine = line.rstrip() ipInfo = "" dateTimeInfo = "" #print ("[" + currentLine + "]") if currentLine.__contains__(matchingFilename): lineParts = currentLine.split(' ') #print ("Found IP [" + lineParts + "]") cacheFileName = pathToCache + lineParts + ".json" #print ("Looking for [" + cacheFileName + "]") if os.path.exists(cacheFileName): pass else: response = requests.get("https://ipgeolocation.abstractapi.com/v1/?api_key=" + myApiKey + "&ip_address=" + lineParts) fp = open (cacheFileName, "w") rawContent = str(response.content.decode("utf-8")) fp.write(rawContent) fp.close() fp = open (cacheFileName) ipInfo = fp.read() fp.close() # Get the country and city from the JSON text. ipData = json.loads(ipInfo) # If a field is null or not specified, an exception will be raised. Also the values # returned by a JSON object may not always be strings. Forcibly cast them as such! country = "" try: country = str(ipData["country"]) except: country = "Not Specified" city = "" try: city = str(ipData["city"]) except: city = "Not Specified" # Get the date/time of the visit. This will just crudely parse out # the date and time from the log. match = re.search(r"\[(.*)\]", currentLine) # The regular expression above matches a group which contains all the text # between the brackets in a given line from the log file. In this case we # want the result of the first group match. #print ("Match is [" + match.group(1) + "]") dateTimeInfo = match.group(1) # Put the record together. Don't forget the use of parentheses should the code lines # need to wrap. records = (records + "" + str(dateTimeInfo) + "" + lineParts + " " + " " + country + "" + city + " ") fileOutput = "" if "" == records: fileOutput = "
No log records found. Wait till someone browses the site. " else: fileOutput = (" " + "" + records + "
") finalOutputFP = open (pathToOutputFile, "w") finalOutputFP.write(fileOutput) finalOutputFP.close() #print (fileOutput) except Exception as err: print ("Generic exception [" + str(err) + "] occurred.") if __name__ == "__main__": main(sys.argv[1:])
Note: this script will not run if the requests module is not loaded into Python via pip3.
This file has three notable features:
- It focuses on just one file being downloaded.
- It caches the results of each API call.
- It saves its output to another file which can be browsed on the site, namely findings.html
Most API-delivered services, even ones that are paid for, impose some sort of limit on the number of times they can be accessed, mainly because they do not want their own servers to be overburdened. As a typical hit to a web page can generate dozens, if not hundreds, of lines in an access log, it becomes an operational necessity to cache one call to the API for each IP address. Like any sort of caching, a scheduled task should be used to delete these files after a certain amount of time.
Note that a single web page often requires the downloading of not just the HTML code, but also any images on the page, along with any script files and stylesheet files. Each of these items results in another line in the log file from a given IP address.
This code is run via the command line:
$ python3 parser.py
After running this code, it will have the following initial output:
Figure 6 – Initial output of parser.py
Note: parser.py must be executed with sufficient privileges so that it can read the Apache httpd log files and also write to the webroot directory.
After allowing for a few hits from all over the world to access this image, and running this script once again, we see the following output:
Figure 7 – Updated output of parser.py with a few hits
It is critical to note that these results are not calculated in real time, this output is only updated on each successive run of parser.py. With that in mind, the best way to run this sort of analysis would be to schedule this task to run via crontab.
In addition to the results page in Figure 7, the following cache files were also created, and each contains the JSON output downloaded from the API:
Figure 8 – Additional output of parser.py
Armed with all of this new knowledge, how could we use it to figure out where a potential user is from? Simply giving a user a URL from this server with a photo could do the trick, assuming they browse to it. It is important to note that this site was temporarily hosted on a local broadband connection (notice the high numbered port?) so giving an unknown user something that points directly to your personal IP address is definitely not a good idea! But, if you have hosted server space that you can run this on, you will definitely be able to get more information about who you are talking to.
Final Thoughts on Python Geolocation
Geolocation has certainly gone a long way from just being able to tell with which continent a particular IP address is associated. As you can see, there is quite a significant amount of data that can be harvested from these logs. While simple flat files do well to illustrate this from a proof-of-concept standpoint, you might consider extending this logic so that it makes use of a database to manage this information instead. In addition to storing the processed results, a database can also store the cached geolocation lookup results as well.
As many databases provide robust analysis tools, website administrators may be able to better gauge various metrics such as which states or regions browse their sites the most or least, or how often given IP addresses may “move around” from one location to another. No doubt that this information can be leveraged to customize or improve the delivery of service to end users, and much, much more.