LanguagesPythonText Scraping in Python

Text Scraping in Python

In this second part of our series on Python text processing, developers will continue learning how to scrape text, building upon the information in our previous article, Python: Extracting Text from Unfriendly File Formats. If you have yet to do so, we encourage you to take a moment to read part one, as it serves as a building block for this Python programming tutorial. In this part, we will examine the code we can use to begin extracting text from files.

Parsing Text from Files with Python

Most programming languages process text input files by using an iterative – or looping – line-by-line method. In the example here, all of the information related to a single record can be found within the 4 lines that generally comprise each record, with the change from one SSN to another delimiting an individual record. This means that we need a loop that will grab all of the information in the table in our previous article before the loop finds a new SSN. We also need to keep track of the previous record’s values so that we know when we have found a new record. The easiest way to start is to write a basic Python script that can recognize when a new SSN begins. The Python code example below shows how to parse out the SSN for each record:

Extractor.py

# Extractor.py

# For command-line arguments
import sys

def main(argv):
    try:
        # Is there one command line param and is it a file?
        if len(sys.argv) < 2:
            raise IndexError("There must be a filename specified.")
        with open(sys.argv[1]) as input_file:
            # Create variables to hold each output record's information,
            # along with corresponding values to hold the previous record's
            # information.
            currentSSN = ""
            previousSSN = ""
            # Handle the file as an enumerable object, split by newlines 
            for x, line in enumerate(input_file):
                # Strip newlines from right (trailing newlines)
                currentLine = line.rstrip()
                
                # For this example, a single record is composed of 4 lines.
                # We need to make sure we get each piece of information
                # before we move on to the next record.

                # Python strings are 0-indexed, so the 13th character is
                # at position 12, and we must add the length of 11 to 12
                # to get position 23 to complete the substring function.
                currentSSN = currentLine[12:23]
                #print("Current SSN is ["+currentSSN+"]")

                if (previousSSN != currentSSN):
                    # We are at a new record, and hopefully the completed
                    # record's information is stored in the "previous"
                    # versions of all these variables.  Note that on the
                    # first iteration of this loop, the previous versions
                    # of these variables will all be blank.
                    if ("" != previousSSN):
                        print ("Found record with SSN ["+previousSSN+"]")

                    # Reset for the next record.
                    previousSSN = currentSSN
            # Note that at the end of the loop, the last record's information
            # will be in the previous versions of the variables.  We need to
            # manually run this logic to get them.
            if ("" != previousSSN):
                print ("Last record with SSN ["+previousSSN+"]")
            #print(str(x+1)+" lines read.")
        return 0
    except IndexError as ex:
        print(str(ex))
    except FileNotFoundError as ex:
        print("The file ["+ sys.argv[1] + "] cannot be read.")
    return 1

# Call the "main" program.
if __name__ == "__main__":
    main(sys.argv[1:])

The sample data file is in the listing below:

42594       001-00-0837 Z000019    UZ3  5H2K 000000006518G    2022              
            001-00-0837      HPZ000000000000000000000000082725                  
2022     87 001-00-0837      NMR SMITH,ADAM           
            001-00-0837      VBPYT8923FZ00000000000000000                       
42594       020-01-0000 Z000019    UZ3  5H2K 000000011025Q    2022              
            020-01-0000      HPZ000000000000000000000000091442                  
2022     87 020-01-0000      NMR WILLIAMS,JAMES           
            020-01-0000      VBPYT8923FZ00000000000000000                       
            020-33-0000      HPZ000000000000000000000000000000                  
            020-33-0000      SW        A00000000000000000                       
42594       200-00-0111 Z000019    UZ3  5H2K 000000003717H    2022              
            200-00-0111      HPZ000000000000000000000000061551                  
2022     87 200-00-0111      NMR MARLEY,RICHARD           
            200-00-0111      VBPYT8923FZ00000000000000000                       
42594       817-22-0000 Z000019    UZ3  5H2K 000000004235G    2022              
            817-22-0000      HPZ000000000000000000000000033258                  
2022     33 817-22-0000      NMR DOUGH,JOHN           
            817-22-0000      VBPYT8923FZ00000000000000000                       
42594       300-00-0001 Z000019    UZ3  5H2K 000000003096H    2022              
            300-00-0001      HPZ000000000000000000000000066889                  
2022     87 300-00-0001      NMR WILLIST,DOUGLAS           
            300-00-0001      VBPYT8923FZ00000000000000000                       

The above is the Sample Data file, saved as Sample Data.txt

Note: Python uses 0-based indexing for strings, and the substring notation uses the ending position in the file as its end, not the length of the substring. Because of this, the start of the SSN is represented by one number below its starting point when found in the text editor.

For the ending position, the string length of 11 needs to be added to the 0-based index position of the string start. As 12 is the 0-based index of the string start, 23 is the 0-based index of the string end.

If you run this code in Windows, you will get the following output:

Text Extraction in Python

Note also that, in this example, the full path to the Python interpreter is specified. Depending on your setup, you may not need to be so explicit. However, on some systems, both Python 2 and Python 3 may be installed, and the “default” Python interpreter that runs when a path is not specified may not be the correct one. It is also assumed that the Sample Data.txt file is in the same directory as the Extractor.py file. Because this file has a space in the name, it must be encapsulated in quotation marks to be recognized as a single parameter. This applies to both Windows and Linux systems.

Before going any further, make sure all of the SSNs from the sample file are displayed in the output. A common mistake in these implementations is to ignore the manual processing of the last record.

Read: Top Online Courses to Learn Python

Extracting Text from Files with Python

Now that the SSN is properly parsed out, the remaining items can be extracted by adding suitable logic:

Full-Extractor.py

# Full-Extractor.py

# For command-line arguments
import sys

def main(argv):
    try:
        # Is there one command line param and is it a file?
        if len(sys.argv) < 2:
            raise IndexError("There must be a filename specified.")
        with open(sys.argv[1]) as input_file:
            # Create variables to hold each output record's information,
            # along with corresponding values to hold the previous record's
            # information.
            currentSSN = ""
            previousSSN = ""

            currentName = ""

            currentMonthlyAmount = ""

            currentYearlyAmount = ""

            # This time, we need to know if we are processing the first record.  If we don't keep track of this, the
            # first record will process incorrectly and each subsequent record will be wrong.
            firstRecord = True
            
            # Handle the file as an enumerable object, split by newlines 
            for x, line in enumerate(input_file):
                # Strip newlines from right (trailing newlines)
                currentLine = line.rstrip()
                
                # For this example, a single record is composed of 4 lines.
                # We need to make sure we get each piece of information
                # before we move on to the next record.

                # Python strings are 0-indexed, so the 13th character is
                # at position 12, and we must add the length of 11 to 12
                # to get position 23 to complete the substring function.
                currentSSN = currentLine[12:23]
                if (True == firstRecord):
                    previousSSN = currentSSN
                    firstRecord = False

                # For the first record, previousSSN would be blank and currentSSN would have a value, and this condition would be true.
                # We do not want this, so we need the logic above to set the values to be the same for the first record.
                if (previousSSN != currentSSN):
                    # We are at a new record, and hopefully the completed
                    # record's information is stored in the "previous"
                    # versions of all these variables.  Note that on the
                    # first iteration of this loop, the previous versions
                    # of these variables will all be blank.

                    # Also note the "Disconnect" between the previous and current notation.
                    if ("" != previousSSN):
                        print ("Found record with SSN ["+previousSSN+"], name ["+currentName+"], monthly amount [" + currentMonthlyAmount+
                               "] yearly amount [" + currentYearlyAmount + "]")

                    # Reset for the next record.  This logic needs to come before the remaining data extractions, or you will have
                    # "off by one" errors.
                    previousSSN = currentSSN

                    # Blank out the "current" versions of the variables (except the SSN!) so the conditions above will be true again.
                    currentName = ""
                    currentMonthlyAmount = ""
                    currentYearlyAmount = ""

                # Get the name if we do not already have it.  This condition prevents us from overwriting the name.  Note that if the
                # data was structured in a way that there was more than one piece of information at this position in the file, you would
                # need additional logic to determine what it is you are parsing out.  In this example, the simplistic logic of checking if
                # the first character is present and that a comma is in the substring is the "test".

                if ("" == currentName) and (False == (currentLine[33:].startswith(' '))) and (True == currentLine[33:].__contains__(',')):
                    # Also note that the name can go to the end of the line, so
                    # no ending position is included here.
                    currentName = currentLine[33:]

                # Follow the same logic for extracting the other information.  In this case, make sure the string contains only
                # numeric values.  In the case of the monthly amount, we only want to process lines that end in "2022" as these are
                # the only lines which contain this information.
                if ("" == currentMonthlyAmount) and (currentLine.endswith("2022")):
                    currentMonthlyAmount = currentLine[51:57]

                if ("" == currentYearlyAmount) and currentLine[57:62].isdigit():
                    currentYearlyAmount = currentLine[57:62]

            # Note that at the end of the loop, the last record's information
            # will be in the previous versions of the variables.  We need to
            # manually run this logic to get them.
            if ("" != previousSSN):
                print ("Last record with SSN ["+previousSSN+"], name [" + currentName +"], monthly amount [" + currentMonthlyAmount+
                       "] yearly amount [" + currentYearlyAmount + "]")
            #print(str(x+1)+" lines read.")
        return 0
    except IndexError as ex:
        print(str(ex))
    except FileNotFoundError as ex:
        print("The file ["+ sys.argv[1] + "] cannot be read.")
    return 1

# Call the "main" program.
if __name__ == "__main__":
    main(sys.argv[1:])

                 

The placement of the logic that determines the record boundary, in this case going from one SSN to another, is critical, because the other data will be “off by one” if that logic remains at the bottom of the loop. It is also equally critical that, for the first record, the previous SSN “matches” the current one, specifically so that this logic will not be executed on the first iteration.

Running this code gives the following output:

Python Text Scrape

Note the highlighted record and how it has no name or amounts associated with it. This “error,” due to the missing data in the original text file, is rendering correctly. The extraction process should not add or take away from the data. Instead, it should represent the data exactly as-is from the original source, or at least indicate that there is some sort of error with the data. This way, a user can look back at the original data source in the ERP to figure out why this information is missing.

Read: File Handling in Python

Exporting Data into a CSV File with Python

Now that we can extract the data programmatically, it is time to write it out to a friendly format. In this case, it will be a simple CSV file. First, we need a CSV file to write to, and, in this case, it will be the same name as the input file, with the extension changed to “.csv”. The full code with the output to .CSV is below:

Full-Extractor-Export.py

 
# Full-Extractor-Export.py

# For command-line arguments
import sys

def main(argv):
    try:
        # Is there one command line param and is it a file?
        if len(sys.argv) < 2:
            raise IndexError("There must be a filename specified.")

        fileNameParts = sys.argv[1].split(".")
        fileNameParts[-1] = "csv"
        outputFileName = ".".join(fileNameParts)
        outputLines = "";
        with open(sys.argv[1]) as input_file:
            # Create variables to hold each output record's information,
            # along with corresponding values to hold the previous record's
            # information.
            currentSSN = ""
            previousSSN = ""

            currentName = ""

            currentMonthlyAmount = ""

            currentYearlyAmount = ""

            # This time, we need to know if we are processing the first record.  If we don't keep track of this, the
            # first record will process incorrectly and each subsequent record will be wrong.
            firstRecord = True
            
            # Handle the file as an enumerable object, split by newlines 
            for x, line in enumerate(input_file):
                # Strip newlines from right (trailing newlines)
                currentLine = line.rstrip()
                
                # For this example, a single record is composed of 4 lines.
                # We need to make sure we get each piece of information
                # before we move on to the next record.

                # Python strings are 0-indexed, so the 13th character is
                # at position 12, and we must add the length of 11 to 12
                # to get position 23 to complete the substring function.
                currentSSN = currentLine[12:23]
                if (True == firstRecord):
                    previousSSN = currentSSN
                    firstRecord = False

                # For the first record, previousSSN would be blank and currentSSN would have a value, and this condition would be true.
                # We do not want this, so we need the logic above to set the values to be the same for the first record.
                if (previousSSN != currentSSN):
                    # We are at a new record, and hopefully the completed
                    # record's information is stored in the "previous"
                    # versions of all these variables.  Note that on the
                    # first iteration of this loop, the previous versions
                    # of these variables will all be blank.

                    # Also note the "Disconnect" between the previous and current notation.
                    if ("" != previousSSN):
                        print ("Found record with SSN ["+previousSSN+"], name ["+currentName+"], monthly amount [" + currentMonthlyAmount+
                               "] yearly amount [" + currentYearlyAmount + "]")
                        # Because CSV is a trivially format to write to, string processing can be used.  Note that we also need to split
                        # the name into the first and last names.
                        nameParts = currentName.split(",")
                        # This is trivial error checking.  Ideally a more robust error response system should be here.
                        firstName = "Error"
                        lastName = "Error"
                        if (2 == len(nameParts)):
                            firstName = nameParts[1]
                            lastName = nameParts[0]
                        # Should there be any quotation marks in these strings, they need to be escaped in the CSV file by using
                        # double quotation marks.  Strings in CSV files should always be delimited with quotation marks.
                        outputLines += ("\"" + previousSSN.replace("\"", "\"\"") + "\",\"" + lastName.replace("\"", "\"\"") + "\",\"" +
                            firstName.replace("\"", "\"\"") + "\",\"" + currentMonthlyAmount.replace("\"", "\"\"") + "\",\"" +
                            currentYearlyAmount.replace("\"", "\"\"") +"\"\r\n")
                    # Reset for the next record.  This logic needs to come before the remaining data extractions, or you will have
                    # "off by one" errors.
                    previousSSN = currentSSN

                    # Blank out the "current" versions of the variables (except the SSN!) so the conditions above will be true again.
                    currentName = ""
                    currentMonthlyAmount = ""
                    currentYearlyAmount = ""

                # Get the name if we do not already have it.  This condition prevents us from overwriting the name.  Note that if the
                # data was structured in a way that there was more than one piece of information at this position in the file, you would
                # need additional logic to determine what it is you are parsing out.  In this example, the simplistic logic of checking if
                # the first character is present and that a comma is in the substring is the "test".

                if ("" == currentName) and (False == (currentLine[33:].startswith(' '))) and (True == currentLine[33:].__contains__(',')):
                    # Also note that the name can go to the end of the line, so
                    # no ending position is included here.
                    currentName = currentLine[33:]

                # Follow the same logic for extracting the other information.  In this case, make sure the string contains only
                # numeric values.  In the case of the monthly amount, we only want to process lines that end in "2022" as these are
                # the only lines which contain this information.
                if ("" == currentMonthlyAmount) and (currentLine.endswith("2022")):
                    currentMonthlyAmount = currentLine[51:57]

                if ("" == currentYearlyAmount) and currentLine[57:62].isdigit():
                    currentYearlyAmount = currentLine[57:62]

            # Note that at the end of the loop, the last record's information
            # will be in the previous versions of the variables.  We need to
            # manually run this logic to get them.
            if ("" != previousSSN):
                print ("Last record with SSN ["+previousSSN+"], name [" + currentName +"], monthly amount [" + currentMonthlyAmount+
                       "] yearly amount [" + currentYearlyAmount + "]")
                nameParts = currentName.split(",")
                firstName = "Error"
                lastName = "Error"
                if (2 == len(nameParts)):
                    firstName = nameParts[1]
                    lastName = nameParts[0§6]
                    outputLines += ("\"" + previousSSN.replace("\"", "\"\"") + "\",\"" + lastName.replace("\"", "\"\"") + "\",\"" +
                        firstName.replace("\"", "\"\"") + "\",\"" + currentMonthlyAmount.replace("\"", "\"\"") + "\",\"" +
                        currentYearlyAmount.replace("\"", "\"\"") +"\"\r\n")
            # As the string already contains newlines, make sure to blank these out.
            outputFile = open(outputFileName, "w", newline="")
            outputFile.write(outputLines)
            outputFile.close()
            print ("Wrote to [" + outputFileName + "]")
            #print(str(x+1)+" lines read.")
        return 0
    except IndexError as ex:
        print(str(ex))
    except FileNotFoundError as ex:
        print("The file ["+ sys.argv[1] + "] cannot be read.")
    return 1

# Call the "main" program.
if __name__ == "__main__":
    main(sys.argv[1:])

For maximum compatibility, all data elements in a CSV file should be encapsulated with quotation marks, with any quotation marks within those strings being escaped with double quotation marks. The above code reflects this.

Running the code gives this output:

Python text processing tutorial

The above command uses the caret (^) convention to split the command between lines for better readability. The command could be a single line if you chose.

The output as displayed in Notepad++:

Python text parsing

Of course, the whole point of this exercise is to see the output in Excel, so now let’s open this file there:

How to text scrape with Python

Now that the data is in Excel, all of the tools that the solution brings to the table can now be applied to this data. Any kinds of errors that may have been present in the ERP-generated data can now be properly troubleshot and corrected in the same way, without having to worry about sending bad data out before it can be verified.

Note that the “Error” values in the cells above are intentional, as they are the result of the original data being blank. From a high level, this will help to show a reviewer very quickly that there is a problem.

Another way this could be done is to write out a more informative error message into a neighboring cell in the same row.

Read: How to Sort Lists in Python

Final Thoughts on Text Scraping in Python

The beauty of this solution is that, while the code looks complex, especially compared to an even more complex regular expression, it can be more readily reused and adapted for similarly structured files. This kind of solution can become an indispensable tool in the arsenal of someone who is tasked with verifying this kind of information from a “black box” data source like an ERP, or beyond.

Read more Python programming tutorials and developer guides.

Latest Posts

Related Stories