Accessing Files Using Python
1. Introduction to Accessing Files Using Python
This article illustrates how files in the file system can be accessed through Python programming. Python-provided APIs are operating system independent and work on all platforms. The Python way of file access works analogous to Unix file access methodologies where files in the file system exist in node data structures and in types such as pipe, socket, FIFO, and regular files wherein a separate process file table entry is maintained against each opened file, identified as a file descriptor. I/O happens directly through the file descriptor, is unbuffered, and makes a kernel file system API call in each invocation, whereas the file object wrapper on the file descriptor provides buffered/stream API calls making kernel file system API calls less often.
Finally, various path actions on files are also discussed through Python programming, such as moving files across directories, renaming a file, changing ownership of the file, and deleting a file.
2. Python API Call Flow
Figure 1: The Python call flow
All file access APIs (descriptor mode or file object mode) get interpreted by the Python interpreter and get converted to byte code followed by assembly code/machine code, which, in turn, adding to library code, makes system calls and finally gets transferred to kernel code, thereby accessing file system privileged code.
3. Python Modules
Accessing files in the file system uses various library routines (APIa) provided through Python. These libraries are known as modules in Python. Modules live in their own namespace and the module namespace becomes visible to the current namespace when the "import" keyword is used.
Figure 2: A Python module
Here the main module includes moduleA and has its own attribute, 'x'. So, the main module contains moduleA and x. moduleA, being a separate namespace that has a function func as its attribute, is dereferenced with a dot [.]. In C terminology, moduleA is a library that contains function func and is included in the current program as 'import moduleA'.
Modules used in Files I/O is 'os' and have the following functions:
- stat(): For file statistics
- open(), read(), write(), lseek(), fdopen(): File descriptor functions
- chmod(), chown(), rename(), remove(), unlink(): File tool function
- Path submodule
Module 'stat' contains the following function:
- S_ISDIR, S_ISREG: Check os.stat() attributes
Module 'sys' contains built in standard input, standard output and
- standard error file object
- stdin, stdout, stderr
4. File Descriptor I/O
File I/O is a file descriptor-based functionality. As mentioned earlier, file descriptor-based file access APIs are exported through module 'os'. The file descriptor is an integer identifier that is process specific and corresponds to one opened file. The process maintains a table of file descriptor and file descriptor points to 'file table,' which has file access permission, file offset, and v-node pointer. A v-node table internally points to the i-vnode of the file in the file system.
Figure 3: The File Descriptor I/O
File I/O that works on the descriptor does not have any buffering mechanism and it write bytes of data to the kernel buffer as soon as it returns. So, Unicode encoding of the str string does not work here and it only takes a byte string as arguments and return byte string.
- Module involved - os: Primary APIs exported through the 'os' module
- os.open(file, flags, mode=0o777,*,dir_fd=None):int: Creates an entry in the process file table and returns a non-inheritable file descriptor
- file: Name of the file
- flags: Read, Write, Execute, Append, Create, and Truncate
- mode: Mode bits are used only at the time of file creation and for future use
- dir_fd: File path provided is relative to dir corresponding to dir_fd if dir_fd descriptor is mentioned
- os.read(fd, N):byte string: Reads at most N bytes from file descriptor fd. Returns bytestring.
- fd: File descriptor
- N: Read at most N bytes
- Return: Bytes object string. If end of file reached, an empty bytes object is returned.
- os.write(fd, str):int: Writes the bytestring in str to file descriptor fd.
- fd: File descriptor
- str: Byte string to be written to fd
- Return: Number of bytes written
- os.lseek(id, pos, how):int: Sets the current position of file descriptor fd to position pos, modified by how. SEEK_SET: beginning, SEEK_CUR: relative to current, and SEEK_END: relative to the end of file.
- fd: File descriptor
- pos: New position in file
- how: SEEK_SET(0), SEEK_CUR(1), SEEK_END(2)
- Return: New cursor position
- os.fdopen(fd,mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None):fileobject: Returns file object connected to file descriptor. This is an alias to the built-in open().
- fd: File descriptor
- mode to opener: Please refer to the file object section open() function
ActivePython 188.8.131.52 (ActiveState Software Inc.) is based on Python 3.4.3 (default, Aug 21 2015, 12:27:26) [MSC v.1600 32 bit (Intel)] on win32. Type "help", "copyright", "credits" or "license" for more information.
>>> import os >>> fdfile = os.open(r'c:\temp\test.txt',(os.O_RDWR|os.O_CREAT)) >>> os.write(fdfile,r'HELLO') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' does not support the buffer interface >>> os.write(fdfile,u'HELLO') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' does not support the buffer interface >>> os.write(fdfile,b'HELLO') 5 >>> os.lseek(fdfile,0,0) 0 >>> byteread=os.read(fdfile,5) >>> print(byteread) b'HELLO' >>> objfile = os.fdopen(fdfile, 'rb') >>> os.lseek(fdfile,0,0) >>> objfile.read() b'HELLO'
5. File Statistics
File statistics are provided through the stat() function in the 'os' module. This function is analogous to the Unix stat() system call. For not following the symbolic link, follow_symlinks=False has to be used.
- os.stat(path,*,dir_fd=None,follow_symlinks=True): Gets the status of a file or a file descriptor
- path: File name or file descriptor
- Return: stat_result object
stat_result object is a tuple. The stat module provides functions to list various values in the stat_result object. For example, stat.S_ISDIR(), stat.S_ISREG, and so forth. Similar functionality is also provided through the os.path module, as in os.path.isdir(), os.path.isfile(), and so on.
Python 2.7.5 (default, Feb 11 2014, 07:46:25) [GCC 4.8.2 20140120 (Red Hat 4.8.2-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.mkfifo('fifo.txt') >>> info=os.stat('fifo.txt') >>> print(info) posix.stat_result(st_mode=4532, st_ino=3177528, st_dev=64768L, st_nlink=1, st_uid=1000, st_gid=1000, st_size=0, st_atime=1474259050, st_mtime=1474259050, st_ctime=1474259050) >>> import stat >>> stat.S_ISFIFO(info.st_mode) True >>> stat.S_ISREG(info.st_mode) False >>> Python 2.7.5 (default, Feb 11 2014, 07:46:25) [GCC 4.8.2 20140120 (Red Hat 4.8.2-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> infof=os.stat('file.txt') >>> print(infof) posix.stat_result(st_mode=33204, st_ino=4609839, st_dev=64768L, st_nlink=1, st_uid=1000, st_gid=1000, st_size=0, st_atime=1474259370, st_mtime=1474259370, st_ctime=1474259370) >>> import stat >>> stat.S_ISREG(infof.st_mode) True >>> stat.S_ISFIFO(infof.st_mode) False
6. File Object I/O
A file object is an object exposing a file-oriented API to an underlying resource. Depending on the way it was created, a file object can mediate access to a real on-disk file or to another type of storage or communication device (for example, standard input/output, in-memory buffers, sockets, pipes, and the like). File objects provide file access in Binary mode (unbuffered) and Buffered mode (Line and File buffer). In Binary mode, only the byte string is accepted whereas a Buffered mode accepts a str string.
In Unix methodology, a file object caches the data and then finally calls descriptor-level file I/O to pass the data to the kernel once the cache is either full or a new line is encountered (Line buffering) or explicitly flushed.
Figure 4: Passing the data to the kernel
A Python FileObject uses os.BufferedReader and os.BufferdWriter when reading and writing data to and from a stream/cache.
Figure 5: Using os.BufferedReader and os.BufferdWriter
A FileObject is created once the built-in function open() is called.
open(file, mode='r', buffering=-1, encoding=None, error=None, newline=None, closefd=True, opener=None)
The preceding code opens a file and returns a corresponding file object.
- file: String or bytes object giving pathname. File descriptor number to be wrapped around. If closefd is set to true(default), fd gets closed when the file object is closed.
- mode: 'r'(read), 'w'(write), 'x'(exclusive creation), 'a'(append), 'b'(binary), 't'(text), '+'(reading and writing)
- buffering: No buffering for binary mode(0), whereas line(1) or file buffering for text mode([^01])
- encoding: Encoding used to encode while writing and decode while reading. Various encoding systems are 'latin1', 'utf8', 'utf16', and so forth.
- error: 'strict'(raise Value Error exception),'ignore',replace'(place replacement marker),surrogateescape','xmlcharrefreplace','backslashreplace'
- newline: New line character to be encoded while writing and reading. None is Unicode and OS default is used.
- closefd: If file argument is a file descriptor, deleting the file object will close the file descriptor if closefd is true(default).
- opener: A custom opener subroutine is used to open the file, which returns a file descriptor.
6.1 Output File
An output file can be generated by the open() function. Various file object functions used in output file manipulations are:
- write(str): Writes str string onto the file. str is encoded as per Unicode (default) unless encoding is set while calling the open() built-in function or via the str string encode method.
- writelines(L): Writes all strings in list L onto file. For example, opening a file in binary (and unbuffered) and writing the byte string.
**example.py** file=open('test.txt','wb+',buffering=0) #file.write(u'd\xe4m') #give error as binary mode does not accept #buffer string file.write(b'd\xe4m') import glob #no buffering, so as soon as write happens #file data would be available print(open(glob.glob('test.txt')).readlines()) file.seek(0) data=file.read() print(data) C:\temp>python example.py ['\dxe4m'] b'd\xe4m'
6.2 Input File
An input file can be opened through the open() function. Methods exported by the file object are related to the input file is as follows:
- read(): Reads entire file to a single string. In text mode, decodes Unicode text into a str string and in binary mode returns unaltered contents in byte. In text mode end of line, for example, \r\n on windows' gets converted to '\n'.
- read(N): Reads at most N more bytes, empty at end of file.
- readline(): Reads next line (throguh end-of-line marker); empty at end of file
- readlines(): Reads entire file into a list of line strings.
The following example shows how binary and text mode writing are different.
**example.py** file=open('test.txt','wb+') #binary mode file.write(b'd\xe4m\n') file.seek(0) print(file.read()) file.close() file=open('test.txt','w+') file.write('d\xe4m\n') #write in unicode encodeing(default) file.close()print(open('test.txt','rb').read()) #read in binary C:\temp>python example.py b'd\xe4m\n' b'd\xe4m\r\n'
The next example shows various encoding systems are different when they put data to a file.
import glob file=open('test.txt','w+') file.write(u'd\xe4m\n') file.flush() print(open(glob.glob('test.txt'),'rb').readlines()) file.close() file=open('test.txt','w+',encoding='utf8') file.write(u'd\xe4m\n') file.flush() print(open(glob.glob('test.txt'),'rb').readlines()) file.close() file=open('test.txt','w+',encoding='latin1') file.write(u'd\xe4m\n') file.flush() print(open(glob.glob('test.txt'),'rb').readlines()) file.close() C:\temp>python example.py [b'd\xe4m\r\n'] [b'd\xc3\xa4m\r\n'] [b'd\xe4m\r\n']
The built-in function seek() provides a way to make random access to file.
- seek(dist, how=0): Set the chunk's current position
- dist: Distance to seek
- how: 0(seek from start), 1(seek from current), 2(seek backward from end of buffer)
If the file contains multiple lines, iterating through the file line-wise becomes inevitable. One way to read file lines is to grasp everything in a string and then loop through.
for line in file.readlines():
readlines() reads all data in a string and this is an issue with very large files. Another solution is through readline(), which loads data into the string one line at a time. This logic is put in the __next__() method of fileobject, which is the iterator function call in 'for loop'. It throws an exception when EOF is met.
for line in open(file=open('test.txt'):
6.5 Closing Files
- close():A file object can be explicitly closed by calling the close() method. However, Python garbage collection deletes unreferenced objects.
open('test.txt','w').write("Hello") # write to temporary object open('test.txt','r).read() # read from temporary object
In both cases, the created file object goes out of scope and it is deleted immediately after the call. There can be scenarios when an exception is thrown.
- The try: and finally:duo makes sure that the file would be closed even if an exception is thrown.
try: for line in file:
- The third way is to create a file context manager that deletes the file once out of context happens with open('test.txt') as file:
for line in file:
6.6 stdin, stdout, stderr Built-in Sys Object
'sys' module contains standard input/output/error file objects. The fileno() method in fileobject returns a file descriptor associated with the fileobject.
fileno():int: Returns the underlying file descriptor (an integer) of the stream or file object. OSError is raised if the I/O object does not use a file descriptor.
>>> import sys >>> for stream in [sys.stdin,sys.stdout,sys.stderr]: ... print(stream.fileno()) ... 0 1 2
One of the arguments in the open() built-in function is buffering. Buffering in file object I/O is supported in three categories:
- No buffering: When value is 0, data is transmitted to the kernel upon function return. This is supported only in BINARY mode.
- Line buffering: When value is 1, data is transmitted when end-of-line is met or any seek() or read() operation is called or cache is flushed explicitly.
- Full/File buffering: Any other positive value sets full buffering (default).
Buffer size is dependent upon platform; for example, 512 bytes. Data is flushed only when either buffer is full or seek() or read() operation is called or flush() is called explicitly.
Line buffering occurs when data goes to the kernel only when a new line is entered.
**eample.py** file=open('test.txt','w+',buffering=1) file.write('hello') print(open(glob.glob('test.txt')).readlines()) file.write('\n') print(open(glob.glob('test.txt')).readlines()) C:\temp>python example.py  ['hello\n']
7. 'os' File Tools
File tools include functions accepting a file pathname string and accomplish file-related tasks such as renaming, deleting, and changing the file's owner and permission settings. Various functions involved are:
- chdir(path): Changes the current directory
- path: New directory path
- chmod(path,mode): Changes the mode of path to the numeric mode
- path: File path whose mode needs to be changed
- mode: New mode
- chown(path,uid,gid): Changes the owner and group id of path to the numeric uid and gid
- path: File path for which ownership needs to be changed
- uid: New uid
- gid: New gid
- remove(path), unlink(path): Remove or Nnlink a fie
- path: File path whose entry would be removed from file system
- rename(srcpath, dstpath): Rename a source path to destination path
- srcpath: Source file path
- dstpath: Destination file path
This article discussed various ways to access files in a file system. Articles maintain synchronization between system calls provided through Unix and Python classes that internally call these system calls. The article separates the file access in Unbuffered (Descriptor based) and Buffered (Stream based) based I/O. Binary and Text mode output, and input is discussed along with a discussion on various encoding techniques. The article does not provide a whole list of APIs exported through the file object and low-level file descriptor methods provided through os module; instead, it focuses only on input, output, and iterator-related methods. Examples are executed on Active Python 3.4 win32 distribution and Python 2.7 on Linux RHEL 7.0 64 bit distribution.
Active Python 3.4 documentation
Programming Python by Mark Lutz