Mastering UNIX pipes, Part 1

By Kamil Rytarowski

November 26, 2020 - 9 minutes read - 1774 words

c freebsd ipc linux netbsd openbsd pipe unix

A pipe is a first-in-first-out interprocess communication channel. The pipe version as it is known today was invented by an American Computer Scientist Douglas McIlroy and incorporated into Version 3 AT&T UNIX in 1973 by Ken Thompson.

It was inspired by the observation that frequently the output of one application is used as an input for another. This concept can be reused to connect a chain of processes. This is frequently observed in UNIX shell constructs that utilize the | operator.

$ find lib -name *.c | awk -F '/' '{print $NF}' | sort -u | tail
yp_maplist.c
yp_master.c
yp_match.c
yp_order.c
yperr_string.c
yplib.c
ypprot_err.c
yyerror.c
zdump.c
zic.c

This can be illustrated as a sequence of processes and pipes connecting the programs.

pipe chain

This concept of connecting the UNIX tools has been expanded to various native tools, such as the troff formatting system, that are specifically designed to be used in pipelines. The troff format and the associated toolkit are still used in the NetBSD Operating System. The build rules, producing the .ps files (PostScript) look like this one, for the kernmalloc (the kernel allocator documentation) example:

#	$NetBSD: Makefile,v 1.4 2003/07/10 10:34:26 lukem Exp $
#
#	@(#)Makefile	1.8 (Berkeley) 6/8/93

DIR=	papers/kernmalloc
SRCS=	kernmalloc.t appendix.t
MACROS=	-ms

paper.ps: ${SRCS} alloc.fig usage.tbl
	${TOOL_SOELIM} ${SRCS} | ${TOOL_TBL} | ${TOOL_PIC} | \
	    ${TOOL_EQN} | \
	    ${TOOL_VGRIND} | ${TOOL_ROFF_PS} ${MACROS} > ${.TARGET}

.include <bsd.doc.mk>

Source src/share/doc/papers/kernmalloc/Makefile.

The C interface for pipes

The POSIX specification declares the pipe function with the following signature:

int pipe(int fildes[2]);

inside the <unistd.h> header.

The pipe function takes an array of two integers, and writes file descriptors of the read and write end of the pipe into it upon successful return. The fildes[0] file descriptor is opened for reading and fildes[1] for writing. Some implementations of UNIX allow using the fildes[0] end for writing too and fildes[1] for reading (the full duplex mode), but this behavior is unspecified by POSIX and it is only safe to assume that they are unidirectional (half duplex mode).

The pipe call can fail and return -1, setting appropriate errno if the process (EMFILE) or the system (ENFILE) expired the allowed number of open file descriptors.

pipe in a process

This interface as it looks is appropriate only for processes that have the shared ancestor (usually the direct parent) and is usually combined with fork(2)/vfork(2)/posix_spawn(3) or an equivalent interface (otherwise the pipe would be a futile feature). To workaround the limitation of having the shared predecessor, the fifo special files or UNIX domain sockets can be used.

In the UNIX system, file descriptors are inherited by children by default (with some exceptions in modern APIs) and thus the created pipe, referenced by the array of two file descriptors, connects the child and the parent.

pipe after fork

In order to make the pipe effective, the user has to decide the direction of the data flow and close the other ends. If the intention is to send data from process A to process B, then we need to close the fildes[0] (reading) end in process A and fildes[1] (writing) end in process B.

pipe after fork

Now, the processes can transmit data over the pipe channel.

This algorithm is coded as follows:

/* CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication */
#include <sys/types.h>
#include <sys/wait.h>

#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
	char c;
	int status;
	pid_t child;
	int fildes[2];

	if (pipe(fildes) == -1)
		err(EXIT_FAILURE, "pipe");

	if ((child = fork()) == -1)
		err(EXIT_FAILURE, "fork");

	if (child == 0) {
		/* child */
		if (close(fildes[1]) == -1)
			err(EXIT_FAILURE, "close");
		read(fildes[0], &c, 1);
		printf("Received: %c\n", c);
		/* force the buffer to be printed on the output (screen) */
		fflush(stdout);
		_exit(0);
	}
	/* parent */
	if (close(fildes[0]) == -1)
		err(EXIT_FAILURE, "close");
	if (write(fildes[1], "x", 1) == -1)
		err(EXIT_FAILURE, "write");

	/* wait for the child process termination */
	if (wait(&status) == -1)
		err(EXIT_FAILURE, "wait");

	return EXIT_SUCCESS;
}

NB. For the sake of simplicity, certain code paths such as handling interrupts (EINTR) were omitted.

The execution of this program results with:

$ ./a.out
Received: x

The UNIX designers put the following constraints on the pipes (assuming O_NONBLOCK not set):

Once the readable end of the pipe is closed, any attempt done to write results with SIGPIPE emitted into the writing process. A process can either be killed or catch or ignore the signal and then needs to handle the error (-1 and errno set to EPIPE) manually.
Once the writable end of the pipe is closed, an attempt to read from the pipe returns 0 and notifes EOF on the file descriptor.

Additionally:

The amount of free space inside the pipe (kernel buffering) is limited and implementation specific.
When the child process starts, the default stdio I/O buffering on pipes defaults to the fully buffered mode. The three basic approaches to workaround this are:
- using fflush(3) explicitly,
- changing the buffering mode (setvbuf(3)) or
- using pseudo terminals if the child process is not modifiable.

The kernel pipe buffer size

The size of the kernel buffer storing the pipe data is limited and will cause further attempts to write(2) data to block until the space is regained, by the read(2) operation on the other end. The minimum acceptable value in a POSIX system is set to 512 bytes.

In order to check the maximum number of bytes that can be written atomically to a pipe, a programer can use the compiler constant PIPE_BUF or the dynamic value _PC_PIPE_BUF passed to pathconf(2) or fpathconf(2). pathconf(2) and fpathconf(2) can be applied on:

directories that can contain fifo files,
fifo files.

Additionally, fpathconf(2) can be applied on the pipe file descriptor.

/* CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication */
#include <err.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
	int fildes[2];

	if (pipe(fildes) == -1)
		err(EXIT_FAILURE, "pipe");

	printf("_PC_PIPE_BUF: %ld\n", fpathconf(fildes[1],_PC_PIPE_BUF));
	printf("PIPE_BUF: %d\n", PIPE_BUF);

	return EXIT_SUCCESS;
}

However, the real number is usually larger. It can be retrieved with ioctl(FIONSPACE) on NetBSD. This feature is unavailable on other systems: FreeBSD, OpenBSD and Linux, thus FreeBSD implements FIONSPACE for sockets, but not for pipes.

/* CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication */
#include <sys/types.h>
#include <sys/ioctl.h>

#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
	int fildes[2];
	int n;

	if (pipe(fildes) == -1)
		err(EXIT_FAILURE, "pipe");

	if (ioctl(fildes[1], FIONSPACE, &n) == -1)
		err(EXIT_FAILURE, "ioctl");
	printf("FIONSPACE fildes[1]: %d\n", n);

	return EXIT_SUCCESS;
}

An alternative approach to check the maximum buffer size of the pipe feature is to count the bytes writable into it manually, one by one, and to detect the hang. This can be achieved for example with the alarm(3) call, unblocking the hang.

/* CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication */
#include <err.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

static int n;

static void
sighand(int s)
{

	printf("bytes written into the pipe: %d\n", n);
	exit(EXIT_SUCCESS);
}

int
main(int argc, char **argv)
{
	int fildes[2];

	if (signal(SIGALRM, sighand) == SIG_ERR)
		err(EXIT_FAILURE, "signal");

	if (pipe(fildes) == -1)
		err(EXIT_FAILURE, "pipe");

	alarm(5); /* arm the alarm to 5 seconds */

	while (write(fildes[1], "x", 1) != -1)
		++n;

	/* if we ended up here, there was an error */
	err(EXIT_FAILURE, "write");
}

Alternatively, one could set the pipe end in the non-blocking mode. This can be achieved with the fcntl(2) call and the F_SETFL + O_NONBLOCK arguments.

The O_NONBLOCK mode on pipes causes the following change:

Writing into a full pipe buffer returns with -1 and errno EAGAIN, instead of blocking.

/* CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication */
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
	int fildes[2];
	int n;

	if (pipe(fildes) == -1)
		err(EXIT_FAILURE, "pipe");

	if (fcntl(fildes[1], F_SETFL, O_NONBLOCK) == -1)
		err(EXIT_FAILURE, "fcntl");

	while (write(fildes[1], "x", 1) != -1)
		++n;

	/* filter real errors from the unavailable for now resource */
	if (errno != EAGAIN)
		err(EXIT_FAILURE, "write");

	printf("bytes written into the pipe: %d\n", n);

	return EXIT_SUCCESS;
}

There are a few other kernel specific approaches to guess the maximum buffer size that can be stored inside the kernel. One of them is to read PIPE_SIZE from <sys/pipe.h> on BSD systems, but given that it is 16384 for FreeBSD, NetBSD and OpenBSD, it’s merely an internal implementation specific header.

In order to make the picture fuller, we need to mention that the FreeBSD and NetBSD kernels allow tuning of the pipe behavior and investigating the kernel virtual address spent on the buffers.

FreeBSD provides the following sysctl knobs:

kern.ipc.piperesizeallowed: Pipe resizing allowed
kern.ipc.piperesizefail: Pipe resize failures
kern.ipc.pipeallocfail: Pipe allocation failures
kern.ipc.pipefragretry: Pipe allocation retries due to fragmentation
kern.ipc.pipekva: Pipe KVA usage
kern.ipc.maxpipekva: Pipe KVA limit

NetBSD:

kern.pipe.maxbigpipes: Maximum number of “big” pipes
kern.pipe.nbigpipes: Number of “big” pipes
kern.pipe.kvasize: Amount of kernel memory consumed by pipe buffers

OpenBSD does not provide any similar sysctl functionality for pipes.

What are “big” pipes in NetBSD? They are special case pipes that exceed PIPE_SIZE four times (giving 65536 bytes) on atomic writes. The maximum number of “big” pipes is set by default to 32, but can be tuned dynamically in runtime.

Summary of the pipe buffer size limits [in bytes]
Limit	FreeBSD 12.0	NetBSD 9.0	OpenBSD 6.6	Linux 5.6.14
_PC_PIPE_BUF	512	512	512	4096
PIPE_BUF	512	512	512	4096
PIPE_SIZE (implementation detail)	16384	16384	16384	N/A
ioctl(FIONSPACE)	N/A	16384	N/A	N/A
write(2) + alarm(3)	65536	16384	16384	65536
write(2) + O_NONBLOCK	98303	16384	49023	65536
"big" pipe on atomic write	N/A	65536	N/A	N/A

As we can see, these limits highly depend on the Operating System and the portable approach to pick the buffer size with guaranteed atomic writes is to use the POSIX limits represented by PIPE_BUF and _PC_PIPE_BUF or fallback to the bare minimum allowed by POSIX at 512 bytes.

In practice, sometimes it’s not important whether an operation will block or not, as the kernel will handle the communication channel with a sequence of write and read operations, and blocking the appropriate end upon reaching the internal kernel buffer limit. Properly designed software shall be immune to the buffering sizes and defer the buffering sizes to the kernel designers who tuned the mechanism for maximal efficiency.

Why not raise the limits to very large sizes like 32 megabytes? Because the kernel would be prone to Denial of Service attacks, more easily going out of available kernel virtual memory.

bufferbloat

Furthermore, the whole mechanism could lead to undesirable waste of kernel memory and in some corner cases even to the latencies similar to bufferbloat.

Summary

We have introduced the reader to the UNIX pipe concept and presented the basic characteristics of this interprocess communication channel. In the next part, we will dig into the examples of combining two processes and managing the byte transfers.