The infamous tar-pipe

Bulk copying

Copying files on or between Linux/Unix machines is considerably nicer than on their Windows counterparts.  Recursive copy with the built-in command (cp) is available by addition of a single flag, where Windows often requires a separate program (xcopy) and additional flags to achieve the same task.  Copying over networks is a breeze with SCP, where Windows would complain about the shell not supporting UNCs.  Alternatively, you would have to map network shares to drive letters first, and keep track of which letters are what shares.

Of course, on Windows you can drag/drop the files in a GUI much like on Linux, but the moment you go away for a coffee will be the moment that Windows freezes the operation and pops up an annoying dialog asking you if you’re sure that you want to copy some of the files.  Then half an hour later when you return, the copy is still only ten seconds in…

On the Linux front, sometimes we want to customize things a bit:

  • error handling (fail or continue?)
  • symbolic link handling (reference or duplicate?)
  • hard link handling (reference or duplicate?)
  • metadata (copy or ignore?)
  • permissions (copy or ignore?)
  • sparse files (preserve or fill?)
  • filesystem boundaries (recurse or skip?)

Additionally, copying many small files over SCP can take a very long time; SCP performs best with large files. Rather than re-invent the wheel with a whole new file copy & networking program, we can do much better with the tools that we already have, thanks to the modular and interoperable nature of software built upon the Unix philosophy.

Most (or maybe all) of these problems can be solved with rsync, but rsync is not available in all environments (e.g. managed servers, overpriced Microsoft crap).

Tar examples

A simple and highly customizable way to read a load of files is provided by the tape backup utility tar. You can tell it how to handle the various intricacies listed above and it will then recursively read a load of files and write them in a single stream to its output or to a file.

Common tar options:
-c combine files into an archive
-x extract files from archive
-f <file> set archive filename (default is standard input/output)
-t list names of files in archive
-z, -j, -J use gzip / bzip2 / lzma (de)compression
-v list names of files processed
-C <path> set current working directory to this path before proceeding
[code language=”bash”]tar -cf output.tar file1 file2 …[/code]
[code language=”bash”]tar -xf input.tar[/code]

By writing to the standard output, we can pass this archive through a stream compressor, e.g. gzip, bzip2.

[code language=”bash”]
tar -c file1 file2 … | gzip -c > output.tar.gz

As this is a common use of tar, the most common compressors can also be specified as flags to tar rather than via a pipeline:

Archive and compress:

[code language=”bash”]
tar -czf output.tar.bz2 file1 file2 …
tar -cjf output.tar.bz2 file1 file2 …
tar -cJf output.tar.xz file1 file2 …

Decompress and extract

[code language=”bash”]
tar -xzf input.tar.bz2
tar -xjf input.tar.bz2
tar -xJf input.tar.xz

Tar streams can be transferred over networks to a destination computer, where a second tar instance is run. This second one receives the archive stream from the first tar instance and extracts the files onto the destination computer.  This usage of two tar instances over a pipeline has resulted in the technique being nicknamed the “tar-pipe”.

Where network speed is the bottleneck, tar can be instructed to (de)compress the streams on the fly, and offers a choice of codecs.  Note that due to the pipelined nature of this operation, any other streaming (de)compressors can also be used even if not supported by tar.

Tar-pipe examples

In its simplest form, to copy one folder tree to another:

[code language=”bash”]tar -C source/ -c . | tar -C dest/ -x[/code]

One could specify the -h parameter for the left-side tar, to have it follow symbolic links and build a link-free copy of the source in the destination, e.g. for sharing the tree with Windows users.

To copy the files over a network, simply wrap the second tar in an SSH call:

[code language=”bash”]tar -C source/ -c . | ssh user@host ‘tar -C dest/ -x'[/code]

To copy from a remote machine, put the first tar in an SSH call instead:

[code language=”bash”]ssh user@host ‘tar -C source/ -c .’ | tar -C dest/ -x[/code]

SSH provides authentication and encryption, so this form can be used over insecure networks such as the internet.  The SCP utility uses SSH internally. SSH can also provide transparent compression, but the options provided by tar will generally be more useful.

Fast but insecure alternative: netcat

A lightweight and insecure alternative would be to use netcat, which should only be used on secure private networks:

[code language=”bash”]
# On the source machine
tar -C source/ -c *.log | nc host port
[code language=”bash”]
# On the target machine
nc -l -p port | tar -C dest/ -x

This lightweight form is useful on ultra-low-end hardware such as the Raspberry Pi. It is considerably less robust than the SSH tar-pipe, and is also very insecure.

Compressed tar-pipe

If the network is slow then data compression can easily be used with the tar-pipe:

[code language=”bash”]
# z = gzip (high speed)
# j = bzip2 (compromise)
# J = xz (high compression)

# example, using bzip2 (why would anyone use bzip2 vs choice of xz/gzip nowadays?)
tar -C source/ -cj . | ssh user@host ‘tar -C dest/ -xj’

To use a (de)compressor of your choice, provided it is installed on both machines:

[code language=”bash”]
tar -C source/ -c . | my-compressor | ssh user@host ‘my-decompressor | tar -C dest/ -x’

You could, for example, use a parallel implementation of a common compressor such as pigz / pbzip2 / pxz, in order to speed things up a bit.

Tar also has a command-line parameter for specifying the compressor/decompresser, provided it follows a certain set of rules.

The choice of (de)compressor and compressor settings depends on the available processing power, RAM, and network bandwidth. Copying between two modern i7 desktops over 1gig ethernet, gzip compression should suffice. On a fast connection, heavy compression (e.g. xz -9e) will create a bottleneck. For a 100mbit ethernet connection or a USB2 connection, bzip2 or xz (levels 1-3) might give better performance. On a Raspberry Pi, a bzip2 tar-pipe might end up being slower (due to CPU bottleneck) than an uncompressed tar-pipe (limited by network bandwidth).

A niche use example of tar+compression

I originally wrote this while solving a somewhat unrelated problem. From in Estonia I can remotely power on my home PC in the UK via a Raspberry Pi Wake-On-LAN server with Dynamic DNS, then I can use port backwarding to access the UK PC. In order to transfer a large amount of data (~1TB) from the UK PC to Estonia, the fastest method (by far) was to use Sneakernet: i.e. copy the data to a USB disk, then have that disk posted to Estonia.

A friend back home plugged the USB disk in, which contained a couple of hundred gigabytes of his own files (which he wanted to send me), but the disk was formatted using Microsoft’s crappy FAT32 file system. After copying a load of small files to the disk, it became very slow to use then while trying to copy “large” files (only >4GB), it failed completely.  I recall Bill Gates once said that we’d never need more than 640kB or RAM – well apparently, he thought that a 4GB file-size limit would also be futureproof…  FAT32 also didn’t support symbolic links, and although Microsoft’s recent versions of NTFS do, their own programs still often fail miserably when handing symbolic links to executable files.

To solve this I wanted to reformat the disk as Ext4, but keep the files that were on it. The only disk in my home PC with enough free space for the files already on the USB disk was very slow (damn Eco-friendly 5400rpm disk!), so moving the files from the USB disk to this one would take a long time. Hence, I used the left half of a tar-pipe with a parallel-gzip (pigz) compressor to copy the data from the USB disk to the very slow disk.

By compressing the data on the fly before storing it, I could fit somewhat more source data into the measly 20MB/s write speed of the slow disk, getting an effective write speed of around 50MB/s – saturating the link from the USB disk, which was one bottleneck that couldn’t be avoided.

After that was complete, I blitzed and re-formatted the USB disk as Ext4, then ran the right-half of the tar-pipe to extract the data from the slow disk back to the USB disk and resumed copying “To Estonia” files to the disk.