Text Processing

An important part of working with the Unix shell is processing text directly from command line. We've done a bit of this, but Chapter 20 introduces a few more interesting tools.

The text tools can be used to create reports, generate HTML, or even send email messages, frequently using temp files to hold intermediate results.

  1. cat (again)
    1. The -A option to show unprintable characters.
    2. -n prints line numbers.
    3. -s suppresses blank lines.
  2. sort offers much control over how sorting is performed.
    1. -f ignore case.
    2. -r reverse.
    3. -t to specify a character (other than spaces) to separate fields.
    4. -k to specify a sort fields.
      1. Specify a range of field numbers which form the key. One-based.
      2. May specify multiple -k's to sort on multiple keys.
    5. -u remove duplicate lines.
  3. uniq filters out duplicates in a sorted stream, much like using the -u option to sort.
  4. cut removes a part of each line.
    1. Use -c to select a list of character ranges.
    2. Use -f to select a list of fields.
    3. Fields separated by tab, or use -d.
    4. Often useful with pipes to perform separate cuts of different type.
    5. last | egrep -v '^wtmp begins|^reboot' | cut -c 1-9,23-37 | sort -u
  5. paste will combine lines from two files in pairs.
  6. join will combine lines based on a common field, modeled on a database join. Works on two files sorted on the join field.
    1. Fields specified in ways similar to sort and cut.
    2. Joins on first field by default.
    3. Files must be sorted on the join field (so maybe it's more of a merge).
  7. tr translate — character substitution.
    1. Lowercase letters: tr A-Z a-z < file.txt
    2. -d option deletes characters instead of replacing them.
    3. -s option will compress runs to a single.
    last | egrep -v '^wtmp begins|^reboot' | cut -c 1-9,23-37 | sort -u > tmp1 sort -t : -k 1 /etc/passwd | cut -d : -f 1,5 | tr : ' ' > tmp2 join tmp1 tmp2
  8. comm compares the lines of two sorted files. Makes three columns for those lines appearing in one, the other, or both.
  9. diff compare two (usually similar) files and summarize differences.
    1. Most often used with the -u option to display context format.
    2. Often used to inspect two different versions of a program.
  10. patch apply a diff to one version to reproduce the other.
    1. A way to store or transmit changes.
    2. Software source updates are sometimes distributed as diffs.
  11. sed stream editor. Various commands.