Jonne Sälevä

When working on Linux, filters are a topic that often comes up. But what are they?

Put simply, filters are small programs that take input from STDIN, do some processing on that input, and output the result to STDOUT.

Sometimes a file argument is also supported in addition to STDIN, but the basic principle is the same.

What makes filters so cool? They might be small but their ingenuity lies in their small size and versatility. They are a great example of the UNIX philosophy principle

“Do one thing and do it well.”

which is often even given as a definition for UNIX philosophy itself.

Converting TSV to JSON(L)

In my work, I’ve frequently got TSV data which would be easier to work with if it were in JSON(L). Finally, I found an easy conversion tool: visidata.

Using visidata, we can write the following filter that takes input from STDIN, as well as some optional arguments, and outputs a neatly formatted JSON document:

visidata \
    -b --save-filetype "jsonl" \
    < ${1:-/dev/stdin} 2>/dev/null \
    | jq -c

We can save this under tsv2jsonl and place it somewhere in our $PATH.

If we have a TSV file like

x   y
foo bar
faa ber
fuu box

saved under my_file.tsv, then we can convert it to JSON by doing:

> cat my_file.tsv | tsv2jsonl

{"x":"foo","y":"bar"}
{"x":"faa","y":"ber"}
{"x":"fuu","y":"box"}

That’s it! Just one filter to pipe TSV data through. No options or complex arguments to remember.

Tokenizing the fields

Another task that often comes up is tokenization, which consists of converting strings of sentences into lists of substrings called tokens.

Often tokenization is carried out using specialized libraries (e.g. Moses). However, sensible default tokens can be obtained by crudely splitting on the space character or general whitespace. Sometimes we may also want to get a string – representing a sentence, a paragraph, or whatever – split into all of its individual characters.

To facilitate these default behaviors, I made another filter called jtok:

Usage: jtok [OPTIONS]

Options:
  -t, --token-type [space|chars|whitespace|custom]
  -a, --tokenize-all-fields
  -f, --field-to-tokenize TEXT
  -c, --custom-tokenizer-command TEXT
  --help                          Show this message and exit.

Out of the box, jtok offers (crude) tokenization based on spaces, general whitespaces, or individual characters. For example, the JSON file from above

> cat my_file.tsv | tsv2jsonl

{"x":"foo","y":"bar"}
{"x":"faa","y":"ber"}
{"x":"fuu","y":"box"}

can be tokenized into characters using the following:

> cat my_file.tsv | tsv2jsonl | jtok -a

{"x":["f","o","o"],"y":["b","a","r"]}
{"x":["f","a","a"],"y":["b","e","r"]}
{"x":["f","u","u"],"y":["b","o","x"]}

Here we are taking care to use jtok -a to indicate that we want all fields tokenized. The tokenization into characters need not be explicitly specified, as it is the default option. This option can be controlled using the --token-type flag.

We can also tokenize entire sentences if we have longer, non-toy data. Consider the following Finnish-English translation data from the FLoRes corpus:

Let’s tokenize these sentences into tokens separated by whitespace like this:

> head -n1 flores_fin_eng.jsonl | jtok -a -t whitespace | jq -c

{
  "fin": ["Stanfordin", "yliopiston", "lääketieteen", "laitoksen", ... ],
  "eng": ["On", "Monday,", "scientists", "from", "the", "Stanford", ... ]
}

Messing with the tokenizer

This seems pretty good, but may not work for some things such as the sentence:

> cat walk_to_school.json

{
    "fin": "Ai, hänkinkö kävelee kouluun?",
    "eng": "Oh, he is walking to school as well?"
}

which will not get correctly tokenized using just whitespace:

> cat walk_to_school.json | jtok -a -t whitespace

{
  "fin": ["Ai,", "hänkinkö", "kävelee", "kouluun?"],
  "eng": ["Oh,", "he", "is", "walking", "to", "school", "as", "well?" ]
}

As we can see, some of the tokens have punctuation attached to them.

To get around this, we’ll need an external tokenizer, such as Sacremoses which, by itself, works like a filter:

> echo "Hello, world." | sacremoses tokenize

Hello , world .

We can use sacremoses tokenize as our custom tokenizer callback inside of jtok:

> cat walk_to_school.json | jtok -a -t custom -c "sacremoses tokenize"

{
  "fin": ["Ai", ",", "hänkinkö", "kävelee", "kouluun", "?"],
  "eng": ["Oh", ",", "he", "is", "walking", "to", "school", "as", "well", "?"]
}

Now sacremoses handles the tokenization and the punctuation comes out correctly.

Conclusion & vim usage tip

Finally, a tip for Vim users: filters are usable in Vim through ":.!your_filter.sh". I learned this trick from this YouTube video by rwxrob. Check out his work for more info on how to integrate UNIX filters into your Vim workflow.

That’s it, thanks a lot for reading! 😁 Feel free to take a look at the filter repository on Codeberg if you want to learn more and also to get in touch over email.

Why UNIX filters are cool

Converting TSV to JSON(L)

Tokenizing the fields

Messing with the tokenizer

Conclusion & vim usage tip