A Month of Haskell, Day 10 - String Types

Posted on May 19, 2017 by Chris Lumens in month-of-haskell.

Strings are one of the most basic types we deal with when programming, and yet languages typically do a very poor job with them. Haskell doesn’t exactly excell with string handling but it does a pretty good job. Unfortunately, it also has at least three different string types for you to choose among.

String

The most basic and most obvious string type is simply named String. It’s just a list of Chars:

type String = [Char]

This is a very convenient type. You can use any of the many Data.List functions to manipulate them. That module also contains functions to split up strings containing newlines or strings containing spaces into multiple strings.

Type signatures:

lines :: String -> [String]
words :: String -> [String]

Strings support Unicode and the Data.Char module gives you lots of ways to work with individual elements of a String. And as I previously covered, you can use OverloadedStrings to make more things look and act like the built-in type.

So, why not just use String for everything? A lot of the time it’s fine and you will never notice any problem, but if you are doing a lot of text-intensive processing you will discover that String can be very, very slow.

That’s where the other types come in.

Text

The text package provides Data.Text, a string representation designed to be fast and efficient. It gives you strict and lazy versions of all the types and functions with the same API. It supports Unicode, too.

It also stomps all over Prelude and Data.List by providing lots of functions with the same names as all that base stuff. This is on purpose - they function the same, just on a different type. It means that unless you are only using one or two functions from Data.Text, you should always import it qualified:

import qualified Data.Text as T

You convert a regular String into Text with the pack function, and you convert the other way with the unpack function:

Type signatures:

pack :: String -> Text
unpack :: Text -> String

In some ways, the Data.Text module is more useful than using a regular String. It provides functions for converting between upper and lowercase that are missing from String, for instance. If you need to do any IO with Text, there’s a module for that too. It should also be imported qualified due to using many of the same names as the Prelude:

import qualified Data.Text.IO as TIO

Slowly, third party Haskell modules are converting from String to Text. It is used almost exclusively in all the bindings provided by haskell-gi, so if you are doing anything that involves the Gtk stack you will need to work in terms of Text. One frustrating hold out is the FilePath type in the base library. If you do much work with file and directory names, you’ll be stuck using String there.

There’s one other thing to know about this type: By default, it is strict which means the whole thing must live in memory at once. If you are working with very large pieces of Text, you may end up using quite a bit of memory. This is when you’ll want to use the lazy API. It looks just like the default strict API, just with a different import at the beginning:

import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO

Text is an instance of the IsString type class, which means you can use OverloadedStrings to write literals in your program and have them be interpreted as the right type. This means you don’t have to call pack on string literals, which is very annoying to look at.

ByteString

The third type is provided by Data.ByteString, in the package of the same name. It’s really not supposed to be used as a string like the other two, despite the name. It’s more intended to be used as an interchange format for getting data between Haskell and C, or sending data over a network, or streaming binary data to and from disk.

A ByteString is, essentially, a list of Word8. Those are kind of like a Char, but are only eight bits in size and therefore don’t support Unicode. The point of this is that while it’s possible to convert between String and ByteString, it’s not really a lossless operation and you shouldn’t do it unless you know what you are doing.

Just like with Data.Text, it also exports a whole bunch of conflicting function names on purpose so you need to import it qualified, too:

import qualified Data.ByteString as BS

You can then use the typical list-like API to manipulate a ByteString. There are familiar IO functions for reading and writing them in a variety of ways. The default type is strict, just like with Text. If you need to manipulate very large items, you should use the lazy API:

import qualified Data.ByteString.Lazy as BS

Again, it provides the same API as the strict version.

Conveniently, ByteString is also an instance of the IsString class allowing you to use the OverloadedStrings language extension for it too.

Summary

In summary, here’s when you should use each type:

String - You are using small strings, or do not care about performance, or are doing a lot of file and path name manipulations.
Text - You are doing lots of text manipulations, or are writing a Gtk-based UI program (perhaps others too; I’m not really familiar with other toolkits), or just want to be prepared for the future when most modules have switched.
ByteString - You are reading or writing data from disks or networks, or are using FFI to shuttle data back and forth to a C library, or are manipulating binary data.