A Month of Haskell, Day 10 - String Types
Strings are one of the most basic types we deal with when programming, and yet languages typically do a very poor job with them. Haskell doesn’t exactly excell with string handling but it does a pretty good job. Unfortunately, it also has at least three different string types for you to choose among.
String
The most basic and most obvious string type is simply named String. It’s just a list of Chars:
type String = [Char]This is a very convenient type. You can use any of the many Data.List functions to manipulate them. That module also contains functions to split up strings containing newlines or strings containing spaces into multiple strings.
Type signatures:
lines :: String -> [String]
words :: String -> [String]Strings support Unicode and the Data.Char
module gives you lots of ways to work with individual elements of a String. And as I
previously covered, you can use OverloadedStrings
to make more things look and act like the built-in type.
So, why not just use String for everything? A lot of the time it’s fine and you will never
notice any problem, but if you are doing a lot of text-intensive processing you will discover
that String can be very, very slow.
That’s where the other types come in.
Text
The text package provides Data.Text, a string representation
designed to be fast and efficient. It gives you strict and lazy versions of all the types and functions
with the same API. It supports Unicode, too.
It also stomps all over Prelude and Data.List by providing lots of functions with the same names as
all that base stuff. This is on purpose - they function the same, just on a different type. It means
that unless you are only using one or two functions from Data.Text, you should always import it
qualified:
import qualified Data.Text as TYou convert a regular String into Text with the pack function, and you convert the other way
with the unpack function:
Type signatures:
pack :: String -> Text
unpack :: Text -> StringIn some ways, the Data.Text module is more useful than using a regular String. It provides functions
for converting between upper and lowercase that are missing from String, for instance. If you need
to do any IO with Text, there’s a module for that too. It should also be imported qualified due to
using many of the same names as the Prelude:
import qualified Data.Text.IO as TIOSlowly, third party Haskell modules are converting from String to Text. It is used almost exclusively
in all the bindings provided by haskell-gi, so if you are
doing anything that involves the Gtk stack you will need to work in terms of Text. One frustrating
hold out is the FilePath type in the base library. If you do much work with file and directory
names, you’ll be stuck using String there.
There’s one other thing to know about this type: By default, it is strict which means the whole thing
must live in memory at once. If you are working with very large pieces of Text, you may end up
using quite a bit of memory. This is when you’ll want to use the lazy API. It looks just like the
default strict API, just with a different import at the beginning:
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIOText is an instance of the IsString type class, which means you can use OverloadedStrings to write
literals in your program and have them be interpreted as the right type. This means you don’t have to
call pack on string literals, which is very annoying to look at.
ByteString
The third type is provided by Data.ByteString, in the package of the same name. It’s really not supposed to be used as a string like the other two, despite the name. It’s more intended to be used as an interchange format for getting data between Haskell and C, or sending data over a network, or streaming binary data to and from disk.
A ByteString is, essentially, a list of Word8.
Those are kind of like a Char, but are only eight bits in size and therefore don’t support Unicode.
The point of this is that while it’s possible to convert between String and ByteString, it’s not
really a lossless operation and you shouldn’t do it unless you know what you are doing.
Just like with Data.Text, it also exports a whole bunch of conflicting function names on purpose so you
need to import it qualified, too:
import qualified Data.ByteString as BSYou can then use the typical list-like API to manipulate a ByteString. There are familiar IO functions
for reading and writing them in a variety of ways. The default type is strict, just like with Text. If
you need to manipulate very large items, you should use the lazy API:
import qualified Data.ByteString.Lazy as BSAgain, it provides the same API as the strict version.
Conveniently, ByteString is also an instance of the IsString class allowing you to use the
OverloadedStrings language extension for it too.
Summary
In summary, here’s when you should use each type:
String- You are using small strings, or do not care about performance, or are doing a lot of file and path name manipulations.Text- You are doing lots of text manipulations, or are writing a Gtk-based UI program (perhaps others too; I’m not really familiar with other toolkits), or just want to be prepared for the future when most modules have switched.ByteString- You are reading or writing data from disks or networks, or are using FFI to shuttle data back and forth to a C library, or are manipulating binary data.