A Month of Haskell, Day 10 - String Types
Strings are one of the most basic types we deal with when programming, and yet languages typically do a very poor job with them. Haskell doesn’t exactly excell with string handling but it does a pretty good job. Unfortunately, it also has at least three different string types for you to choose among.
String
The most basic and most obvious string type is simply named String
. It’s just a list of Char
s:
type String = [Char]
This is a very convenient type. You can use any of the many Data.List functions to manipulate them. That module also contains functions to split up strings containing newlines or strings containing spaces into multiple strings.
Type signatures:
lines :: String -> [String]
words :: String -> [String]
Strings support Unicode and the Data.Char
module gives you lots of ways to work with individual elements of a String
. And as I
previously covered, you can use OverloadedStrings
to make more things look and act like the built-in type.
So, why not just use String
for everything? A lot of the time it’s fine and you will never
notice any problem, but if you are doing a lot of text-intensive processing you will discover
that String
can be very, very slow.
That’s where the other types come in.
Text
The text
package provides Data.Text, a string representation
designed to be fast and efficient. It gives you strict and lazy versions of all the types and functions
with the same API. It supports Unicode, too.
It also stomps all over Prelude
and Data.List
by providing lots of functions with the same names as
all that base stuff. This is on purpose - they function the same, just on a different type. It means
that unless you are only using one or two functions from Data.Text
, you should always import it
qualified:
import qualified Data.Text as T
You convert a regular String
into Text
with the pack
function, and you convert the other way
with the unpack
function:
Type signatures:
pack :: String -> Text
unpack :: Text -> String
In some ways, the Data.Text
module is more useful than using a regular String
. It provides functions
for converting between upper and lowercase that are missing from String
, for instance. If you need
to do any IO with Text
, there’s a module for that too. It should also be imported qualified due to
using many of the same names as the Prelude
:
import qualified Data.Text.IO as TIO
Slowly, third party Haskell modules are converting from String
to Text
. It is used almost exclusively
in all the bindings provided by haskell-gi, so if you are
doing anything that involves the Gtk stack you will need to work in terms of Text
. One frustrating
hold out is the FilePath
type in the base library. If you do much work with file and directory
names, you’ll be stuck using String
there.
There’s one other thing to know about this type: By default, it is strict which means the whole thing
must live in memory at once. If you are working with very large pieces of Text
, you may end up
using quite a bit of memory. This is when you’ll want to use the lazy API. It looks just like the
default strict API, just with a different import at the beginning:
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO
Text
is an instance of the IsString
type class, which means you can use OverloadedStrings
to write
literals in your program and have them be interpreted as the right type. This means you don’t have to
call pack
on string literals, which is very annoying to look at.
ByteString
The third type is provided by Data.ByteString, in the package of the same name. It’s really not supposed to be used as a string like the other two, despite the name. It’s more intended to be used as an interchange format for getting data between Haskell and C, or sending data over a network, or streaming binary data to and from disk.
A ByteString
is, essentially, a list of Word8.
Those are kind of like a Char
, but are only eight bits in size and therefore don’t support Unicode.
The point of this is that while it’s possible to convert between String
and ByteString
, it’s not
really a lossless operation and you shouldn’t do it unless you know what you are doing.
Just like with Data.Text
, it also exports a whole bunch of conflicting function names on purpose so you
need to import it qualified, too:
import qualified Data.ByteString as BS
You can then use the typical list-like API to manipulate a ByteString
. There are familiar IO functions
for reading and writing them in a variety of ways. The default type is strict, just like with Text
. If
you need to manipulate very large items, you should use the lazy API:
import qualified Data.ByteString.Lazy as BS
Again, it provides the same API as the strict version.
Conveniently, ByteString
is also an instance of the IsString
class allowing you to use the
OverloadedStrings
language extension for it too.
Summary
In summary, here’s when you should use each type:
String
- You are using small strings, or do not care about performance, or are doing a lot of file and path name manipulations.Text
- You are doing lots of text manipulations, or are writing a Gtk-based UI program (perhaps others too; I’m not really familiar with other toolkits), or just want to be prepared for the future when most modules have switched.ByteString
- You are reading or writing data from disks or networks, or are using FFI to shuttle data back and forth to a C library, or are manipulating binary data.