How languages work #1: String interpolation

As I’m writing my own language Cara, I am forced to revisit (or learn for the first time) how programming languages pull off implementing their features.

This includes things like lexing, do-notation, type inference, pattern matching and destructuring, operator precedence and so on.

Some ideas will be quite profound while other might feel quite simple. Let’s start with the latter!

String interpolation

Let’s get our problem statement out first:

`Hello ${name}! Your name is ${name.length} letters long.`
// assuming name == "Martin", will result in:
"Hello Martin! Your name is 6 letters long."

That is, whenever you see the string interpolation delimiters (${...}), evaluate the expression inside them, convert the result to string and concatenate it with the rest of the string.

We might even take inspiration from Python and add a debugging variant:

`${name=} and ${name.length=}`
// assuming name == "Martin", will result in:
"name=Martin and name.length=6"

Overview

How (and where) to implement this? It seems we’ll need to parse the expression inside the string at some point, so that we can run it.

In the debugging variant, we’ll also need access to the verbatim source code (before parsing).

We could do this in the interpreter. That would be a lot of repeated work though—all this parsing would have to be repeated everytime you encountered a string.

Skipping the parser for the moment, we could do this in the lexer, but that would breach separation of concerns: the lexer would have to know about AST nodes, how appending strings works etc.

⚠️ EDIT 2023-07-28: Hayleigh Thompson made me aware of some lexer-only approaches:

Denis Defreyne solves this with modal lexers: instead of a StringToken "one plus two is ${1+2}." you’d have these:
[ StringStart
, StringPartLit "one plus two is "
, StringInterpStart
, Number 1
, Plus
, Number 2
, StringInterpEnd
, StringPartLit "."
, StringEnd
]
More on modal lexers also on the Oil Shell blog.

So, solving this in a lexer does make a lot of sense! (I’ll keep the rest of the article focused on the parser approach though, having already written it 😅.)

It feels like the parser (or some compiler stage right after parsing) is the right place to do things in: we know about details of AST at this point, and it’s a preprocessing job to be done once, before any runtime looping.

Assuming our lexer has already returned a token like:

StringToken "Hello ${name}! Length: ${name.length}"

Our parser can take this token and do some post-processing on it: if there are no ${...} substrings, it will just return a string expression with the contents of the token ("Hello...").

But if ${...} is found, the parser will return something equivalent to:

"Hello " 
 ++ toString name
 ++ "! Length: "
 ++ toString name.length

And in the debugging case, a string token like:

StringToken "The ${name=} and ${name.length=}"`

should be post-processed to something equivalent to:

"The name=" 
 ++ toString name
 ++ " and name.length="
 ++ toString name.length

Disclaimers

The above is simple to understand, and as we’ll see shortly, it’s just a bit fiddly with recursion and nesting.

For simplicity I’m choosing a solution that will chomp things one character at a time. There are other variants (string-splitting the contents on the ${ delimiter, r̸̘̔e̴̩͊g̸̯̾ȇ̴̼͇̗̈́̽x̵̺͑̆͑̈́̑͂̔̊̚è̸̺s̴̘̃ etc.); in a production-ready compiler you’ll likely want to splurge on a full-fledged parser!

There are also many edge cases to handle:

unbalanced delimiters: "Hello ${name"
empty expression: "Hello ${}" and "Hello ${=}"
escaping: "Hello \${name}" and "Hello $\{name}"
expressions containing the delimiters: "Hello ${{a:1}}"
nested strings: "Hello ${"World"}"
- …with string interpolation: "Hello ${"World ${num}"}"

And I won’t be dealing with those in this post. High-level idea only, strictly on the happy path!

Test suite

Let’s first make a quick test suite for all the interesting (happy path) cases.

It will be a bit less unit- and a bit more end-to-end-. In it, we’ll match the string input to an expected interpolated string output, after interpreting with a hardcoded environment.

For this purpose I’ve made a little toy language that only has expressions (that’s right, you can’t even bind an expression to a name).

type Expr
    = Str String
    | Int Int
    | Append Expr Expr
    | ToString Expr
    | Var String

The interpreter will take these and the env and turn them into a Value:

type Value
    = VStr String
    | VInt Int
    | VError

interpret : Dict String Value -> Expr -> Value

I’m cheating here, normally you’d return Result Error Value instead of having VError be one of the values. That would only slow us down though.

And finally, the star of the show, postprocessString : String -> Expr.

In the above link, as our starting point, it “does nothing”: it wraps the string in the Str Expr constructor:

postprocessString : String -> Expr
postprocessString string =
    Str string

And the test suite rightfully complains:

The algorithm

Let’s start fleshing the postprocessString function out (Ellie).

We’ll create a loop, looking at the next character, keeping track of the expression we accumulated so far (see above for the expected shape of the final product!) and of the string we need to add to that expression.

If we see ${, we need to update the accumulated expression and start chomping the expression to be interpolated, until we see the final }.
If we see any other character, we remember it and continue.
If we run out of characters, we update and return the accumulated expression.

Now this will look differently in various languages; Elm has neither mutation nor loops, so we do things with recursion:

postprocessString : String -> Expr
postprocessString string =
    let
        append : Maybe Expr -> Expr -> Expr
        append soFar expr =
            case soFar of
                Nothing     -> expr
                Just soFar_ -> Append soFar_ expr

        go : Maybe Expr -> String -> List Char -> Expr
        go accExpr accString todos =
            case todos of
                -- base case
                [] ->
                    append accExpr (Str accString)

                '$' :: '{' :: rest ->
                    Debug.todo "handle ${"

                whatever :: rest ->
                    go
                        accExpr
                        (accString ++ String.fromChar whatever)
                        rest
    in
    go Nothing "" (String.toList string)

In the middle case, we’ll need to switch modes and chomp the expression to interpolate. Let’s move that out to its own function:

'$' :: '{' :: rest ->
    goInner
        (Just (append accExpr (Str accString)))
        ""
        rest

And flesh goInner out in a very similar manner (Ellie):

goInner accExpr accSource todos =
    case todos of
        -- no } found!
        [] ->
            append accExpr (Str ("${" ++ accSource))

        '}' :: rest ->
            let
                parsed : Expr
                parsed = Debug.todo "parse accSource"
            in
            go
                (Just (append accExpr (ToString parsed)))
                ""
                rest

        whatever :: rest ->
            goInner
                accExpr
                (accSource ++ String.fromChar whatever)
                rest

You can see that in this inner loop we’re looking for the } ending delimiter. If we don’t find one, it’s potentially an error state, or we could make the ${ inert, which is what I did.

If we do find one, we now have a source code to lex and parse! (We’ll deal with that in a minute.) After parsing the inner expression we’ll wrap it in a call to toString().

Otherwise we just continue searching for the } and noting the characters found along the way.

Parsing

We now need to take the string inside accSource and parse it into an expression.

To not make the blogpost longer than it already is, I’m cheating again: I’m skipping the lexing phase, my parse function only ever parses Vars, and it never fails:

parse : String -> Expr
parse source =
    Var source

In a real language this would not fly: you’d plug in the whole machinery, thus being able to have arbitrary expressions inside ("hello ${1 + 3}"), and there would inevitably be some error handling to do (ie. String -> Result ParseError Expr).

Nevertheless, this will do for now. (Ellie)

Some tests passing

Debug mode

We still haven’t touched the last part: ie. the "${name=}" equal-sign mode.

Let’s quickly add that in. We’ll add a new case to goInner:

[]                 -> ...
'=' :: '}' :: rest -> Debug.todo "debug mode"
'}' :: rest        -> ...
whatever :: rest   -> ...

I’m indicating the position in which to add this case, because the order matters. Some care needs to be taken to make sure the = doesn’t get chomped prematurely (making this newly added case never fire). Anyways, the above should be fine.

What to do inside the new case? The same parsing we did in the '}' :: rest case, just with a different new appended expression.

Recall the difference between the debug and non-debug interpolation in our analysis above:

"Hello {name}"
-->
"Hello " ++ toString name

"Hello {name=}"
-->
"Hello name=" ++ toString name

You can think of the last expression as:

"Hello " ++ "name=" ++ toString name

which should illustrate how to implement this. We do have the source string ("name") in our accSource variable, so instead of

append accExpr (ToString parsed)

we’ll return:

append accExpr
    (Append
        (Str (accSource ++ "="))
        (ToString parsed)
    )

And would you look at that, the whole test suite is passing now! (Ellie)

All tests passing

Conclusion

String interpolation is relatively simple: parse the string inside your string expression, distinguishing between inert string content and the interpolation delimiters (here, "${...}").

Finding the delimiters, parse the source code inside them (presumably using your expression parser instead of a top-level declaration one).

Build up the final expression by appending the inert strings around the delimiters with the toString(parsedExpression) function calls, one for each interpolation found.

In functional terms, this could be written as:

stringContent -- "Hello ${name}! ${age=}"
    |> parseDelimiters
       {- [ Inert "Hello "
          , Interpolation "name"
          , Inert "! "
          , DebugInterpolation "age"
          ]
       -}
    |> List.concatMap parseInterpolation
       {- [ Str "Hello "
          , ToString (Var "name")
          , Str "! "
          , Str "age="
          , ToString (Var "age")
          ]
       -}
    |> List.foldr Append (Str "")
       {- Append Str "Hello "
           (Append (ToString (Var "name"))
            (Append (Str "! ")
             (Append (Str "age=")
              (Append (ToString (Var "age"))
                      (Str "")))))
       -}

where parseInterpolation would look like

parseInterpolation : StringContent -> List Expr
parseInterpolation content =
    case content of
        Inert str ->
            [ Str str ]

        Interpolation source ->
            [ ToString (parse source) ]

        DebugInterpolation source ->
            [ Str (source ++ "=")
            , ToString (parse source)
            ]

Happy interpolating!

Demystifying Pratt Parsers

Notes from Elm Camp 2024

Martin Janiczek