Call Us: +44 (0) 203 005 2505
Google Go – Golang, next generation language?

Google Go – Golang, next generation language?

For those unfamiliar, Google Go, or GoLang has just celebrated its 4th birthday, its still relatively new although critics are already hailing it as a next generation language. With arguably  the right community platform  to push it into mainstream use, it was first announced in Nov 2009 by 3 Google engineers Robert GriesemerRob Pike, and Ken Thompson . This is from the GoLang blog, written by Andrew Gerrand at Googles Sydney office (recommended you scroll up). Check out the first Go programme written in Feb 2008.

Also, here‘s a great youtube clip introducing Google Go team. – Leo 29th Nov 2013

 

The Go Blog

Text normalization in Go

26 November 2013

Introduction

An earlier post talked about strings, bytes and characters in Go. I’ve been working on various packages for multilingual text processing for the go.text repository. Several of these packages deserve a separate blog post, but today I want to focus ongo.text/unicode/norm, which handles normalization, a topic touched in the strings article and the subject of this post. Normalization works at a higher level of abstraction than raw bytes.

To learn pretty much everything you ever wanted to know about normalization (and then some), Annex 15 of the Unicode Standard is a good read. A more approachable article is the corresponding Wikipedia page. Here we focus on how normalization relates to Go.

What is normalization?

There are often several ways to represent the same string. For example, an é (e-acute) can be represented in a string as a single rune (“\u00e9″) or an ‘e’ followed by an acute accent (“e\u0301″). According to the Unicode standard, these two are “canonically equivalent” and should be treated as equal.

Using a byte-to-byte comparison to determine equality would clearly not give the right result for these two strings. Unicode defines a set of normal forms such that if two strings are canonically equivalent and are normalized to the same normal form, their byte representations are the same.

Unicode also defines a “compatibility equivalence” to equate characters that represent the same characters, but may have a different visual appearance. For example, the superscript digit ‘⁹’ and the regular digit ’9′ are equivalent in this form.

For each of these two equivalence forms, Unicode defines a composing and decomposing form. The former replaces runes that can combine into a single rune with this single rune. The latter breaks runes apart into their components. This table shows the names, all starting with NF, by which the Unicode Consortium identifies these forms:

Composing Decomposing
Canonical equivalence NFC NFD
Compatibility equivalence NFKC NFKD

Go’s approach to normalization

As mentioned in the strings blog post, Go does not guarantee that characters in a string are normalized. However, the go.text packages can compensate. For example, the collate package, which can sort strings in a language-specific way, works correctly even with unnormalized strings. The packages in go.text do not always require normalized input, but in general normalization may be necessary for consistent results.

Normalization isn’t free but it is fast, particularly for collation and searching or if a string is either in NFD or in NFC and can be converted to NFD by decomposing without reordering its bytes. In practice, 99.98% of the web’s HTML page content is in NFC form (not counting markup, in which case it would be more). By far most NFC can be decomposed to NFD without the need for reordering (which requires allocation). Also, it is efficient to detect when reordering is necessary, so we can save time by doing it only for the rare segments that need it.

To make things even better, the collation package typically does not use the norm package directly, but instead uses the norm package to interleave normalization information with its own tables. Interleaving the two problems allows for reordering and normalization on the fly with almost no impact on performance. The cost of on-the-fly normalization is compensated by not having to normalize text beforehand and ensuring that the normal form is maintained upon edits. The latter can be tricky. For instance, the result of concatenating two NFC-normalized strings is not guaranteed to be in NFC.

Of course, we can also avoid the overhead outright if we know in advance that a string is already normalized, which is often the case.

Why bother?

After all this discussion about avoiding normalization, you might ask why it’s worth worrying about at all. The reason is that there are cases where normalization is required and it is important to understand what those are, and in turn how to do it correctly.

Before discussing those, we must first clarify the concept of ‘character’.

What is a character?

As was mentioned in the strings blog post, characters can span multiple runes. For example, an ‘e’ and ‘◌́’ (acute “\u0301″) can combine to form ‘é’ (“e\u0301″ in NFD).  Together these two runes are one character. The definition of a character may vary depending on the application. For normalization we will define it as a sequence of runes that starts with a starter, a rune that does not modify or combine backwards with any other rune, followed by possibly empty sequence of non-starters, that is, runes that do (typically accents). The normalization algorithm processes one character at at time.

Theoretically, there is no bound to the number of runes that can make up a Unicode character. In fact, there are no restrictions on the number of modifiers that can follow a character and a modifier may be repeated, or stacked. Ever seen an ‘e’ with three acutes? Here you go: ‘é́́’. That is a perfectly valid 4-rune character according to the standard.

As a consequence, even at the lowest level, text needs to be processed in increments of unbounded chunk sizes. This is especially awkward with a streaming approach to text processing, as used by Go’s standard Reader and Writer interfaces, as that model potentially requires any intermediate buffers to have unbounded size as well. Also, a straightforward implementation of normalization will have a O(n²) running time.

There are really no meaningful interpretations for such large sequences of modifiers for practical applications. Unicode defines a Stream-Safe Text format, which allows capping the number of modifiers (non-starters) to at most 30, more than enough for any practical purpose. Subsequent modifiers will be placed after a freshly inserted Combining Grapheme Joiner (CGJ or U+034F). Go adopts this approach for all normalization algorithms. This decision gives up a little conformance but gains a little safety.

Writing in normal form

Even if you don’t need to normalize text within your Go code, you might still want to do so when communicating to the outside world. For example, normalizing to NFC might compact your text, making it cheaper to send down a wire. For some languages, like Korean, the savings can be substantial. Also, some external APIs might expect text in a certain normal form. Or you might just want to fit in and output your text as NFC like the rest of the world.

To write your text as NFC, use the unicode/norm package to wrap your io.Writer of choice:

wc := norm.NFC.Writer(w)
defer wc.Close()
// write as before...

If you have a small string and want to do a quick conversion, you can use this simpler form:

norm.NFC.Bytes(b)

Package norm provides various other methods for normalizing text. Pick the one that suits your needs best.

Catching look-alikes

Can you tell the difference between ‘K’ (“\u004B”) and ‘K’ (Kelvin sign “\u212A”) or ‘Ω’ (“\u03a9″) and ‘Ω’ (Ohm sign “\u2126″)? It is easy to overlook the sometimes minute differences between variants of the same underlying character. It is generally a good idea to disallow such variants in identifiers or anything where deceiving users with such look-alikes can pose a security hazard.

The compatibility normal forms, NFKC and NFKD, will map many visually nearly identical forms to a single value. Note that it will not do so when two symbols look alike, but are really from two different alphabets. For example the Latin ‘o’, Greek ‘ο’, and Cyrillic ‘о’ are still different characters as defined by these forms.

Correct text modifications

The norm package might also come to the rescue when one needs to modify text. Consider a case where you want to search and replace the word “cafe” with its plural form “cafes”.  A code snippet could look like this.

s := "We went to eat at multiple cafe"
cafe := "cafe"
if p := strings.Index(s, cafe); p != -1 {
    p += len(cafe)
    s = s[:p] + "s" + s

} fmt.Println(s)

This prints “We went to eat at multiple cafes” as desired and expected. Now consider our text contains the French spelling “café” in NFD form:

s := "We went to eat at multiple cafe\u0301"

Using the same code from above, the plural “s” would still be inserted after the ‘e’, but before the acute, resulting in  ”We went to eat at multiple cafeś”.  This behavior is undesirable.

The problem is that the code does not respect the boundaries between multi-rune characters and inserts a rune in the middle of a character.  Using the norm package, we can rewrite this piece of code as follows:

s := "We went to eat at multiple cafe\u0301"
cafe := "cafe"
if p := strings.Index(s, cafe); p != -1 {
    p += len(cafe)
    if bp := norm.FirstBoundary(s

); bp > 0 { p += bp } s = s[:p] + "s" + s

} fmt.Println(s)

This may be a contrived example, but the gist should be clear. Be mindful of the fact that characters can span multiple runes. Generally these kinds of problems can be avoided by using search functionality that respects character boundaries (such as the planned go.text/search package.)

Iteration

Another tool provided by the norm package that may help dealing with character boundaries is its iterator, norm.Iter. It iterates over characters one at a time in the normal form of choice.

Performing magic

As mentioned earlier, most text is in NFC form, where base characters and modifiers are combined into a single rune whenever possible.  For the purpose of analyzing characters, it is often easier to handle runes after decomposition into their smallest components. This is where the NFD form comes in handy. For example, the following piece of code creates atransform.Transformer that decomposes text into its smallest parts, removes all accents, and then recomposes the text into NFC:

import (
    "unicode"

    "code.google.com/p/go.text/transform"
    "code.google.com/p/go.text/unicode/norm"
)

isMn := func(r rune) bool {
    return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)

The resulting Transformer can be used to remove accents from an io.Reader of choice as follows:

r = transform.NewReader(r, t)
// read as before ...

This will, for example, convert any mention of “cafés” in the text to “cafes”, regardless of the normal form in which the original text was encoded.

Normalization info

As mentioned earlier, some packages precompute normalizations into their tables to minimize the need for normalization at run time. The type norm.Properties provides access to the per-rune information needed by these packages, most notably the Canonical Combining Class and decomposition information. Read the documentation for this type if you want to dig deeper.

Performance

To give an idea of the performance of normalization, we compare it against the performance of strings.ToLower. The sample in the first row is both lowercase and NFC and can in every case be returned as is. The second sample is neither and requires writing a new version.

Input ToLower NFC Append NFC Transform NFC Iter
nörmalization 199 ns 137 ns 133 ns 251 ns (621 ns)
No\u0308rmalization 427 ns 836 ns 845 ns 573 ns (948 ns)

The column with the results for the iterator shows both the measurement with and without initialization of the iterator, which contain buffers that don’t need to be reinitialized upon reuse.

As you can see, detecting whether a string is normalized can be quite efficient. A lot of the cost of normalizing in the second row is for the initialization of buffers, the cost of which is amortized when one is processing larger strings. As it turns out, these buffers are rarely needed, so we may change the implementation at some point to speed up the common case for small strings even further.

Conclusion

If you’re dealing with text inside Go, you generally do not have to use the unicode/norm package to normalize your text. The package may still be useful for things like ensuring that strings are normalized before sending them out or to do advanced text manipulation.

This article briefly mentioned the existence of other go.text packages as well as multilingual text processing and it may have raised more questions than it has given answers. The discussion of these topics, however, will have to wait until another day.

By Marcel van Lohuizen

Four years of Go

10 November 2013

Today marks the fourth anniversary of Go as an open source project.

Rather than talk about our technical progress (there’ll be much to talk about when we release Go 1.2 in a couple of weeks) we thought we would instead take this occasion to look at how the Go community has grown.

Let’s start with a chart:

This chart shows the growth of Google searches for the term “golang” over the past four years. Notice the knee in the curve around March 2012, when Go 1.0 was released. If these searches are a decent proxy for interest, then it’s clear that interest in Go has grown remarkably since launch, and particularly so in the last 2 years.

But where is the interest coming from?

The open source community has embraced Go, with our community wiki listing hundreds of Go projects. Some popular ones:

  • Docker is a tool for packaging and running applications in lightweight containers. Docker makes it easy to isolate, package, and deploy applications, and is beloved by system administrators. Its creator Solomon Hykes cited Go’s standard library, concurrency primitives, and ease of deployment as key factors, and said “To put it simply, if Docker had not been written in Go, it would not have been as successful.”
  • Packer is a tool for automating the creation of machine images for deployment to virtual machines or cloud services. Its author, Mitchell Hashimoto, is now working on another Go project, serf, a decentralized discovery service. Like Docker, these projects help with management of large-scale, cluster-based services.
  • Bitly‘s NSQ is a realtime distributed messaging platform designed for fault-tolerance and high-availability, and is used in production at bitly and a bunch of other companies.
  • Canonical‘s JuJu infrastructure automation system was rewritten in Go. Project lead Gustavo Niemeyer said “It’s not a single aspect of Go that makes it a compelling choice, but rather the careful organization of well-crafted small pieces.”
  • The raft package provides an implementation of the Raft distributed consensus protocol. It is the basis of Go projects like etcd and SkyDNS.

But this is just the tip of the iceberg. The number of high-quality open source Go projects is phenomenal. Prolific Go hackerKeith Rarick put it well: “The state of the Go ecosystem after only four years is astounding. Compare Go in 2013 to Python in 1995 or Java in 1999. Or C++ in 1987!”

Businesses are enjoying Go, too. The Go Users wiki page lists dozens of success stories (and if you use Go, please add yourself to it). Some examples:

  • CloudFlare built their distributed DNS service entirely with Go, and are in the process of migrating their gigabytes-per-minute logging infrastructure to the language. Programmer John Graham-Cumming said “We’ve found Go to be the perfect match for our needs: the combination of familiar syntax, a powerful type system, a strong network library and built-in concurrency means that more and more projects are being built here in Go.”
  • SoundCloud is an audio distribution service that has “dozens of systems in Go, touching almost every part of the site, and in many cases powering features from top to bottom.” Engineer Peter Bourgon said “Go demonstrates that the cruft that burdens other languages and ecosystems—stuff that developers have learned to deal with, often in anger—is simply not a necessary part of modern programming. With Go, I have a straightforward and non-adversarial relationship with my tools, from development to production.”
  • The ngrok service allows web developers to provide remote access to their development environments. Its author Alan Shreve said that “ngrok’s success as a project is due in no small part to choosing Go as the implementation language,” citing Go’s HTTP libraries, efficiency, cross-platform compatibility, and ease of deployment as the major benefits.
  • Poptip provides social analytics services, and product engineer Andy Bonventre said “What started as an experiment in writing a single service in Go turned into moving almost our entire infrastructure over to it. What I love about Go the most is not necessarily the features of the language, but the focus on tooling, testing, and other elements that make writing large applications much more manageable.”
  • Music collaboration startup Splice chose to build their service with Go. Co-founder Matt Aimonetti said “We seriously studied and considered many programming languages, but Go’s simplicity, efficiency, philosophy and community won us over.”
  • And, of course, engineering teams across Google are moving to Go. Engineer Matt Welsh recently shared his experience rewriting a large production service in Go. Other notable public examples include YouTube’s vitess projectand dl.google.com. We hope to share more stories like these soon.

In September 2012, Apcera CEO Derek Collison predicted that “Go will become the dominant language for systems work in [Infastructure-as-a-Service], Orchestration, and [Platform-as-a-Service] in 24 months.” Looking at the list above, it’s easy to believe that prediction.

So how can you get involved? Whether you’re a seasoned Go programmer or just Go-curious, there are many ways to get started in the Go community:

  • Join your nearest Go User Group, where your local gophers meet to share their knowledge and experience. These groups are popping up all over the world. I have personally spoken at Go groups in Amsterdam, Berlin, Gothenburg, London, Moscow, Munich, New York City, Paris, San Francisco, Seoul, Stockholm, Sydney, Tokyo, and Warsaw; but there are many more!
  • Create or contribute to an open source Go project (or to Go itself). (And if you’re building something, we’d love to hear from you on the Go mailing list.)

The Go team has been amazed by the growth of the Go community over the past four years. We are thrilled to see so many great things being built with Go, and deeply grateful to work with our wonderful and dedicated contributors. Thank you, everyone.

Here’s to four more years!

By Andrew Gerrand

Strings, bytes, runes and characters in Go

23 October 2013

Introduction

The previous blog post explained how slices work in Go, using a number of examples to illustrate the mechanism behind their implementation. Building on that background, this post discusses strings in Go. At first, strings might seem too simple a topic for a blog post, but to use them well requires understanding not only how they work, but also the difference between a byte, a character, and a rune, the difference between Unicode and UTF-8, the difference between a string and a string literal, and other even more subtle distinctions.

One way to approach this topic is to think of it as an answer to the frequently asked question, “When I index a Go string at position n, why don’t I get the nth character?” As you’ll see, this question leads us to many details about how text works in the modern world.

An excellent introduction to some of these issues, independent of Go, is Joel Spolsky’s famous blog post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Many of the points he raises will be echoed here.

What is a string?

Let’s start with some basics.

In Go, a string is in effect a read-only slice of bytes. If you’re at all uncertain about what a slice of bytes is or how it works, please read the previous blog post; we’ll assume here that you have.

It’s important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

Here is a string literal (more about those soon) that uses the \xNN notation to define a string constant holding some peculiar byte values. (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)

    const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

Printing strings

Because some of the bytes in our sample string are not valid ASCII, not even valid UTF-8, printing the string directly will produce ugly output. The simple print statement

    fmt.Println(sample)

produces this mess (whose exact appearance varies with the environment):

��=� ⌘

To find out what that string really holds, we need to take it apart and examine the pieces. There are several ways to do this. The most obvious is to loop over its contents and pull out the bytes individually, as in this for loop:

    for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])
    }

As implied up front, indexing a string accesses individual bytes, not characters. We’ll return to that topic in detail below. For now, let’s stick with just the bytes. This is the output from the byte-by-byte loop:

bd b2 3d bc 20 e2 8c 98

Notice how the individual bytes match the hexadecimal escapes that defined the string.

A shorter way to generate presentable output for a messy string is to use the %x (hexadecimal) format verb of fmt.Printf. It just dumps out the sequential bytes of the string as hexadecimal digits, two per byte.

    fmt.Printf("%x\n", sample)

Compare its output to that above:

bdb23dbc20e28c98

A nice trick is to use the “space” flag in that format, putting a space between the % and the x. Compare the format string used here to the one above,

    fmt.Printf("% x\n", sample)

and notice how the bytes come out with spaces between, making the result a little less imposing:

bd b2 3d bc 20 e2 8c 98

There’s more. The %q (quoted) verb will escape any non-printable byte sequences in a string so the output is unambiguous.

    fmt.Printf("%q\n", sample)

This technique is handy when much of the string is intelligible as text but there are peculiarities to root out; it produces:

"\xbd\xb2=\xbc ⌘"

If we squint at that, we can see that buried in the noise is one ASCII equals sign, along with a regular space, and at the end appears the well-known Swedish “Place of Interest” symbol. That symbol has Unicode value U+2318, encoded as UTF-8 by the bytes after the space (hex value 20): e2 8c 98.

If we are unfamiliar or confused by strange values in the string, we can use the “plus” flag to the %q verb. This flag causes the output to escape not only non-printable sequences, but also any non-ASCII bytes, all while interpreting UTF-8. The result is that it exposes the Unicode values of properly formatted UTF-8 that represents non-ASCII data in the string:

    fmt.Printf("%+q\n", sample)

With that format, the Unicode value of the Swedish symbol shows up as a \u escape:

"\xbd\xb2=\xbc \u2318"

These printing techiques are good to know when debugging the contents of strings, and will be handy in the discussion that follows. It’s worth pointing out as well that all these methods behave exactly the same for byte slices as they do for strings.

Here’s the full set of printing options we’ve listed, presented as a complete program you can run (and edit) right in the browser:

package main

import "fmt"

func main() {
    const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

    fmt.Println("Println:")
    fmt.Println(sample)

    fmt.Println("Byte loop:")
    for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])
    }
    fmt.Printf("\n")

    fmt.Println("Printf with %x:")
    fmt.Printf("%x\n", sample)

    fmt.Println("Printf with % x:")
    fmt.Printf("% x\n", sample)

    fmt.Println("Printf with %q:")
    fmt.Printf("%q\n", sample)

    fmt.Println("Printf with %+q:")
    fmt.Printf("%+q\n", sample)
}

[Exercise: Modify the examples above to use a slice of bytes instead of a string. Hint: Use a conversion to create the slice.]

[Exercise: Loop over the string using the %q format on each byte. What does the output tell you?]

UTF-8 and string literals

As we saw, indexing a string yields its bytes, not its characters: a string is just a bunch of bytes. That means that when we store a character value in a string, we store its byte-at-a-time representation. Let’s look at a more controlled example to see how that happens.

Here’s a simple program that prints a string constant with a single character three different ways, once as a plain string, once as an ASCII-only quoted string, and once as individual bytes in hexadecimal. To avoid any confusion, we create a “raw string”, enclosed by back quotes, so it can contain only literal text. (Regular strings, enclosed by double quotes, can contain escape sequences as we showed above.)

func main() {
    const placeOfInterest = `⌘`

    fmt.Printf("plain string: ")
    fmt.Printf("%s", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("quoted string: ")
    fmt.Printf("%+q", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("hex bytes: ")
    for i := 0; i < len(placeOfInterest); i++ {
        fmt.Printf("%x ", placeOfInterest[i])
    }
    fmt.Printf("\n")
}

The output is:

plain string: ⌘
quoted string: "\u2318"
hex bytes: e2 8c 98

which reminds us that the Unicode character value U+2318, the “Place of Interest” symbol ⌘, is represented by the bytes e28c 98, and that those bytes are the UTF-8 encoding of the hexadecimal value 2318.

It may be obvious or it may be subtle, depending on your familiarity with UTF-8, but it’s worth taking a moment to explain how the UTF-8 representation of the string was created. The simple fact is: it was created when the source code was written.

Source code in Go is defined to be UTF-8 text; no other representation is allowed. That implies that when, in the source code, we write the text

`⌘`

the text editor used to create the program places the UTF-8 encoding of the symbol ⌘ into the source text. When we print out the hexadecimal bytes, we’re just dumping the data the editor placed in the file.

In short, Go source code is UTF-8, so the source code for the string literal is UTF-8 text. If that string literal contains no escape sequences, which a raw string cannot, the constructed string will hold exactly the source text between the quotes. Thus by definition and by construction the raw string will always contain a valid UTF-8 representation of its contents. Similarly, unless it contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8.

Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes.

To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.

Code points, characters, and runes

We’ve been very careful so far in how we use the words “byte” and “character”. That’s partly because strings hold bytes, and partly because the idea of “character” is a little hard to define. The Unicode standard uses the term “code point” to refer to the item represented by a single value. The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘. (For lots more information about that code point, see its Unicode page.)

To pick a more prosaic example, the Unicode code point U+0061 is the lower case Latin letter ‘A’: a.

But what about the lower case grave-accented letter ‘A’, à? That’s a character, and it’s also a code point (U+00E0), but it has other representations. For example we can use the “combining” grave accent code point, U+0300, and attach it to the lower case letter a, U+0061, to create the same character à. In general, a character may be represented by a number of different sequences of code points, and therefore different sequences of UTF-8 bytes.

The concept of character in computing is therefore ambiguous, or at least confusing, so we use it with care. To make things dependable, there are normalization techniques that guarantee that a given character is always represented by the same code points, but that subject takes us too far off the topic for now. A later blog post will explain how the Go libraries address normalization.

“Code point” is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as “code point”, with one interesting addition.

The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go. The type and value of the expression

'⌘'

is rune with integer value 0x2318.

To summarize, here are the salient points:

  • Go source code is always UTF-8.
  • A string holds arbitrary bytes.
  • A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.
  • Those sequences represent Unicode code points, called runes.
  • No guarantee is made in Go that characters in strings are normalized.

Range loops

Besides the axiomatic detail that Go source code is UTF-8, there’s really only one way that Go treats UTF-8 specially, and that is when using a for range loop on a string.

We’ve seen what happens with a regular for loop. A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value. Here’s an example using yet another handy Printf format, %#U, which shows the code point’s Unicode value and its printed representation:

    const nihongo = "日本語"
    for index, runeValue := range nihongo {
        fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
    }

The output shows how each code point occupies multiple bytes:

U+65E5 '日' starts at byte position 0
U+672C '本' starts at byte position 3
U+8A9E '語' starts at byte position 6

[Exercise: Put an invalid UTF-8 byte sequence into the string. (How?) What happens to the iterations of the loop?]

Libraries

Go’s standard library provides strong support for interpreting UTF-8 text. If a for range loop isn’t sufficient for your purposes, chances are the facility you need is provided by a package in the library.

The most important such package is unicode/utf8, which contains helper routines to validate, disassemble, and reassemble UTF-8 strings. Here is a program equivalent to the for range example above, but using the DecodeRuneInString function from that package to do the work. The return values from the function are the rune and its width in UTF-8-encoded bytes.

    const nihongo = "日本語"
    for i, w := 0, 0; i < len(nihongo); i += w {
        runeValue, width := utf8.DecodeRuneInString(nihongo[i:])
        fmt.Printf("%#U starts at byte position %d\n", runeValue, i)
        w = width
    }

Run it to see that it performs the same. The for range loop and DecodeRuneInString are defined to produce exactly the same iteration sequence.

Look at the documentation for the unicode/utf8 package to see what other facilities it provides.

Conclusion

To answer the question posed at the beginning: Strings are built from bytes so indexing them yields bytes, not characters. A string might not even hold characters. In fact, the definition of “character” is ambiguous and it would be a mistake to try to resolve the ambiguity by defining that strings are made of characters.

There’s much more to say about Unicode, UTF-8, and the world of multilingual text processing, but it can wait for another post. For now, we hope you have a better understanding of how Go strings behave and that, although they may contain arbitrary bytes, UTF-8 is a central part of their design.

By Rob Pike

Arrays, slices (and strings): The mechanics of ‘append’

26 September 2013

Introduction

One of the most common features of procedural programming languages is the concept of an array. Arrays seem like simple things but there are many questions that must be answered when adding them to a language, such as:

  • fixed-size or variable-size?
  • is the size part of the type?
  • what do multidimensional arrays look like?
  • does the empty array have meaning?

The answers to these questions affect whether arrays are just a feature of the language or a core part of its design.

In the early development of Go, it took about a year to decide the answers to these questions before the design felt right. The key step was the introduction of slices, which built on fixed-size arrays to give a flexible, extensible data structure. To this day, however, programmers new to Go often stumble over the way slices work, perhaps because experience from other languages has colored their thinking.

In this post we’ll attempt to clear up the confusion. We’ll do so by building up the pieces to explain how the append built-in function works, and why it works the way it does.

Arrays

Arrays are an important building block in Go, but like the foundation of a building they are often hidden below more visible components. We must talk about them briefly before we move on to the more interesting, powerful, and prominent idea of slices.

Arrays are not often seen in Go programs because the size of an array is part of its type, which limits its expressive power.

The declaration

var buffer [256]byte

declares the variable buffer, which holds 256 bytes. The type of buffer includes its size, [256]byte. An array with 512 bytes would be of the distinct type [512]byte.

The data associated with an array is just that: an array of elements. Schematically, our buffer looks like this in memory,

buffer: byte byte byte ... 256 times ... byte byte byte

That is, the variable holds 256 bytes of data and nothing else. We can access its elements with the familiar indexing syntax,buffer[0]buffer[1], and so on through buffer[255]. (The index range 0 through 255 covers 256 elements.) Attempting to index buffer with a value outside this range will crash the program.

There is a built-in function called len that returns the number of elements of an array or slice and also of a few other data types. For arrays, it’s obvious what len returns. In our example, len(buffer) returns the fixed value 256.

Arrays have their place—they are a good representation of a transformation matrix for instance—but their most common purpose in Go is to hold storage for a slice.

Slices: The slice header

Slices are where the action is, but to use them well one must understand exactly what they are and what they do.

A slice is a data structure describing a contiguous section of an array stored separately from the slice variable itself. A slice is not an array. A slice describes a piece of an array.

Given our buffer array variable from the previous section, we could create a slice that describes elements 100 through 150 (to be precise, 100 through 149, inclusive) by slicing the array:

var slice []byte = buffer[100:150]

In that snippet we used the full variable declaration to be explicit. The variable slice has type []byte, pronounced “slice of bytes”, and is initialized from the array, called buffer, by slicing elements 100 (inclusive) through 150 (exclusive). The more idiomatic syntax would drop the type, which is set by the initializing expression:

var slice = buffer[100:150]

Inside a function we could use the short declaration form,

slice := buffer[100:150]

What exactly is this slice variable? It’s not quite the full story, but for now think of a slice as a little data structure with two elements: a length and a pointer to an element of a array. You can think of it as being built like this behind the scenes:

type sliceHeader struct {
    Length        int
    ZerothElement *byte
}

slice := sliceHeader{
    Length:        50,
    ZerothElement: &buffer[100],
}

Of course, this is just an illustration. Despite what this snippet says that sliceHeader struct is not visible to the programmer, and the type of the element pointer depends on the type of the elements, but this gives the general idea of the mechanics.

So far we’ve used a slice operation on an array, but we can also slice a slice, like this:

slice2 := slice[5:10]

Just as before, this operation creates a new slice, in this case with elements 5 through 9 (inclusive) of the original slice, which means elements 105 through 109 of the original array. The underlying sliceHeader struct for the slice2 variable looks like this:

slice2 := sliceHeader{
    Length:        5,
    ZerothElement: &buffer[105],
}

Notice that this header still points to the same underlying array, stored in the buffer variable.

We can also reslice, which is to say slice a slice and store the result back in the original slice structure. After

slice = slice[5:10]

the sliceHeader structure for the slice variable looks just like it did for the slice2 variable. You’ll see reslicing used often, for example to truncate a slice. This statement drops the first and last elements of our slice:

slice = slice[1:len(slice)-1]

[Exercise: Write out what the sliceHeader struct looks like after this assignment.]

You’ll often hear experienced Go programmers talk about the “slice header” because that really is what’s stored in a slice variable. For instance, when you call a function that takes a slice as an argument, such as bytes.IndexRune, that header is what gets passed to the function. In this call,

slashPos := bytes.IndexRune(slice, '/')

the slice argument that is passed to the IndexRune function is, in fact, a “slice header”.

There’s one more data item in the slice header, which we talk about below, but first let’s see what the existence of the slice header means when you program with slices.

Passing slices to functions

It’s important to understand that even though a slice contains a pointer, it is itself a value. Under the covers, it is a struct value holding a pointer and a length. It is not a pointer to a struct.

This matters.

When we called IndexRune in the previous example, it was passed a copy of the slice header. That behavior has important ramifications.

Consider this simple function:

func AddOneToEachElement(slice []byte) {
    for i := range slice {
        slice[i]++
    }
}

It does just what its name implies, iterating over the indices of a slice (using a for range loop), incrementing its elements.

Try it:

func main() {
    slice := buffer[10:20]
    for i := 0; i < len(slice); i++ {
        slice[i] = byte(i)
    }
    fmt.Println("before", slice)
    AddOneToEachElement(slice)
    fmt.Println("after", slice)
}

(You can edit and re-execute these runnable snippets if you want to explore.)

Even though the slice header is passed by value, the header includes a pointer to elements of an array, so both the original slice header and the copy of the header passed to the function describe the same array. Therefore, when the function returns, the modified elements can be seen through the original slice variable.

The argument to the function really is a copy, as this example shows:

func SubtractOneFromLength(slice []byte) []byte {
    slice = slice[0 : len(slice)-1]
    return slice
}

func main() {
    fmt.Println("Before: len(slice) =", len(slice))
    newSlice := SubtractOneFromLength(slice)
    fmt.Println("After:  len(slice) =", len(slice))
    fmt.Println("After:  len(newSlice) =", len(newSlice))
}

Here we see that the contents of a slice argument can be modified by a function, but its header cannot. The length stored in the slice variable is not modified by the call to the function, since the function is passed a copy of the slice header, not the original. Thus if we want to write a function that modifies the header, we must return it as a result parameter, just as we have done here. The slice variable is unchanged but the returned value has the new length, which is then stored in newSlice,

Pointers to slices: Method receivers

Another way to have a function modify the slice header is to pass a pointer to it. Here’s a variant of our previous example that does this:

func PtrSubtractOneFromLength(slicePtr *[]byte) {
    slice := *slicePtr
    *slicePtr = slice[0 : len(slice)-1]
}

func main() {
    fmt.Println("Before: len(slice) =", len(slice))
    PtrSubtractOneFromLength(&slice)
    fmt.Println("After:  len(slice) =", len(slice))
}

It seems clumsy in that example, especially dealing with the extra level of indirection (a temporary variable helps), but there is one common case where you see pointers to slices. It is idiomatic to use a pointer receiver for a method that modifies a slice.

Let’s say we wanted to have a method on a slice that truncates it at the final slash. We could write it like this:

type path []byte

func (p *path) TruncateAtFinalSlash() {
    i := bytes.LastIndex(*p, []byte("/"))
    if i >= 0 {
        *p = (*p)[0:i]
    }
}

func main() {
    pathName := path("/usr/bin/tso") // Conversion from string to path.
    pathName.TruncateAtFinalSlash()
    fmt.Printf("%s\n", pathName)
}

If you run this example you’ll see that it works properly, updating the slice in the caller.

[Exercise: Change the type of the receiver to be a value rather than a pointer and run it again. Explain what happens.]

On the other hand, if we wanted to write a method for path that upper-cases the ASCII letters in the path (parochially ignoring non-English names), the method could be a value because the value receiver will still point to the same underlying array.

type path []byte

func (p path) ToUpper() {
    for i, b := range p {
        if 'a' <= b && b <= 'z' {
            p[i] = b + 'A' - 'a'
        }
    }
}

func main() {
    pathName := path("/usr/bin/tso")
    pathName.ToUpper()
    fmt.Printf("%s\n", pathName)
}

Here the ToUpper method uses two variables in the for range construct to capture the index and slice element. This form of loop avoids writing p[i] multiple times in the body.

[Exercise: Convert the ToUpper method to use a pointer receiver and see if its behavior changes.]

[Advanced exercise: Convert the ToUpper method to handle Unicode letters, not just ASCII.]

Capacity

Look at the following function that extends its argument slice of ints by one element:

func Extend(slice []int, element int) []int {
    n := len(slice)
    slice = slice[0 : n+1]
    slice[n] = element
    return slice
}

(Why does it need to return the modified slice?) Now run it:

func main() {
    var iBuffer [10]int
    slice := iBuffer[0:0]
    for i := 0; i < 20; i++ {
        slice = Extend(slice, i)
        fmt.Println(slice)
    }
}

See how the slice grows until… it doesn’t.

It’s time to talk about the third component of the slice header: its capacity. Besides the array pointer and length, the slice header also stores its capacity:

type sliceHeader struct {
    Length        int
    Capacity      int
    ZerothElement *byte
}

The Capacity field records how much space the underlying array actually has; it is the maximum value the Length can reach. Trying to grow the slice beyond its capacity will step beyond the limits of the array and will trigger a panic.

After our example slice is created by

slice := iBuffer[0:0]

its header looks like this:

slice := sliceHeader{
    Length:        0,
    Capacity:      10,
    ZerothElement: &iBuffer[0],
}

The Capacity field is equal to the length of the underlying array, minus the index in the array of the first element of the slice (zero in this case). If you want to inquire what the capacity is for a slice, use the built-in function cap:

if cap(slice) == len(slice) {
    fmt.Println("slice is full!")
}

Make

What if we want to grow the slice beyond its capacity? You can’t! By definition, the capacity is the limit to growth. But you can achieve an equivalent result by allocating a new array, copying the data over, and modifying the slice to describe the new array.

Let’s start with allocation. We could use the new built-in function to allocate a bigger array and then slice the result, but it is simpler to use the make built-in function instead. It allocates a new array and creates a slice header to describe it, all at once. The make function takes three arguments: the type of the slice, its initial length, and its capacity, which is the length of the array that make allocates to hold the slice data. This call creates a slice of length 10 with room for 5 more (15-10), as you can see by running it:

    slice := make([]int, 10, 15)
    fmt.Printf("len: %d, cap: %d\n", len(slice), cap(slice))

This snippet doubles the capacity of our int slice but keeps its length the same:

    slice := make([]int, 10, 15)
    fmt.Printf("len: %d, cap: %d\n", len(slice), cap(slice))
    newSlice := make([]int, len(slice), 2*cap(slice))
    for i := range slice {
        newSlice[i] = slice[i]
    }
    slice = newSlice
    fmt.Printf("len: %d, cap: %d\n", len(slice), cap(slice))

After running this code the slice has much more room to grow before needing another reallocation.

When creating slices, it’s often true that the length and capacity will be same. The make built-in has a shorthand for this common case. The length argument defaults to the capacity, so you can leave it out to set them both to the same value. After

gophers := make([]Gopher, 10)

the gophers slice has both its length and capacity set to 10.

Copy

When we doubled the capacity of our slice in the previous section, we wrote a loop to copy the old data to the new slice. Go has a built-in function, copy, to make this easier. Its arguments are two slices, and it copies the data from the right-hand argument to the left-hand argument. Here’s our example rewritten to use copy:

    newSlice := make([]int, len(slice), 2*cap(slice))
    copy(newSlice, slice)

The copy function is smart. It only copies what it can, paying attention to the lengths of both arguments. In other words, the number of elements it copies is the minimum of the lengths of the two slices. This can save a little bookkeeping. Also, copyreturns an integer value, the number of elements it copied, although it’s not always worth checking.

The copy function also gets things right when source and destination overlap, which means it can be used to shift items around in a single slice. Here’s how to use copy to insert a value into the middle of a slice.

// Insert inserts the value into the slice at the specified index,
// which must be in range.
// The slice must have room for the new element.
func Insert(slice []int, index, value int) []int {
    // Grow the slice by one element.
    slice = slice[0 : len(slice)+1]
    // Use copy to move the upper part of the slice out of the way and open a hole.
    copy(slice[index+1:], slice[index:])
    // Store the new value.
    slice[index] = value
    // Return the result.
    return slice
}

There are a couple of things to notice in this function. First, of course, it must return the updated slice because its length has changed. Second, it uses a convenient shorthand. The expression

slice[i:]

means exactly the same as

slice[i:len(slice)]

Also, although we haven’t used the trick yet, we can leave out the first element of a slice expression too; it defaults to zero. Thus

slice[:]

just means the slice itself, which is useful when slicing an array. This expression is the shortest way to say “a slice describing all the elements of the array”:

array[:]

Now that’s out of the way, let’s run our Insert function.

    slice := make([]int, 10, 20) // Note capacity > length: room to add element.
    for i := range slice {
        slice[i] = i
    }
    fmt.Println(slice)
    slice = Insert(slice, 5, 99)
    fmt.Println(slice)

Append: An example

A few sections back, we wrote an Extend function that extends a slice by one element. It was buggy, though, because if the slice’s capacity was too small, the function would crash. (Our Insert example has the same problem.) Now we have the pieces in place to fix that, so let’s write a robust implementation of Extend for integer slices.

func Extend(slice []int, element int) []int {
    n := len(slice)
    if n == cap(slice) {
        // Slice is full; must grow.
        // We double its size and add 1, so if the size is zero we still grow.
        newSlice := make([]int, len(slice), 2*len(slice)+1)
        copy(newSlice, slice)
        slice = newSlice
    }
    slice = slice[0 : n+1]
    slice[n] = element
    return slice
}

In this case it’s especially important to return the slice, since when it reallocates the resulting slice describes a completely different array. Here’s a little snippet to demonstrate what happens as the slice fills up:

    slice := make([]int, 0, 5)
    for i := 0; i < 10; i++ {
        slice = Extend(slice, i)
        fmt.Printf("len=%d cap=%d slice=%v\n", len(slice), cap(slice), slice)
        fmt.Println("address of 0th element:", &slice[0])
    }

Notice the reallocation when the initial array of size 5 is filled up. Both the capacity and the address of the zeroth element change when the new array is allocated.

With the robust Extend function as a guide we can write an even nicer function that lets us extend the slice by multiple elements. To do this, we use Go’s ability to turn a list of function arguments into a slice when the function is called. That is, we use Go’s variadic function facility.

Let’s call the function Append. For the first version, we can just call Extend repeatedly so the mechanism of the variadic function is clear. The signature of Append is this:

func Append(slice []int, items ...int) []int

What that says is that Append takes one argument, a slice, followed by zero or more int arguments. Those arguments are exactly a slice of int as far as the implementation of Append is concerned, as you can see:

// Append appends the items to the slice.
// First version: just loop calling Extend.
func Append(slice []int, items ...int) []int {
    for _, item := range items {
        slice = Extend(slice, item)
    }
    return slice
}

Notice the for range loop iterating over the elements of the items argument, which has implied type []int. Also notice the use of the blank identifier _ to discard the index in the loop, which we don’t need in this case.

Try it:

    slice := []int{0, 1, 2, 3, 4}
    fmt.Println(slice)
    slice = Append(slice, 5, 6, 7, 8)
    fmt.Println(slice)

Another new technique is in this example is that we initialize the slice by writing a composite literal, which consists of the type of the slice followed by its elements in braces:

    slice := []int{0, 1, 2, 3, 4}

The Append function is interesting for another reason. Not only can we append elements, we can append a whole second slice by “exploding” the slice into arguments using the ... notation at the call site:

    slice1 := []int{0, 1, 2, 3, 4}
    slice2 := []int{55, 66, 77}
    fmt.Println(slice1)
    slice1 = Append(slice1, slice2...) // The '...' is essential!
    fmt.Println(slice1)

Of course, we can make Append more efficient by allocating no more than once, building on the innards of Extend:

// Append appends the elements to the slice.
// Efficient version.
func Append(slice []int, elements ...int) []int {
    n := len(slice)
    total := len(slice) + len(elements)
    if total > cap(slice) {
        // Reallocate. Grow to 1.5 times the new size, so we can still grow.
        newSize := total*3/2 + 1
        newSlice := make([]int, total, newSize)
        copy(newSlice, slice)
        slice = newSlice
    }
    slice = slice[:total]
    copy(slice[n:], elements)
    return slice
}

Here, notice how we use copy twice, once to move the slice data to the newly allocated memory, and then to copy the appending items to the end of the old data.

Try it; the behavior is the same as before:

    slice1 := []int{0, 1, 2, 3, 4}
    slice2 := []int{55, 66, 77}
    fmt.Println(slice1)
    slice1 = Append(slice1, slice2...) // The '...' is essential!
    fmt.Println(slice1)

Append: The built-in function

And so we arrive at the motivation for the design of the append built-in function. It does exactly what our Append example does, with equivalent efficiency, but it works for any slice type.

A weakness of Go is that any generic-type operations must be provided by the run-time. Some day that may change, but for now, to make working with slices easier, Go provides a built-in generic append function. It works the same as our int slice version, but for any slice type.

Remember, since the slice header is always updated by a call to append, you need to save the returned slice after the call. In fact, the compiler won’t let you call append without saving the result.

Here are some one-liners intermingled with print statements. Try them, edit them and explore:

    // Create a couple of starter slices.
    slice := []int{1, 2, 3}
    slice2 := []int{55, 66, 77}
    fmt.Println("Start slice: ", slice)
    fmt.Println("Start slice2:", slice2)

    // Add an item to a slice.
    slice = append(slice, 4)
    fmt.Println("Add one item:", slice)

    // Add one slice to another.
    slice = append(slice, slice2...)
    fmt.Println("Add one slice:", slice)

    // Make a copy of a slice (of int).
    slice3 := append([]int(nil), slice...)
    fmt.Println("Copy a slice:", slice3)

    // Copy a slice to the end of itself.
    fmt.Println("Before append to self:", slice)
    slice = append(slice, slice...)
    fmt.Println("After append to self:", slice)

It’s worth taking a moment to think about the final one-liner of that example in detail to understand how the design of slices makes it possible for this simple call to work correctly.

There are lots more examples of appendcopy, and other ways to use slices on the community-built “Slice Tricks” Wiki page.

Nil

As an aside, with our newfound knowledge we can see what the representation of a nil slice is. Naturally, it is the zero value of the slice header:

sliceHeader{
    Length:        0,
    Capacity:      0,
    ZerothElement: nil,
}

or just

sliceHeader{}

The key detail is that the element pointer is nil too. The slice created by

array[0:0]

has length zero (and maybe even capacity zero) but its pointer is not nil, so it is not a nil slice.

As should be clear, an empty slice can grow (assuming it has non-zero capacity), but a nil slice has no array to put values in and can never grow to hold even one element.

That said, a nil slice is functionally equivalent to a zero-length slice, even though it points to nothing. It has length zero and can be appended to, with allocation. As an example, look at the one-liner above that copies a slice by appending to a nilslice.

Strings

Now a brief section about strings in Go in the context of slices.

Strings are actually very simple: they are just read-only slices of bytes with a bit of extra syntactic support from the language.

Because they are read-only, there is no need for a capacity (you can’t grow them), but otherwise for most purposes you can treat them just like read-only slices of bytes.

For starters, we can index them to access individual bytes:

slash := "/usr/ken"[0] // yields the byte value '/'.

We can slice a string to grab a substring:

usr := "/usr/ken"[0:4] // yields the string "/usr"

It should be obvious now what’s going on behind the scenes when we slice a string.

We can also take a normal slice of bytes and create a string from it with the simple conversion:

str := string(slice)

and go in the reverse direction as well:

slice := []byte(usr)

The array underlying a string is hidden from view; there is no way to access its contents except through the string. That means that when we do either of these conversions, a copy of the array must be made. Go takes care of this, of course, so you don’t have to. After either of these conversions, modifications to the array underlying the byte slice don’t affect the corresponding string.

An important consequence of this slice-like design for strings is that creating a substring is very efficient. All that needs to happen is the creation of a two-word string header. Since the string is read-only, the original string and the string resulting from the slice operation can share the same array safely.

A historical note: The earliest implementation of strings always allocated, but when slices were added to the language, they provided a model for efficient string handling. Some of the benchmarks saw huge speedups as a result.

There’s much more to strings, of course, but they are a topic for another post.

Conclusion

To understand how slices work, it helps to understand how they are implemented. There is a little data structure, the slice header, that is the item associated with the slice variable, and that header describes a section of a separately allocated array. When we pass slice values around, the header gets copied but the array it points to is always shared.

Once you appreciate how they work, slices become not only easy to use, but powerful and expressive, especially with the help of the copy and append built-in functions.

More reading

There’s lots to find around the intertubes about slices in Go. As mentioned earlier, the “Slice Tricks” Wiki page has many examples. The Go Slices blog post describes the memory layout details with clear diagrams. Russ Cox’s Go Data Structuresarticle includes a discussion of slices along with some of Go’s other internal data structures.

There is much more material available, but the best way to learn about slices is to use them.

By Rob Pike

The first Go program

18 July 2013

Brad Fitzpatrick and I (Andrew Gerrand) recently started restructuring godoc, and it occurred to me that it is one of the oldest Go programs. Robert Griesemer started writing it back in early 2009, and we’re still using it today.

When I tweeted about this, Dave Cheney replied with an interesting question: what is the oldest Go program? Rob Pike dug into his mail and found it in an old message to Robert and Ken Thompson.

What follows is the first Go program. It was written by Rob in February 2008, when the team was just Rob, Robert, and Ken. They had a solid feature list (mentioned in this blog post) and a rough language specfication. Ken had just finished the first working version of a Go compiler (it didn’t produce native code, but rather transliterated Go code to C for fast prototyping) and it was time to try writing a program with it.

Rob sent mail to the “Go team”:

From: Rob 'Commander' Pike
Date: Wed, Feb 6, 2008 at 3:42 PM
To: Ken Thompson, Robert Griesemer
Subject: slist

it works now.

roro=% a.out
(defn foo (add 12 34))
return: icounter = 4440
roro=%

here's the code.
some ugly hackery to get around the lack of strings.

(The icounter line in the program output is the number of executed statements, printed for debugging.)

package main

// fake stuff
type char uint8;

// const char TESTSTRING[] = "(defn foo (add 'a 'b))\n";

type Atom struct {
        string  *[100]char;
        integer int;
        next    *Slist;  /* in hash bucket */
}

type List struct {
        car     *Slist;
        cdr     *Slist;
}

type Slist struct {
        isatom          bool;
        isstring        bool;
        //union {
        atom    Atom;
        list    List;
        //} u;

        Free method();
        Print method();
        PrintOne method(doparen bool);
        String method(*char <-);
        Integer method(int <-);
        Car method(*Slist <-);
        Cdr method(*Slist <-);
}

method (this *Slist) Car(*Slist <-) {
        return this.list.car;
}

method (this *Slist) Cdr(*Slist <-) {
        return this.list.cdr;
}

method (this *Slist) String(*[100]char <-) {
        return this.atom.string;
}

method (this *Slist) Integer(int <-) {
        return this.atom.integer;
}

function OpenFile();
function Parse(*Slist <-);

//Slist* atom(char *s, int i);

var token int;
var peekc int = -1;
var lineno int32 = 1;

var input [100*1000]char;
var inputindex int = 0;
var tokenbuf [100]char;

var EOF int = -1;  // BUG should be const

function main(int32 <-) {
        var list *Slist;

        OpenFile();
        for ;; {
                list = Parse();
                if list == nil {
                        break;
                }
                list.Print();
                list.Free();
                break;
        }

        return 0;
}

method (slist *Slist) Free(<-) {
        if slist == nil {
                return;
        }
        if slist.isatom {
//              free(slist.String());
        } else {
                slist.Car().Free();
                slist.Cdr().Free();
        }
//      free(slist);
}

method (slist *Slist) PrintOne(<- doparen bool) {
        if slist == nil {
                return;
        }
        if slist.isatom {
                if slist.isstring {
                        print(slist.String());
                } else {
                        print(slist.Integer());
                }
        } else {
                if doparen {
                        print("(");
                }
                slist.Car().PrintOne(true);
                if slist.Cdr() != nil {
                        print(" ");
                        slist.Cdr().PrintOne(false);
                }
                if doparen {
                        print(")");
                }
        }
}

method (slist *Slist) Print() {
        slist.PrintOne(true);
        print "\n";
}

function Get(int <-) {
        var c int;

        if peekc >= 0 {
                c = peekc;
                peekc = -1;
        } else {
                c = convert(int, input[inputindex]);
                inputindex = inputindex + 1; // BUG should be incr one expr
                if c == '\n' {
                        lineno = lineno + 1;
                }
                if c == '\0' {
                        inputindex = inputindex - 1;
                        c = EOF;
                }
        }
        return c;
}

function WhiteSpace(bool <- c int) {
        return c == ' ' || c == '\t' || c == '\r' || c == '\n';
}

function NextToken() {
        var i, c int;
        var backslash bool;

        tokenbuf[0] = '\0';     // clear previous token
        c = Get();
        while WhiteSpace(c)  {
                c = Get();
        }
        switch c {
                case EOF:
                        token = EOF;
                case '(':
                case ')':
                        token = c;
                        break;
                case:
                        for i = 0; i < 100 - 1; {  // sizeof tokenbuf - 1
                                tokenbuf[i] = convert(char, c);
                                i = i + 1;
                                c = Get();
                                if c == EOF {
                                        break;
                                }
                                if WhiteSpace(c) || c == ')' {
                                        peekc = c;
                                        break;
                                }
                        }
                        if i >= 100 - 1 {  // sizeof tokenbuf - 1
                                panic "atom too long\n";
                        }
                        tokenbuf[i] = '\0';
                        if '0' <= tokenbuf[0] && tokenbuf[0] <= '9' {
                                token = '0';
                        } else {
                                token = 'A';
                        }
        }
}

function Expect(<- c int) {
        if token != c {
                print "parse error: expected ", c, "\n";
                panic "parse";
        }
        NextToken();
}

// Parse a non-parenthesized list up to a closing paren or EOF
function ParseList(*Slist <-) {
        var slist, retval *Slist;

        slist = new(Slist);
        slist.list.car = nil;
        slist.list.cdr = nil;
        slist.isatom = false;
        slist.isstring = false;

        retval = slist;
        for ;; {
                slist.list.car = Parse();
                if token == ')' {       // empty cdr
                        break;
                }
                if token == EOF {       // empty cdr  BUG SHOULD USE ||
                        break;
                }
                slist.list.cdr = new(Slist);
                slist = slist.list.cdr;
        }
        return retval;
}

function atom(*Slist <- i int) {  // BUG: uses tokenbuf; should take argument
        var h, length int;
        var slist, tail *Slist;

        slist = new(Slist);
        if token == '0' {
                slist.atom.integer = i;
                slist.isstring = false;
        } else {
                slist.atom.string = new([100]char);
                var i int;
                for i = 0; ; i = i + 1 {
                        (*slist.atom.string)[i] = tokenbuf[i];
                        if tokenbuf[i] == '\0' {
                                break;
                        }
                }
                //slist.atom.string = "hello"; // BUG! s; //= strdup(s);
                slist.isstring = true;
        }
        slist.isatom = true;
        return slist;
}

function atoi(int <-) {  // BUG: uses tokenbuf; should take argument
        var v int = 0;
        for i := 0; '0' <= tokenbuf[i] && tokenbuf[i] <= '9'; i = i + 1 {
                v = 10 * v + convert(int, tokenbuf[i] - '0');
        }
        return v;
}

function Parse(*Slist <-) {
        var slist *Slist;

        if token == EOF || token == ')' {
                return nil;
        }
        if token == '(' {
                NextToken();
                slist = ParseList();
                Expect(')');
                return slist;
        } else {
                // Atom
                switch token {
                        case EOF:
                                return nil;
                        case '0':
                                slist = atom(atoi());
                        case '"':
                        case 'A':
                                slist = atom(0);
                        case:
                                slist = nil;
                                print "unknown token"; //, token, tokenbuf;
                }
                NextToken();
                return slist;
        }
        return nil;
}

function OpenFile() {
        //strcpy(input, TESTSTRING);
        //inputindex = 0;
        // (defn foo (add 12 34))\n
        inputindex = 0;
        peekc = -1;  // BUG
        EOF = -1;  // BUG
        i := 0;
        input[i] = '('; i = i + 1;
        input[i] = 'd'; i = i + 1;
        input[i] = 'e'; i = i + 1;
        input[i] = 'f'; i = i + 1;
        input[i] = 'n'; i = i + 1;
        input[i] = ' '; i = i + 1;
        input[i] = 'f'; i = i + 1;
        input[i] = 'o'; i = i + 1;
        input[i] = 'o'; i = i + 1;
        input[i] = ' '; i = i + 1;
        input[i] = '('; i = i + 1;
        input[i] = 'a'; i = i + 1;
        input[i] = 'd'; i = i + 1;
        input[i] = 'd'; i = i + 1;
        input[i] = ' '; i = i + 1;
        input[i] = '1'; i = i + 1;
        input[i] = '2'; i = i + 1;
        input[i] = ' '; i = i + 1;
        input[i] = '3'; i = i + 1;
        input[i] = '4'; i = i + 1;
        input[i] = ')'; i = i + 1;
        input[i] = ')'; i = i + 1;
        input[i] = '\n'; i = i + 1;
        NextToken();
}

The program parses and prints an S-expression. It takes no user input and has no imports, relying only on the built-in printfacility for output. It was written literally the first day there was a working but rudimentary compiler. Much of the language wasn’t implemented and some of it wasn’t even specified.

Still, the basic flavor of the language today is recognizable in this program. Type and variable declarations, control flow, and package statements haven’t changed much.

But there are many differences and absences. Most significant are the lack of concurrency and interfaces—both considered essential since day 1 but not yet designed.

func was a function, and its signature specified return values before arguments, separating them with <-, which we now use as the channel send/receive operator. For example, the WhiteSpace function takes the integer c and returns a boolean.

function WhiteSpace(bool <- c int)

This arrow was a stop-gap measure until a better syntax arose for declaring multiple return values.

Methods were distinct from functions and had their own keyword.

method (this *Slist) Car(*Slist <-) {
    return this.list.car;
}

And methods were pre-declared in the struct definition, although that changed soon.

type Slist struct {
    ...
    Car method(*Slist <-);
}

There were no strings, although they were in the spec. To work around this, Rob had to build the input string as an uint8 array with a clumsy construction. (Arrays were rudimentary and slices hadn’t been designed yet, let alone implemented, although there was the unimplemented concept of an “open array”.)

input[i] = '('; i = i + 1;
input[i] = 'd'; i = i + 1;
input[i] = 'e'; i = i + 1;
input[i] = 'f'; i = i + 1;
input[i] = 'n'; i = i + 1;
input[i] = ' '; i = i + 1;
...

Both panic and print were built-in keywords, not pre-declared functions.

print "parse error: expected ", c, "\n";
panic "parse";

And there are many other little differences; see if you can identify some others.

Less than two years after this program was written, Go was released as an open source project. Looking back, it is striking how much the language has grown and matured. (The last thing to change between this proto-Go and the Go we know today was the elimination of semicolons.)

But even more striking is how much we have learned about writing Go code. For instance, Rob called his method receiversthis, but now we use shorter context-specific names. There are hundreds of more significant examples and to this day we’re still discovering better ways to write Go code. (Check out the glog package‘s clever trick for handling verbosity levels.)

I wonder what we’ll learn tomorrow.

By Andrew Gerrand

Comments are closed.