Fast S3 Updates with Golang and Goroutines

I recently wrote a go program to optimize my S3 uploads for this blog. The idea was to write a program that would only upload the files that have changed (rather like RSync) rather than bulk uploading the whole site each time. Although there are third-party tools to do this, I wanted to learn more about the golang AWS API while at the same time getting a chance to run some goroutine benchmarks. To see the code and the benchmarks I did, read on!

Is This Trip Necessary?

After writing this article, I discovered that a simpler approach to this is to use the AWS CLI. The AWS CLI has a “sync” command which allows you to sync files and directories. I’m leaving this article here, however, as it still shows some interesting techniques in Go.

Some Preliminaries

The AWS API for golang exposes an S3 service that we’ll use both for querying the files already in the Amazon cloud and uploading the new or changed files. Let’s see that and quickly look at all the imports we’ll be using:

// main.go

import (
    "bytes"
    "crypto/md5"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "os"
    "path/filepath"
    "strings"

    "github.com/aws/aws-sdk-go/aws"
    "github.com/aws/aws-sdk-go/aws/credentials"
    "github.com/aws/aws-sdk-go/aws/session"
    "github.com/aws/aws-sdk-go/service/s3"
)

/* getS3Session returns the AWS session with our configuration.
   For our purposes this could have gone in getS3Service,
   but in case we end up relying on other AWS clients
   it's handy to have the session available separately.
*/ 
func getS3Session() *session.Session {
    // Use shared credentials ($HOME/.aws/credentials),
    // but set the region
    return session.Must(session.NewSession(
        &aws.Config{
            Region:      aws.String(getConfiguration().region),
            Credentials: credentials.NewSharedCredentials("", "default"),
        }))
}

/*
    getS3Service returns a service we use to query our bucket 
    and to put new / changed files.
*/
func getS3Service() *s3.S3 {
    return s3.New(getS3Session())
}

The function getS3Session relies on a configuration function to set up our program. This is simply a struct to keep whatever settings we need, which you’d need to change up to match your local path and S3 bucket name. Since this was a personal project, we chose a configuration object over a set of command-line parameters. Here’s the code for our configuration type/function.

type config struct {
    region         string /* AWS region to use for session */
    bucket         string /* S3 bucket where are files are stored */
    localDirectory string /* Local files to sync with S3 bucket */
}

func getConfiguration() *config {
    configuration := config{
        region:         "us-east-2",
        bucket:         "codesolid.com",
        localDirectory: "/home/john/source/codesolid/public/"}
    return &configuration
}

Program Design

Our strategy will be simple. If we can build a map of remote S3 object (file) names to file checksums, any local files that either aren’t in the map or who are in the map but have a different checksum need to be uploaded. We’re concerned with new or updated files – any deletes are ignored, and we can handle them separately.

So our first function builds the map of remote files to checksums.

/**
getAwsS3ItemMap constructs and returns a map of keys (relative filenames)
to checksums for the given bucket-configured s3 service.  
It is assumed  that the objects have not been multipart-uploaded, 
which will change the checksum.
*/
func getAwsS3ItemMap(s3Service *s3.S3) (map[string]string, error) {

    var loi s3.ListObjectsInput
    loi.SetBucket(getConfiguration().bucket)

    obj, err := s3Service.ListObjects(&loi)

    // Uncomment this to see what AWS returns for ListObjects
    // fmt.Printf("%s\n", obj)

    var items = make(map[string]string)

    // Based on response from aws API, map relative filenames to checksums
    // The "Key" in S3 is the filename, while the ETag contains the 
    // checksum
    if err == nil {
        for _, s3obj := range obj.Contents {
            eTag := strings.Trim(*(s3obj.ETag), "\"")
            items[*(s3obj.Key)] = eTag
        }
        return items, nil
    }

    return nil, err
}

Next, we need to walk our local directory that contains the corresponding files we’ll want to check for updates. Since we’ll be walking the directory tree and opening each file, in turn, to run a checksum on it, there’s a lot of file input going on that makes the task a good fit for optimizing with GoRoutines.

I confess that I did it wrong in an early draft of this approach. By using shared memory which needed to be synchronized to add the files to a list, I made the program slower than without the goroutines (as my benchmarks showed)! Once I used channels and moved the building of the list to the main goroutine, the benchmarks improved significantly for the goroutine version compared to the same code without goroutines and channels.

I paid the price for ignoring the Go Blog’s advice:

***”Do not communicate by sharing memory; instead, share memory by communicating.”***

Before we get to the benchmarks, let’s look at the final version of the code. Let’s start our discussion with the traditional, synchronous version without go-routines or channels.

/*
fileWalkWithoutGoroutine walks the local directory using golang's
filepath.Walk.  Ignoring directories, for each file we get 
the checksum of the relative file path from the awsItems map.
We then pass the full filename and checksum to testFileNoGoroutine,
which returns either an empty string if the file doesn't need to be 
uploaded, or a non-empty string if there's no checksum (file not on the remote) 
or if the checksum doesn't match the local file (file has changed).

Return the list of files to be uploaded as a list.
*/
func fileWalkWithoutGoroutine(awsItems map[string]string) []string {
    var files int
    var filesToUpload []string
    blogDir := getConfiguration().localDirectory
    filepath.Walk(blogDir, func(path string, info os.FileInfo, err error)   error {
        if err != nil {
            return err
        }
        if !info.IsDir() {
            files++
            relativeFile := strings.Replace(path, blogDir, "", 1)
            checksumRemote, _ := awsItems[relativeFile]

            filename := testFileNoGoroutine(path, checksumRemote)
            if len(filename) > 0 {
                filesToUpload = append(filesToUpload, filename)
            }
        }
        return nil
    })

    return filesToUpload
}

/**
    testFileNoGoroutine returns a blank string if the given 
    filename does not need to be updated (checksum blank or does 
    not match local file), or a non-blank filename if the 
    
*/
func testFileNoGoroutine(filename string, checksumRemote string) string {
    if checksumRemote == "" {
        return filename
    }

    contents, err := ioutil.ReadFile(filename)
    if err != nil {
        return ""
    }
    sum := md5.Sum(contents)
    sumString := fmt.Sprintf("%x", sum)
    // checksums don't match, mark for upload
    if sumString != checksumRemote {
        return filename
    }
    return ""
}

Now, let’s look at the same code using goroutines and channels. This time, as we walk the local directory, we call testFile repeatedly as a goroutine. In testFile, instead of returning a string as in our earlier program, we pass the filename or a blank string back on the channel that we passed from fileWalk.

In fileWalk, we keep a count of the files we walked in numfiles, and when we’re done walking the files, we read from the channel we passed to testFiles numfiles times to build our list of files to return.

func fileWalk(awsItems map[string]string) []string {
    blogDir := getConfiguration().localDirectory
    ch := make(chan string)
    var numfiles int
    filepath.Walk(blogDir, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }
        if !info.IsDir() {
            numfiles++
            relativeFile := strings.Replace(path, blogDir, "", 1)
            checksumRemote, _ := awsItems[relativeFile]

            go testFile(path, checksumRemote, ch)
            //fmt.Println(path, info.Size())
        }
        return nil
    })

    var filesToUpload []string

    var f string
    for i := 0; i < numfiles; i++ {
        f = <-ch
        if len(f) > 0 {
            filesToUpload = append(filesToUpload, f)
        }
    }

    return filesToUpload
}

func testFile(filename string, checksumRemote string, ch chan<- string) {

    if checksumRemote == "" {
        ch <- filename
        return
    }

    contents, err := ioutil.ReadFile(filename)
    if err == nil {
        sum := md5.Sum(contents)
        sumString := fmt.Sprintf("%x", sum)
        // checksums don't match, mark for upload
        if sumString != checksumRemote {
            ch <- filename
            return
        } else {
            // Files matched
            ch <- ""
            return
        }
    }
}

Benchmarking the Two Versions

We’ll leave aside the main method and the code to upload the files for the time being, but at this point we have code that tells us just what files need to be uploaded.

To see how much of a performance boost we get from the goroutine version, we write some simple benchmarks in main_test.go. Here’s our main_test.go file, with some non-benchmark tests we wrote excluded.

// main_test.go
package main

import (
	"log"
	"os"
	"testing"
)

var awsItems map[string]string

/* Set up the awsItems map that we'll need */
func TestMain(m *testing.M) {

	s3Service := getS3Service()
	var err error
	awsItems, err = getAwsS3ItemMap(s3Service)
	if err != nil {
		log.Fatal(err)
	}

	// call flag.Parse() here if TestMain uses flags
	os.Exit(m.Run())
}

func BenchmarkFilewalkGoroutine(b *testing.B) {
	for n := 0; n < b.N; n++ {
		fileWalk(awsItems)
	}
}

func BenchmarkFilewalkWithoutGoroutine(b *testing.B) {
	for n := 0; n < b.N; n++ {
		fileWalkWithoutGoroutine(awsItems)
	}
}

I stumbled on the prettybench utility one time, which I like to use to improve the output of Go benchmark tests. Using that tool to format the output gives:

$ go test -v -bench=. | prettybench
goos: linux
goarch: amd64
pkg: aws
PASS
benchmark                             iter     time/iter
--------- ---- ---------
# Fast S3 Updates with Golang and GoroutinesBenchmarkFilewalkGoroutine-4            50   27.26 ms/op
BenchmarkFilewalkWithoutGoroutine-4     50   32.82 ms/op
ok  	aws	3.543s

Of course, we’re running an SSD drive here, and Go is hecka fast to begin with, but even with that, the goroutine version improves the performance by about 20%. I’ll publish an update to this article or a link to the GitHub version with the upload code and the rest of the program soon, but for now, it’s late, so I think I’ll use my AWS go file upload utility one last time to push this article to the cloud! Cheers.

You May Also Enjoy

Python Format Strings: Beginner to Expert

Simple Vagrant Ansible Local Example