Fixing Harvey's bugs Automagically

27 June 2016

How we fix harvey's bugs automagically

How it works

Both gcc and clang, but more notoriously clang come with very useful diagnostic messages

main.c:153:6: warning: use of GNU 'missing =' extension in designator
      [-Wgnu-designator]
[16]    "r5g6b5",
        ^
        = 

but clang also exposes these diagnostic messages to the programmes via libclang. As I wrote in a previous post, the most important thing I wanted to do with build was not just building stuff but was rather creating a library that can be used from a cli too. Infact build was insipred by that libclang philosophy.

Clang exposes the information needed to fix the source code in form of FixIt messages, this requires the compiler flags and the actual files too but since build exposes that information we can programatically fix the files that need fixing.

You can view the code here.

By Sevki

Learn you some build

10 January 2016

Build targets are;

  • Concurrent
  • Declerative
  • nodes on a directed acyclic graph (no import cycles, 1-way communication)

So what does this mean for build targets?

Every target you create must exist in a vacuum; which means that your target may only make assumptions about what is in its vacuum, and nothing more. If it's not there, it doesn't exist.

You are free to make assumptions about what is installed from your dependencies a.k.a. your vacuum, at first glance this may seem a bit limiting, but this can compose incredibly versatile workflows.

For instance imagine you have two targets, foo and foo_Clang, both targets have one source file foo.c, foo_Clang has one dependency clang while foo has none.

As mentioned before, children are how you manipulate the vacuum. In this case, clang installs a binary in the under bin directory in the foo_Clang vacuum. In the foo_Clang vacuum calling CC will now invoke the clang binary installed by that dependency, instead of what ever it is the host machine os had defined in environment variables, mechanics of how this works will be explained in the Installs() sections.

This gives you the one place to make assumptions; dependencies. Imagine creating a os_package type that installs binaries that your target needs in it's vacuum. And given that these are vacuums, you can have multiple versions of the same compiler running in parallel, because vacuums.

Ok enough chatter; time to create a target type;

First thing first, let's get you familiar with the target interface.

// Target defines the interface that rules must implement for becoming build targets.
type Target interface {
    GetName() string
    GetDependencies() []string

    Hash() []byte
    Build(*Context) error
    Installs() map[string]string // map[dst]src
}

There is nothing in this interface that isn't absolutely necessary, if it's not needed it will be removed.

Convenience methods (how's that for irony after that last sentence)

GetName() return's the name, for the target, that's it.

GetDependencies() returns the list of dependencies for a target.

Let's build the first bits

package js

import (
    "crypto/sha1"
    "fmt"
    "log"

    "io"
    "path/filepath"
    "strings"
    "time"

    "sevki.org/build"
    "sevki.org/build/ast"
)

func init() {
    if err := ast.Register("npm_package", NpmPackage{}); err != nil {
        log.Fatal(err)
    }
}

type NpmPackage struct {
    Name         string   `npm_package:"name"`
    Version      string   `npm_package:"version"`
    Dependencies []string `npm_package:"deps"`
}

func (npm *NpmPackage) GetName() string {
    return npm.Name
}

func (npm *NpmPackage) GetDependencies() []string {
    return npm.Dependencies
}

these are pretty straight forward. And since now we know what the structure of the target will be we can write a BUILD file.

npm_package(
    name="express"
    version="4.13.3"
)

Hash() []byte

This is where your target should produce a hash. There are two things to remember when you are writing the hash function;

  • You have two hash functions, Build() and Hash(). If you think of your compilation step as a hash function, for file foo.c your c compiler should always produce the same foo.o. So your Hash() function should be as deterministic as your Build() function.
  • Whenever build is producing wrong builds, or you have to blow away the /tmp/build directory before a build, it means there is a discrepancy between your two hash functions. Fix the Hash() never blow the directory away.
  • After this step is complete, you shouldn't do anything to change the deterministic nature of the hash produced by this function in the build step, the correctness of your target relies on it.

While build does know where your files are, because it has to make their paths absolute, it will leave you the choice of how to extract a meaningful hash from them. For instance, cc_library hashes the --version of $CC. Build assumes the author of the target type knows more about how it should be built then the build. So if you have to get the current weather report for hell like so

func Hash() []byte {
    res, _ := http.Get("http://www.yr.no/place/Norway/Nord-Tr%C3%B8ndelag/Stj%C3%B8rdal/Hell/")
    h := sha1.New()
    io.Copy(h, res.Body)
    return h.Sum(nil)
}

you are more than welcome to do it.

let's build the hash function for npm

func (npm *NpmPackage) Hash() []byte {
    h := sha1.New()
    io.WriteString(h, npm.Name)
    if npm.Version == "" {
        io.WriteString(h, npm.Version)
    } else {
        fmt.Fprintf(h, "%d", time.Now().UnixNano())
    }
    return h.Sum(nil)
}

as you can see in this we are hashing the version and name of the package, and if the package doesn't have a version we make sure that it's not cached. USE A VERSION YOU HIPPIES!

Build(*Context) error

This is the only step of our target that is cached. So this is where it all gets interesting.

So this is the most basic example I could find on npmjs.org

$ npm install express

let's start by implementing this.

func (npm *NpmPackage) Build(c *build.Context) error {
    if npm.Version == "" {
        return fmt.Errorf("NPM package %s failed to install, no version string")
    }
    params := []string{}
    params = append(params, "install")
    params = append(params, fmt.Sprintf("%s@%s", npm.Name, npm.Version))

first we check and fail if the version isn't there, second thing we do is to create a params array and append install and express@4.13.3 to the array. Make sure that all the flags and arguments you are passing to Exec are their own string items, do not try to be smart and join all the params with \s and pass them all in to exec as a single item string array. The params array, essentially ends up being the os.Args array on the executed program.

Context

So far, everything we have done has been pretty much go, now's the part I've been going on about in introduction about vacuums.

We want to isolate each target build from other's (unless otherwise specified), we do this by using contexts. Each build function gets a context object (it's vacuum).

Contexts provide a bunch of functions that are bound to that node, for instance c.Println() will write to that node's output stream, since these are usually run in parallel, if all the compilations wrote to STDOUT at the same time, it would be indecipherable garbage output.

Lets start by writing to that output

    c.Println(strings.Join(append([]string{"npm"}, params...), " "))

while that's not entirely necessary it's good for debugging later on.

The same problem with STDOUT becomes a problem when build needs to shell out. If all the workers before they started working called os.Chdir() there would be no predicting what would happen, so you need to use Exec() which can be used in different directories in parallel.

So the last bit of our npm build function will be

    if err := c.Exec("npm", nil, params); err != nil {
        c.Println(err.Error())
        return fmt.Errorf(err.Error())
    }
    return nil
}

First argument exec takes is the name of the application to exec, which in this case is going to be npm, pretty straight forward. Second argument it will get is the string array of environment variables, for the C/C++ target types for instance, this involves adding the lib and include directories to LIBRARY_PATH and C_INCLUDE_PATH environment variables respectively, for target types that are expecting binaries to be installed, adding bin to PATH environment variable would be the way to go.

Installs() map[string]string

This function returns a map of things to be installed from dependencies to the dependent target. key of the map is the destination, the value it returns is the source of installation.

func (npm *NpmPackage) Installs() (installs map[string]string) {
    path := filepath.Join("node_modules", npm.Name)
    installs[path] = path
    return
}

the order of setting the key and the value is not very prominent in this example, so let's give an example where it is;

Destination is always the key and source is always the value, couple of reasons for that, firstly while you can create 2 symlinks with one source, you can't create one symlink with to sources (atleast in unix), another reason why destination is the key is that it makes life easier when looking up sources while making the paths absolute.

So for cc_library this would look like this;

func (cl *CLib) Installs() map[string]string {
    exports := make(map[string]string)
    libName := fmt.Sprintf("%s.a", cl.Name)

    exports[filepath.Join("lib", libName)] = libName

    return exports
}

Our finished npm example looks like this;

package js

import (
    "crypto/sha1"
    "fmt"
    "log"

    "io"
    "path/filepath"
    "strings"
    "time"

    "sevki.org/build"
    "sevki.org/build/ast"
)

func init() {
    if err := ast.Register("npm_package", NpmPackage{}); err != nil {
        log.Fatal(err)
    }
}

type NpmPackage struct {
    Name         string   `npm_package:"name"`
    Version      string   `npm_package:"version"`
    Dependencies []string `npm_package:"deps"`
}

func (npm *NpmPackage) GetName() string {
    return npm.Name
}

func (npm *NpmPackage) GetDependencies() []string {
    return npm.Dependencies
}


func (npm *NpmPackage) Hash() []byte {
    h := sha1.New()
    io.WriteString(h, npm.Name)
    if npm.Version == "" {
        io.WriteString(h, npm.Version)
    } else {
        fmt.Fprintf(h, "%d", time.Now().UnixNano())
    }
    return h.Sum(nil)
}

func (npm *NpmPackage) Build(c *build.Context) error {
    if npm.Version == "" {
        return fmt.Errorf("NPM package %s failed to install, no version string")
    }
    params := []string{}
    params = append(params, "install")
    params = append(params, fmt.Sprintf("%s@%s", npm.Name, npm.Version))
    c.Println(strings.Join(append([]string{"npm"}, params...), " "))
    if err := c.Exec("npm", nil, params); err != nil {
        c.Println(err.Error())
        return fmt.Errorf(err.Error())
    }
    return nil
}

func (npm *NpmPackage) Installs() (installs map[string]string) {
    path := filepath.Join("node_modules", npm.Name)
    installs[path] = path
    return
}

By Sevki

Build internals

9 January 2016

Overview

Build is composed of a bunch of parts; a lexer, parser, preprocessor, processor, and post-processor to be exact. All these are part of what I call the builder (as in graph builder) which, you guessed it builds the build graph.

Lexer

Lexer is not really that sophisticated in terms of what I do with it, practically I've seen Rob's lexical scanning in go.

Parser

Parser get's all these tokens and turns then in to a function and some variables. A file's AST representation that looks like this

type File struct {
    Path  string
    Funcs []*Func
    Vars  map[string]interface{}
}

PreProcessor

At this point in the lifecycle of a build file, we have a rudimentary AST so we can do things like check if we have duplicate load functions in a file.

func processDupeLoad(f *ast.File) error {
    seenFile := make(map[string]*ast.Func)
    for _, function := range f.Funcs {
        if function.Name != "load" {
            continue
        }
        var fileName string
        switch function.AnonParams[0].(type) {
        case string:
            fileName = function.AnonParams[0].(string)
        default:
            errorMessage := `load must always be in this form 'load("//foo/bar/FILE", "EXPORTED_VALUE_A", "EXPORTED_VALUE_B")'`
            log.Fatal(errorMessage)
        }

        if before, seen := seenFile[fileName]; seen {
            return fmt.Errorf("'load' function in file %s, loads from same file %s twice. try merging load functions on line %d and %d.",
                filepath.Join(f.Path, function.File),
                function.AnonParams[0].(string),
                function.Line,
                before.Line,
            )
        } else {
            seenFile[fileName] = function
        }
    }
    return nil
}

and fail the compiler with a helpful message like so;

2016/01/09 17:34:18 error processing document: 'load' function in file /Users/sevki/Code/harvey/sys/src/libthread/BUILD, loads from same file //sys/src/FLAGS twice. try merging load functions on line 2 and 1.

We can also check for other stuff, like duplicate functions that describe the same target and so on.

Processor

At this stage we are still dealing with an AST, processor is what takes all the ast.Func stuff and returns their respective graph objects.

There are two types of functions essentially, those who return and those that become build targets. Functions like glob, load, version get processed right inside processor.

Caveat: Files that are added trough the glob function are an exception to the abosluting mechanism in the post processor.

All the non returning functions are then valmorphanize in to their respective target interfaces and that's how they will spend the rest of their lives.

PostProcessor

At this stage all the targets are in their go struct forms. Last two things build does is process the dependencies (which is a pretty straight forward thing to do targets that start with : get the path attached to them) and processing the paths, which involves more work.

For instance the syn target in harvey installs a x.tab.c file, which rc lists in it's srcs field, but because it is also in the map of things that are installed from the syn target the post processor doesn't absolute the path.

How is this ok? Doesn't build take hash of the files?

Yes, but a nodes hash is determined by it's dependencies so if the file that is produced by the target has changed, so should the hash of the dependency node and every other node that depends on it.

By Sevki

Build as a Library

20 December 2015

I looked up my first commit

commit 2701e0c5126574a41e8abcbbe07ea81a74e51d7b
Author: Sevki Hasirci <s@sevki.org>
Date:   Thu Aug 13 20:57:34 2015 +0300

    first commit

to see how long it has been since I started working on this and it has been a while, 4 months and some change to be vauge, 4 months is a long time to be working on something in the opensource world, but I have not been able to work on this a lot, since I was jumping trough a job interview after another, I ended up getting a job @ Spotify in Stockholm. During bootcamp and talking to fellow engineers I've told them about my new project and people have been interested, so I decided to start working on this more.

Objectives

Buids only purpose inlife is to provide you with build semantics and abstractions. That's it. So I call it Build as a library. The fact that it was written in go is a implementation detail, it very well could have been written in C, erlang, ruby and sure enough it was actually written in python and java(x2).

  • Making a lightweight, 0 dependency build system so we can use it in plan9, harvey or freebsd run it on arm or powerpc or kubernetes.
  • MapReduce'd distributed compilations and tests
  • Better output

Before we go on I can't stress this enough, this is a ALPHA proof of concept software (even though the concept has been proven before).

First results and impressions

(excerpts from harvey mail list)

When I started testing this against the old harvey build system I got

CC=x86_64-elf-gcc build libs.json  17.75s user 7.66s system 90% cpu 28.087 total

compilation with build was

CC=x86_64-elf-gcc build //sys/src:libs  26.79s user 12.97s system 221% cpu 17.972 total

( I freaked out at first, how could my version be twice as slow as the old one I'm doing things in parallel, then I figured out I'm an idiot, and was reading the wrong column, mine is twice as fast as the old one. #winning )

Build introduces a lot of over head to the compilation process that wasn't there in the original build.go, It hashes all the files, reads the stdout and stderr of processes and does a bunch of other things. I'm pretty sure there can be some optimisations made to have major perf gains but I'm not too fussed about those.

So here is what it looks like when you run the old build.go and build

and even if we just reduced the compilation time to half just with parallelising the compilation, that would be a good day in my book, but now minor edits take significantly less time, for instance,

a minor comment change in anyfile and the original build.go will still compile in the same time

CC=x86_64-elf-gcc build libs.json  19.96s user 10.71s system 90% cpu 33.994 total

but build will compile it in a second and chnage

CC=x86_64-elf-gcc build //sys/src:libs  1.01s user 0.43s system 136% cpu 1.056 total

that's where the hashing and caching yield massive improovements.

Hashing Mechanics

before build builds a target, said target must provide a hash of its input, in the case of a c library build assumes that gcc or what ever compiler you are using is essentially a deterministic hash and for the given input lib, it should always compile the same binary, so in build target definition it tries to hash as many things as it can to make the hash we calculate as correct as possible, this includes, hashing all the files, hashing all the includes, all the parameters, copts and link opts as we can.

given that this is a build graph and every target is a node on that graph, we also produce hash for each node, that is the target and all its child dependencies hash, so if one of the dependencies is invalidated, every parent of that node will also be invalidated

so given the situation

a --> b --> c --> d  --> e

if 'e' is invalidated, all of it's parents will also be invalidated.

however it's not all roses, ATM build doesn't hash std libraries or system headers, which is not a problem for harvey since it doesn't use any of the std stuff, but that is just an implementation glitch that can easily be fixed by hashing those too. For instance when we first run build, cc library hashes the version of the $CC, that sort of technique can also be used to hash the std includes and libs.

so anything that requires a clean build is something build failed to include in its hash fucntion. I am also playing around with libclang that will abosulutely make sure that C/C++ stuff are properly hashed, using the actual hash function to produce the hash, as opposed to guessing it's validity but that's not a priority and it intoroduces a big dependency. That may just be for the server vesion.

Notes on go

while I was discussing this project the subject of lines of code popped up in the context of maintainability,

Project              files       blank      comment           code
bazel                2159        45111      103108            256772
buck                 2836        55583      88187             281625
build                28          424        151               2307

while my code base is not as feature complete as its java counterparts, I don't think it will ever reach 100 times its current size.

By Sevki

Why not make?

8 September 2015

This is about a my evaluation of build systems for like bazel, buck and pants which I think could improve the current build system harvey build.go.

Package manager

Whilst building a package-manager the need for a build system arose (because the package manager is going to be a collection of some internet aware higher level tools around a build tool) during searching for such a tool a new thread in harvey-os popped up titled Why not make?. TL;DR: harvey team ditched mk -- which is what you had to use to build plan9 originally -- for a custom build script written in go. Mostly because go and plan9 share ancestry and most harvey contributors are well versed in go.

Why not make?

I don't think I can elobarate this further then it already has been in that thread, but I'll try anyways, something Aki said really struct a chord

Today, Go is the build system for Harvey, with some stuff pushed to json for convenience. For all I care, build.go should be seen as malleable as the json it reads is. Or maybe the json should be viewed as struct literals split out from build.go. Either way, I prefer to think of go as the single fully general underlying system, with build.go being the equivalent of a Makefile.

and with that, and what ron had said

Build is fast and can be faster also, build can let us do some interesting things not yet done. For example, we could gather all the .c files for the kernel into one file, compile that, and see if we get benefit from whole program optimization. Easy to write this in Go, hard to do in shell scripts (I've tried) since we want to correctly order the inclusion of files.

(Indeed, on my lenovo x201, when I time a build I get

106.39s user 19.74s system 92% cpu 2:15.88 total

so there is still room for improvement, infact that was my main motivation for doing this.)

{Java, Java} - Choose two

So I started looking at build systems, state of the art seems to be Buck, Bazel which seem to be built by ex or current googlers. Installing buck and bazel are a nightmare since you have to install java -- you have to go to oracle's website and download a version of jdk like its 1995 -- ant and a bunch of other dependencies, so there is no `brew install` for buck or bazel there is a AUR package for bazel but that seems to be it from my perspective since I only use OSX and archlinux. (I don't know what the story is on ubuntu or debian and I couldn't care less because all I care about as a dependency is go and all harvey devs will have go installed.)

As Álvaro said,

Before build.go we thought in other methods. One of them recent opensourced by Google, but it requieres java. Java is very expensive for now if we want build harvey from harvey before Christmas.

I also wanted to be able to make this extensible

as the section Adding Custom Rules to Buck states

As of the writing of this document, the only official way to add rules to Buck is to fork the project and modify the source. We will, at some point, construct a beautiful and elegant extensions API. Until then....

meanwhile bazel atleast has an extension mechanism called skylark but the main code is still java so who cares. Ok I shouldn't probably give so much grief to buck for the extensibility story, because thats forking is how you would extend build as well. Although the extension writing process is immensely easier as it shall be apparent later in this post.

Enter build

So I started hacking on this thing called build, incidentally the same name given to the harvey build script.

Installing it is as simple as go getting sevki.org/build. If for no other reason, go getable bazel or buck is plenty good to be doing this.

But there are other motivations behind it, for starters I wanted the build tool to have a web interface, not that it isn't a CLI or requires to be used via a browser, but especially build-bots would benefit alot from being able to know what the dependency graph looks like.

for instance, building a harvey is as simple as

build //:harvey

to build harvey. One thing build assumes is that you are always in a git repo, so // is where your .git lives, and everything is relative to that. If you are in a submodule, we assume that you are doing stuff related to that module and assume that that submodule also has BUILD files with targets, and source file mappings relative to that folders //.

To run as a server you add server to the command like so

build server

(I haven't nailed down the server functionality yet, how the build bot server should be configured, how timeouts should work on a target compliation level, so on and so forth, which is why this isn't public yet) and build runs in server mode which is when things start to become interesting. There are two view modes as of writing this, depGraph mode and buildGraph mode; depGrpah hashes the target name and does some cutesy coloring for differentiating of targets, buildGraph is what is produced after a build, and thats when things start to get interesting, for instance take harvey travis build log, infact I'll actually spare you the navigating from here

I know this is a cheap shot but you can't really tell what which of the targets are acting up and for what reason from looking at that image even if it were full sized, and other more popular projects like go and docker aren't doing much better either. On the other hand if you very clumsily stare at the graph below, you'll see that syscallheader caused a chain reaction for all the targets that depend on it to fail.

Extending

Let's take mksys for instance, the entire application hasn't really changed that much but I'll quickly go trough the changed bits, previous build file looked like

{
    "Library": "klibc.a",
    "Name": "KernelLibc",
    "Pre": [
        "mksys -o sys.h -mode=sys.h $HARVEY/sys/src/sysconf.json",
        "mksys -o . -mode=syscallfiles $HARVEY/sys/src/sysconf.json"
    ]
}

we moved the mksys flags to the build definition by declaring a struct like so

type MkSys struct {
    Name         string   `mk_sys:"name"`
    Mode         string   `mk_sys:"mode"`
    ARCH         string   `mk_sys:"arch"`
    OutPath      string   `mk_sys:"out" build:"path"`
    SysConf      string   `mk_sys:"sysconf" build:"path"`
    Dependencies []string `mk_sys:"deps"`
    Out          []byte
    Source       string
}

simple enough. The end result is, not extremely different from its json couterpart, but its looks bazely.

mk_sys(
    name = "syscallheader",
    mode = "sys.h",
    arch = "amd64",
    sysconf = "//sys/src/sysconf.json",
    out = "//sys/src/libc/9syscall/sys.h"
)

mk_sys(
    name = "syscallfiles",
    mode = "syscallfiles",
    arch = "amd64",
    sysconf = "//sys/src/sysconf.json",
    out = "//sys/src/libc/9syscall",
    deps = [
         "syscallheader"
         ]
)

In order to be a target type MkSys needs to implement the build target interface, which means that it should have a couple of functions,

// Target is for implementing different build targets.
type Target interface {
    Build() error
    GetName() string
    GetDependencies() []string
    GetSource() string
    Reader() io.Reader
    Hash() []byte
}

Build(), obviously builds the target, everything else that starts with get are convinience methods, Reader() returns the output log reader, it could be a log file, it could be a byte buffer, what ever you like, and Hash() which is for caching.

Caching

Harvey, builds really fast, no question about that, and will harvey benefit from caching, probably not to the extent that it will become a problem for a very long while, but if something is worth porting, it's worth overdoing. And while Ron has his reservations about how gnu make handles it, which as far as I can gather is by file modtimes, I think this is not a particularly hard problem to solve, build tries to fix it by hashing everything underthe sun, files, arguments for targets hashes of dependencies, and even hashes the CC version, while it isn't as cheap as stat() ing the file, it is the most effective way I could think of to assure correctness of builds, whilst increasing the speed still dramatically.

build //:harvey > /dev/null  0.10s user 0.07s system 111% cpu 0.149 total

This is the time it takes to hash the entire file tree, variables and even the "CC --verison".

Concurrency

All target builds are executed in their own go routines. So one should not assume serial, execution of dependencies, they almost always will be in randomium ("mium" is latin for not really) order, if a target has to be executed before another target then by definition it's a dependency hence it goes in its dependency pile.

Clearly the mechanism for concurrency should be bound by the ammount of cpu power you have, there is also no reason that the workers should only be distributed to your machines CPUs they should also be distrubted to a cluster machines in a data center, so that is something I'm looking to implement in build.

Beyond harvey

Caching and paralelism is thrown in to this project not because there is a real need for it in harvey but because I think everyone can benefit from build, there is probably a case to be made for using build to build docker images, vendoring go packages and so on and so forth. Of bazel, buck and pants, only pants has support of go packages, and I feel uneasy about that, building go python doesn't feel right. Meanwhile it would be trivial integrate the already great tools like gb, gt or godep into build, I think any go developer could do it in less than a day.

By Sevki

See the index for more articles.