I’ve always wanted a type system that would keep me from exposing
sensitive data with minimal developer effort, so I took a crack at
designing it for Go. It requires two new container types
(inproc[T]
and xproc[T]
) and a global function
that are all removed at compile-time, but results in a nearly zero-cost
data privacy setup. Besides reflection-based checks, no runtime changes
are required.
This is somewhere between a thought experiment and a proper proposal for inclusion in a language. It’s certainly not ready for an RFC anywhere, but there’s perhaps a kernel of value here.
There’s a class of data in many applications that’s sensitive and/or secret, but must still be accessible in plain-text for the application to work. The space is unbounded so a few examples help:
In any scenario, this information is required for the application to function. Clearly, users can’t log in if we make password hashes unreadable on the server, and we can’t display a user’s shipping address if their addresses aren’t readable. Users can’t perform any actions after logging in if we make auth tokens unreadable on the client. And we can’t decrypt the user’s password store without accepting their passphrase.
The trouble comes in with the handling of these values. It’s incredibly easy to include one in a log line that gets shipped off to a separate machine, where it’s indexed, replicated, and archived. Whether that machine is in your control or not is irrelevant: sensitive data has just leaked.
Just how easy that is depends on the languages and frameworks
involved, but printf
debugging remains relevant because
it’s so dang effective. And sometimes you just need to log values in
production to see why a bug doesn’t reproduce in a pre-prod
environment.
For a contrived example, consider a user session object in Go:
type UserSession struct {
string
Username string
pwHash .Time
expires time}
By default, you’re one fmt.(Sp|P)rintf("%v\n", session)
away from including a user’s password hash in your logs, writing it to a
file, or returning it in a server response.
Type systems as they exist today are unable to prevent this kind of
bug. A string
containing a user’s name is indistinguishable
from the string
containing their salted+hashed password at
compile-time. While some tools and techniques exist to mitigate these
issues, they’re at-best difficult to set up and at-worst (and most
commonly) not attempted at all.
Languages have recently taken to adding compile-time guarantees for concurrency-related bugs. Like the concurrency-awareness baked into Rust and Swift (among others), I’d argue there’s significant benefits to be realized if type systems were aware of process boundaries and could prevent values from being exfiltrated from processes.
First, a brief detour for justification.
The “Internet commenter”-obvious solution is to tell developers to
just… not do these kinds of things. Don’t log structures that contain
sensitive values; don’t log values you don’t fully understand; don’t
rely on printf
debugging; don’t deploy code without reading
through responses.
And they’re right! Those are of course valid solutions. That same advice applies to other forms of safety that we as developers have come to rely on:
_
”At some point in time, developers have agreed that a subset of these “Just don’t (…)” techniques weren’t enough. We wrote tools to protect our users from our own forgetfulness and made those tools available by default in many cases. We’ve been able to spend our newfound attention on more complicated problems as a result.
Protecting sensitive values — both “our” values and our users’ data — is due for this kind of tooling.
This is by no means an unsolvable problem using current techniques, but each of those have at least one significant drawback.
I recognize that I’m likely missing some approaches here. Please let me know.
By far the most common approach is to structure an application such
that it’s harder to accidentally leak this kind of information. Access
modifiers can allow only the database module to access database
credentials and DB session info. This, naturally, limits the space for
accidental disclosures of those credentials to the database module as
well. Wrong code certainly looks “more wrong” during review, but a
revision that logs credentials still builds, still passes test cases,
etc. This especially breaks down when dealing with credentials passed
via environment variable. Once read via os.Getenv
,
sensitive environment variables are still just loggable strings.
A similar option is to control how structures are serialized. Go’s
MarshalJSON
function can be implemented for any struct
that’s serialized with encoding/json
, allowing developers
to exclude fields that shouldn’t be logged. Ditto adding
toJSON()
for a JavaScript class, implementing the
ToString
and Debug
traits in Rust, etc.
Logger-specific functions, like MarshalZerologObject
from
Go’s github.com/rs/zerolog
, allow fields to be excluded
with specific logging libraries as well. And for more general
data-dumping purposes, the fmt.Formatter
and
fmt.GoStringer
interfaces can interface can be implemented
on a struct for additional control. These unfortunately must be
implemented manually for each type containing sensitive data, and they
have no influence over non-logging contexts.
Combined, implementing MarshalJSON
,
MarshalZerologObject
, Format
,
ToString
, and GoStringer
(or their equivalents
in other languages) cover most cases. Go’s
fmt.Sprintf("%s", some.StructField)
remains vulnerable
however, and the resulting string can be freely logged, written to
files, combined with other values, etc. Developers must manually know
not to call that. We’ve shrunk the vulnerable surface, but it’s still
possible to leak data.
Following access modifiers and custom serialization is the approach
of building domain-specific types. In our UserLogin
example
above, we’d replace pwHash: string
with
pwHash: PasswordHash
— that is, we’d use a type that wraps
a string
serializes to a static string (or nothing):
class PasswordHash {
private readonly hash: string;
constructor(hash: string) {
this.hash = hash;
}
toJSON(): string {
return "[redacted]";
} }
This combines both the benefits and limitations of its source techniques though, so it’s vulnerable to the same issues.
IO
MonadHaving read about Haskell’s IO
Monad for the first time
just a few days ago, I’m quite clearly an expert when it comes to its
limitations. As such an monadic sage — with minutes of research
and zero experience — I can confidently say:
The
IO
Monad appears to be solving a different yet mildly related problem and probably wouldn’t make an impact. I think. Probably.
An IO String
in Haskell is needed to solve compile-time
and lazy-evaluation constraints with externally-provided bits of text,
but doesn’t limit what actions can be taken with such a value once it’s
been obtained. IO String
s can still be printed, logged,
sent in network requests, etc.
There may be another related technique from Haskell and its relatives, but I haven’t found it yet. Do let me know if you find one :)
A naive linter can pretty easily detect calls to
fmt.Print*
(and its wrappers) that reference variables and
fields of known name, e.g.:
.Printf("Auth token: %q", sess.AuthToken)
fmt// ~~~~~~~~~~~~~~
// Lint error: cannot print field named "AuthToken"
This is quite easy to accidentally bypass. If that auth token is used
for Bearer authentication, it’s reasonable to prefix it with
"Bearer "
early:
:= "Bearer " + sess.AuthToken
bt .Printf("Auth token: %q", bt)
fmt// ~~
// No error. "bt" isn't in the denylist, and string concatenation
// hides the presence of sess.AuthToken.
Besides the list of names being unbounded and easily gamed, a denylist linter is unlikely to be effective beyond a trivial program. And once folks are used to relying on it to catch this case, bypassing the denylist becomes trivial. In other words, a “the linter didn’t catch this, it’s probably fine” mentality.
Taint tracking is not a new concept in programming and often comes up when I ask about this topic. Bugs for Perl’s perlsec pod date back to at least September 2000, Ruby had a taint-tracking mechanism until version 2.7, and some Googlers released the experimental (and unfortunately seemingly-defunct) go-flow-levee. This seems to be the closest I’ve found to process-aware types, but it’s unfortunately not flexible enough for our “don’t log sensitive values” use-case.
Taint tracking tools — whether provided by default in a language or in external libraries/binaries — are all security-focused. They deal in the domain of trusted and untrusted values, and strive to prevent the use of externally-provided values in privileged ways. Attacks like SQL Injection, poison-pilled requests, and missing input validation are all well within-scope for a taint tracker. None of the available tools prevent the exposure of values before or after sanitization, possibly because printing or writing to files are considered low-risk. In fact many values we’d like to avoid exposing have known and trusted sources. For example, hashed passwords in our databases or customer fraud risk scores are inherently trustworthy but should absolutely not be logged anywhere. Taint tracking as it exists today is largely unable to help. Where they’re flexible enough to be helpful (e.g. Semgrep’s pattern-based taint tracking), they tend to be complicated to set up, easy to ignore (due to being run in a separate build pass), or difficult to tune as desired.
inproc
and xproc
type modifiersOn a POSIX system, there’s surprisingly few ways to leak data outside the bounds of a process:
stdout
/stderr
, etc.)connect
)mmap
;
overlaps with 1, but looks like raw memory manipulation)shmat
or
similar)At any higher level beyond C(ish), developers typically see:
Go has io.Writer
, node
has
fs.WriteStream
and net.Stream
, etc. that cover
almost everything.
What if we could limit Go’s fmt.Printf
to only values we
know at compile time are safe to write to stdout
?
How would we approach MarshalJSON
, given that it produces
new values (a byte slice) based on existing ones?
Let’s consider a pair of new type modifiers, inproc
and
xproc
, and feel out the impacts they’d have.
inproc[T]
T
that should not be allowed to cross
process boundaries without manual intervention.
xproc[T]
T
that can cross process boundaries with no
additional effort.
For now, we’ll model these as Go generics. As will become clear below, these don’t qualify as proper Go generics and have some special compile-time properties. The syntax is convenient for demonstration purposes though.
T
,
inproc[T]
, and xproc[T]
The proposed inproc
and xproc
container
types are generic, with no notable additional contraints. They can be
thought of as cost-free containers (costs consisdered below) or simply
tagged subtypes of T
.
They do however have unique variance rules, to ensure an
inproc[T]
can’t easily be smuggled out via a regular
T
. To ensure backwards-compatibility with existing Go code,
T
is equivalent to xproc[T]
. I do appreciate
the clarity that an explicit xproc[T]
provides, so it
remains in this proposal.
Binary expressions between T
, inproc[T]
,
and xproc[T]
should be allowed as a convenience for
developers, but the result must preserve the most specific and
restrictive of the two types.
Consider the type of c
for the statement
c := a + b
, given certain types for a
and
b
:
a | b | c |
---|---|---|
inproc[T] |
inproc[T] |
inproc[T] |
inproc[T] |
T |
inproc[T] |
inproc[T] |
xproc[T] |
inproc[T] |
xproc[T] |
T[T] |
xproc[T] |
Similar to taint tracking
above, the inproc
container type “poisons” the result of
any binary expression its involved it. The validity of
inproc[T] + inproc[U]
(that is, the addition of two
different types wrapped in inproc
) is unchanged here. If
they were +
able before, they still are; if they weren’t
before, they still aren’t.
Since inproc[T]
is more specific than
xproc[T]
, binary expressions have covariant arguments of
type T
. The same applies for xproc[T]
and bare
T
.
Basically the same rules apply to assignments. For example,
x = y
for various declarations of x
and types
of y
:
Declaration | Type of y |
x = y Allowed? |
---|---|---|
var x inproc[T] |
inproc[T] |
Yes |
var x inproc[T] |
xproc[T] |
Yes |
var x inproc[T] |
T |
Yes |
var x xproc[T] |
inproc[T] |
No - leaks y |
var x xproc[T] |
xproc[T] |
Yes |
var x xproc[T] |
T |
Yes |
var x T |
inproc[T] |
No - leaks y |
var x T |
xproc[T] |
Yes |
var x T |
T |
Yes |
And the same for function arguments. For various signatures of a
function f
and the types of input parameters passed to it,
we get roughly the same table. For simplicity, the placeholder type
T
is reified to string
in this example.
Signature | Type of b |
f(b) Allowed? |
---|---|---|
func f(a inproc[string]) |
inproc[string] |
Yes |
func f(a inproc[string]) |
xproc[string] |
Yes |
func f(a inproc[string]) |
string |
Yes |
func f(a xproc[string]) |
inproc[string] |
No - f can leak b |
func f(a xproc[string]) |
xproc[string] |
Yes |
func f(a xproc[string]) |
string |
Yes |
func f(a string) |
inproc[string] |
No - f can leak b |
func f(a string) |
xproc[string] |
Yes |
func f(a string) |
string |
Yes |
Already, the power of this pattern is beginning to show through.
Function f
can declare that it won’t cause an
argument to leak outside the current process. This is spiritually
similar to using an argument of type const char* a
in C to
declare that the char
array pointed to by a
won’t be modified.
Likely unsurprising, we see these same rules in a function’s return
type. A function that claims it returns inproc[string]
can
return x
where x
has the type
xproc[string]
, but not the inverse. The truth table here is
left as an exercise to the reader.
There’s nothing special about structs in this arrangement. Structs
can be marked inproc
or xproc
separately from
their fields.
So far, these are all pretty trivial rules. Where taint tracking
tends to get complicated is when collection types are introduced.
Without careful static analysis, a map[string]string
could
be used to “juggle” a sensitive value and erase its taint marker.
Go already rejects slices of heterogeneous types without an interface
wrapping them. It follows then that
[](inproc[string] | xproc[string])
would also be invalid.
To optimize for developer simplicity — and in the interests of
side-stepping a Known Hard Problem — this restriction should likely be
maintained outside of Go. There’s few cases where sensitive and
non-sensitive values of the same type should be colocated in a single
slice or map. Cases that do so are already at risk of accidental
disclosure, but are unchanged from a backwards-compatibility
perspective.
There’s one notable restriction applied to the inproc
and xproc
containers: collection types cannot
themselves be marked inproc
or
xproc
.
inproc
/xproc
CollectionsThis seems initially restrictive, but the motivation is minimizing
confusion and false-negatives. There are, in-effect, two ways to handle
inproc[[]T]
and xproc[[]T]
:
The inproc
modifier propagates to its
children. An inproc
slice implicitly contains
inproc
elements. For example, inproc[[]T]
is
implicitly inproc[[]inproc[T]]
. An explicit
inproc[[]xproc[T]]
(an inproc
slice of
xproc
elements) would also be valid.
The inproc
modifier never propagates to its
children. In this approach, inproc[[]T]
from the
previous example is explicitly an inproc
slice of bare
T
elements. This forces developers to read and write
inproc[[]inproc[T]]
to represent an inproc
slice of inproc
values. That’s difficult to parse visually,
so broken down:
inproc[T] - the element type
[]inproc[T] - a slice of the element type
inproc[ []inproc[T] ] - an inproc slice of the element type
In practice, a collection type itself isn’t
sensitive. The length of a slice is rarely sensitive without
also having the contents. Since []inproc[T]
(e.g. []inproc[byte]
for a slice of inproc bytes) already
covers the contents of such a slice, wrapping the collection provides
near-zero value. Disallowing inproc
on collections is a bit
of an arbitrary decision, but it does serve to simplify things
significantly.
It must still be possible to convert from an inproc[T]
to an xproc[T]
. A generated auth token needs to be shown to
users once for it to be useful, and a customer’s shipping address must
be displayed on invoices. It’s recommended that these both be
inproc[T]
values to prevent accidental logging, but these
at some level do need to be serialized and sent over the
wire.
Like Go’s len
or new
, a new global function
larry()
is included here as well.
This function returns whatever it’s provided. It’s a no-op, implemented roughly as:
func larry[V any](sensitive inproc[V]) xproc[V] {
return any(sensitive)
}
This provides the necessary escape-hatch to allow sensitive values to
be intentionally exposed for inclusion in user-facing function calls.
(Assume fmt.Printf
has been modified to accept only
xproc[T]
values).
type UserSession struct {
[string]
Username xproc[string]
pwHash inproc.Time
expires time}
// ...
.Printf("Session created: %q/%q", sess.Username, sess.pwHash)
fmt// ~~~~~~~~~~~
// Compile error: 'inproc[string]' is not assignable to 'xproc[string]'.
// To print this value, use larry().
// Debugging incident 717
.Printf("Session created: %q/%q", sess.Username, larry(sess.pwHash))
fmt// ~~~~~~~~~~~~~~~~~~
// No compile error.
This helps to highlight where sensitive values are intentionally being exposed, and makes wrong code look wrong.
Remember, inproc[T]
and xproc[T]
are not
actually generic containers, even though they’re typed like generics. To
avoid unnecessary allocations or pointer dereferences, these container
types are erased during Go’s IR
construction (“noding”) phase. This is the first possible point
after type checking, and ensures no changes are needed to the Go IR or
any emit phases.
Similarly, the larry
function above also gets completely
optimized away at compile-time.
The result is a nearly zero-cost data privacy setup.
Go’s encoding/json.Marshal
function would need some
small modifications to ensure that inproc[T]
values can be
marshalled properly. The critical behavior change is that
Marshal(v inproc[any])
must produce a slice of inproc bytes
([]inproc[byte]
), to ensure JSON marshalling can’t sneak
past these restrictions.
The default implementation of encoding/json.Marshal
would continue to require reflection. There is some runtime cost to
checking field tags, but this is in-line with the current implementation
of encoding/json
.
Unfortunately, the erasure of inproc[T]
and
xproc[T]
at compile-time render the reflect
package unable to determine the inproc
ness of struct fields
automatically. Struct tags are accessible via the
reflect
package, so this may be solveable via a matching
json
struct tag. This would require extra work for
third-party serializers to integrate, as well a compile-time check
guaranteeing that the struct tag and type agree on
inproc
-ness.
By putting this all together into a reasonably realistic(ish)
example, we can get a better sense of what it feels like to build
something with inproc
available.
// InvoiceBuilder provides a builder pattern for serialized,
// externally-facing invoices.
type InvoiceBuilder struct {
// ...
string
destAddr string
memo }
// WithDestAddr adds the provided destination address to the invoice.
func (ib *InvoiceBuilder) WithDestAddr(a *Address) *InvoiceBuilder {
if a == nil {
return ib
}
.destAddr = "Ships to:\n" + a.String()
ibreturn ib
}
func (ib *InvoiceBuilder) WithMemo(m string) *InvoiceBuilder {
if m == nil || m == "" {
return ib
}
}
// -----
type AddressBook interface {
// Lookup retrieves an address from the database by its ID.
(id inproc[string]) inproc[Address];
Lookup}
type Order struct {
bool
isPhysical [string]
destAddrId inprocstring
customerMemo // The ID of the corresponding transaction in Stripe.
[string]
stripeTxId inproc}
// GetInvoice builds a plain-text representation of a purchase
// invoice for the provided order, looking up additional order
// metadata as needed.
func GetInvoice(ctx context.Contextx, order *Order) string {
:= tracing.Span(ctx, "GetInvoice")
span ("orderID", order.id).
Str("stripeTxId", larry(order.stripeTxId))
Str// ~~~~~
// ① Converts inproc[string] to xproc[string], and allows
// the Stripe transaction ID to be included in a span's
// metadata. We know it's safe to send to our controlled
// tracing infrastructure, but it shouldn't ever be sent
// to users.
defer span.End()
:= new(InvoiceBuilder).
ib (order.seqNum).
WithNumber(order.stripeTxId)
WithMemo// ~~~~~~~~~~~~~~~~
// ② Compile error: 'inproc[string]' is not assignable to 'string'.
//
// Without inproc support, this would accidentally include the
// Stripe transaction ID instead of the customer's order memo.
/* ... */
if (order.isPhysical && order.destAddrId) {
if dst, err := addrBook.Lookup(destAddrId); err != nil {
.WithDestAddr(larry(dst))
ib// ~~~~~
// ③ Converts inproc[Address] to xproc[Address], and
// allows the destination address to be passed to
// InvoiceBuilder.WithDestAddr
}
}
return ib.String()
}
This is of course a contrived example, but there’s three locations worth calling out:
larry
makes
the intention clear, and requires no runtime overhead.inproc
value in a cross-process context. The compile-time
error is a lightweight prompt to the author: should that variable be
exposed?Goodness I sure wish I had a highly-motivating closing section! To reiterate, this is somewhere between a thought experiment and an actual proposal.
I’d love to hear your thoughts — consider this a pre-RFC request for
comments. Perhaps I’m over-indexing on the risks, and this just isn’t
that big of a deal. Maybe I’ve overlooked something major that makes
this approach a non-starter. Am I a big dumdum? (Generally, yeah). Am I
a big dumdum who should have realized this is already built into
$popular_language
?
Feel free to get in touch by: