Every now and again people ask about if CEPL will support compute and I’ve always said that it would happen for years. The reason is that, because compute is not part of the GL’s draw pipeline, I thought there would be a tonne of compute specific changes that would need to be made. Turns out I was wrong.
Wednesday I was triaging some tickets in the CEPL repo and saw the one for compute. I was about to ignore it when I thought that it would be nice to read through the GL wiki to see just how horrendous a job it would be. 5 minutes later I’m rather disconcerted as it looked easy.. temptingly easy.
‘Luckily’ however I really want SSBOs for writing out of compute and that will be hard ..
looks at gl wiki again .. Shit.
Ok so SSBOs turned out to be a much smaller feature than expected as well. So I gave myself 24 hours, from late Friday to late Saturday to implement as many new features (sanely) as I could.
Here are the results:
GL only has one of these, the fence. CEPL now has support for this too.
You make a fence with
(setf some-fence (make-gpu-fence))
and then you can wait on the fence
optionally with a timeout
(wait-on-gpu-fence some-fence 10000)
also optionally flushing
(wait-on-gpu-fence some-fence 10000 t)
Or you can simply check if the fence has signalled
GL has a range of queries you can use, we have exposed them as structs you can create as follows:
(make-timestamp-query) (make-samples-passed-query) (make-any-samples-passed-query) (make-any-samples-passed-conservative-query) (make-primitives-generated-query) (make-transform-feedback-primitives-written-query) (make-time-elapsed-query)
To begin querying into the object you need to make the query active. This is done with
(with-gpu-query-bound (some-query) ..)
After the scope of
with-gpu-query-bound the message to stop querying is in the gpu’s queue, however the results are not available immediately. To check if the results are ready you can use
gpu-query-result-available-p or you can use some of the options to
pull-gpu-query-result, let’s look at that function now.
To get the results to lisp we use
pull-gpu-query-result. When called with just a query object it will block until the results are ready:
We can also say not to wait and CEPL will try to pull the results immediately, if they are not ready it will return nil as the second return value
(pull-gpu-query-result some-query nil) ;; the nil here means don't wait
To use compute you simply make a gpu function which takes no non-uniform arguments and always returns
(values) (a void function in C nomenclature) and then make a gpu pipeline that only uses that function.
(defstruct-g bah (data (:int 100))) (defun-g yay-compute (&uniform (woop bah :ssbo)) (declare (local-size :x 1 :y 1 :z 1)) (setf (aref (bah-data woop) (int (x gl-work-group-id))) (int (x gl-work-group-id))) (values)) (defpipeline-g test-compute () :compute yay-compute)
You can the
map-g over this like any other pipeline..
(map-g #'test-compute (make-compute-space 10) :woop *ssbo*)
..with one little difference. Instead of taking a stream of vertices we now take a compute space. This specify the number of ‘groups’ that will be working on the problem. The value has up to 3 dimensions so
(make-compute-space 10 10 10) is valid.
We also soften the requirements around gpu-function names for the compute stage. Usually you have to specify the full name of a gpu function due to possible overloading e.g.
(saturate :vec3) however as compute shaders can only take uniforms, and we don’t offer overloading based on uniforms, there can only be one with a given name. Because of this we allow
yay-compute instead of
The eagle eyed of you will have noticed the
:ssbo qualifier in the
woop uniform argument. SSBOs give you storage you can write into from a compute shader. Their api is almost identical to that of UBOs so I copied-pasted that code in CEPL and got SSBOs working. This code will most likely be unified again once I have fixed some details with binding however for now we have something that works.
This means we can take our struct definition from before:
(defstruct-g bah (data (:int 100)))
and make a gpu-array
(setf *data* (make-gpu-array nil :dimensions 1 :element-type 'bah))
and then make an SSBO from that
(setf *ssbo* (make-ssbo *data*))
And that’s it, ready to pass to our compute shader.
Yeah. So all of that was awesome, I’m really glad to have a feature land that I wasnt expecting to add for a couple more years. Of course there are bugs, the most immediately obvious is that when I tried the example above I was getting odd gaps in the data in my SSBO
TEST> (pull-g *data*) (((0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)))
The reason for this that I didnt understand layout in GL properly. This also means CEPL doesnt handle it properly and that I have a bug :) (this bug)[https://github.com/cbaggers/cepl/issues/193].
Fixing this is interesting as it means that, unless we force you to make a different type for each layout e.g.
(defstruct-g bah140 (:layout :std140) (data (:int 100))) (defstruct-g bah430 (:layout :std140) (data (:int 100)))
..which feels ugly, we would need to support multiple layouts for each type. Which means the accessor functions in lisp would need to switch on this fact dynamically. That sounds slow to me when trying to process a load of foreign data quickly.
I also have a nagging feeling that the current way we marshal struct elements from c-arrays is not ideal.
This things together make me think I will be making some very breaking changes to CEPL’s data marshaling for the start of 2018.
This stuff needs to be done so it’s better we rip the band-aid off whilst we have very few known users.
News on that as it progresses.
All in all though, a great 24 hours. I’m currently learning how to extract std140 layout info from varjo types so that will likely be what I work on next Saturday.
Hi again! This last week has gone pretty well. Shared contexts have landed in CEPL master, the only host that supports them right now is SDL2 although I want to make a PR to CEPL.GLFW as it should be easy to support there also. Glop is proving a little harder as we really need to update the OSX support, I started poking at it but it’s gonna take a while and I’ve got a bunch of stuff on my plate right now.
I started looking at multi-draw & indirect rendering in GL as I think these are the features I want to implemented next. However I immediately ran into existing CEPL bugs in defstruct so I think the next few weeks are going to be spent cleaning that up and fixing the struct related issues from github.
AAAAges back I promised beginner tutorials for common lisp and I totally failed to deliver. I had been hoping the Atom support would get good enough that we could use that in the videos. Alas that project seems to have slowed down recently and my guilt finally reached the point that I had to put out something. To that end I have started making a whole bunch of little videos on random bits of common lisp. Although it doesnt achieve what I’d really like to do with proper tutorials, I hope it will help enough people on their journey into the language.
That’s all for now, Peace.
 although I’m still praying it get’s finished
Ah it feels good to have some meat for this week’s writeup. In short transform feedback has landed in CEPL and will ship in the next quicklisp release.
What is it?
Transform feedback is a feature that allows you to write data out from one of the vertex stages into a VBO as well as passing it on to the next stage. It also opens up the possibility of not having a fragment shader at all and just using your vertex shader like the function is a gpu based
map function. For a good example of it’s use check out this great tutorial by the little grasshopper.
How is it exposed in CEPL?
In your gpu-function you simply add an additional qualifier to one or more of your outputs, like this:
(defun-g mtri-vert ((position :vec4) &uniform (pos :vec2)) (values (:feedback (+ position (v! pos 0 0))) (v! 0.1 0 1)))
Here we see that the gl_position from this stage will be captured, now for the cpu side. First we make a gpu-array to write the data into.
(setf *feedback-vec4* (make-gpu-array nil :element-type :vec4 :dimensions 100))
And then we make a transform feedback stream and attach our array. (transform feedback streams can have multiple arrays attached as we will see soon)
(setf *tfs* (make-transform-feedback-stream *feedback-vec4*))
And finally we can use it. Assuming the gpu function above was used as the vertex stage in a pipeline called
some-pipeline then the code will look like this:
(with-transform-feedback (*tfs*) (map-g #'prog-1 *vertex-stream* :pos (v! -0.1 0)))
And that’s it! now the first result from
mtri-vert will be written into the gpu-array in
*feedback-vec4* and you can pull back the values like this:
If you add the feedback modifier to multiple outputs then they will all be interleaved into the gpu-array. However you might want to write them into seperate arrays, this can be done by providing a ‘group number’ to the
(defun-g mtri-vert ((position :vec4) &uniform (pos :vec2)) (values ((:feedback 1) (+ position (v! pos 0 0))) ((:feedback 0) (v! 0.1 0 1))))
making another gpu-array
(setf *feedback-vec3* (make-gpu-array nil :element-type :vec3 :dimensions 10))
and binding both arrays to a transform feedback stream
(setf *tfs* (make-transform-feedback-stream *feedback-vec3* *feedback-vec4*))
You can also use the same pipeline multiple times within the scope of
(with-transform-feedback (*tfs*) (map-g #'prog-1 *vertex-stream* :pos (v! -0.1 0)) (map-g #'prog-1 *vertex-stream* :pos (v! 0.3 0.28)))
CEPL is pretty good at catching and explaining cases where GL will throw an error such as: not enough vbos (gpu-arrays) bound for the number of feedback targets -or- 2 different pipelines called within the scope of
During this I ran into some aggravating issues relating to transform feedback and recompilation of pipelines, it was annoying to the point that I rewrote a lot of the code behind the
defpipeline-g macro. The short version of this is that the code emitted is no longer a top-level closure and also that CEPL now has ways of avoiding recompilation when it can be proved that the gpu-functions in use havent changed.
I also found out that in some cases
defvar with type declarations is faster than the captured values from a top level closure, even when they are typed. See here for a test you can run on your machine to see if you get the same kind of results.
Like I said this code is in the branch to be picked up by the next quicklisp release. This feature will certainly have it’s corner cases and bugs but I’m happy to see this out and to have one less missing feature from CEPL.
Future work on transform feedback includes using the transform feedback objects introduced in GLv4 to allow for more advanced interleaving options and also nesting of the
Next on the list for this month is shared contexts. More on that next week!
 late november or early december
 testing on my mac has given different results, use
defvar and packing data in a struct is faster than multiple
defvars but slower than the top level closure. Luckily it’s still on the order of microseconds a frame (assuming 5000 calls per frame) but measurable. It’s interesting to see as packed in
defvar was faster on my linux desktop
¯\_(ツ)_/¯ I’ll show more data when I have it.
Its now November which means it’s NanoWrimo time again, each year I like to participate in spirit by picking a couple of features for projects I’m working on and hammer them out. This is well timed as I haven’t had much time for adding new things to CEPL recently.
The features I want to have by the end of the month are decent multi-context support and single stage pipelines.
This stuff we have already talked about but I have bit the bullet and got coding at last. I have support for non-shared contexts now but not shared ones yet, the various hosts have mildly-annoyingly different approaches to abstracting this so I’m working on finding the sweet spot right now.
Regarding the defpipeline threading issues from last week I did in the end opt for a small array of program-ids per pipeline indexed by the cepl-context id. I can make this fast enough and the cost is constant so that feels like the right call for now. CEPL & Varjo were never written with thread safety in mind so it’s going to be a while before I can do a real review and work out what the approach should be, for now its a very ‘throw some mutexes in and hope’ situation, but it’s fine…we’ll get there :p
One side note is that all lambda-pipelines in CEPL are tied to their thread so don’t need the indirection mentioned above :)
Single Stage Pipelines
This is going to be fun. Up until now if you wanted to render a fullscreen quad you needed to:
- make a gpu-array holding the quad data
- make a stream for that gpu-array
- make a pipeline with:
- a vertex shader to put the points in clip space
- a fragment shader
The annoying thing is that the fragment shader was the only bit you really cared about. Luckily it turns out there is a way to do this with geometry shaders and no gpu-array and it should be portable for all GL versions CEPL supports. So I’m going to prototype this out on the stream on wednesday and, assuming it works, I’ll make this into a CEPL feature soon.
That covers making pipelines with only a fragment shader of course but what about with only a vertex stage? Well transform feedback buffers are something I’ve wanted for a while and so I’m going to look into supporting those. This is cool as you can then use the vertex stage for data processing kinda like a big pmap. This could be handy when you data is already in a gpu-array.
With transform feedback support a few possibilities open up. The first is that with some dark magic we could run a number of variants of the pipeline using transform feedback to ‘log’ values from the shaders, this gives us opportunities for debugging that weren’t there before.
Another tempting idea (which is also easier) is to allow users to call a gpu-function directly. This will
- make a temporary pipeline
- make a temporary gpu-array and stream with just the arguments given (1 element)
- run the code on the gpu capturing the result with a temporary transform-feedback buffer or fbo
- convert the values to lisp values
- dispose of are the temporaries
The effect is to allow people to run gpu code from the repl for the purposes of rapid prototyping. It obviously is useless in production because of all the overhead but being able to iterate in the repl with stuff like this could really be great.
That’ll do pig
Right, time to go.
This last weekend I put a little time into multi-context support in CEPL.
CEPL has it’s own context (
cepl-context) class that holds both the gl-context handle and also state that is cached to improve performance.
cepl-contexts are passed implicitly down the stack and are tied to a single thread.
Most of the work was just finding simple errors in my code an shoring them up, but I did find one tricky case and that was in pipelines. So a pipeline is usually defined in a top level declaration like so:
(defpipeline my-pipeline () :vertex some-gpu-function :fragment some-other-gpu-function)
And this generates all the bootstrapping to compile the gpu functions, get the GL program-id, etc. However that program-id is a GL resource and belongs to a single GL context. As it is right now it’ll be the context that calls this pipeline first..ew.
So how to tackle this? We could create one program-id per context, however this means either looking up the program-id based on the context per call in a pipeline local cache..or looking up the program-id in a context local cache based on the pipeline. Neither is great, as extra lookups per call are something we should be avoiding.
Another option is to have shared GL contexts. This is nice anyway as it means we can share textures/buffers/etc between threads which I think is a nice default behavior. However even with this solution there are still issues with pipelines.
The state of a gl program object is naturally shared between the two threads too, that state includes which uniforms are bound, so if two threads try to use the same pipeline with different uniforms then we are in a fun data-racey land again.
This seem to lead back to the ‘gl program per gl context’ thing again. I’ll ponder this some more but I think it’s the only real option.
Happy to hear suggestions too,
I think that’s all for now
 in the future I expect I will allow multiple GL contexts per CEPL context  or explicitly if you prefer
This last week hasn’t seen much exciting code so there isn’t too much to write up.
I’m still dreaming up some way to wrangle data in my games in a way that maximizes performance whilst keeping live redefinition in tact, however this isn’t even fully formed in my head yet so there is no code to show or even speak of. However I’ve been increasingly interested in relational databases recently. The fact that you only define the layout of your table data and queries, and that the system just works out what other passes as intermediate data-structures it needs to work best is pretty sweet. You can get a free book on mssql query optimizer here.
CppCon is also out, here are a few good talks I’ve been watching so far:
- Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”
- Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler’s Lid”
- P. McKenney, M. Michael & M. Wong “Is Parallel Programming still hard?” (spoiler..yes)
- Olivier Giroux “Designing (New) C++ Hardware”
I’ve also just had a book on Garbage Collection delivered. YAY! It’s another one of those amazing computer systems where you get to directly impact people, but without having the deal with horrible human factors (like unicode & dates & BLEEEEGHH). I’m pretty stoked to work through this book.
Other than this researchy stuff I’ve still been streaming. Last week we played with a physics engine and tonight we are going to implement chromatic aberration :) I’m pretty happy with where the streaming has been going, the nerve wracking part of the process these days is finding things I can do in the two hours rather than the stream itself.
That’ll do for now, seeya next week
My lack of focus over the weekend was disappointing so I haven’t got much to report. The one thing I did get done however was to finish adding types to my WIP lisp bindings for the newton-dynamics physics engine. This was motivated by the fact that although I had got the basics working a while back, I had seen some overhead from the lisp code; that should be minimized now.
I think I might try using the physics bindings on this week’s stream. Could be fun.
Other than that I’ve been reading and procrastinating. This book is now in my ‘to read’ list, I have no desire to make a proper database but I’m super interested in how their query planner/optimizers work.
That’s all for now, seeya!
This weekend I put a bit of time into Sketch which I, to my shame, have not worked on in a while. Sketch is a lovely project by Vydd which looks to sit in a similar place to processing, but in the lisp world.
A while back I was approach to look into porting it to CEPL so we could have the shader development process of CEPL in Sketch. We started by monkey-patching CEPL in which provided a fantastic test case for performance and resulted in some big refactoring and wins back in July.
Sketch was previously built on the excellent sdl2kit but there aren’t enough hooks in the projects to have them work together yet so I’m currently replacing the bootstrapping. I stripped down a bunch of code and have a test which shows things are rendering so that’s a start. However CEPL’s support for multiple contexts is untested so this project is really gonna force me to implement that well which is AWESOME. Incidentally sketch was the project that forced me to add CEPL’s multi window support (which will also get more robust as I port this).
Other than that I’m busy with other projects and ideas that may become stuff in the future, I’ve got so much to learn :) This last week has seen me binging on xerox parc related research talks (mainly smalltalk stuff) which has been building up a nice healthy level of dissatisfaction. I have proto-ideas rocking around with big ol’ gaps in their narratives, so I’m just pushing a load of chunks of software dna into my head in the hope of some aberrant collision will result in some useful mental genesis will occur. TLDR feed brain hope to shit ideas.
That’ll do for this post.
Writing shaders (in lisp or otherwise) is fun, however debugging them is not. Where on the CPU we get exceptions or error codes, on the gpu we get silence and undefined behavior. I really felt this when trying (and failing) to implement procedural terrain generation on the livestream. I tried to add additional outputs so that I could inspect the values but it was very easy to make a mistake and change the behavior of the shader..or worse to forget it was there and waste time debugging a side effect from the instrumentation. I need a more reliable way to get values back to the CPU. Luckily CEPL has some great places we can hide this logic.
Quick recap, in CEPL we define GPU functions and then compose them into a pipeline using
(defpipeline-g some-pipeline () (vertex-stage :vec4) (fragment-stage :vec2))
This is a macro that generates a function called
some-pipeline that does all the wrangler to make the gl draw call. You then use it by using
(map-g #'some-pipeline vertex-data)
This is another macro that expands into some plumbing and (ultimately) a call to the
Putting aside other details what we have here is 2 places we can inject code, one in the function body and one at the function call-site. This gives us tonnes of leverage.
My goal is to take some gpu-function like this:
(defun-g qkern ((tc :vec2) &uniform (tex :sampler-2d) (offset :vec2)) (+ (* (texture tex (- tc offset)) 0.3125) (* (texture tex tc) 0.375) (* (texture tex (+ tc offset)) 0.3125)))
And add calls to some function we will call
(defun-g qkern ((tc :vec2) &uniform (tex :sampler-2d) (offset :vec2)) (+ (peek (* (texture tex (peek (- tc offset))) 0.3125)) (* (texture tex tc) 0.375) (* (texture tex (+ tc offset)) 0.3125)))
Peek will capture the value at that point and make it available for inspection from the CPU side of your program.
The way we can do it is to:
- compile the shader normally (we need to do this anyway)
- inspect the AST for calls to peek and the types of the argument
- create a new version of the shader with peek replaced with the instrumenting code
(defun-g qkern ((tc :vec2) &uniform (tex :sampler-2d) (offset :vec2)) (let (((dbg-0 :vec2)) ((dbg-1 :vec4))) (+ (setf dbg-1 (* (texture tex (setf dbg-0 (- tc offset))) 0.3125)) (* (texture tex tc) 0.375) (* (texture tex (+ tc offset)) 0.3125)) (values dbg-0 dbg-1)))
This code will work mostly the same way except that it will be returning the captured values instead of the original one. I say ‘mostly’ as now the code that doesnt contribute to the captured values is essentially dead code and it is likely that the GLSL compiler will strip chunks of it.
So now we have an augmented shader stage as well as the original,
defpipeline-g can generate, compile and store these and on each
map-g it can make 2 draw calls. First the debug one capturing the results using transform-feedback (for the vertex stages) and FBOs for the fragment stage. Because
map-g is also a macro we use it to implicitly pass the thread-local ‘CEPL Context’ object to the pipeline function. This lets us write debug values into a ‘scratch’ buffer stored on the context making the whole process transparent.
With this data available we can then come up with nice ways to visualize it. Just dumping it to the REPL will usually be a bad move as a single
peek in a fragment shader is going to result in a value for every fragment, which (at best) means 2073600 values for a 1920x1080 render target.
There are a lot of details to work out to get this feature to work well, however it could be a real boost in getting real data back from these pipelines and can work on all GL versions CEPL supports.
Seeya next week, Peace.
: transform feedback only works from the last implemented vertex stage, so if you have vertex, tessellation & geom stages, only geom can write to the transform feedback buffer.
: Another option was to compile the lisp like shader language to regular lisp. However implementing the GLSL standard library exactly is hard and it’s impossible to capture all the gpu/manufacturer specific quirks.
Over the weekend I got a little lisping done and was working on something that has been rolling around my head for a couple of years.
During the standardization process of lisp as well as agreeing on what would go in, there were also things cut. Some of those things have become de facto standards as all the implementations ship them, however some seem rather fundamental.
One of the more fundamental ones that didn’t make it was the idea of introspectable (and extensible) environment objects.
The high level view goes something like this: An environment object is a set of lexical bindings, having access to this (and any metadata about those bindings) would allow you to do more semantic analysis of the code. Given that any macro is allowed access to the environment object when evaluated this would allow a macro to expand differently depending on the data in the environment.
For example let’s say that we use the environment to store static type information in the environment; we could then potentially optimize certain function calls within that scope using this information (like using static dispatch on a generic function call).
Now a while back stassats kindly shared a snippet which allows you to essentially define a variable who’s value is available during macro expansion. Over the weekend I’ve been playing with this to provide a way to allow the following.
(defmacro your-macro-here (&environment env x y) (with-ext-env (env) (augment-environment env :foo 10) `(+ x y)))
So you can wrap the code in your macro in
with-ext-env and this lets you get access to a user-extensible environment object. We would then provide functions (like
augment-environment) to modify the environment, in the above code to store the value
10 against the key
:foo however we could use this for type info.
The downsides are that we don’t get all the data that was potentially available in the proposed feature. I’d really like to have access to all the standard declarations made as well as our additional ones.
Luckily it’s possible to make a new CL package with modified versions of
labels, etc and in those to capture the metadata and make it available to our extensible environment.
With this we may be able to make something that convincingly does the job of a extensible macro environment. I have made a prototype of the meat (the passing of the environment itself) and so next is it wrap the other things into a package and then see if it is useful.
Other than this I’ve been poking around a little with Unity. It’s fun to see how it’s done in the big leagues and to see where ones approach aligns and diverges from a larger player’s philosophy.
That’s all for now,