From the Burrow
It's been a long time
.. how have you been?
- glados
I fell out of the habit of writing these over christmas so it’s reall time for me to start this up again.
Currently I am reading ____ in preparation for my new job but I have also had some time for lisp.
On the lisp side I have been giving a bit more time to Varjo as there is a lot of completeness & stability things I need to work on over there. I’ve hacked a few extra sanity checks for qualifiers, however the checks are scattered around and it feels sucky so I really need to do a cleanup pass.
In good news I’ve also made a sweep through the CL spec looking for things that are worth adding to Vari. This resulted in this list and I’ve been able to add whole bunch of things. This really helps Vari feel closer to Common Lisp.
One interesting issue though is around clashes in the spec; lets take round
for example. In CL round
rounds to the nearest even, in GLSL it rounds in an implementation defined direction. GLSL does provide roundEven
which more closely matches CL’s behavior. So the conundrum is, do we defined Vari’s round
using roundEven
or round
. One way we may trip up CL programmers, the other way we trip up GLSL programmers, and also anyone porting code from GLSL to Vari without being aware of those rules. We can of course provide a fast-round
function, but this doesnt neccessarily help with the issue of discoverability or ‘least surprise’.
I’m leaning towards keeping the GLSL meaning however, simply as we are already not writing true CL and our dialect makes sacrifices in the name of performance already. Maybe this is ok.
That’s all I’ve got for now, seeya all next week.
p.s. No stream this week as ferris is giving a rust talk here in Oslo.
p.p.s I also feel the ‘I haven’t been making lil-bits-of-lisp videos in ages’ guilt so I need to get back to that soon.
Raining Features
Every now and again people ask about if CEPL will support compute and I’ve always said that it would happen for years. The reason is that, because compute is not part of the GL’s draw pipeline, I thought there would be a tonne of compute specific changes that would need to be made. Turns out I was wrong.
Wednesday I was triaging some tickets in the CEPL repo and saw the one for compute. I was about to ignore it when I thought that it would be nice to read through the GL wiki to see just how horrendous a job it would be. 5 minutes later I’m rather disconcerted as it looked easy.. temptingly easy.
‘Luckily’ however I really want SSBOs for writing out of compute and that will be hard .. looks at gl wiki again
.. Shit.
Ok so SSBOs turned out to be a much smaller feature than expected as well. So I gave myself 24 hours, from late Friday to late Saturday to implement as many new features (sanely) as I could.
Here are the results:
Sync Objects
GL only has one of these, the fence. CEPL now has support for this too.
You make a fence with make-gpu-fence
(setf some-fence (make-gpu-fence))
and then you can wait on the fence
(wait-on-gpu-fence some-fence)
optionally with a timeout
(wait-on-gpu-fence some-fence 10000)
also optionally flushing
(wait-on-gpu-fence some-fence 10000 t)
Or you can simply check if the fence has signalled
(gpu-fence-signalled-p some-fence)
Query Objects
GL has a range of queries you can use, we have exposed them as structs you can create as follows:
(make-timestamp-query)
(make-samples-passed-query)
(make-any-samples-passed-query)
(make-any-samples-passed-conservative-query)
(make-primitives-generated-query)
(make-transform-feedback-primitives-written-query)
(make-time-elapsed-query)
To begin querying into the object you need to make the query active. This is done with with-gpu-query-bound
(with-gpu-query-bound (some-query)
..)
After the scope of with-gpu-query-bound
the message to stop querying is in the gpu’s queue, however the results are not available immediately. To check if the results are ready you can use gpu-query-result-available-p
or you can use some of the options to pull-gpu-query-result
, let’s look at that function now.
To get the results to lisp we use pull-gpu-query-result
. When called with just a query object it will block until the results are ready:
(pull-gpu-query-result some-query)
We can also say not to wait and CEPL will try to pull the results immediately, if they are not ready it will return nil as the second return value
(pull-gpu-query-result some-query nil) ;; the nil here means don't wait
Compute
To use compute you simply make a gpu function which takes no non-uniform arguments and always returns (values)
(a void function in C nomenclature) and then make a gpu pipeline that only uses that function.
(defstruct-g bah
(data (:int 100)))
(defun-g yay-compute (&uniform (woop bah :ssbo))
(declare (local-size :x 1 :y 1 :z 1))
(setf (aref (bah-data woop) (int (x gl-work-group-id)))
(int (x gl-work-group-id)))
(values))
(defpipeline-g test-compute ()
:compute yay-compute)
You can the map-g
over this like any other pipeline..
(map-g #'test-compute (make-compute-space 10)
:woop *ssbo*)
..with one little difference. Instead of taking a stream of vertices we now take a compute space. This specify the number of ‘groups’ that will be working on the problem. The value has up to 3 dimensions so (make-compute-space 10 10 10)
is valid.
We also soften the requirements around gpu-function names for the compute stage. Usually you have to specify the full name of a gpu function due to possible overloading e.g. (saturate :vec3)
however as compute shaders can only take uniforms, and we don’t offer overloading based on uniforms, there can only be one with a given name. Because of this we allow yay-compute
instead of (yay-compute)
.
SSBOs
The eagle eyed of you will have noticed the :ssbo
qualifier in the woop
uniform argument. SSBOs give you storage you can write into from a compute shader. Their api is almost identical to that of UBOs so I copied-pasted that code in CEPL and got SSBOs working. This code will most likely be unified again once I have fixed some details with binding however for now we have something that works.
This means we can take our struct definition from before:
(defstruct-g bah
(data (:int 100)))
and make a gpu-array
(setf *data* (make-gpu-array nil :dimensions 1 :element-type 'bah))
and then make an SSBO from that
(setf *ssbo* (make-ssbo *data*))
And that’s it, ready to pass to our compute shader.
.Phew.
Yeah. So all of that was awesome, I’m really glad to have a feature land that I wasnt expecting to add for a couple more years. Of course there are bugs, the most immediately obvious is that when I tried the example above I was getting odd gaps in the data in my SSBO
TEST> (pull-g *data*)
(((0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6
0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)))
The reason for this that I didnt understand layout in GL properly. This also means CEPL doesnt handle it properly and that I have a bug :) (this bug)[https://github.com/cbaggers/cepl/issues/193].
Fixing this is interesting as it means that, unless we force you to make a different type for each layout e.g.
(defstruct-g bah140 (:layout :std140)
(data (:int 100)))
(defstruct-g bah430 (:layout :std140)
(data (:int 100)))
..which feels ugly, we would need to support multiple layouts for each type. Which means the accessor functions in lisp would need to switch on this fact dynamically. That sounds slow to me when trying to process a load of foreign data quickly.
I also have a nagging feeling that the current way we marshal struct elements from c-arrays is not ideal.
This things together make me think I will be making some very breaking changes to CEPL’s data marshaling for the start of 2018.
This stuff needs to be done so it’s better we rip the band-aid off whilst we have very few known users.
News on that as it progresses.
Yay
All in all though, a great 24 hours. I’m currently learning how to extract std140 layout info from varjo types so that will likely be what I work on next Saturday.
Peace
Little bits of progress
Hi again! This last week has gone pretty well. Shared contexts have landed in CEPL master, the only host that supports them right now is SDL2 although I want to make a PR to CEPL.GLFW as it should be easy to support there also. Glop is proving a little harder as we really need to update the OSX support, I started poking at it but it’s gonna take a while and I’ve got a bunch of stuff on my plate right now.
I started looking at multi-draw & indirect rendering in GL as I think these are the features I want to implemented next. However I immediately ran into existing CEPL bugs in defstruct so I think the next few weeks are going to be spent cleaning that up and fixing the struct related issues from github.
AAAAges back I promised beginner tutorials for common lisp and I totally failed to deliver. I had been hoping the Atom support would get good enough that we could use that in the videos. Alas that project seems to have slowed down recently[0] and my guilt finally reached the point that I had to put out something. To that end I have started making a whole bunch of little videos on random bits of common lisp. Although it doesnt achieve what I’d really like to do with proper tutorials, I hope it will help enough people on their journey into the language.
That’s all for now, Peace.
[0] although I’m still praying it get’s finished
Transform Feedback
Ah it feels good to have some meat for this week’s writeup. In short transform feedback has landed in CEPL and will ship in the next[0] quicklisp release.
What is it?
Transform feedback is a feature that allows you to write data out from one of the vertex stages into a VBO as well as passing it on to the next stage. It also opens up the possibility of not having a fragment shader at all and just using your vertex shader like the function is a gpu based map
function. For a good example of it’s use check out this great tutorial by the little grasshopper.
How is it exposed in CEPL?
In your gpu-function you simply add an additional qualifier to one or more of your outputs, like this:
(defun-g mtri-vert ((position :vec4) &uniform (pos :vec2))
(values (:feedback (+ position (v! pos 0 0)))
(v! 0.1 0 1)))
Here we see that the gl_position from this stage will be captured, now for the cpu side. First we make a gpu-array to write the data into.
(setf *feedback-vec4*
(make-gpu-array nil :element-type :vec4 :dimensions 100))
And then we make a transform feedback stream and attach our array. (transform feedback streams can have multiple arrays attached as we will see soon)
(setf *tfs*
(make-transform-feedback-stream *feedback-vec4*))
And finally we can use it. Assuming the gpu function above was used as the vertex stage in a pipeline called some-pipeline
then the code will look like this:
(with-transform-feedback (*tfs*)
(map-g #'prog-1 *vertex-stream* :pos (v! -0.1 0)))
And that’s it! now the first result from mtri-vert
will be written into the gpu-array in *feedback-vec4*
and you can pull back the values like this:
`(pull-g *feedback-vec4*)`
If you add the feedback modifier to multiple outputs then they will all be interleaved into the gpu-array. However you might want to write them into seperate arrays, this can be done by providing a ‘group number’ to the :feedback
qualifier.
(defun-g mtri-vert ((position :vec4) &uniform (pos :vec2))
(values ((:feedback 1) (+ position (v! pos 0 0)))
((:feedback 0) (v! 0.1 0 1))))
making another gpu-array
(setf *feedback-vec3*
(make-gpu-array nil :element-type :vec3 :dimensions 10))
and binding both arrays to a transform feedback stream
(setf *tfs*
(make-transform-feedback-stream *feedback-vec3*
*feedback-vec4*))
You can also use the same pipeline multiple times within the scope of with-transform-feedback
(with-transform-feedback (*tfs*)
(map-g #'prog-1 *vertex-stream* :pos (v! -0.1 0))
(map-g #'prog-1 *vertex-stream* :pos (v! 0.3 0.28)))
CEPL is pretty good at catching and explaining cases where GL will throw an error such as: not enough vbos (gpu-arrays) bound for the number of feedback targets -or- 2 different pipelines called within the scope of with-transform-feedback
More stuff
During this I ran into some aggravating issues relating to transform feedback and recompilation of pipelines, it was annoying to the point that I rewrote a lot of the code behind the defpipeline-g
macro. The short version of this is that the code emitted is no longer a top-level closure and also that CEPL now has ways of avoiding recompilation when it can be proved that the gpu-functions in use havent changed.
I also found out that in some cases defvar
with type declarations is faster than the captured values from a top level closure, even when they are typed. See here for a test you can run on your machine to see if you get the same kind of results.[1]
Shipping it
Like I said this code is in the branch to be picked up by the next quicklisp release. This feature will certainly have it’s corner cases and bugs but I’m happy to see this out and to have one less missing feature from CEPL.
Future work on transform feedback includes using the transform feedback objects introduced in GLv4 to allow for more advanced interleaving options and also nesting of the with-transform-feedback
forms.
Next on the list for this month is shared contexts. More on that next week!
Ciao
[0] late november or early december
[1] testing on my mac has given different results, use defvar
and packing data in a struct is faster than multiple defvar
s but slower than the top level closure. Luckily it’s still on the order of microseconds a frame (assuming 5000 calls per frame) but measurable. It’s interesting to see as packed in defvar
was faster on my linux desktop ¯\_(ツ)_/¯
I’ll show more data when I have it.
Multiple Contexts are ALIVE..kinda
Allo again!
Its now November which means it’s NanoWrimo time again, each year I like to participate in spirit by picking a couple of features for projects I’m working on and hammer them out. This is well timed as I haven’t had much time for adding new things to CEPL recently.
The features I want to have by the end of the month are decent multi-context support and single stage pipelines.
Multi-Context
This stuff we have already talked about but I have bit the bullet and got coding at last. I have support for non-shared contexts now but not shared ones yet, the various hosts have mildly-annoyingly different approaches to abstracting this so I’m working on finding the sweet spot right now.
Regarding the defpipeline threading issues from last week I did in the end opt for a small array of program-ids per pipeline indexed by the cepl-context id. I can make this fast enough and the cost is constant so that feels like the right call for now. CEPL & Varjo were never written with thread safety in mind so it’s going to be a while before I can do a real review and work out what the approach should be, for now its a very ‘throw some mutexes in and hope’ situation, but it’s fine…we’ll get there :p
One side note is that all lambda-pipelines in CEPL are tied to their thread so don’t need the indirection mentioned above :)
Single Stage Pipelines
This is going to be fun. Up until now if you wanted to render a fullscreen quad you needed to:
- make a gpu-array holding the quad data
- make a stream for that gpu-array
- make a pipeline with:
- a vertex shader to put the points in clip space
- a fragment shader
The annoying thing is that the fragment shader was the only bit you really cared about. Luckily it turns out there is a way to do this with geometry shaders and no gpu-array and it should be portable for all GL versions CEPL supports. So I’m going to prototype this out on the stream on wednesday and, assuming it works, I’ll make this into a CEPL feature soon.
That covers making pipelines with only a fragment shader of course but what about with only a vertex stage? Well transform feedback buffers are something I’ve wanted for a while and so I’m going to look into supporting those. This is cool as you can then use the vertex stage for data processing kinda like a big pmap. This could be handy when you data is already in a gpu-array.
Future things
With transform feedback support a few possibilities open up. The first is that with some dark magic we could run a number of variants of the pipeline using transform feedback to ‘log’ values from the shaders, this gives us opportunities for debugging that weren’t there before.
Another tempting idea (which is also easier) is to allow users to call a gpu-function directly. This will
- make a temporary pipeline
- make a temporary gpu-array and stream with just the arguments given (1 element)
- run the code on the gpu capturing the result with a temporary transform-feedback buffer or fbo
- convert the values to lisp values
- dispose of are the temporaries
The effect is to allow people to run gpu code from the repl for the purposes of rapid prototyping. It obviously is useless in production because of all the overhead but being able to iterate in the repl with stuff like this could really be great.
That’ll do pig
Right, time to go.
Seeya soon
Mutliple Contexts
This last weekend I put a little time into multi-context support in CEPL.
CEPL has it’s own context (cepl-context
) class that holds both the gl-context handle[0] and also state that is cached to improve performance. cepl-context
s are passed implicitly[1] down the stack and are tied to a single thread.
Most of the work was just finding simple errors in my code an shoring them up, but I did find one tricky case and that was in pipelines. So a pipeline is usually defined in a top level declaration like so:
(defpipeline my-pipeline ()
:vertex some-gpu-function
:fragment some-other-gpu-function)
And this generates all the bootstrapping to compile the gpu functions, get the GL program-id, etc. However that program-id is a GL resource and belongs to a single GL context. As it is right now it’ll be the context that calls this pipeline first..ew.
So how to tackle this? We could create one program-id per context, however this means either looking up the program-id based on the context per call in a pipeline local cache..or looking up the program-id in a context local cache based on the pipeline. Neither is great, as extra lookups per call are something we should be avoiding.
Another option is to have shared GL contexts. This is nice anyway as it means we can share textures/buffers/etc between threads which I think is a nice default behavior. However even with this solution there are still issues with pipelines.
The state of a gl program object is naturally shared between the two threads too, that state includes which uniforms are bound, so if two threads try to use the same pipeline with different uniforms then we are in a fun data-racey land again.
This seem to lead back to the ‘gl program per gl context’ thing again. I’ll ponder this some more but I think it’s the only real option.
Happy to hear suggestions too,
I think that’s all for now
Peace
[0] in the future I expect I will allow multiple GL contexts per CEPL context [1] or explicitly if you prefer
Small things
This last week hasn’t seen much exciting code so there isn’t too much to write up.
I’m still dreaming up some way to wrangle data in my games in a way that maximizes performance whilst keeping live redefinition in tact, however this isn’t even fully formed in my head yet so there is no code to show or even speak of. However I’ve been increasingly interested in relational databases recently. The fact that you only define the layout of your table data and queries, and that the system just works out what other passes as intermediate data-structures it needs to work best is pretty sweet. You can get a free book on mssql query optimizer here.
CppCon is also out, here are a few good talks I’ve been watching so far:
- Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”
- Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler’s Lid”
- P. McKenney, M. Michael & M. Wong “Is Parallel Programming still hard?” (spoiler..yes)
- Olivier Giroux “Designing (New) C++ Hardware”
I’ve also just had a book on Garbage Collection delivered. YAY! It’s another one of those amazing computer systems where you get to directly impact people, but without having the deal with horrible human factors (like unicode & dates & BLEEEEGHH). I’m pretty stoked to work through this book.
Other than this researchy stuff I’ve still been streaming. Last week we played with a physics engine and tonight we are going to implement chromatic aberration :) I’m pretty happy with where the streaming has been going, the nerve wracking part of the process these days is finding things I can do in the two hours rather than the stream itself.
That’ll do for now, seeya next week
Not much this time
My lack of focus over the weekend was disappointing so I haven’t got much to report. The one thing I did get done however was to finish adding types to my WIP lisp bindings for the newton-dynamics physics engine. This was motivated by the fact that although I had got the basics working a while back, I had seen some overhead from the lisp code; that should be minimized now.
I think I might try using the physics bindings on this week’s stream. Could be fun.
Other than that I’ve been reading and procrastinating. This book is now in my ‘to read’ list, I have no desire to make a proper database but I’m super interested in how their query planner/optimizers work.
That’s all for now, seeya!
Sketch
This weekend I put a bit of time into Sketch which I, to my shame, have not worked on in a while. Sketch is a lovely project by Vydd which looks to sit in a similar place to processing, but in the lisp world.
A while back I was approach to look into porting it to CEPL so we could have the shader development process of CEPL in Sketch. We started by monkey-patching CEPL in which provided a fantastic test case for performance and resulted in some big refactoring and wins back in July.
Sketch was previously built on the excellent sdl2kit but there aren’t enough hooks in the projects to have them work together yet so I’m currently replacing the bootstrapping. I stripped down a bunch of code and have a test which shows things are rendering so that’s a start. However CEPL’s support for multiple contexts is untested so this project is really gonna force me to implement that well which is AWESOME. Incidentally sketch was the project that forced me to add CEPL’s multi window support (which will also get more robust as I port this).
Other than that I’m busy with other projects and ideas that may become stuff in the future, I’ve got so much to learn :) This last week has seen me binging on xerox parc related research talks (mainly smalltalk stuff) which has been building up a nice healthy level of dissatisfaction. I have proto-ideas rocking around with big ol’ gaps in their narratives, so I’m just pushing a load of chunks of software dna into my head in the hope of some aberrant collision will result in some useful mental genesis will occur. TLDR feed brain hope to shit ideas.
That’ll do for this post.
Seeya!
The long path to shader debugging
Writing shaders (in lisp or otherwise) is fun, however debugging them is not. Where on the CPU we get exceptions or error codes, on the gpu we get silence and undefined behavior. I really felt this when trying (and failing) to implement procedural terrain generation on the livestream. I tried to add additional outputs so that I could inspect the values but it was very easy to make a mistake and change the behavior of the shader..or worse to forget it was there and waste time debugging a side effect from the instrumentation. I need a more reliable way to get values back to the CPU. Luckily CEPL has some great places we can hide this logic.
Quick recap, in CEPL we define GPU functions and then compose them into a pipeline using defpipeline-g
:
(defpipeline-g some-pipeline ()
(vertex-stage :vec4)
(fragment-stage :vec2))
This is a macro that generates a function called some-pipeline
that does all the wrangler to make the gl draw call. You then use it by using map-g
(map-g #'some-pipeline vertex-data)
This is another macro that expands into some plumbing and (ultimately) a call to the some-pipeline
function.
Putting aside other details what we have here is 2 places we can inject code, one in the function body and one at the function call-site. This gives us tonnes of leverage.
My goal is to take some gpu-function like this:
(defun-g qkern ((tc :vec2) &uniform (tex :sampler-2d) (offset :vec2))
(+ (* (texture tex (- tc offset)) 0.3125)
(* (texture tex tc) 0.375)
(* (texture tex (+ tc offset)) 0.3125)))
And add calls to some function we will call peek
.
(defun-g qkern ((tc :vec2) &uniform (tex :sampler-2d) (offset :vec2))
(+ (peek (* (texture tex (peek (- tc offset))) 0.3125))
(* (texture tex tc) 0.375)
(* (texture tex (+ tc offset)) 0.3125)))
Peek will capture the value at that point and make it available for inspection from the CPU side of your program.
The way we can do it is to:
- compile the shader normally (we need to do this anyway)
- inspect the AST for calls to peek and the types of the argument
- create a new version of the shader with peek replaced with the instrumenting code
For example:
(defun-g qkern ((tc :vec2) &uniform (tex :sampler-2d) (offset :vec2))
(let (((dbg-0 :vec2))
((dbg-1 :vec4)))
(+ (setf dbg-1 (* (texture tex (setf dbg-0 (- tc offset))) 0.3125))
(* (texture tex tc) 0.375)
(* (texture tex (+ tc offset)) 0.3125))
(values dbg-0 dbg-1)))
This code will work mostly the same way except that it will be returning the captured values instead of the original one. I say ‘mostly’ as now the code that doesnt contribute to the captured values is essentially dead code and it is likely that the GLSL compiler will strip chunks of it.
So now we have an augmented shader stage as well as the original, defpipeline-g
can generate, compile and store these and on each map-g
it can make 2 draw calls. First the debug one capturing the results using transform-feedback (for the vertex stages) and FBOs for the fragment stage. Because map-g
is also a macro we use it to implicitly pass the thread-local ‘CEPL Context’ object to the pipeline function. This lets us write debug values into a ‘scratch’ buffer stored on the context making the whole process transparent.
With this data available we can then come up with nice ways to visualize it. Just dumping it to the REPL will usually be a bad move as a single peek
in a fragment shader is going to result in a value for every fragment, which (at best) means 2073600 values for a 1920x1080 render target.
There are a lot of details to work out to get this feature to work well[0], however it could be a real boost in getting real data[1] back from these pipelines and can work on all GL versions CEPL supports.
Seeya next week, Peace.
[0]:
transform feedback only works from the last implemented vertex stage, so if you have vertex, tessellation & geom stages, only geom can write to the transform feedback buffer.
[1]:
Another option was to compile the lisp like shader language to regular lisp. However implementing the GLSL standard library exactly is hard and it’s impossible to capture all the gpu/manufacturer specific quirks.