Its now November which means it’s NanoWrimo time again, each year I like to participate in spirit by picking a couple of features for projects I’m working on and hammer them out. This is well timed as I haven’t had much time for adding new things to CEPL recently.
The features I want to have by the end of the month are decent multi-context support and single stage pipelines.
This stuff we have already talked about but I have bit the bullet and got coding at last. I have support for non-shared contexts now but not shared ones yet, the various hosts have mildly-annoyingly different approaches to abstracting this so I’m working on finding the sweet spot right now.
Regarding the defpipeline threading issues from last week I did in the end opt for a small array of program-ids per pipeline indexed by the cepl-context id. I can make this fast enough and the cost is constant so that feels like the right call for now. CEPL & Varjo were never written with thread safety in mind so it’s going to be a while before I can do a real review and work out what the approach should be, for now its a very ‘throw some mutexes in and hope’ situation, but it’s fine…we’ll get there :p
One side note is that all lambda-pipelines in CEPL are tied to their thread so don’t need the indirection mentioned above :)
Single Stage Pipelines
This is going to be fun. Up until now if you wanted to render a fullscreen quad you needed to:
- make a gpu-array holding the quad data
- make a stream for that gpu-array
- make a pipeline with:
- a vertex shader to put the points in clip space
- a fragment shader
The annoying thing is that the fragment shader was the only bit you really cared about. Luckily it turns out there is a way to do this with geometry shaders and no gpu-array and it should be portable for all GL versions CEPL supports. So I’m going to prototype this out on the stream on wednesday and, assuming it works, I’ll make this into a CEPL feature soon.
That covers making pipelines with only a fragment shader of course but what about with only a vertex stage? Well transform feedback buffers are something I’ve wanted for a while and so I’m going to look into supporting those. This is cool as you can then use the vertex stage for data processing kinda like a big pmap. This could be handy when you data is already in a gpu-array.
With transform feedback support a few possibilities open up. The first is that with some dark magic we could run a number of variants of the pipeline using transform feedback to ‘log’ values from the shaders, this gives us opportunities for debugging that weren’t there before.
Another tempting idea (which is also easier) is to allow users to call a gpu-function directly. This will
- make a temporary pipeline
- make a temporary gpu-array and stream with just the arguments given (1 element)
- run the code on the gpu capturing the result with a temporary transform-feedback buffer or fbo
- convert the values to lisp values
- dispose of are the temporaries
The effect is to allow people to run gpu code from the repl for the purposes of rapid prototyping. It obviously is useless in production because of all the overhead but being able to iterate in the repl with stuff like this could really be great.
That’ll do pig
Right, time to go.
This last weekend I put a little time into multi-context support in CEPL.
CEPL has it’s own context (
cepl-context) class that holds both the gl-context handle and also state that is cached to improve performance.
cepl-contexts are passed implicitly down the stack and are tied to a single thread.
Most of the work was just finding simple errors in my code an shoring them up, but I did find one tricky case and that was in pipelines. So a pipeline is usually defined in a top level declaration like so:
(defpipeline my-pipeline () :vertex some-gpu-function :fragment some-other-gpu-function)
And this generates all the bootstrapping to compile the gpu functions, get the GL program-id, etc. However that program-id is a GL resource and belongs to a single GL context. As it is right now it’ll be the context that calls this pipeline first..ew.
So how to tackle this? We could create one program-id per context, however this means either looking up the program-id based on the context per call in a pipeline local cache..or looking up the program-id in a context local cache based on the pipeline. Neither is great, as extra lookups per call are something we should be avoiding.
Another option is to have shared GL contexts. This is nice anyway as it means we can share textures/buffers/etc between threads which I think is a nice default behavior. However even with this solution there are still issues with pipelines.
The state of a gl program object is naturally shared between the two threads too, that state includes which uniforms are bound, so if two threads try to use the same pipeline with different uniforms then we are in a fun data-racey land again.
This seem to lead back to the ‘gl program per gl context’ thing again. I’ll ponder this some more but I think it’s the only real option.
Happy to hear suggestions too,
I think that’s all for now
 in the future I expect I will allow multiple GL contexts per CEPL context  or explicitly if you prefer
This last week hasn’t seen much exciting code so there isn’t too much to write up.
I’m still dreaming up some way to wrangle data in my games in a way that maximizes performance whilst keeping live redefinition in tact, however this isn’t even fully formed in my head yet so there is no code to show or even speak of. However I’ve been increasingly interested in relational databases recently. The fact that you only define the layout of your table data and queries, and that the system just works out what other passes as intermediate data-structures it needs to work best is pretty sweet. You can get a free book on mssql query optimizer here.
CppCon is also out, here are a few good talks I’ve been watching so far:
- Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”
- Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler’s Lid”
- P. McKenney, M. Michael & M. Wong “Is Parallel Programming still hard?” (spoiler..yes)
- Olivier Giroux “Designing (New) C++ Hardware”
I’ve also just had a book on Garbage Collection delivered. YAY! It’s another one of those amazing computer systems where you get to directly impact people, but without having the deal with horrible human factors (like unicode & dates & BLEEEEGHH). I’m pretty stoked to work through this book.
Other than this researchy stuff I’ve still been streaming. Last week we played with a physics engine and tonight we are going to implement chromatic aberration :) I’m pretty happy with where the streaming has been going, the nerve wracking part of the process these days is finding things I can do in the two hours rather than the stream itself.
That’ll do for now, seeya next week
My lack of focus over the weekend was disappointing so I haven’t got much to report. The one thing I did get done however was to finish adding types to my WIP lisp bindings for the newton-dynamics physics engine. This was motivated by the fact that although I had got the basics working a while back, I had seen some overhead from the lisp code; that should be minimized now.
I think I might try using the physics bindings on this week’s stream. Could be fun.
Other than that I’ve been reading and procrastinating. This book is now in my ‘to read’ list, I have no desire to make a proper database but I’m super interested in how their query planner/optimizers work.
That’s all for now, seeya!
This weekend I put a bit of time into Sketch which I, to my shame, have not worked on in a while. Sketch is a lovely project by Vydd which looks to sit in a similar place to processing, but in the lisp world.
A while back I was approach to look into porting it to CEPL so we could have the shader development process of CEPL in Sketch. We started by monkey-patching CEPL in which provided a fantastic test case for performance and resulted in some big refactoring and wins back in July.
Sketch was previously built on the excellent sdl2kit but there aren’t enough hooks in the projects to have them work together yet so I’m currently replacing the bootstrapping. I stripped down a bunch of code and have a test which shows things are rendering so that’s a start. However CEPL’s support for multiple contexts is untested so this project is really gonna force me to implement that well which is AWESOME. Incidentally sketch was the project that forced me to add CEPL’s multi window support (which will also get more robust as I port this).
Other than that I’m busy with other projects and ideas that may become stuff in the future, I’ve got so much to learn :) This last week has seen me binging on xerox parc related research talks (mainly smalltalk stuff) which has been building up a nice healthy level of dissatisfaction. I have proto-ideas rocking around with big ol’ gaps in their narratives, so I’m just pushing a load of chunks of software dna into my head in the hope of some aberrant collision will result in some useful mental genesis will occur. TLDR feed brain hope to shit ideas.
That’ll do for this post.
Writing shaders (in lisp or otherwise) is fun, however debugging them is not. Where on the CPU we get exceptions or error codes, on the gpu we get silence and undefined behavior. I really felt this when trying (and failing) to implement procedural terrain generation on the livestream. I tried to add additional outputs so that I could inspect the values but it was very easy to make a mistake and change the behavior of the shader..or worse to forget it was there and waste time debugging a side effect from the instrumentation. I need a more reliable way to get values back to the CPU. Luckily CEPL has some great places we can hide this logic.
Quick recap, in CEPL we define GPU functions and then compose them into a pipeline using
(defpipeline-g some-pipeline () (vertex-stage :vec4) (fragment-stage :vec2))
This is a macro that generates a function called
some-pipeline that does all the wrangler to make the gl draw call. You then use it by using
(map-g #'some-pipeline vertex-data)
This is another macro that expands into some plumbing and (ultimately) a call to the
Putting aside other details what we have here is 2 places we can inject code, one in the function body and one at the function call-site. This gives us tonnes of leverage.
My goal is to take some gpu-function like this:
(defun-g qkern ((tc :vec2) &uniform (tex :sampler-2d) (offset :vec2)) (+ (* (texture tex (- tc offset)) 0.3125) (* (texture tex tc) 0.375) (* (texture tex (+ tc offset)) 0.3125)))
And add calls to some function we will call
(defun-g qkern ((tc :vec2) &uniform (tex :sampler-2d) (offset :vec2)) (+ (peek (* (texture tex (peek (- tc offset))) 0.3125)) (* (texture tex tc) 0.375) (* (texture tex (+ tc offset)) 0.3125)))
Peek will capture the value at that point and make it available for inspection from the CPU side of your program.
The way we can do it is to:
- compile the shader normally (we need to do this anyway)
- inspect the AST for calls to peek and the types of the argument
- create a new version of the shader with peek replaced with the instrumenting code
(defun-g qkern ((tc :vec2) &uniform (tex :sampler-2d) (offset :vec2)) (let (((dbg-0 :vec2)) ((dbg-1 :vec4))) (+ (setf dbg-1 (* (texture tex (setf dbg-0 (- tc offset))) 0.3125)) (* (texture tex tc) 0.375) (* (texture tex (+ tc offset)) 0.3125)) (values dbg-0 dbg-1)))
This code will work mostly the same way except that it will be returning the captured values instead of the original one. I say ‘mostly’ as now the code that doesnt contribute to the captured values is essentially dead code and it is likely that the GLSL compiler will strip chunks of it.
So now we have an augmented shader stage as well as the original,
defpipeline-g can generate, compile and store these and on each
map-g it can make 2 draw calls. First the debug one capturing the results using transform-feedback (for the vertex stages) and FBOs for the fragment stage. Because
map-g is also a macro we use it to implicitly pass the thread-local ‘CEPL Context’ object to the pipeline function. This lets us write debug values into a ‘scratch’ buffer stored on the context making the whole process transparent.
With this data available we can then come up with nice ways to visualize it. Just dumping it to the REPL will usually be a bad move as a single
peek in a fragment shader is going to result in a value for every fragment, which (at best) means 2073600 values for a 1920x1080 render target.
There are a lot of details to work out to get this feature to work well, however it could be a real boost in getting real data back from these pipelines and can work on all GL versions CEPL supports.
Seeya next week, Peace.
: transform feedback only works from the last implemented vertex stage, so if you have vertex, tessellation & geom stages, only geom can write to the transform feedback buffer.
: Another option was to compile the lisp like shader language to regular lisp. However implementing the GLSL standard library exactly is hard and it’s impossible to capture all the gpu/manufacturer specific quirks.
Over the weekend I got a little lisping done and was working on something that has been rolling around my head for a couple of years.
During the standardization process of lisp as well as agreeing on what would go in, there were also things cut. Some of those things have become de facto standards as all the implementations ship them, however some seem rather fundamental.
One of the more fundamental ones that didn’t make it was the idea of introspectable (and extensible) environment objects.
The high level view goes something like this: An environment object is a set of lexical bindings, having access to this (and any metadata about those bindings) would allow you to do more semantic analysis of the code. Given that any macro is allowed access to the environment object when evaluated this would allow a macro to expand differently depending on the data in the environment.
For example let’s say that we use the environment to store static type information in the environment; we could then potentially optimize certain function calls within that scope using this information (like using static dispatch on a generic function call).
Now a while back stassats kindly shared a snippet which allows you to essentially define a variable who’s value is available during macro expansion. Over the weekend I’ve been playing with this to provide a way to allow the following.
(defmacro your-macro-here (&environment env x y) (with-ext-env (env) (augment-environment env :foo 10) `(+ x y)))
So you can wrap the code in your macro in
with-ext-env and this lets you get access to a user-extensible environment object. We would then provide functions (like
augment-environment) to modify the environment, in the above code to store the value
10 against the key
:foo however we could use this for type info.
The downsides are that we don’t get all the data that was potentially available in the proposed feature. I’d really like to have access to all the standard declarations made as well as our additional ones.
Luckily it’s possible to make a new CL package with modified versions of
labels, etc and in those to capture the metadata and make it available to our extensible environment.
With this we may be able to make something that convincingly does the job of a extensible macro environment. I have made a prototype of the meat (the passing of the environment itself) and so next is it wrap the other things into a package and then see if it is useful.
Other than this I’ve been poking around a little with Unity. It’s fun to see how it’s done in the big leagues and to see where ones approach aligns and diverges from a larger player’s philosophy.
That’s all for now,
Once again I don’t have much to report but things are going ok.
The streams are still going and still going well, this week was using a cute approach from http://nullprogram.com/blog/2014/06/01/ to make voronoi diagrams. You can see that stream here
I also revisited my bindings to the Newton Dynamics physics engine. Last time I had tested it seem they were abysmally slow. Luckily a quick review showed me that the ‘max fps’ for the simulation was set too low and that to something sensible made a world of difference. Some profiling also revealed that the bindings have a non-trivial amount of overhead which I believe I can remove by declaring the types and turning up the performance optimizations.
That’s all for now Ciao
2 days back I started getting into coding in the evenings again. For a while my brain has not been into it but it looks like I might be back at last.
So far I havent done much but tonight I’ll be doing another stream where I will try to implement boids.
That’s all for now, except that if you are an old fan of Command & Conquer you can get the whole game here for free http://nyerguds.arsaneus-design.com/cnc95upd/cc95p106/ . It looks sketchy but it’s legit and DAMN it’s fun :D
I fucked up for a few weeks by not doing these. Sorry about that.
Update will be easy as I have been in ‘absorb phase’ for a few weeks and so I’ve not been coding much in my free time.
Aside from watching a few films and playing a few games my intake has mainly been around the ‘data oriented design’ space. I’ve been rewatching these:
code::dive conference 2014 - Scott Meyers: Cpu Caches and Why You Care CppCon 2014: Chandler Carruth - Efficiency with Algorithms, Performance with Data Structures CppCon 2014: Mike Acton - Data-Oriented Design and C++
code::dive 2016 conference – Chandler Carruth – Making C++ easier, faster and safer (part 1) code::dive 2016 conference – Chandler Carruth –Making C++ easier, faster, safer (part 2) On “simple” Optimizations - Chandler Carruth - Secret Lightning Talks - Meeting C++ 2016 2013 Keynote: Chandler Carruth: Optimizing the Emergent Structures of C++ Things that Matter - Scott Meyers | DConf2017
I’ve also been reading the wonderful What every programmer should know about memory which is poor in name but outstanding in content. It’s really for programmers who need to get everything out of the machine and folks who need to understand why they aren’t. I’m not finished yet but its already been very helpful.
From that my interest in assembly programming was growing again. The primary reason is that lisp has a disassemble function that lets you see what your function got compiled to. Here is a destructive ‘multiply a vector3 by a float’ function:
CL-USER> (disassemble #'v3-n:*s) ; disassembly for RTG-MATH.VECTOR3.NON-CONSING:*S ; Size: 51 bytes. Origin: #x22966965 ; 65: F30F105201 MOVSS XMM2, [RDX+1] ; no-arg-parsing entry point ; 6A: F30F59D1 MULSS XMM2, XMM1 ; 6E: F30F115201 MOVSS [RDX+1], XMM2 ; 73: F30F105205 MOVSS XMM2, [RDX+5] ; 78: F30F59D1 MULSS XMM2, XMM1 ; 7C: F30F115205 MOVSS [RDX+5], XMM2 ; 81: F30F105209 MOVSS XMM2, [RDX+9] ; 86: F30F59CA MULSS XMM1, XMM2 ; 8A: F30F114A09 MOVSS [RDX+9], XMM1 ; 8F: 488BE5 MOV RSP, RBP ; 92: F8 CLC ; 93: 5D POP RBP ; 94: C3 RET ; 95: 0F0B10 BREAK 16 ; Invalid argument count trap
Cool, but naturally not much good if I can’t read it. So that is my driving factor: Understanding this ↑↑↑↑.
I have been repeatedly advised to start with something simpler that x64 asm, but i just cant find any motivation, when it comes down to it I dont want to program for the c64 any more, and arm/mips holds no appeal until I have a need for it. So against better judgement I picked up the AMD64 Architecture Programmer’s Manual Volume 1 and started reading.. and it’s nice. I quickly run into things I dont get of course but then it’s off to youtube again where kupula’s series of x86_64 Linux Assembly tutorials gave me a nice soft intro.
This is a clearly the start of a looooong road, but I’m in no rush, and everything I learn is directly helping me grok that disassembly above.
|I think that’s it. I’m still streaming and still struggling with the procedural erosion stuff..turns out there are more bugs in the paper than expected :|
But that’s for another day