Performance Issues (Lecture 1) ------------------------------ Q: How do you measure performance in OpenGL? * Does it run? * How fast does it run? <---- This is usually meant by "performance" * How good does it look? Q: What could cause OpenGL to run slowly (or "slower")? Q: (Alternately, what causes CPU-based programs to run slowly? Are there similarities?) Q: (Think 'what are the stages data must go through' on the way to the screen?) Q: (What causes slowdowns in a general 'pipeline' or assembly line?) Thoughts: * Things become significantly slower if any one stage starts taking more than it's fair share of the time, causing other stages to stall. * Data specification might be slow (Why might this be?) - Lots of it - Redundantly repeated - Each command requires overhead (operating system, memory, etc) - Requires lots of processing (e.g., custom CPU transforms, dynamic computation) - Accesses slow storage (anyone load models from disk every frame?) - Your code does other expensive things ( "for(i=0;i<100000000;i++);" ) * Communications are slow (transferring data to graphics card) - Just so much data it can't be transferred fast enough to be processed - Not enough resources/memory to store data on right side, so it must be transferred repeatedly. - Data is going "the wrong way" (i.e., not designed for large data transport) * Vertex transformations are the bottleneck - Often related to sizable geometry * Rasterization is the bottleneck - Rarely the case, but it's possible... Not much you can do except reduce amount of rasterization required. * Fragment/pixel operations become the bottleneck. - Lots of large geometry occluding the scene - Very complex per-fragment or per-pixel operations - Use of very large framebuffers Often, these are split into 3 categories: * "CPU bound" - CPU based limits and communications to/from GPU * "Geometry limited" - Limited by power of the vertex trasform/lighting processor * "Fill limited" - Limited by number of fragments or complexity of fragment operations. Q: How do you determine what your bottleneck is? Q: (How would you determine if you are fill limited?) * Shrink geometry so it covers a small part of the screen * Turn off complex lighting (i.e., render everything as white) * Try your program at a small resolution Q: (How would you determine if you are geometry limited?) * Try the same scene with simpler geometry (the 10k v.s. 250k vs. 850k dragon) * Eliminate computations from the vertex shader / vertex processor. * If you cut # polygons in half and scene was perfectly geometry limited, what would be the speedup? Q: (How would you determine if you are CPU bound?) * If neither of the others, CPU bound is likely. * Run on a faster/slower machine (with the same GPU!) * Render a static scene (no costly computations for animation, perhaps or dynamic geometry from frame to frame) --------------------------------------------------------------------------------- Notes: ------ Can look at limited programs as an opportunity. If you determine your application is geometry limited, you can add complex shaders to increase the realism _without slowing your application_!! Similarly, a fill-limited program can increase polygon count at no cost to performance (assuming you fill the same area -- i.e., use higher complexity models). Different applications have different limitations. For instance CAD/CAM applications typically are geometry limited, as they display very complex models with simple shading. Games typically use tens of thousands (maybe low 100,000s for really complex games) of visible polygons, but use very complex shaders to add realism, so they are typically fill limited. This explains the two tiers of graphics cards: the workstation models versus the standard models. Professional and entertainment applications have different usage patterns and accuracy requirements. --------------------------------------------------------------------------------- Q: If you're CPU bound, what do you do? (Have your apps ever been CPU bound?) * Bunny with 70k polygons was probably CPU bound. * Speed up geometry specification (we'll be focusing on this shortly) - Don't really need 140-280,000 system calls for bunny. (This is probably an overestimation, as GL probably optimizes to not make system calls for each GL call, but you're still making 140-280,000 system calls). This method of specifying geometry is called "immediate mode" and is the slowest technique available in OpenGL. * Use fewer polygons. If that dragon covers 300 pixels, do we *really* need 10,000 polygons (or 850k)? Could we use a model with less detail? * Push some computations to the vertex processor. If you're doing computations FOR EACH vertex on the CPU, it's likely your program would be faster with the GPU doing the computations. * Avoid changing OpenGL state! State changes are slow, both on and off the GPU. Clump geometry with similar or identical state together, so you don't have to turn on and off lighting, depth testing, or your favorite shaders multiple times. Obviously, this can't be avoided in some situations, but at least use some common sense about which changes will be most expensive (i.e., shaders come with lots of state, whereas a matrix multiply is probably a cheaper change). * You might try using simpler data types. OpenGL rarely uses double types in its pipeline, so there's really no need to use doubles to store geometry data (unless you're interfacing with code/libraries that need the accuracy). It takes extra memory in your application, and GL just has to cast it anyways to send it to the graphics card. * If your code doesn't really need fancy 32-bit-per-pixel texutures, don't use them. Similarly, if you don't need RGBA textures, just use RGB. (I break this a lot, but if you're scrounging for speed, transferring less data over the AGP/PCIe but might save you time). You might even get away with a texture with only a red (or intensity) channel! Q: If you're geometry limited, what do you do? (Have your apps ever been geoemtry limited) * Bunny with 70k polygons probably NOT geometry limited. * Use fewer vertices! (That 850k dragon covering 300 pixels can probably be simplified). * Transfer operations to the CPU or fragment shaders. - If you do an operation in vertex shaders that could be done once on the CPU (i.e., multiply 2 constant matrices together), move those operations to the CPU. "Parallelism" doesn't mean duplicating operations is necessarily efficient! - If the CPU can't handle more, consider tranferring data to the fragment shader. Yes, you'd be repeating operations, but if those cycles are wasted because there were no fragments to process, duplication becomes efficient! * Simplify vertex operations. Q: If you're fill limited what, do you do? (Have your apps ever been fill limited?) * Probably all your apps (except maybe using the bunny & other complex geometry) * Enable techniques like face culling. - If back faces are invisible, why draw them (only to overwrite them)? * Move computation to the CPU or the vertex shader. Often computations can be done once per vertex (or per frame) rather than per-pixel. Sometimes computations that must be done per-fragment can be approximated once per vertex or per frame. (i.e., Gouraud shading instead of Phong shading) * Be careful with z-culling. - Z-buffering allows you to eliminate fragments that fall behind other, previously draw geometry. - If you can cull pixels using a depth text before expensive shading operations (like long fragment shaders), you save fragment cycles. Drivers and hardware have been optimized to do this, but various OpenGL abilities automatically turn this feature off! * Only render complex shaders on the *final* geometry. - Scenes might have depth complexity of 5 or 10 or higher. You don't want to render 10 complex pixels, and only pick the closest one. Instead, you can do "deferred shading" -- do one render pass where you only render a depth buffer, then a second pass with the appropriate shaders. Since the depth buffer already contains frontmost surfaces, all the others are automatically eliminated (if early z-culling is on!) * Avoid complex operations (loops, reading one texture based upon the result in a different texture, based upon another texture read) * Memory reads (texture) always slow things down, so be careful with multiple textures, especially if the reads aren't predictable -- then the GPUs cache memory won't help.