Chapter 10 Mathematical Topics
from 3D Graphics
I don't think there's anything wrong with pretty graphics.
— Shigeru Miyamoto (1952–)

This chapter discusses a number of mathematical issues that arise when creating 3D graphics on a computer. Of course, we cannot hope to cover the vast subject of computer graphics in any amount of detail in a single chapter. Entire books are written that merely survey the topic. This chapter is to graphics what this entire book is to interactive 3D applications: it presents an extremely brief and high level overview of the subject matter, focusing on topics for which mathematics plays a critical role. Just like the rest of this book, we try to pay special attention to those topics that, from our experience, are glossed over in other sources or are a source of confusion in beginners.

To be a bit more direct: this chapter alone is not enough to teach you how to get some pretty pictures on the screen. However, it should be used parallel with (or preceding!) some other course, book, or self-study on graphics, and we hope that it will help you breeze past a few traditional sticky points. Although we present some example snippets in High Level Shading Language (HLSL) at the end of this chapter, you will not find much else to help you figure out which DirectX or OpenGL function calls to make to achieve some desired effect. These issues are certainly of supreme practical importance, but alas, they are also in the category of knowledge that Robert Maynard Hutchins dubbed “rapidly aging facts,” and we have tried to avoid writing a book that requires an update every other year when ATI releases a new card or Microsoft a new version of DirectX. Luckily, up-to-date API references and examples abound on the Internet, which is a much more appropriate place to get that sort of thing. (API stands for application programming interface. In this chapter, API will mean the software that we use to communicate with the rendering subsystem.)

One final caveat is that since this is a book on math for video games, we will have a real-time bias. This is not to say that the book cannot be used if you are interested in learning how to write a raytracer; only that our expertise and focus is in real-time graphics.

This chapter proceeds roughly in order from ivory tower theory to down-and-dirty code snippets.

• Section 10.1 gives a very high-level (and high-brow) theoretical approach to graphics, culminating in the rendering equation.
• We then lower our brows somewhat to focus attention on matters of more direct practical application, while still maintaining our platform independence and attempt to be relevant ten years from now.
• Section 10.2 discusses some basic mathematics related to viewing in 3D.
• Section 10.3 introduces some important coordinate spaces and transformations.
• Section 10.4 looks at how to represent the surfaces of the geometry in our scene using a polygon mesh.
• Section 10.5 shows how to control material properties (such as the “color” of the object) using texture maps.
• The next sections are about lighting.
• Section 10.6 defines the ubiquitous Blinn-Phong lighting model.
• Section 10.7 discusses some common methods for representing light sources.
• With a little nudge further away from timeless theory, the next sections discuss two issues of particular contemporary interest.
• The last third of this chapter is the most in danger of becoming irrelevant in coming years, because it is the most immediately practical.
• Section 10.10 gives an overview of a simple real-time graphics pipeline, and then descends that pipeline and talks about some mathematical issues along the way.
• Section 10.11 concludes the chapter squarely in the “rapidly aging facts” territory with several HLSL examples demonstrating some of the techniques covered earlier.

# 10.1How Graphics Works

We begin our discussion of graphics by telling you how things really work, or perhaps more accurately, how they really should work, if we had enough knowledge and processing power to make things work the right way. The beginner student is to be warned that much introductory material (especially tutorials on the Internet) and API documentation suffers from a great lack of perspective. You might get the impression from reading these sources that diffuse maps, Blinn-Phong shading, and ambient occlusion are “The way images in the real world work,” when in fact you are probably reading a description of how one particular lighting model was implemented in one particular language on one particular piece of hardware through one particular API. Ultimately, any down-to-the-details tutorial must choose a lighting model, language, platform, color representation, performance goals, etc.—as we will have to do later in this chapter. (This lack of perspective is usually purposeful and warranted.) However, we think it's important to know which are the fundamental and timeless principles, and which are arbitrary choices based on approximations and trade-offs, guided by technological limitations that might by applicable only to real-time rendering, or are likely to change in the near future. So before we get too far into the details of the particular type of rendering most useful for introductory real-time graphics, we want to take our stab at describing how rendering really works.

We also hasten to add that this discussion assumes that the goal is photorealism, simulating how things work in nature. In fact, this is often not the goal, and it certainly is never the only goal. Understanding how nature works is a very important starting place, but artistic and practical factors often dictate a different strategy than just simulating nature.

## 10.1.1The Two Major Approaches to Rendering

We begin with the end in mind. The end goal of rendering is a bitmap, or perhaps a sequence of bitmaps if we are producing an animation. You almost certainly already know that a bitmap is a rectangular array of colors, and each grid entry is known as pixel, which is short for “picture element.” At the time we are producing the image, this bitmap is also known as the frame buffer, and often there is additional post-processing or conversion that happens when we copy the frame buffer to the final bitmap output.

How do we determine the color of each pixel? That is the fundamental question of rendering. Like so many challenges in computer science, a great place to start is by investigating how nature works.

We see light. The image that we perceive is the result of light that bounces around the environment and finally enters the eye. This process is complicated, to say the least. Not only is the physics1 of the light bouncing around very complicated, but so are the physiology of the sensing equipment in our eyes2 and the interpreting mechanisms in our minds. Thus, ignoring a great number of details and variations (as any introductory book must do), the basic question that any rendering system must answer for each pixel is “What color of light is approaching the camera from the direction corresponding to this pixel?”

There are basically two cases to consider. Either we are looking directly at a light source and light traveled directly from the light source to our eye, or (more commonly) light departed from a light source in some other direction, bounced one or more times, and then entered our eye. We can decompose the key question asked previously into two tasks. This book calls these two tasks the rendering algorithm, although these two highly abstracted procedures obviously conceal a great deal of complexity about the actual algorithms used in practice to implement it.

The rendering algorithm
• Visible surface determination. Find the surface that is closest to the eye, in the direction corresponding to the current pixel.
• Lighting. Determine what light is emitted and/or reflected off this surface in the direction of the eye.

At this point it appears that we have made some gross simplifications, and many of you no doubt are raising your metaphorical hands to ask “What about translucency?” “What about reflections?” “What about refraction?” “What about atmospheric effects?” Please hold all questions until the end of the presentation.

The first step in the rendering algorithm is known as visible surface determination. There are two common solutions to this problem. The first is known as raytracing. Rather than following light rays in the direction that they travel from the emissive surfaces, we trace the rays backward, so that we can deal only with the light rays that matter: the ones that enter our eye from the given direction. We send a ray out from the eye in the direction through the center of each pixel3 to see the first object in the scene this ray strikes. Then we compute the color that is being emitted or reflected from that surface back in the direction of the ray. A highly simplified summary of this algorithm is illustrated by Listing 10.1.

for (each x,y screen pixel) {

// Select a ray for this pixel
Ray ray = getRayForPixel(x,y);

// Intersect the ray against the geometry.  This will
// not just return the point of intersection, but also
// a surface normal and some other information needed
// to shade the point, such as an object reference,
// material information, local S,T coordinates, etc.
// Don't take this pseudocode too literally.
Vector3 pos, normal;
Object *obj; Material *mtl;
if (rayIntersectScene(ray, pos, normal, obj, mtl)) {

// Shade the intersection point.  (What light is
// emitted/reflected from this point towards the camera?)
Color c = shadePoint(ray, pos, normal, obj, mtl);

// Put it into the frame buffer
writeFrameBuffer(x,y, c);

} else {

// Ray missed the entire scene.  Just use a generic
// background color at this pixel
writeFrameBuffer(x,y, backgroundColor);
}
}


The other major strategy for visible surface determination, the one used for real-time rendering at the time of this writing, is known as depth buffering. The basic plan is that at each pixel we store not only a color value, but also a depth value. This depth buffer value records the distance from the eye to the surface that is reflecting or emitting the light used to determine the color for that pixel. As illustrated in Listing 10.1, the “outer loop” of a raytracer is the screen-space pixels, but in real-time graphics, the “outer loop” is the geometric elements that make up the surface of the scene.

The different methods for describing surfaces are not important here. What is important is that we can project the surface onto screen-space and map them to screen-space pixels through a process known as rasterization. For each pixel of the surface, known as the source fragment, we compute the depth of the surface at that pixel and compare it to the existing value in the depth buffer, sometimes known as the destination fragment. If the source fragment we are currently rendering is farther away from the camera than the existing value in the buffer, then whatever we rendered before this is obscuring the surface we are now rendering (at least at this one pixel), and we move on to the next pixel. However, if our depth value is closer than the existing value in the depth buffer, then we know this is the closest surface to the eye (at least of those rendered so far) and so we update the depth buffer with this new, closer depth value. At this point we might also proceed to step 2 of the rendering algorithm (at least for this pixel) and update the frame buffer with the color of the light being emitted or reflected from the surface that point. This is known as forward rendering, and the basic idea is illustrated by Listing 10.2.

// Clear the frame and depth buffers
fillFrameBuffer(backgroundColor);
fillDepthBuffer(infinity);

// Outer loop iterates over all the primitives (usually triangles)
for (each geometric primitive) {

// Rasterize the primitive
for (each pixel x,y in the projection of the primitive) {

// Test the depth buffer, to see if a closer pixel has
float primDepth = getDepthOfPrimitiveAtPixel(x,y);

// Pixel of this primitive is obscured, discard it
continue;
}

// Determine primitive color at this pixel.
Color c = getColorOfPrimitiveAtPixel(x,y);

// Update the color and depth buffers
writeFrameBuffer(x,y, c);
writeDepthBuffer(x,y, primDepth);
}
}


Opposed to forward rendering is deferred rendering, an old technique that is becoming popular again due to the current location of bottlenecks in the types of images we are producing and the hardware we are using to produce them. A deferred renderer uses, in addition to the frame buffer and the depth buffer, additional buffers, collectively known as the G-buffer (short for “geometry” buffer), which holds extra information about the surface closest to the eye at that location, such as the 3D location of the surface, the surface normal, and material properties needed for lighting calculations, such as the “color” of the object and how “shiny” it is at that particular location. (Later, we see how those intuitive terms in quotes are a bit too vague for rendering purposes.) Compared to a forward renderer, a deferred renderer follows our two-step rendering algorithm a bit more literally. First we “render” the scene into the G-buffer, essentially performing only visibility determination—fetching the material properties of the point that is “seen” by each pixel but not yet performing lighting calculations. The second pass actually performs the lighting calculations. Listing 10.3 explains deferred rendering in pseudocode.

// Clear the geometry and depth buffers
clearGeometryBuffer();
fillDepthBuffer(infinity);

// Rasterize all primitives into the G-buffer
for (each geometric primitive) {
for (each pixel x,y in the projection of the primitive) {

// Test the depth buffer, to see if a closer pixel has
float primDepth = getDepthOfPrimitiveAtPixel(x,y);

// Pixel of this primitive is obscured, discard it
continue;
}

// Fetch information needed for shading in the next pass.
MaterialInfo mtlInfo;
Vector3 pos, normal;

// Save it off into the G-buffer and depth buffer
writeGeometryBuffer(x,y, mtlInfo, pos, normal);
writeDepthBuffer(x,y, primDepth);
}
}

// Now perform shading in a 2nd pass, in screen space
for (each x,y screen pixel) {

// No geometry here.  Just write a background color
writeFrameBuffer(x,y, backgroundColor);

} else {

// Fetch shading info back from the geometry buffer
MaterialInfo mtlInfo;
Vector3 pos, normal;

Color c = shadePoint(pos, normal, mtlInfo);

// Put it into the frame buffer
writeFrameBuffer(x,y, c);
}
}

Pseudocode for deferred rendering using the depth buffer

Before moving on, we must mention one important point about why deferred rendering is popular. When multiple light sources illuminate the same surface point, hardware limitations or performance factors may prevent us from computing the final color of a pixel in a single calculation, as was shown in the pseudocode listings for both forward and deferred rendering. Instead, we must using multiple passes, one pass for each light, and accumulate the reflected light from each light source into the frame buffer. In forward rendering, these extra passes involve rerendering the primitives. Under deferred rendering, however, extra passes are in image space, and thus depend on the 2D size of the light in screen space, not on the complexity of the scene! It is in this situation that deferred rendering really begins to have large performance advantages over forward rendering.

## 10.1.2Describing Surface Properties: The BRDF

Now let's talk about the second step in the rendering algorithm: lighting. Once we have located the surface closest to the eye, we must determine the amount of light emitted directly from that surface, or emitted from some other source and reflected off the surface in the direction of the eye. The light directly transmitted from a surface to the eye—for example, when looking directly at a light bulb or the sun—is the simplest case. These emissive surfaces are a small minority in most scenes; most surfaces do not emit their own light, but rather they only reflect light that was emitted from somewhere else. We will focus the bulk of our attention on the nonemissive surfaces.

Although we often speak informally about the “color” of an object, we know that the perceived color of an object is actually the light that is entering our eye, and thus can depend on many different factors. Important questions to ask are: What colors of light are incident on the surface, and from what directions? From which direction are we viewing the surface? How “shiny” is the object?4 So a description of a surface suitable for use in rendering doesn't answer the question “What color is this surface?” This question is sometimes meaningless—what color is a mirror, for example? Instead, the salient question is a bit more complicated, and it goes something like, “When light of a given color strikes the surface from a given incident direction, how much of that light is reflected in some other particular direction?” The answer to this question is given by the bidirectional reflectance distribution function, or BRDF for short. So rather than “What color is the object?” we ask, “What is the distribution of reflected light?”

Symbolically, we write the BRDF as the function $f\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}},\lambda \right)$ .5 The value of this function is a scalar that describes the relatively likelihood that light incident at the point $\mathbf{x}$ from direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}$ will be reflected in the outgoing direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ rather than some other outgoing direction. As indicated by the boldface type and hat, $\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }$ might be a unit vector, but more generally it can be any way of specifying a direction; polar angles are another obvious choice and are commonly used. Different colors of light are usually reflected differently; hence the dependence on $\lambda$ , which is the color (actually, the wavelength) of the light.

Although we are particularly interested in the incident directions that come from emissive surfaces and the outgoing directions that point towards our eye, in general, the entire distribution is relevant. First of all, lights, eyes, and surfaces can move around, so in the context of creating a surface description (for example, “red leather”), we don't know which directions will be important. But even in a particular scene with all the surfaces, lights, and eyes fixed, light can bounce around multiple times, so we need to measure light reflections for arbitrary pairs of directions.

Before moving on, it's highly instructive to see how the two intuitive material properties that were earlier disparaged, color and shininess, can be expressed precisely in the framework of a BRDF. Consider a green ball. A green object is green and not blue because it reflects incident light that is green more strongly than incident light of any other color.6 For example, perhaps green light is almost all reflected, with only a small fraction absorbed, while 95%of the blue and red light is absorbed and only 5%of light at those wavelengths is reflected in various directions. White light actually consists of all the different colors of light, so a green object essentially filters out colors other than green. If a different object responded to green and red light in the same manner as our green ball, but absorbed 50%of the blue light and reflected the other 50%, we might perceive the object as teal. Or if most of the light at all wavelengths was absorbed, except for a small amount of green light, then we would perceive it as a dark shade of green. To summarize, a BRDF accounts for the difference in color between two objects through the dependence on $\lambda$ : any given wavelength of light has its own reflectance distribution.

Next, consider the difference between shiny red plastic and diffuse red construction paper. A shiny surface reflects incident light much more strongly in one particular direction compared to others, whereas a diffuse surface scatters light more evenly across all outgoing directions. A perfect reflector, such as a mirror, would reflect all the light from one incoming direction in a single outgoing direction, whereas a perfectly diffuse surface would reflect light equally in all outgoing directions, regardless of the direction of incidence. In summary, a BRDF accounts for the difference in “shininess” of two objects through its dependence on ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}$ and ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ .

More complicated phenomena can be expressed by generalizing the BRDF. Translucence and light refraction can be easily incorporated by allowing the direction vectors to point back into the surface. We might call this mathematical generalization a bidirectional surface scattering distribution function (BSSDF). Sometimes light strikes an object, bounces around inside of it, and then exits at a different point. This phenomenon is known as subsurface scattering and is an important aspect of the appearances of many common substances, such as skin and milk. This requires splitting the single reflection point $\mathbf{x}$ into ${\mathbf{x}}_{\mathrm{i}\mathrm{n}}$ and ${\mathbf{x}}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ , which is used by the bidirectional surface scattering distribution function (BSSDF). Even volumetric effects, such as fog and subsurface scattering, can be expressed, by dropping the words “surface” and defining a bidirectional scattering distribution function (BSDF) at any point in space, not just on the “surfaces.” Taken at face value, these might seem like impractical abstractions, but they can be useful in understanding how to design practical tools.

By the way, there are certain criteria that a BRDF must satisfy in order to be physically plausible. First, it doesn't make sense for a negative amount of light to be reflected in any direction. Second, it's not possible for the total reflected light to be more than the light that was incident, although the surface may absorb some energy so the reflected light can be less than the incident light. This rule is usually called the normalization constraint. A final, less obvious principle obeyed by physical surfaces is Helmholtz reciprocity: if we pick two arbitrary directions, the same fraction of light should be reflected, no matter which is the incident direction and which is the outgoing direction. In other words,

Helmholtz reciprocity
$f\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{1},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{2},\lambda \right)=f\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{2},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{1},\lambda \right).$

Due to Helmholtz reciprocity, some authors don't label the two directions in the BRDF as “in” and “out” because to be physically plausible the computation must be symmetric.

The BRDF contains the complete description of an object's appearance at a given point, since it describes how the surface will reflect light at that point. Clearly, a great deal of thought must be put into the design of this function. Numerous lighting models have been proposed over the last several decades, and what is surprising is that one of the earliest models, Blinn-Phong, is still in widespread use in real-time graphics today. Although it is not physically accurate (nor plausible: it violates the normalization constraint), we study it because it is a good educational stepping stone and an important bit of graphics history. Actually, describing Blinn-Phong as “history” is wishful thinking—perhaps the most important reason to study this model is that it still is in such widespread use! In fact, it's the best example of the phenomena we mentioned at the start of this chapter: particular methods being presented as if they are “the way graphics work.”

Different lighting models have different goals. Some are better at simulating rough surfaces, others at surfaces with multiple strata. Some focus on providing intuitive “dials” for artists to control, without concern for whether those dials have any physical significance at all. Others are based on taking real-world surfaces and measuring them with special cameras called goniophotometers, essentially sampling the BRDF and then using interpolation to reconstruct the function from the tabulated data. The notable Blinn-Phong model discussed in Section 10.6 is useful because it is simple, inexpensive, and well understood by artists. Consult the sources in the suggested reading for a survey of lighting models.

## 10.1.3A Very Brief Introduction to Colorimetryand Radiometry

Graphics is all about measuring light, and you should be aware of some important subtleties, even though we won't have time to go into complete detail here. The first is how to measure the color of light, and the second is how to measure its brightness.

In your middle school science classes you might have learned that every color of light is some mixture of red, green, and blue (RGB) light. This is the popular conception of light, but it's not quite correct. Light can take on any single frequency in the visible band, or it might be a combination of any number of frequencies. Color is a phenomena of human perception and is not quite the same thing as frequency. Indeed different combinations of frequencies of light can be perceived as the same color—these are known as metamers. The infinite combinations of frequencies of light are sort of like all the different chords that can be played on a piano (and also tones between the keys). In this metaphor our color perception is unable to pick out all the different individual notes, but instead, any given chord sounds to us like some combination of middle C, F, and G. Three color channels is not a magic number as far as physics is concerned, it's peculiar to human vision. Most other mammals have only two different types of receptors (we would call them “color blind”), and fish, reptiles, and birds have four types of color receptors (they would call us color blind).

However, even very advanced rendering systems project the continuous spectrum of visible light onto some discrete basis, most commonly, the RGB basis. This is a ubiquitous simplification, but we still wanted to let you know that it is a simplification, as it doesn't account for certain phenomena. The RGB basis is not the only color space, nor is it necessarily the best one for many purposes, but it is a very convenient basis because it is the one used by most display devices. In turn, the reason that this basis is used by so many display devices is due to the similarity to our own visual system. Hall  does a good job of describing the shortcomings of the RGB system.

Since the visible portion of the electromagnetic spectrum is continuous, an expression such as $f\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}},\lambda \right)$ is continuous in terms of $\lambda$ . At least it should be in theory. In practice, because we are producing images for human consumption, we reduce the infinite number of different $\lambda$ s down to three particular wavelengths. Usually, we choose the three wavelengths to be those perceived as the colors red, green, and blue. In practice, you can think of the presence of $\lambda$ in an equation as an integer that selects which of the three discrete “color channels” is being operated on.

• To describe the spectral distribution of light requires a continuous function, not just three numbers. However, to describe the human perception of that light, three numbers are essentially sufficient.
• The RGB system is a convenient color space, but it's not the only one, and not even the best one for many practical purposes. In practice, we usually treat light as being a combination of red, green, and blue because we are making images for human consumption.

You should also be aware of the different ways that we can measure the intensity of light. If we take a viewpoint from physics, we consider light as energy in the form of electromagnetic radiation, and we use units of measurement from the field of radiometry. The most basic quantity is radiant energy, which in the SI system is measured in the standard unit of energy, the joule (J). Just like any other type of energy, we are often interested in the rate of energy flow per unit time, which is known as power. In the SI system power is measured using the watt (W), which is one joule per second (1 W = 1 J/s). Power in the form of electromagnetic radiation is called radiant power or radiant flux. The term “flux,” which comes from the Latin fluxus for “flow,” refers to some quantity flowing across some cross-sectional area. Thus, radiant flux measures the total amount of energy that is arriving, leaving, or flowing across some area per unit time.

Imagine that a certain amount of radiant flux is emitted from a surface, while that same amount of power is emitted from a different surface that is . Clearly, the smaller surface is “brighter” than the larger surface; more precisely, it has a greater flux per unit area, also known as flux density. The radiometric term for flux density, the radiant flux per unit area, is called radiosity, and in the SI system it is measured in watts per meter. The relationship between flux and radiosity is analogous to the relationship between force and pressure; confusing the two will lead to similar sorts of conceptual errors.

Several equivalent terms exist for radiosity. First, note that we can measure the flux density (or total flux, for that matter) across any cross-sectional area. We might be measuring the radiant power emitted from some surface with a finite area, or the surface through which the light flows might be an imaginary boundary that exists only mathematically (for example, the surface of some imaginary sphere that surrounds a light source). Although in all cases we are measuring flux density, and thus the term “radiosity” is perfectly valid, we might also use more specific terms, depending on whether the light being measured is coming or going. If the area is a surface and the light is arriving on the surface, then the term irradiance is used. If light is being emitted from a surface, the term radiant exitance or radiant emittance is used. In digital image synthesis, the word “radiosity” is most often used to refer to light that is leaving a surface, having been either reflected or emitted.

When we are talking about the brightness at a particular point, we cannot use plain old radiant power because the area of that point is infinitesimal (essentially zero). We can speak of the flux density at a single point, but to measure flux, we need a finite area over which to measure. For a surface of finite area, if we have a single number that characterizes the total for the entire surface area, it will be measured in flux, but to capture the fact that different locations within that area might be brighter than others, we use a function that varies over the surface that will measure the flux density.

Now we are ready to consider what is perhaps the most central quantity we need to measure in graphics: the intensity of a “ray” of light. We can see why the radiosity is not the unit for the job by an extension of the ideas from the previous paragraph. Imagine a surface point surrounded by an emissive dome and receiving a certain amount of irradiance coming from all directions in the hemisphere centered on the local surface normal. Now imagine a second surface point experiencing the same amount of irradiance, only all of the illumination is coming from a single direction, in a very thin beam. Intuitively, we can see that a ray along this beam is somehow “brighter” than any one ray that is illuminating the first surface point. The irradiance is somehow “denser.” It is denser per unit solid area.

The idea of a solid area is probably new to some readers, but we can easily understand the idea by comparing it to angles in the plane. A “regular” angle is measured (in radians) based on the length of its projection onto the unit circle. In the same way, a solid angle measures the area as projected onto the unit sphere surrounding the point. The SI unit for solid angle is the steradian, abbreviated “sr.” The complete sphere has $4\pi$  sr; a hemisphere encompasses $2\pi$  sr. Figure 10.1 The two surfaces are receiving identical bundles of light, but the surface on the bottom has a larger area, and thus has a lower irradiance.

By measuring the radiance per unit solid angle, we can express the intensity of light at a certain point as a function that varies based upon the direction of incidence. We are very close to having the unit of measurement that describes the intensity of a ray. There is just one slight catch, illustrated by Figure 10.1, which is a close-up of a very thin pencil of light rays striking a surface. On the top, the rays strike the surface perpendicularly, and on the bottom, light rays of the same strength strike a different surface at an angle. The key point is that the area of the top surface is smaller than the area of the bottom surface; therefore, the irradiance on the top surface is larger than the irradiance on the bottom surface, despite the fact that the two surfaces are being illuminated by the “same number” of identical light rays. This basic phenomenon, that the angle of the surface causes incident light rays to be spread out and thus contribute less irradiance, is known as Lambert's law. We have more to say about Lambert's law in Section 10.6.3, but for now, the key idea is that the contribution of a bundle of light to the irradiance at a surface depends on the angle of that surface.

Due to Lambert's law, the unit we use in graphics to measure the strength of a ray, radiance, is defined as the radiant flux per unit projected area, per unit solid angle. To measure a projected area, we take the actual surface area and project it onto the plane perpendicular to the ray. (In Figure 10.1, imagine taking the bottom surface and projecting it upwards onto the top surface). Essentially this counteracts Lambert's law.

Table 10.1 summarizes the most important radiometric terms.

 Quantity Units SI unit Rough translation Radiant energy Energy $\mathrm{J}$ Total illumination duringan interval of time Radiant flux Power $\mathrm{W}$ Brightness of a finite areafrom all directions Radiant flux density Power per unit area ${\mathrm{W}/\mathrm{m}}^{2}$ Brightness of a single pointfrom all directions Irradiance Power per unit area ${\mathrm{W}/\mathrm{m}}^{2}$ Radiant flux density ofincident light Radiant exitance Power per unit area ${\mathrm{W}/\mathrm{m}}^{2}$ Radiant flux density ofemitted light Radiosity Power per unit area ${\mathrm{W}/\mathrm{m}}^{2}$ Radiant flux density ofemitted or reflected light Radiance Power per unit projected area, perunit solid angle $\mathrm{W}/\left({\mathrm{m}}^{2}\cdot \mathrm{s}\mathrm{r}\right)$ Brightness of a ray

Whereas radiometry takes the perspective of physics by measuring the raw energy of the light, the field of photometry weighs that same light using the human eye. For each of the corresponding radiometric terms, there is a similar term from photometry (Table 10.2). The only real difference is a nonlinear conversion from raw energy to perceived brightness.

 Radiometric term Photometric term SI Photometric unit Radiant energy Luminous energy talbot, or lumen second ( $\mathrm{l}\mathrm{m}\cdot \mathrm{s}$ ) Radiant flux Luminous flux, luminous power lumen ( $\mathrm{l}\mathrm{m}$ ) Irradiance Illuminance lux ( $\mathrm{l}\mathrm{x}=\mathrm{l}\mathrm{m}/{\mathrm{m}}^{2}$ ) Radiant exitance Luminous emittance lux ( $\mathrm{l}\mathrm{x}=\mathrm{l}\mathrm{m}/{\mathrm{m}}^{2}$ ) Radiance Luminance $\mathrm{l}\mathrm{m}/\left({\mathrm{m}}^{2}\cdot \mathrm{s}\mathrm{r}\right)$
Table 10.2Units of measurement from radiometry and photometry

Throughout the remainder of this chapter, we try to use the proper radiometric units when possible. However, the practical realities of graphics make using proper units confusing, for two particular reasons. It is common in graphics to need to take some integral over a “signal”—for example, the color of some surface. In practice we cannot do the integral analytically, and so we must integrate numerically, which boils down to taking a weighted average of many samples. Although mathematically we are taking a weighted average (which ordinarily would not cause the units to change), in fact what we are doing is integrating, and that means each sample is really being multiplied by some differential quantity, such as a differential area or differential solid angle, which causes the physical units to change. A second cause of confusion is that, although many signals have a finite nonzero domain in the real world, they are represented in a computer by signals that are nonzero at a single point. (Mathematically, we say that the signal is a multiple of a Direc delta; see Section 12.4.3.) For example, a real-world light source has a finite area, and we would be interested in the radiance of the light at a given point on the emissive surface, in a given direction. In practice, we imagine shrinking the area of this light down to zero while holding the radiant flux constant. The flux density becomes infinite in theory. Thus, for a real area light we would need a signal to describe the flux density, whereas for a point light, the flux density becomes infinite and we instead describe the brightness of the light by its total flux. We'll repeat this information when we talk about point lights.

• Vague words such as “intensity” and “brightness” are best avoided when the more specific radiometric terms can be used. The scale of our numbers is not that important and we don't need to use real world SI units, but it is helpful to understand what the different radiometric quantities measure to avoid mixing quantities together inappropriately.
• Use radiant flux to measure the total brightness of a finite area, in all directions.
• Use radiant flux density to measure the brightness at a single point, in all directions. Irradiance and radiant exitance refer to radiant flux density of light that is incident and emitted, respectively. Radiosity is the radiant flux density of light that is leaving a surface, whether the light was reflected or emitted.
• Due to Lambert's law, a given ray contributes more differential irradiance when it strikes a surface at a perpendicular angle compared to a glancing angle.
• Use radiance to measure the brightness of a ray. More specifically, radiance is the flux per unit projected angle, per solid angle. We use projected area so that the value for a given ray is a property of a ray alone and does not depend on the orientation of the surface used to measure the flux density.
• Practical realities thwart our best intentions of doing things “the right way” when it comes to using proper units. Numerical integration is a lot like taking a weighted average, which hides the change of units that really occurs. Point lights and other Dirac deltas add further confusion.

## 10.1.4The Rendering Equation

Now let's fit the BRDF into the rendering algorithm. In step 2 of our rendering algorithm (Section 10.1), we're trying to determine the radiance leaving a particular surface in the direction of our eye. The only way this can happen is for light to arrive from some direction onto the surface and get reflected in our direction. With the BRDF, we now have a way to measure this. Consider all the potential directions that light might be incident upon the surface, which form a hemisphere centered on $\mathbf{x}$ , oriented according to the local surface normal $\stackrel{^}{\mathbf{n}}$ . For each potential direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}$ , we measure the color of light incident from that direction. The BRDF tells us how much of the radiance from ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}$ is reflected in the direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ towards our eye (as opposed to scattered in some other direction or absorbed). By summing up the radiance reflected towards ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ over all possible incident directions, we obtain the total radiance reflected along ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ into our eye. We add the reflected light to any light that is being emitted from the surface in our direction (which is zero for most surfaces), and voila, we have the total radiance. Writing this in math notation, we have the rendering equation.

The Rendering Equation
$\begin{array}{c}{L}_{\mathrm{o}\mathrm{u}\mathrm{t}}\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}},\lambda \right)={L}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}},\lambda \right)\hfill \\ \text{(10.1)}& \hfill +{\int }_{\mathrm{\Omega }\phantom{\rule{1px}{0ex}}\mathrm{\Omega }}{L}_{\mathrm{i}\mathrm{n}}\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}},\lambda \right)f\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}},\lambda \right)\left(-{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}\cdot \stackrel{^}{\mathbf{n}}\right)\phantom{\rule{thinmathspace}{0ex}}d{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}.\end{array}$

As fundamental as Equation (10.1) may be, its development is relatively recent, having been published in SIGGRAPH in 1986 by Kajiya . Furthermore, it was the result of, rather than the cause of, numerous strategies for producing realistic images. Graphics researchers pursued the creation of images through different techniques that seemed to make sense to them before having a framework to describe the problem they were trying to solve. And for many years after that, most of us in the video game industry were unaware that the problem we were trying to solve had finally been clearly defined. (Many still are.)

Now let's convert this equation into English and see what the heck it means. First of all, notice that $\mathbf{x}$ and $\lambda$ appear in each function. The whole equation governs a balance of radiance at a single surface point $\mathbf{x}$ for a single wavelength (“color channel”) $\lambda$ . So this balance equation applies to each color channel individually, at all surface points simultaneously.

The term ${L}_{\mathrm{o}\mathrm{u}\mathrm{t}}\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}},\lambda \right)$ on the left side of the equals sign is simply “The radiance leaving the point in the direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ .” Of course, if $\mathbf{x}$ is the visible surface at a given pixel, and ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ is the direction from $\mathbf{x}$ to the eye, then this quantity is exactly what we need to determine the pixel color. But note that the equation is more general, allowing us to compute the outgoing radiance in any arbitrary direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ and for any given point $\mathbf{x}$ , whether or not ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ points towards our eye.

On the right-hand side, we have a sum. The first term in the sum ${L}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}},\lambda \right)$ , is “the radiance emitted from $\mathbf{x}$ in the direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ ” and will be nonzero only for special emissive surfaces. The second term, the integral, is “the light reflected from $\mathbf{x}$ in the direction of ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ .” Thus, from a high level the rendering equation would seem to state the rather obvious relation

Now let's dig into that intimidating integral. (By the way, if you haven't had calculus and haven't read Chapter 11 yet, just replace the word “integral” with “sum,” and you won't miss any of the main point of this section.) We've actually already discussed how it works when we talked about the BRDF, but let's repeat it with different words. We might rewrite the integral as

Note that symbol $\mathrm{\Omega }\phantom{\rule{1px}{0ex}}\mathrm{\Omega }$ (uppercase Greek omega) appears where we normally would write the limits of integration. This is intended to mean “sum over the hemisphere of possible incoming directions.” For each incoming direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}$ , we determine how much radiance was incident in this incoming direction and got scattered in the outgoing direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ . The sum of all these contributions from all the different incident directions gives the total radiance reflected in the direction ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ . Of course, there are an infinite number of incident directions, which is why this is an integral. In practice, we cannot evaluate the integral analytically, and we must sample a discrete number of directions, turning the “ $\int$ ” into a “ $\sum$ .”

Now all that is left is to dissect the integrand. It's a product of three factors:

The first factor denotes the radiance incident from the direction of ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}$ . The next factor is simply the BRDF, which tells us how much of the radiance incident from this particular direction will be reflected in the outgoing direction we care about. Finally, we have the Lambert factor. As discussed in Section 10.1.2, this accounts for the fact that more incident light is available to be reflected, per unit surface area, when ${\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}$ is perpendicular to the surface than when at a glancing angle to the surface. The vector $\stackrel{^}{\mathbf{n}}$ is the outward-facing surface normal; the dot product $-{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}\cdot \stackrel{^}{\mathbf{n}}$ peaks at 1 in the perpendicular direction and trails off to zero as the angle of incidence becomes more glancing. We discuss the Lambert factor once more in Section 10.6.3.

In purely mathematical terms, the rendering equation is an integral equation: it states a relationship between some unknown function ${L}_{\mathrm{o}\mathrm{u}\mathrm{t}}\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}},\lambda \right)$ , the distribution of light on the surfaces in the scene, in terms of its own integral. It might not be apparent that the rendering equation is recursive, but ${L}_{\mathrm{o}\mathrm{u}\mathrm{t}}$ actually appears on both sides of the equals sign. It appears in the evaluation of ${L}_{\mathrm{i}\mathrm{n}}\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}},\lambda \right)$ , which is precisely the expression we set out to solve for each pixel: what is the radiance incident on a point from a given direction? Thus to find the radiance exiting a point $\mathbf{x}$ , we need to know all the radiance incident at $\mathbf{x}$ from all directions. But the radiance incident on $\mathbf{x}$ is the same as the radiance leaving from all other surfaces visible to $\mathbf{x}$ , in the direction pointing from the other surface towards $\mathbf{x}$ .

To render a scene realistically, we must solve the rendering equation, which requires us to know (in theory) not only the radiance arriving at the camera, but also the entire distribution of radiance in the scene in every direction at every point. Clearly, this is too much to ask for a finite, digital computer, since both the set of surface locations and the set of potential incident/exiting directions are infinite. The real art in creating software for digital image synthesis is to allocate the limited processor time and memory most efficiently, to make the best possible approximation.

The simple rendering pipeline we present in Section 10.10 accounts only for direct light. It doesn't account for indirect light that bounced off of one surface and arrived at another. In other words, it only does “one recursion level” in the rendering equation. A huge component of realistic images is accounting for the indirect light—solving the rendering equation more completely. The various methods for accomplishing this are known as global illumination techniques.

This concludes our high-level presentation of how graphics works. Although we admit we have not yet presented a single practical idea, we believe it's very important to understand what you are trying to approximate before you start to approximate it. Even though the compromises we are forced to make for the sake of real-time are quite severe, the available computing power is growing. A video game programmer whose only exposure to graphics has been OpenGL tutorials or demos made by video card manufacturers or books that focused exclusively on real-time rendering will have a much more difficult time understanding even the global illumination techniques of today, much less those of tomorrow.

# 10.2Viewing in 3D

Before we render a scene, we must pick a camera and a window. That is, we must decide where to render it from (the view position, orientation, and zoom) and where to render it to (the rectangle on the screen). The output window is the simpler of the two, and so we will discuss it first.

Section 10.2.1 describes how to specify the output window. Section 10.2.2 discusses the pixel aspect ratio. Section 10.2.3 introduces the view frustum. Section 10.2.4 describes field of view angles and zoom. Figure 10.2Specifying the output window

## 10.2.1Specifying the Output Window

We don't have to render our image to the entire screen. For example, in split-screen multiplayer games, each player is given a portion of the screen. The output window refers to the portion of the output device where our image will be rendered. This is shown in Figure 10.2.

The position of the window is specified by the coordinates of the upper left-hand pixel $\left({\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{P}\mathrm{o}\mathrm{s}}_{x},{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{P}\mathrm{o}\mathrm{s}}_{y}\right)$ . The integers ${\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{R}\mathrm{e}\mathrm{s}}_{x}$ and ${\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{R}\mathrm{e}\mathrm{s}}_{y}$ are the dimensions of the window in pixels. Defining it this way, using the size of the window rather than the coordinates of the lower right-hand corner, avoids some sticky issues caused by integer pixel coordinates. We are also careful to distinguish between the size of the window in pixels, and the physical size of the window. This distinction will become important in Section 10.2.2.

With that said, it is important to realize that we do not necessarily have to be rendering to the screen at all. We could be rendering into a buffer to be saved into a .TGA file or as a frame in an .AVI, or we may be rendering into a texture as a subprocess of the “main” render, to produce a shadow map, or a reflection, or the image on a monitor in the virtual world. For these reasons, the term render target is often used to refer to the current destination of rendering output.

## 10.2.2Pixel Aspect Ratio

Regardless of whether we are rendering to the screen or an off-screen buffer, we must know the aspect ratio of the pixels, which is the ratio of a pixel's height to its width. This ratio is often 1:1—that is, we have “square” pixels—but this is not always the case! We give some examples below, but it is common for this assumption to go unquestioned and become the source of complicated kludges applied in the wrong place, to fix up stretched or squashed images.

The formula for computing the aspect ratio is

Computing the pixel aspect ratio
$\begin{array}{}\text{(10.2)}& \frac{{\mathrm{p}\mathrm{i}\mathrm{x}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{x}}{{\mathrm{p}\mathrm{i}\mathrm{x}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{y}}=\frac{{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{x}}{{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{y}}\cdot \frac{{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{R}\mathrm{e}\mathrm{s}}_{y}}{{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{R}\mathrm{e}\mathrm{s}}_{x}}.\end{array}$

The notation $\mathrm{p}\mathrm{i}\mathrm{x}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}$ refers to the physical size of a pixel, and $\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}$ is the physical height and width of the device on which the image is displayed. For both quantities, the individual measurements may be unknown, but that's OK because the ratio is all we need, and this usually is known. For example, standard desktop monitors come in all different sizes, but the viewable area on many older monitors has a ratio of 4:3, meaning it is 33%wider than it is tall. Another common ratio is 16:9 or wider7 on high-definition televisions. The integers ${\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{R}\mathrm{e}\mathrm{s}}_{x}$ and ${\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{R}\mathrm{e}\mathrm{s}}_{y}$ are the number of pixels in the $x$ and $y$ dimensions. For example, a resolution of $1280×720$ means that ${\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{R}\mathrm{e}\mathrm{s}}_{x}=1280$ and ${\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{R}\mathrm{e}\mathrm{s}}_{y}=720$ .

But, as mentioned already, we often deal with square pixels with an aspect ratio of 1:1. For example, on a desktop monitor with a physical width:height ratio of 4:3, some common resolutions resulting in square pixel ratios are $640×480$ , $800×600$ , $1024×768$ , and $1600×1200$ . On 16:9 monitors, common resolutions are $1280×720$ , $1600×900$ , $1920×1080$ . The aspect ratio 8:5 (more commonly known as 16:10) is also very common, for desktop monitor sizes and televisions. Some common display resolutions that are 16:10 are $1153×720$ , $1280×800$ , $1440×900$ , $1680×1050$ , and $1920×1200$ . In fact, on the PC, it's common to just assume a 1:1 pixel ratio, since obtaining the dimensions of the display device might be impossible. Console games have it easier in this respect.

Notice that nowhere in these calculations is the size or location of the window used; the location and size of the rendering window has no bearing on the physical proportions of a pixel. However, the size of the window will become important when we discuss field of view in Section 10.2.4, and the position is important when we map from camera space to screen space Section 10.3.5.

At this point, some readers may be wondering how this discussion makes sense in the context of rendering to a bitmap, where the word “physical” implied by the variable names $\mathrm{p}\mathrm{i}\mathrm{x}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}$ and $\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}$ doesn't apply. In most of these situations, it's appropriate simply to act as if the pixel aspect ratio is 1:1. In some special circumstances, however, you may wish to render anamorphically, producing a squashed image in the bitmap that will later be stretched out when the bitmap is used.

## 10.2.3The View Frustum

The view frustum is the volume of space that is potentially visible to the camera. It is shaped like a pyramid with the tip snipped off. An example of a view frustum is shown in Figure 10.3. Figure 10.3The 3D view frustum

The view frustum is bounded by six planes, known as the clip planes. The first four of the planes form the sides of the pyramid and are called the top, left, bottom, and right planes, for obvious reasons. They correspond to the sides of the output window. The near and far clip planes, which correspond to certain camera-space values of $z$ , require a bit more explanation.

The reason for the far clip plane is perhaps easier to understand. It prevents rendering of objects beyond a certain distance. There are two practical reasons why a far clip plane is needed. The first is relatively easy to understand: a far clip plane can limit the number of objects that need to be rendered in an outdoor environment. The second reason is slightly more complicated, but essentially it has to do with how the depth buffer values are assigned. As an example, if the depth buffer entries are 16-bit fixed point, then the largest depth value that can be stored is 65,535. The far clip establishes what (floating point) $z$ value in camera space will correspond to the maximum value that can be stored in the depth buffer. The motivation for the near clip plane will have to wait until we discuss clip space in Section 10.3.2.

Notice that each of the clipping planes are planes, with emphasis on the fact that they extend infinitely. The view volume is the intersection of the six half-spaces defined by the clip planes.

## 10.2.4Field of View and Zoom

A camera has position and orientation, just like any other object in the world. However, it also has an additional property known as field of view. Another term you probably know is zoom. Intuitively, you already know what it means to “zoom in” and “zoom out.” When you zoom in, the object you are looking at appears bigger on screen, and when you zoom out, the apparent size of the object is smaller. Let's see if we can develop this intuition into a more precise definition.

The field of view (FOV) is the angle that is intercepted by the view frustum. We actually need two angles: a horizontal field of view, and a vertical field of view. Let's drop back to 2D briefly and consider just one of these angles. Figure 10.4 shows the view frustum from above, illustrating precisely the angle that the horizontal field of view measures. The labeling of the axes is illustrative of camera space, which is discussed in Section 10.3. Figure 10.4Horizontal field of view Figure 10.5Geometric interpretation of zoom

Zoom measures the ratio of the apparent size of the object relative to a $90{}^{\mathrm{o}}$ field of view. For example, a zoom of 2.0 means that object will appear twice as big on screen as it would if we were using a $90{}^{\mathrm{o}}$ field of view. So larger zoom values cause the image on screen to become larger (“zoom in”), and smaller values for zoom cause the images on screen to become smaller (“zoom out”).

Zoom can be interpreted geometrically as shown in Figure 10.5. Using some basic trig, we can derive the conversion between zoom and field of view:

Converting between zoom and field of view

Notice the inverse relationship between zoom and field of view. As zoom gets larger, the field of view gets smaller, causing the view frustum to narrow. It might not seem intuitive at first, but when the view frustum gets more narrow, the perceived size of visible objects increases.

Field of view is a convenient measurement for humans to use, but as we discover in Section 10.3.4, zoom is the measurement that we need to feed into the graphics pipeline.

We need two different field of view angles (or zoom values), one horizontal and one vertical. We are certainly free to choose any two arbitrary values we fancy, but if we do not maintain a proper relationship between these values, then the rendered image will appear stretched. If you've ever watched a movie intended for the wide screen that was simply squashed anamorphically to fit on a regular TV, or watched content with a 4:3 aspect on a 16:9 TV in “full”8 mode, then you have seen this effect.

In order to maintain proper proportions, the zoom values must be inversely proportional to the physical dimensions of the output window:

The usual relationship between vertical and horizontal zoom
$\begin{array}{}\text{(10.4)}& \frac{{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{y}}{{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{x}}=\frac{{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{x}}{{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{y}}=\text{window aspect ratio}.\end{array}$

The variable $\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}$ refers to the physical size of the output window. As indicated in Equation (10.4), even though we don't usually know the actual size of the render window, we can determine its aspect ratio. But how do we do this? Usually, all we know is the resolution (number of pixels) of the output window. Here's where the pixel aspect ratio calculations from Section 10.2.2 come in:

$\begin{array}{}\text{(10.5)}& \begin{array}{rl}\frac{{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{y}}{{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{x}}=\frac{{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{x}}{{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{y}}& =\frac{{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{R}\mathrm{e}\mathrm{s}}_{x}}{{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{R}\mathrm{e}\mathrm{s}}_{y}}\cdot \frac{{\mathrm{p}\mathrm{i}\mathrm{x}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{x}}{{\mathrm{p}\mathrm{i}\mathrm{x}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{y}}\\ & =\frac{{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{R}\mathrm{e}\mathrm{s}}_{x}}{{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{R}\mathrm{e}\mathrm{s}}_{y}}\cdot \frac{{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{x}}{{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}}_{y}}\cdot \frac{{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{R}\mathrm{e}\mathrm{s}}_{y}}{{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{R}\mathrm{e}\mathrm{s}}_{x}}.\end{array}\end{array}$

In this formula,

• $\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}$ refers to the camera's zoom values,
• $\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}$ refers to the physical window size,
• $\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{R}\mathrm{e}\mathrm{s}$ refers to the resolution of the window, in pixels,
• $\mathrm{p}\mathrm{i}\mathrm{x}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}$ refers to the physical dimensions of a pixel,
• $\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{P}\mathrm{h}\mathrm{y}\mathrm{s}$ refers to the physical dimensions of the output device. Remember that we usually don't know the individual sizes, but we do know the ratio,
• $\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{R}\mathrm{e}\mathrm{s}$ refers to the resolution of the output device.

Many rendering packages allow you to specify only one field of view angle (or zoom value). When you do this, they automatically compute the other value for you, assuming you want uniform display proportions. For example, you may specify the horizontal field of view, and they compute the vertical field of view for you.

Now that we know how to describe zoom in a manner suitable for consumption by a computer, what do we do with these zoom values? They go into the clip matrix, which is described in Section 10.3.4.

## 10.2.5Orthographic Projection

The discussion so far has centered on perspective projection, which is the most commonly used type of projection, since that's how our eyes perceive the world. However, in many situations orthographic projection is also useful. We introduced orthographic projection in Section 5.3; to briefly review, in orthographic projection, the lines of projection (the lines that connect all the points in space that project onto the same screen coordinates) are parallel, rather than intersecting at a single point. There is no perspective foreshortening in orthographic projection; an object will appear the same size on the screen no matter how far away it is, and moving the camera forward or backward along the viewing direction has no apparent effect so long as the objects remain in front of the near clip plane.

Figure 10.6 shows a scene rendered from the same position and orientation, comparing perspective and orthographic projection. On the left, notice that with perspective projection, parallel lines do not remain parallel, and the closer grid squares are larger than the ones in the distance. Under orthographic projection, the grid squares are all the same size and the grid lines remain parallel.  Perspective projection Orthographic projection
Figure 10.6Perspective versus orthographic projection

Orthographic views are very useful for “schematic” views and other situations where distances and angles need to be measured precisely. Every modeling tool will support such a view. In a video game, you might use an orthographic view to render a map or some other HUD element.

For an orthographic projection, it makes no sense to speak of the “field of view” as an angle, since the view frustum is shaped like a box, not a pyramid. Rather than defining the $x$ and $y$ dimensions of the view frustum in terms of two angles, we give two sizes: the physical width and height of the box.

The zoom value has a different meaning in orthographic projection compared to perspective. It is related to the physical size of the frustum box:

Converting between zoom and frustum size in orthographic projection
$\begin{array}{rlr}\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}=2/\mathrm{s}\mathrm{i}\mathrm{z}\mathrm{e},& & \mathrm{s}\mathrm{i}\mathrm{z}\mathrm{e}=2/\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}.\end{array}$

As with perspective projections, there are two different zoom values, one for $x$ and one for $y$ , and their ratio must be coordinated with the aspect ratio of the rendering window in order to avoid producing a “squashed” image. We developed Equation (10.5) with perspective projection in mind, but this formula also governs the proper relationship for orthographic projection.

# 10.3Coordinate Spaces

This section reviews several important coordinate spaces related to 3D viewing. Unfortunately, terminology is not consistent in the literature on the subject, even though the concepts are. Here, we discuss the coordinate spaces in the order they are encountered as geometry flows through the graphics pipeline.

## 10.3.1Model, World, and Camera Space

The geometry of an object is initially described in object space, which is a coordinate space local to the object being described (see Section 3.2.2). The information described usually consists of vertex positions and surface normals. Object space is also known as local space and, especially in the context of graphics, model space.

From model space, the vertices are transformed into world space (see Section 3.2.1). The transformation from modeling space to world space is often called the model transform. Typically, lighting for the scene is specified in world space, although, as we see in Section 10.11, it doesn't really matter what coordinate space is used to perform the lighting calculations provided that the geometry and the lights can be expressed in the same space.

From world space, vertices are transformed by the view transform into camera space (see Section 3.2.3), also known as eye space and view space (not to be confused with canonical view volume space, discussed later). Camera space is a 3D coordinate space in which the origin is at the center of projection, one is axis parallel to the direction the camera is facing (perpendicular to the projection plane), one axis is the intersection of the top and bottom clip planes, and the other axis is the intersection of the left and right clip planes. If we assume the perspective of the camera, then one axis will be “horizontal” and one will be “vertical.”

In a left-handed world, the most common convention is to point $+z$ in the direction that the camera is facing, with $+x$ and $+y$ pointing “right” and “up” (again, from the perspective from the camera). This is fairly intuitive, as shown in Figure 10.7. The typical right-handed convention is to have $-z$ point in the direction that the camera is facing. We assume the left-handed conventions for the remainder of this chapter Figure 10.7Typical camera-space conventions for left-handed coordinate systems

## 10.3.2Clip Space and the Clip Matrix

From camera space, vertices are transformed once again into clip space, also known as the canonical view volume space. The matrix that transforms vertices from camera space into clip space is called the clip matrix, also known as the projection matrix.

Up until now, our vertex positions have been “pure” 3D vectors—that is, they only had three coordinates, or if they have a fourth coordinate, then $w$ was always equal to 1 for position vectors and 0 for direction vectors such as surface normals. (In some special situations, we might use more exotic transforms, but most basic transforms are 3D affine transformations.) The clip matrix, however, puts meaningful information into $w$ . The clip matrix serves two primary functions:

• Prepare for projection. We put the proper value into $w$ so that the homogeneous division produces the desired projection. For the typical perspective projection, this means we copy $z$ into $w$ . We talk about this in Section 10.3.3.
• Apply zoom and prepare for clipping. We scale $x$ , $y$ , and $z$ so that they can be compared against $w$ for clipping. This scaling takes the camera's zoom values into consideration, since those zoom values affect the shape of the view frustum against which clipping occurs. This is discussed in Section 10.3.4.

## 10.3.3The Clip Matrix: Preparing for Projection

Recall from Section 6.4.1 that a 4D homogeneous vector is mapped to the corresponding physical 3D vector by dividing by $w$ :

Converting 4D homogeneous coordinates to 3D
$\left[\begin{array}{c}x\\ y\\ z\\ w\end{array}\right]⟹\left[\begin{array}{c}x/w\\ y/w\\ z/w\end{array}\right].$

The first goal of the clip matrix is to get the correct value into $w$ such that this division causes the desired projection (perspective or orthographic). That's the reason this matrix is sometimes called the projection matrix, although this term is a bit misleading—the projection doesn't take place during the multiplication by this matrix, it happens when we divide $x$ , $y$ , and $z$ by $w$ .

If this was the only purpose of the clip matrix, to place the correct value into $w$ , the clip matrix for perspective projection would simply be

A trivial matrix for setting $\mathbit{w}\mathbf{=}\mathbit{z}$ , for perspective projection
$\left[\begin{array}{cccc}1& 0& 0& 0\\ 0& 1& 0& 0\\ 0& 0& 1& 1\\ 0& 0& 0& 0\end{array}\right].$

Multiplying a vector of the form $\left[x,y,z,1\right]$ by this matrix, and then performing the homogeneous division by $w$ , we get

$\left[\begin{array}{cccc}x& y& z& 1\end{array}\right]\left[\begin{array}{cccc}1& 0& 0& 0\\ 0& 1& 0& 0\\ 0& 0& 1& 1\\ 0& 0& 0& 0\end{array}\right]=\left[\begin{array}{cccc}x& y& z& z\end{array}\right]\phantom{\rule{1em}{0ex}}⟹\phantom{\rule{1em}{0ex}}\left[\begin{array}{ccc}x/z& y/z& 1\end{array}\right].$

At this point, many readers might very reasonably ask two questions. The first question might be, “Why is this so complicated? This seems like a lot of work to accomplish what basically amounts to just dividing by $z$ .” You're right. In many old school software rasterizers, where the projection math was hand-coded, $w$ didn't appear anywhere, and there was just an explicit divide by $z$ . So why do we tolerate all this complication? One reason for homogeneous coordinates is that they can represent a wider range of camera specifications naturally. At the end of this section we'll see how orthographic projections can be handled easily, without the “if statement” that was necessary in the old hand-coded systems. But there are other types of projections that are also useful and are handled naturally in this framework. For example, the frustum planes do not need to be symmetric about the viewing direction, which corresponds to the situation where your view direction does not look through the center of the window. This is useful, for example, when rendering a very high resolution image in smaller blocks, or for seamless dynamic splitting and merging of split screen views. Another advantage of using homogeneous coordinates is that they make $z$ -clipping (against the near and far clipping planes) identical to $x$ - and $y$ -clipping. This similarity makes things nice and tidy, but, more important, on some hardware the vector unit can be exploited to perform clipping comparison tests in parallel. In general, the use of homogeneous coordinates and $4×4$ matrices makes things more compact and general purpose, and (in some peoples' minds) more elegant. But regardless of whether the use of $4×4$ matrices improves the process, it's the way most APIs want things delivered, so that's the way it works, for better or worse.

The second question a reader might have is, “What happened to $d$ ?” Remember that $d$ is the focal distance, the distance from the projection plane to the center of projection (the “focal point”). Our discussion of perspective projection via homogeneous division in Section 6.5 described how to project onto a plane perpendicular to the $z$ -axis and $d$ units away from the origin. (The plane is of the form $z=d$ .) But we didn't use $d$ anywhere in the above discussion. As it turns out, the value we use for $d$ isn't important, and so we choose the most convenient value possible for $d$ , which is 1.

To understand why $d$ doesn't matter, let's compare the projection that occurs in a computer to the projection that occurs in a physical camera. Inside a real camera, increasing this distance causes the camera to zoom in (objects appear bigger), and decreasing it zooms out (objects appear smaller). This is shown in Figure 10.8. Figure 10.8 In a physical camera, increasing the focal distance $d$ while keeping the size of the “film” the same has the effect of zooming in.

The vertical line on the left side of each diagram represents the film (or, for modern cameras, the sensing element), which lies in the infinite plane of projection. Importantly, notice that the film is the same height in each diagram. As we increase $d$ , the film moves further away from the focal plane, and the field of view angle intercepted by the view frustum decreases. As the view frustum gets smaller, an object inside this frustum takes a larger proportion of the visible volume, and thus appears larger in the projected image. The perceived result is that we are zooming in. The key point here is that changing the focal length causes an object to appear bigger because the projected image is larger relative to the size of the film.

Now let's look at what happens inside a computer. The “film” inside a computer is the rectangular portion of the projection plane that intersects the view frustum.9 Notice that if we increase the focal distance, the size of the projected image increases, just like it did in a real camera. However, inside a computer, the film actually increases by this same proportion, rather than the view frustum changing in size. Because the projected image and the film increase by the same proportion, there is no change to the rendered image or the apparent size of objects within thisimage.

In summary, zoom is always accomplished by changing the shape of the view frustum, whether we're talking about a real camera or inside a computer. In a real camera, changing the focal length changes the shape of the view frustum because the film stays the same size. However, in a computer, adjusting the focal distance $d$ does not affect the rendered image, since the “film” increases in size and the shape of the view frustum does not change.

Some software allow the user to specify the field of view by giving a focal length measured in millimeters. These numbers are in reference to some standard film size, almost always 35 mm film.

What about orthographic projection? In this case, we do not want to divide by $z$ , so our clip matrix will have a right-hand column of $\left[0,0,0,1{\right]}^{\mathrm{T}}$ , the same as the identity matrix. When multiplied by a vector of the form $\left[x,y,z,1\right]$ , this will result in a vector with $w=1$ , rather than $w=z$ . The homogeneous division still occurs, but this time we are dividingby 1:

$\left[\begin{array}{cccc}x& y& z& 1\end{array}\right]\left[\begin{array}{cccc}1& 0& 0& 0\\ 0& 1& 0& 0\\ 0& 0& 1& 0\\ 0& 0& 0& 1\end{array}\right]=\left[\begin{array}{cccc}x& y& z& 1\end{array}\right]\phantom{\rule{1em}{0ex}}⟹\phantom{\rule{1em}{0ex}}\left[\begin{array}{ccc}x& y& z\end{array}\right].$

The next section fills in the rest of the clip matrix. But for now, the key point is that a perspective projection matrix will always have a right-hand column of $\left[0,0,1,0\right]$ , and a orthographic projection matrix will always have a right-hand column of $\left[0,0,0,1\right]$ . Here, the word “always” means “we've never seen anything else.” You might come across some obscure case on some particular hardware for which other values are needed, and it is important to understand that 1 isn't a magic number here, it is just the simplest number. Since the homogeneous conversion is a division, what is important is the ratio of the coordinates, not their magnitude.

Notice that multiplying the entire matrix by a constant factor doesn't have any effect on the projected values $x/w$ , $y/w$ , and $z/w$ , but it will adjust the value of $w$ , which is used for perspective correct rasterization. So a different value might be necessary for some reason. Then again, certain hardware (such as the Wii) assume that these are the only two cases, and no other right-hand column is allowed.

## 10.3.4The Clip Matrix: Applying Zoom andPreparing for Clipping

The second goal of the clip matrix is to scale the $x$ , $y$ , and $z$ components such that the six clip planes have a trivial form. Points are outside the view frustum if they satisfy at least one of the inequalities:

The six planes of the view frustum in clip space
$\begin{array}{ll}\mathrm{B}\mathrm{o}\mathrm{t}\mathrm{t}\mathrm{o}\mathrm{m}& y<-w,\\ \mathrm{T}\mathrm{o}\mathrm{p}& y>w,\\ \mathrm{L}\mathrm{e}\mathrm{f}\mathrm{t}& x<-w,\\ \mathrm{R}\mathrm{i}\mathrm{g}\mathrm{h}\mathrm{t}& x>w,\\ \mathrm{N}\mathrm{e}\mathrm{a}\mathrm{r}& z<-w,\\ \mathrm{F}\mathrm{a}\mathrm{r}& z>w.\end{array}$

So the points inside the view volume satisfy

$\begin{array}{cccc}-w& \le & x& \le w,\\ -w& \le & y& \le w,\\ -w& \le & z& \le w.\end{array}$

Any geometry that does not satisfy these equalities must be clipped to the view frustum. Clipping is discussed in Section 10.10.4.

To stretch things to put the top, left, right, and bottom clip planes in place, we scale the $x$ and $y$ values by the zoom values of the camera. We discussed how to compute these values in Section 10.2.4. For the near and far clip planes, the $z$ -coordinate is biased and scaled such that at the near clip plane, $z/w=-1$ , and at the far clip plane, $z/w=1$ .

Let ${\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{x}$ and ${\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{y}$ be the horizontal and vertical zoom values, and let $n$ and $f$ be the distances to the near and far clipping planes. Then the matrix that scales $x$ , $y$ , and $z$ appropriately, while simultaneously outputting the $z$ -coordinate into $w$ , is

Clip matrix for perspective projection with $\mathbit{z}\mathbf{=}\mathbf{-}\mathbit{w}$ at the near clip plane
$\begin{array}{}\text{(10.6)}& \left[\begin{array}{cccc}{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{x}& 0& 0& 0\\ 0& {\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{y}& 0& 0\\ 0& 0& \frac{f+n}{f-n}& 1\\ 0& 0& \frac{-2nf}{f-n}& 0\end{array}\right].\end{array}$

This clip matrix assumes a coordinate system with $z$ pointing into the screen (the usual left-handed convention), row vectors on the left, and $z$ values in the range $\left[-w,w\right]$ from the near to far clip plane. This last detail is yet another place where conventions can vary. Other APIs, (notably, DirectX) want the projection matrix such that $z$ is in the range $\left[0,w\right]$ . In other words, a point in clip space is outside the clip plane if

Near and far clip planes in DirectX-style clip space
$\begin{array}{ll}\mathrm{n}\mathrm{e}\mathrm{a}\mathrm{r}& z<0,\\ \mathrm{f}\mathrm{a}\mathrm{r}& z>w.\end{array}$

Under these DirectX-style conventions, the points inside the view frustum satisfy the inequality $0\le z\le w$ . A slightly different clip matrix is used in this case:

Clip matrix for perspective projection with $\mathbit{z}\mathbf{=}\mathbf{0}$ at the near clip plane
$\begin{array}{}\text{(10.7)}& \left[\begin{array}{cccc}{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{x}& 0& 0& 0\\ 0& {\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{y}& 0& 0\\ 0& 0& \frac{f}{f-n}& 1\\ 0& 0& \frac{-nf}{f-n}& 0\end{array}\right].\end{array}$

We can easily tell that the two matrices in Equations (10.6) and (10.7) are perspective projection matrices because the right-hand column is $\left[0,0,1,0{\right]}^{\mathrm{T}}$ . (OK, the caption in the margin is a bit of a hint, too.)

What about orthographic projection? The first and second columns of the projection matrix don't change, and we know the fourth column will become $\left[0,0,0,1{\right]}^{\mathrm{T}}$ . The third column, which controls the output $z$ value, must change. We start by assuming the first set of conventions for $z$ , that is the output $z$ value will be scaled such that $z/w$ takes on the values $-1$ and $+1$ at the near and far clip planes, respectively. The matrix that does this is

Clip matrix for orthographic projection with $\mathbit{z}\mathbf{=}\mathbf{-}\mathbit{w}$ at the near clip plane
$\left[\begin{array}{cccc}{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{x}& 0& 0& 0\\ 0& {\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{y}& 0& 0\\ 0& 0& \frac{2}{f-n}& 0\\ 0& 0& -\frac{f+n}{f-n}& 1\end{array}\right].$

Alternatively, if we are using a DirectX-style range for the clip space $z$ values, then the matrix we use is

Clip matrix for orthographic projection with $\mathbit{z}\mathbf{=}\mathbf{0}$ at the near clip plane
$\left[\begin{array}{cccc}{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{x}& 0& 0& 0\\ 0& {\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{y}& 0& 0\\ 0& 0& \frac{1}{f-n}& 0\\ 0& 0& \frac{n}{n-f}& 1\end{array}\right].$

In this book, we prefer a left-handed convention and row vectors on the left, and all the projection matrices so far assume those conventions. However, both of these choices differ from the OpenGL convention, and we know that many readers may be working in environments that are similar to OpenGL. Since this can be very confusing, let's repeat these matrices, but with the right-handed, column-vector OpenGL conventions. We'll only discuss the $\left[-1,+1\right]$ range for clip-space $z$ values, because that's what OpenGL uses.

It's instructive to consider how to convert these matrices from one set of conventions to the other. Because OpenGL uses column vectors, the first thing we need to do is transpose our matrix. Second, the right-handed conventions have $-z$ pointing into the screen in camera space (“eye space” in the OpenGL vocabulary), but the clip-space $+z$ axis points into the screen just like the left-handed conventions assumed earlier. (In OpenGL, clip space is actually a left-handed coordinate space!) This means we need to negate our incoming $z$ values, or alternatively, negate the third column (after we've transposed the matrix), which is the column that is multiplied by $z$ .

The above procedure results in the following perspective projection matrix

Clip matrix for perspective projection assuming OpenGL conventions
$\begin{array}{}\text{(10.3.4)}& \left[\begin{array}{cccc}{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{x}& 0& 0& 0\\ 0& {\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{y}& 0& 0\\ 0& 0& -\frac{f+n}{f-n}& \frac{-2nf}{f-n}\\ 0& 0& -1& 0\end{array}\right],\end{array}$

and the orthographic projection matrix is

Clip matrix for orthographic projection assuming OpenGL conventions
$\left[\begin{array}{cccc}{\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{x}& 0& 0& 0\\ 0& {\mathrm{z}\mathrm{o}\mathrm{o}\mathrm{m}}_{y}& 0& 0\\ 0& 0& \frac{-2}{f-n}& -\frac{f+n}{f-n}\\ 0& 0& 0& 1\end{array}\right].$

So, for OpenGL conventions, you can tell whether a projection matrix is perspective or orthographic based on the bottom row. It will be $\left[0,0,-1,0\right]$ for perspective, and $\left[0,0,0,1\right]$ for orthographic.

Now that we know a bit about clip space, we can understand the need for the near clip plane. Obviously, there is a singularity precisely at the origin, where a perspective projection is not defined. (This corresponds to a perspective division by zero.) In practice, this singularity would be extremely rare, and however we wanted to handle it—say, by arbitrarily projecting the point to the center of the screen—would be OK, since putting the camera directly in a polygon isn't often needed in practice.

But projecting polygons onto pixels isn't the only issue. Allowing for arbitrarily small (but positive) values of $z$ will result in arbitrarily large values for $w$ . Depending on the hardware, this can cause problems with perspective-correct rasterization. Another potential problem area is depth buffering. Suffice it to say that for practical reasons it is often necessary to restrict the range of the $z$ values so that there is a known minimum value, and we must accept the rather unpleasant necessity of a near clip plane. We say “unpleasant” because the near clip plane is an artifact of implementation, not an inherent part of a 3D world. (Raytracers don't necessarily have this issue.) It cuts off objects when you get too close to them, when in reality you should be able to get arbitrarily close. Many readers are probably familiar with the phenomena where a camera is placed in the middle of a very large ground polygon, just a small distance above it, and a gap opens up at the bottom of the screen, allowing the camera to see through the ground. A similar situation exists if you get very close to practically any object other than a wall. A hole will appear in the middle of the object, and this hole will expand as you move closer.

## 10.3.5Screen Space

Once we have clipped the geometry to the view frustum, it is projected into screen space, which corresponds to actual pixels in the frame buffer. Remember that we are rendering into an output window that does not necessarily occupy the entire display device. However, we usually want our screen-space coordinates to be specified using coordinates that are absolute to the rendering device (Figure 10.9). Figure 10.9The output window in screen space

Screen space is a 2D space, of course. Thus, we must project the points from clip space to screen space to generate the correct 2D coordinates. The first thing that happens is the standard homogeneous division by $w$ . (OpenGL calls the result of this division the normalized device coordinates.) Then, the $x$ - and $y$ -coordinates must be scaled to map into the output window. This is summarized by

Projecting and mapping to screen space
$\begin{array}{}\text{(10.8)}& {\mathrm{s}\mathrm{c}\mathrm{r}\mathrm{e}\mathrm{e}\mathrm{n}}_{x}& =\frac{{\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{x}\cdot {\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{R}\mathrm{e}\mathrm{s}}_{x}}{2\cdot {\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{w}}+{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{C}\mathrm{e}\mathrm{n}\mathrm{t}\mathrm{e}\mathrm{r}}_{x},\text{(10.9)}& {\mathrm{s}\mathrm{c}\mathrm{r}\mathrm{e}\mathrm{e}\mathrm{n}}_{y}& =-\frac{{\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{y}\cdot {\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{R}\mathrm{e}\mathrm{s}}_{y}}{2\cdot {\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{w}}+{\mathrm{w}\mathrm{i}\mathrm{n}\mathrm{C}\mathrm{e}\mathrm{n}\mathrm{t}\mathrm{e}\mathrm{r}}_{y}.\end{array}$

A quick comment is warranted about the negation of the $y$ component in the math above. This reflects DirectX-style coordinate conventions where (0,0) is in the upper-left corner. Under these conventions, $+y$ points up in clip space, but down in screen space. In fact, if we continue to think about $+z$ pointing into the screen, then screen space actually becomes a right-handed coordinate space, even though it's left-handed everywhere else in DirectX. In OpenGL, the origin is in the lower left corner, and the negation of the $y$ -coordinate does not occur. (As already discussed, in OpenGL, they choose a different place to introduce confusion, by flipping the $z$ -axis between eye space, where $-z$ points into the screen, to clip space, where $+z$ points into the screen.)

Speaking of $z$ , what happens to ${\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{z}$ ? In general it's used in some way for depth buffering. A traditional method is to take the normalized depth value ${\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{z}/{\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{w}$ and store this value in the depth buffer. The precise details depend on exactly what sort of clip values are used for clipping, and what sort of depth values go into the depth buffer. For example, in OpenGL, the conceptual convention is for the view frustum to contain $-1\le {\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{z}/{\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{w}\le +1$ , but this might not be optimal for depth buffering. Driver vendors must convert from the API's conceptual conventions to whatever is optimal for the hardware.

An alternative strategy, known as $w$ -buffering, is to use ${\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{w}$ as the depth value. In most situations ${\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{w}$ is simply a scaled version of the camera-space $z$ value; thus by using ${\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{w}$ in the depth buffer, each value has a linear relationship to the viewing depth of the corresponding pixel. This method can be attractive, especially if the depth buffer is fixed-point with limited precision, because it spreads out the available precision more evenly. The traditional method of storing ${\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{z}/{\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{w}$ in the depth buffer results in greatly increased precision up close, but at the expense of (sometimes drastically) reduced precision near the far clip plane. If the depth buffer values are stored in floating-point, this issue is much less important. Also note that $w$ -buffering doesn't work for orthographic projection, since an orthographic projection matrix always outputs $w=1$ .

The ${\mathrm{c}\mathrm{l}\mathrm{i}\mathrm{p}}_{w}$ value is also not discarded. As we've said, it serves the important purpose as the denominator in the homogeneous division to normalized device coordinates. But this value is also usually needed for proper perspective-correct interpolation of texture coordinates, colors, and other vertex-level values during rasterization.

On modern graphics APIs at the time of this writing, the conversion of vertex coordinates from clip space to screen space is done for you. Your vertex shader outputs coordinates in clip space. The API clips the triangles to the view frustum and then projects the coordinates to screen space. But that doesn't mean that you will never use the equations in this section in your code. Quite often, we need to perform these calculations in software for visibility testing, level-of-detail selection, and so forth.

## 10.3.6Summary of Coordinate Spaces

Figure 10.10 summarizes the coordinate spaces and matrices discussed in this section, showing the data flow from object space to screen space. Figure 10.10Conversion of vertex coordinates through the graphics pipeline

The coordinate spaces we've mentioned are the most important and common ones, but other coordinate spaces are used in computer graphics. For example, a projected light might have its own space, which is essentially the same as camera space, only it is from the perspective that the light “looks” onto the scene. This space is important when the light projects an image (sometimes called a gobo) and also for shadow mapping to determine whether a light can “see” a given point.

Another space that has become very important is tangent space, which is a local space on the surface of an object. One basis vector is the surface normal and the other two basis vectors are locally tangent to the surface, essentially establishing a 2D coordinate space that is “flat” on the surface at that spot. There are many different ways we could determine these basis vectors, but by far the most common reason to establish such a coordinate space is for bump mapping and related techniques. A more complete discussion of tangent space will need to wait until after we discuss texture mapping in Section 10.5, so we'll come back to this subject in Section 10.9.1. Tangent space is also sometimes called surface-localspace.

# 10.4Polygon Meshes

To render a scene, we need a mathematical description of the geometry in that scene. Several different methods are available to us. This section focuses on the one most important for real-time rendering: the triangle mesh. But first, let's mention a few alternatives to get some context. Constructive solid geometry (CSG) is a system for describing an object's shape using Boolean operators (union, intersection, subtraction) on primitives. Within video games, CSG can be especially useful for rapid prototyping tools, with the Unreal engine being a notable example. Another technique that works by modeling volumes rather than their surfaces is metaballs, sometimes used to model organic shapes and fluids, as was discussed in Section 9.1. CSG, metaballs, and other volumetric descriptions are very useful in particular realms, but for rendering (especially real-time rendering) we are interested in a description of the surface of the object, and seldom need to determine whether a given point is inside or outside this surface. Indeed, the surface need not be closed or even define a coherent volume.

The most common surface description is the polygon mesh, of which you are probably already aware. In certain circumstances, it's useful to allow the polygons that form the surface of the object to have an arbitrary number of vertices; this is often the case in importing and editing tools. For real-time rendering, however, modern hardware is optimized for triangle meshes, which are polygon meshes in which every polygon is a triangle. Any given polygon mesh can be converted into an equivalent triangle mesh by decomposing each polygon into triangles individually, as was discussed briefly in Section 9.7.3. Please note that many important concepts introduced in the context of a single triangle or polygon were covered in Section 9.6 and Section 9.7, respectively. Here, our focus is on how more than one triangle can be connected in a mesh.

One very straightforward way to store a triangle mesh would be to use an array of triangles, as shown in Listing 10.4.

struct Triangle {
Vector3 vertPos;   // vertex positions
};

struct TriangleMesh {
int      triCount; // number of triangles
Triangle *triList; // array of triangles
};


For some applications this trivial representation might be adequate. However, the term “mesh” implies a degree of connectivity between adjacent triangles, and this connectivity is not expressed in our trivial representation. There are three basic types of information in a triangle mesh:

• Vertices. Each triangle has exactly three vertices. Each vertex may be shared by multiple triangles. The valence of a vertex refers to how many faces are connected to the vertex.
• Edges. An edge connects two vertices. Each triangle has three edges. In many cases, each edge is shared by exactly two faces, but there are certainly exceptions. If the object is not closed, an open edge with only one neighboring face can exist.
• Faces. These are the surfaces of the triangles. We can store a face as either a list of three vertices, or a list of three edges.

A variety of methods exist to represent this information efficiently, depending on the operations to be performed most often on the mesh. Here we will focus on a standard storage format known as an indexed triangle mesh.

## 10.4.1Indexed Triangle Mesh

An indexed triangle mesh consists of two lists: a list of vertices, and a list of triangles.

• Each vertex contains a position in 3D. We may also store other information at the vertex level, such as texture-mapping coordinates, surface normals, or lighting values.
• A triangle is represented by three integers that index into the vertex list. Usually, the order in which these vertices are listed is significant, since we may consider faces to have “front” and “back” sides. We adopt the left-handed convention that the vertices are listed in clockwise order when viewed from the front side. Other information may also be stored at the triangle level, such as a precomputed normal of the plane containing the triangle, surface properties (such as a texture map), and so forth.

Listing 10.5 shows a highly simplified example of how an indexed triangle mesh might be stored in C.

// struct Vertex is the information we store at the vertex level
struct Vertex {

// 3D position of the vertex
Vector3 pos;

// Other information could include
// texture mapping coordinates, a
// surface normal, lighting values, etc.
};

// struct Triangle is the information we store at the triangle level
struct Triangle {

// Indices into the vertex list.  In practice, 16-bit indices are
// almost always used rather than 32-bit, to save memory and bandwidth.
int vertexIndex;

// Other information could include
// a normal, material information, etc
};

// struct TriangleMesh stores an indexed triangle mesh
struct TriangleMesh {

// The vertices
int    vertexCount;
Vertex *vertexList;

// The triangles
int      triangleCount;
Triangle *triangleList;
};


Figure 10.11 shows how a cube and a pyramid might be represented as a polygon mesh or a triangle mesh. Note that both objects are part of a single mesh with 13 vertices. The lighter, thicker wires show the outlines of polygons, and the thinner, dark green wires show one way to add edges to triangulate the polygon mesh. Figure 10.11A simple mesh containing a cube and a pyramid

Assuming the origin is on the “ground” directly between the two objects, the vertex coordinates might be as shown in Table 10.3.

 0 $\left(-3,2,1\right)$ 4 $\left(-3,0,1\right)$ 8 $\left(2,2,0\right)$ 12 $\left(1,0,-1\right)$ 1 $\left(-1,2,1\right)$ 5 $\left(-1,0,1\right)$ 9 $\left(1,0,1\right)$ 2 $\left(-1,2,-1\right)$ 6 $\left(-1,0,-1\right)$ 10 $\left(3,0,1\right)$ 3 $\left(-3,2,-1\right)$ 7 $\left(-3,0,-1\right)$ 11 $\left(3,0,-1\right)$
Table 10.3Vertex positions in our sample mesh

Table 10.4 shows the vertex indices that would form faces of this mesh, either as a polygon mesh or as a triangle mesh. Remember that the order of the vertices is significant; they are listed in clockwise order when viewed from the outside. You should study these figures until you are sure you understand them.

 Vertex indices Vertex indices Description (Polygon mesh) (Triangle mesh) Cube top $\left\{0,1,2,3\right\}$ $\left\{1,2,3\right\}$ , $\left\{1,3,0\right\}$ Cube front $\left\{2,6,7,3\right\}$ $\left\{2,6,7\right\}$ , $\left\{2,7,3\right\}$ Cube right $\left\{2,1,5,6\right\}$ $\left\{2,1,5\right\}$ , $\left\{2,5,6\right\}$ Cube left $\left\{0,3,7,4\right\}$ $\left\{0,3,7\right\}$ , $\left\{0,7,4\right\}$ Cube back $\left\{0,4,5,1\right\}$ $\left\{0,4,5\right\}$ , $\left\{0,5,1\right\}$ Cube bottom $\left\{4,7,6,5\right\}$ $\left\{4,7,6\right\}$ , $\left\{4,6,5\right\}$ Pyramid front $\left\{12,8,11\right\}$ $\left\{12,8,11\right\}$ Pyramid left $\left\{9,8,12\right\}$ $\left\{9,8,12\right\}$ Pyramid right $\left\{8,10,11\right\}$ $\left\{8,10,11\right\}$ Pyramid back $\left\{8,9,10\right\}$ $\left\{8,9,10\right\}$ Pyramid bottom $\left\{9,12,11,10\right\}$ $\left\{9,12,11\right\}$ , $\left\{9,11,10\right\}$
Table 10.4 The vertex indices that form the faces of our sample mesh, either as a polygon mesh or a triangle mesh

The vertices must be listed in clockwise order around a face, but it doesn't matter which one is considered the “first” vertex; they can be cycled without changing the logical structure of the mesh. For example, the quad forming the cube top could equivalently have been given as $\left\{1,2,3,0\right\}$ , $\left\{2,3,0,1\right\}$ , or $\left\{3,0,1,2\right\}$ .

As indicated by the comments in Listing 10.5, additional data are almost always stored per vertex, such as texture coordinates, surface normals, basis vectors, colors, skinning data, and so on. Each of these is discussed in later sections in the context of the techniques that make use of the data. Additional data can also be stored at the triangle level, such as an index that tells which material to use for that face, or the plane equation (part of which is the surface normal—see Section 9.5) for the face. This is highly useful for editing purposes or in other tools that perform mesh manipulations in software. For real-time rendering, however, we seldom store data at the triangle level beyond the three vertex indices. In fact, the most common method is to not have a struct Triangle at all, and to represent the entire list of triangles simply as an array (e.g. unsigned short triList[] ), where the length of the array is the number of triangles times 3. Triangles with identical properties are grouped into batches so that an entire batch can be fed to the GPU in this optimal format. After we review many of the concepts that give rise to the need to store additional data per vertex, Section 10.10.2 looks at several more specific examples of how we might feed that data to the graphics API. By the way, as a general rule, things are a lot easier if you do not try to use the same mesh class for both rendering and editing. The requirements are very different, and a bulkier data structure with more flexibility is best for use in tools, importers, and the like.

Note that in an indexed triangle mesh, the edges are not stored explicitly, but rather the adjacency information contained in an indexed triangle list is stored implicitly: to locate shared edges between triangles, we must search the triangle list. Our original trivial “array of triangles” format in Listing 10.4 did not have any logical connectivity information (although we could have attempted to detect whether the vertices on an edge were identical by comparing the vertex positions or other properties). What's surprising is that the “extra” connectivity information contained in the indexed representation actually results in a reduction of memory usage in most cases, compared to the flat method. The reason for this is that the information stored at the vertex level, which is duplicated in the trivial flat format, is relatively large compared to a single integer index. (At a minimum, we must store a 3D vector position.) In meshes that arise in practice, a typical vertex has a valence of around 3–6, which means that the flat format duplicates quite a lot of data.

The simple indexed triangle mesh scheme is appropriate for many applications, including the very important one of rendering. However, some operations on triangle meshes require a more advanced data structure in order to be implemented more efficiently. The basic problem is that the adjacency between triangles is not expressed explicitly and must be extracted by searching the triangle list. Other representation techniques exist that make this information available in constant time. One idea is to maintain an edge list explicitly. Each edge is defined by listing the two vertices on the ends. We also maintain a list of triangles that share the edge. Then the triangles can be viewed as a list of three edges rather than a list of three vertices, so they are stored as three indices into the edge list rather than the vertex list. An extension of this idea is known as the winged-edge model , which also stores, for each vertex, a reference to one edge that uses the vertex. The edges and triangles may be traversed intelligently to quickly locate all edges and triangles that use the vertex.

## 10.4.2Surface Normals

Surface normals are used for several different purposes in graphics; for example, to compute proper lighting (Section 10.6), and for backface culling (Section 10.10.5). In general, a surface normal is a unit10 vector that is perpendicular to a surface. We might be interested in the normal of a given face, in which case the surface of interest is the plane that contains the face. The surface normals for polygons can be computed easily by using the techniques from Section 9.5.

Vertex-level normals are a bit trickier. First, it should be noted that, strictly speaking, there is not a true surface normal at a vertex (or an edge for that matter), since these locations mark discontinuities in the surface of the polygon mesh. Rather, for rendering purposes, we typically interpret a polygon mesh as an approximation to some smooth surface. So we don't want a normal to the piecewise linear surface defined by the polygon mesh; rather, we want (an approximation of) the surface normal of the smooth surface.

The primary purpose of vertex normals is lighting. Practically every lighting model takes a surface normal at the spot being lit as an input. Indeed, the surface normal is part of the rendering equation itself (in the Lambert factor), so it is always an input, even if the BRDF does not depend on it. We have normals available only at the vertices, but yet we need to compute lighting values over the entire surface. What to do? If hardware resources permit (as they usually do nowadays), then we can approximate the normal of the continuous surface corresponding to any point on a given face by interpolating vertex normals and renormalizing the result. This technique is illustrated in Figure 10.12, which shows a cross section of a cylinder (black circle) that is being approximated by a hexagonal prism (blue outline). Black normals at the vertices are the true surface normals, whereas the interior normals are being approximated through interpolation. (The actual normals used would be the result of stretching these out to unit length.) Figure 10.12 A cylinder approximated with a hexagonal prism.

Once we have a normal at a given point, we can perform the full lighting equation per pixel. This is known as per-pixel shading.11 An alternative strategy to per-pixel shading, known as Gouraud12 shading , is to perform lighting calculations only at the vertex level, and then interpolate the results themselves, rather than the normal, across the face. This requires less computation, and is still done on some systems, such as the Nintendo Wii. Figure 10.13 Approximating cylinders with prisms of varying number of sides.

Figure 10.13 shows per-pixel lighting of cylinders with a different number of sides. Although the illusion breaks down on the ends of the cylinder, where the silhouette edge gives away the low-poly nature of the geometry, this method of approximating a smooth surface can indeed make even a very low-resolution mesh look “smooth.” Cover up the ends of the cylinder, and even the 5-sided cylinder is remarkably convincing.

Now that we understand how normals are interpolated in order to approximately reconstruct a curved surface, let's talk about how to obtain vertex normals. This information may not be readily available, depending on how the triangle mesh was generated. If the mesh is generated procedurally, for example, from a parametric curved surface, then the vertex normals can be supplied at that time. Or you may simply be handed the vertex normals from the modeling package as part of the mesh. However, sometimes the surface normals are not provided, and we must approximate them by interpreting the only information available to us: the vertex positions and the triangles. One trick that works is to average the normals of the adjacent triangles, and then renormalize the result. This classic technique is demonstrated in Listing 10.6.

struct Vertex {
Vector3 pos;
Vector3 normal;
};
struct Triangle {
int     vertexIndex;
Vector3 normal;
};
struct TriangleMesh {
int      vertexCount;
Vertex   *vertexList;
int      triangleCount;
Triangle *triangleList;

void computeVertexNormals() {

// First clear out the vertex normals
for (int i = 0 ; i < vertexCount ; ++i) {
vertexList[i].normal.zero();
}

// Now add in the face normals into the
// normals of the adjacent vertices
for (int i = 0 ; i < triangleCount ; ++i) {

// Get shortcut
Triangle &tri = triangleList[i];

// Compute triangle normal.
Vector3 v0 = vertexList[tri.vertexIndex].pos;
Vector3 v1 = vertexList[tri.vertexIndex].pos;
Vector3 v2 = vertexList[tri.vertexIndex].pos;
tri.normal = cross(v1-v0, v2-v1);
tri.normal.normalize();

// Sum it into the adjacent vertices
for (int j = 0 ; j < 3 ; ++j) {
vertexList[tri.vertexIndex[j]].normal += tri.normal;
}
}

// Finally, average and normalize the results.
// Note that this can blow up if a vertex is isolated
// (not used by any triangles), and in some other cases.
for (int i = 0 ; i < vertexCount ; ++i) {
vertexList[i].normal.normalize();
}
}
};


Averaging face normals to compute vertex normals is a tried-and-true technique that works well in most cases. However, there are a few things to watch out for. The first is that sometimes the mesh is supposed to have a discontinuity, and if we're not careful, this discontinuity will get “smoothed out.” Take the very simple example of a box. There should be a sharp lighting discontinuity at its edges. However, if we use vertex normals computed from the average of the surface normals, then there is no lighting discontinuity, as shown in Figure 10.14. Figure 10.14 On the right, the box edges are not visible because there is only one normal at each corner
 Vertices # Position Normal 0 $\left(-1,+1,+1\right)$ $\left[-0.577,+0.577,+0.577\right]$ 1 $\left(+1,+1,+1\right)$ $\left[+0.577,+0.577,+0.577\right]$ 2 $\left(+1,+1,-1\right)$ $\left[+0.577,+0.577,-0.577\right]$ 3 $\left(-1,+1,-1\right)$ $\left[-0.577,+0.577,-0.577\right]$ 4 $\left(-1,-1,+1\right)$ $\left[-0.577,-0.577,+0.577\right]$ 5 $\left(+1,-1,+1\right)$ $\left[+0.577,-0.577,+0.577\right]$ 6 $\left(+1,-1,-1\right)$ $\left[+0.577,-0.577,-0.577\right]$ 7 $\left(-1,-1,-1\right)$ $\left[-0.577,-0.577,-0.577\right]$
 Faces Description Indices Top $\left\{0,1,2,3\right\}$ Front $\left\{2,6,7,3\right\}$ Right $\left\{2,1,5,6\right\}$ Left $\left\{0,3,7,4\right\}$ Back $\left\{0,4,5,1\right\}$ Bottom $\left\{4,7,6,5\right\}$
Table 10.5Polygon mesh of a box with welded vertices and smoothed edges

The basic problem is that the surface discontinuity at the box edges cannot be properly represented because there is only one normal stored per vertex. The solution to this problem is to “detach” the faces; in other words, duplicate the vertices along the edge where there is a true geometric discontinuity, creating a topological discontinuity to prevent the vertex normals from being averaged. After doing so, the faces are no longer logically connected, but this seam in the topology of the mesh doesn't cause a problem for many important tasks, such as rendering and raytracing. Table 10.5 shows a smoothed box mesh with eight vertices. Compare that mesh to the one in Table 10.6, in which the faces have been detached, resulting in 24 vertices.

 Vertices # Position Normal 0 $\left(-1,+1,+1\right)$ $\left[0,+1,0\right]$ 1 $\left(+1,+1,+1\right)$ $\left[0,+1,0\right]$ 2 $\left(+1,+1,-1\right)$ $\left[0,+1,0\right]$ 3 $\left(-1,+1,-1\right)$ $\left[0,+1,0\right]$ 4 $\left(-1,+1,-1\right)$ $\left[0,0,-1\right]$ 5 $\left(+1,+1,-1\right)$ $\left[0,0,-1\right]$ 6 $\left(+1,-1,-1\right)$ $\left[0,0,-1\right]$ 7 $\left(-1,-1,-1\right)$ $\left[0,0,-1\right]$ 8 $\left(+1,+1,-1\right)$ $\left[+1,0,0\right]$ 9 $\left(+1,+1,+1\right)$ $\left[+1,0,0\right]$ 10 $\left(+1,-1,+1\right)$ $\left[+1,0,0\right]$ 11 $\left(+1,-1,-1\right)$ $\left[+1,0,0\right]$ 12 $\left(-1,+1,+1\right)$ $\left[-1,0,0\right]$ 13 $\left(-1,+1,-1\right)$ $\left[-1,0,0\right]$ 14 $\left(-1,-1,-1\right)$ $\left[-1,0,0\right]$ 15 $\left(-1,-1,+1\right)$ $\left[-1,0,0\right]$ 16 $\left(+1,+1,+1\right)$ $\left[0,0,+1\right]$ 17 $\left(-1,+1,+1\right)$ $\left[0,0,+1\right]$ 18 $\left(-1,-1,+1\right)$ $\left[0,0,+1\right]$ 19 $\left(+1,-1,+1\right)$ $\left[0,0,+1\right]$ 20 $\left(+1,-1,-1\right)$ $\left[0,-1,0\right]$ 21 $\left(-1,-1,-1\right)$ $\left[0,-1,0\right]$ 22 $\left(-1,-1,+1\right)$ $\left[0,-1,0\right]$ 23 $\left(+1,-1,+1\right)$ $\left[0,-1,0\right]$
 Faces Description Indices Top $\left\{0,1,2,3\right\}$ Front $\left\{4,5,6,7\right\}$ Right $\left\{8,9,10,11\right\}$ Left $\left\{12,13,14,15\right\}$ Back $\left\{16,17,18,19\right\}$ Bottom $\left\{20,21,22,23\right\}$
Table 10.6Polygon mesh of a box with detached faces and lighting discontinuities at the edges

An extreme version of this situation occurs when two faces are placed back-to-back. Such infinitely thin double-sided geometry can arise with foliage, cloth, billboards, and the like. In this case, since the normals are exactly opposite, averaging them produces the zero vector, which cannot be normalized. The simplest solution is to detach the faces so that the vertex normals will not average together. Or if the front and back sides are mirror images, the two “single-sided” polygons can be replaced by one “double-sided” one. This requires special treatment during rendering to disable backface culling (Section 10.10.5) and intelligently dealing with the normal in the lighting equation.

A more subtle problem is that the averaging is biased towards large numbers of triangles with the same normal. For example, consider the vertex at index 1 in Figure 10.11. This vertex is adjacent to two triangles on the top of the cube, but only one triangle on the right side and one triangle on the back side. The vertex normal computed by averaging the triangle normals is biased because the top face normal essentially gets twice as many “votes” as each of the side face normals. But this topology is the result of an arbitrary decision as to where to draw the edges to triangulate the faces of the cube. For example, if we were to triangulate the top face by drawing an edge between vertices 0 and 2 (this is known as “turning” the edge), all of the normals on the top face wouldchange.

Techniques exist to deal with this problem, such as weighing the contribution from each adjacent face based on the interior angle adjacent to the vertex, but it's often ignored in practice. Most of the really terrible examples are contrived ones like this, where the faces should be detached anyway. Furthermore, the normals are an approximation to begin with, and having a slightly perturbed normal is often difficult to tell visually.

Although some modeling packages can deliver vertex normals for you, fewer provide the basis vectors needed for bump mapping. As we see in Section 10.9, techniques used to synthesize vertex basis vectors are similar to those described here.

Before we go on, there is one very important fact about surface normals that we must mention. In certain circumstances, they cannot be transformed by the same matrix that is used to transform positions. (This is an entirely separate issue from the fact that normals should not be translated like positions.) The reason for this is that normals are covariant vectors. “Regular” vectors, such as position and velocity, are said to be contravariant: if we scale the coordinate space used to describe the vector, the coordinates will respond in the opposite direction. If we use a coordinate space with a larger scale (for example, using meters instead of feet) the coordinates of a contravariant vector respond to the contrary, by becoming smaller. Notice that this is all about scale; translation and rotation are not part of the discussion. Normals and other types of gradients, known as dual vectors, do not behave like this.

Imagine that we stretch a 2D object, such as a circle, horizontally, as shown in Figure 10.15. Notice that the normals (shown in light blue in the right figure) begin to turn to point more vertically—the horizontal coordinates of the normals are decreasing in absolute value while the horizontal coordinates of the positions are increasing. A stretching of the object (object getting bigger while coordinate space stays the same) has the same effect as scaling down the coordinate space while holding the object at the same size. The coordinates of the normal change in the same direction as the scale of the coordinate space, which is why they are called covariant vectors. Figure 10.15 Transforming normals with nonuniform scale. The light red vectors show the normals multiplied by the same transform matrix used to transform the object; the dark red vectors are their normalized versions. The light blue vectors show the correct normals.

To properly transform surface normals, we must use the inverse transpose of the matrix used to transform positions; that is, the result of transposing and inverting the matrix. This is sometimes denoted ${\mathbf{M}}^{-\mathrm{T}}$ , since it doesn't matter if we transpose first, or invert first: $\left({\mathbf{M}}^{-1}{\right)}^{\mathrm{T}}=\left({\mathbf{M}}^{\mathrm{T}}{\right)}^{-1}$ . If the transform matrix doesn't contain any scale (or skew), then the matrix is orthonormal, and thus the inverse transpose is simply the same as the original matrix, and we can safely transform normals with this transform. If the matrix contains uniform scale, then we can still ignore this, but we must renormalize the normals after transforming them. If the matrix contains nonuniform scale (or skew, which is indistinguishable from nonuniform scale combined with rotation), then to properly transform the normals, we must use the inverse transpose transform matrix, and then re-normalize the resulting transformed normals.

In general, normals must be transformed with the inverse transpose of the matrix used to transform positions. This can safely be ignored if the transform matrix is without scale. If the matrix contains uniform scale, then all that is required is to renormalize the normals after transformation. If the matrix contains nonuniform scale, then we must use the inverse transpose transform and renormalize after transforming.

# 10.5Texture Mapping

There is much more to the appearance of an object than its shape. Different objects are different colors and have different patterns on their surface. One simple yet powerful way to capture these qualities is through texture mapping. A texture map is a bitmap image that is “pasted” to the surface of an object. Rather than controlling the color of an object per triangle or per vertex, with texture mapping we can control the color at a much finer level—per texel. (A texel is a single pixel in a texture map. This is a handy word to know, since in graphics contexts, there are lots of different bitmaps being accessed, and it's nice to have a short way to differentiate between a pixel in the frame buffer and a pixel in a texture.)

So a texture map is just a regular bitmap that is applied onto the surface of a model. Exactly how does this work? Actually, there are many different ways to apply a texture map onto a mesh. Planar mapping projects the texture orthographically onto the mesh. Spherical, cylindrical, and cubic mapping are various methods of “wrapping” the texture around the object. The details of each of these techniques are not important to us at the moment, since modeling packages such as 3DS Max deal with these user interface issues. The key idea is that, at each point on the surface of the mesh, we can obtain texture-mapping coordinates, which define the 2D location in the texture map that corresponds to this 3D location. Traditionally, these coordinates are assigned the variables $\left(u,v\right)$ , where $u$ is the horizontal coordinate and $v$ is the vertical coordinate; thus, texture-mapping coordinates are often called UV coordinates or simply UVs.

Although bitmaps come in different sizes, UV coordinates are normalized such that the mapping space ranges from 0 to 1 over the entire width ( $u$ ) or height ( $v$ ) of the image, rather than depending on the image dimensions. The origin of this space is either in the upper left-hand corner of the image, which is the DirectX-style convention, or in the lower left-hand corner, the OpenGL conventions. We use the DirectX conventions in this book. Figure 10.16 shows the texture map that we use in several examples and the DirectX-style coordinate conventions. Figure 10.16 An example texture map, with labeled UV coordinates according to the DirectX convention, which places the origin in the upper-left corner.

In principle, it doesn't matter how we determine the UV coordinates for a given point on the surface. However, even when UV coordinates are calculated dynamically, rather than edited by an artist, we typically compute or assign UV coordinates only at the vertex level, and the UV coordinates at an arbitrary interior position on a face are obtained through interpolation. If you imagine the texture map as a stretchy cloth, then when we assign texture-mapping coordinates to a vertex, it's like sticking a pin through the cloth at those UV coordinates, and then pinning the cloth onto the surface at that vertex. There is one pin per vertex, so the whole surface is covered.

Let's look at some examples. Figure 10.17 shows a single texture-mapped quad, with different UV values assigned to the vertices. The bottom of each diagram shows the UV space of the texture. You should study these examples until you are sure you understand them.      Figure 10.17A texture-mapped quad, with different UV coordinates assigned to the vertices

UV coordinates outside of the range $\left[0,1\right]$ are allowed, and in fact are quite useful. Such coordinates are interpreted in a variety of ways. The most common addressing modes are repeat (also known as tile or wrap) and clamp. When repeating is used, the integer portion is discarded and only the fractional portion is used, causing the texture to repeat, as shown in the left side of Figure 10.18. Under clamping, when a coordinate outside the range $\left[0,1\right]$ is used to access a bitmap, it is clamped in range. This has the effect of streaking the edge pixels of the bitmap outwards, as depicted on the right side of Figure 10.18. The mesh in both cases is identical: a single polygon with four vertices. And the meshes have identical UV coordinates. The only difference is how coordinates outside the $\left[0,1\right]$ range are interpreted.

 Repeat Clamp  Figure 10.18Comparing repeating and clamping texture addressing modes

There are other options supported on some hardware, such as mirror, which is similar to repeat except that every other tile is mirrored. (This can be beneficial because it guarantees that no “seam” will exist between adjacent tiles.) On most hardware, the addressing mode can be set for the $u$ - and $v$ -coordinates independently. It's important to understand that these rules are applied at the last moment, when the coordinates are used to index into the texture. The coordinates at the vertex are not limited or processed in any way; otherwise, they could not be interpolated properly across the face.

Figure 10.19 shows one last instructive example: the same mesh is texture mapped two different ways.    Figure 10.19Texture mapping works on stuff that's not just a single quad

# 10.6The Standard Local Lighting Model

In the rendering equation, the BRDF describes the scattering distribution for light of a given frequency and direction of incidence. The differences in distributions between different surfaces is precisely what causes those surfaces (or even different surface points on the same object) to look different from one another. Most BRDFs are expressed in a computer by some sort of formula, where certain numbers in the formula are adjusted to match the desired material properties. The formula itself is often called a lighting model, and the particular values going into the formula come from the material assigned to the surface. It is common for a game engine to use only a handful of lighting models, even though the materials in the scene may be quite diverse and there may be thousands of different BRDFs. Indeed, just a few years ago, almost all real-time rendering was done with a single lighting model. In fact, the practice is not uncommon today.

This lighting model was so ubiquitous that it was hardwired into the very rendering APIs of OpenGL and DirectX. Although these older parts of the API have effectively become legacy features on hardware with programmable shaders, the standard model is still commonly used in the more general framework of shaders and generic constants and interpolants. The great diversity and flexibility available is usually used to determine the best way to feed the parameters into the model (for example, by doing multiple lights at once, or doing all the lighting at the end with deferred shading), rather than using different models. But even ignoring programmable shaders, at the time of this writing, the most popular video game console is the Nintendo Wii,13 which has hardwired support for this standard model.

The venerable standard lighting model is the subject of this section. Since its development precedes the framework of the BRDF and the rendering equation by at least a decade, we first present this model in the simplified context that surrounded its creation. This notation and perspective are still predominant in the literature today, which is why we think we should present the idea in its own terms. Along the way, we show how one component of the model (the diffuse component) is modeled as a BRDF. The standard model is important in the present, but you must understand the rendering equation if you want to be prepared for the future.

## 10.6.1The Standard Lighting Equation: Overview

Bui Tuong Phong  introduced the basic concepts behind the standard lighting model in 1975. Back then, the focus was on a fast way to model direct reflection. While certainly researchers understood the importance of indirect light, it was a luxury that could not yet be afforded. Thus while the rendering equation (which, as we noted previously, came into focus a decade or so after the proposal of the standard model) is an equation for the radiance outgoing from a point in any particular direction, the only outgoing direction that mattered in those days were the directions that pointed to the eye. Similarly, while the rendering equation considers incident light from the entire hemisphere surrounding the surface normal, if we ignore indirect light, then we need not cast about in all incident directions. We need to consider only those directions that aim at a light source. We examine some different ways that light sources are modeled in real-time graphics in more detail in Section 10.7, but for now an important point is that the light sources are not emissive surfaces in the scene, as they are in the rendering equation and in the real world. Instead, lights are special entities without any corresponding geometry, and are simulated as if the light were emitting from a single point. Thus, rather than including a solid angle of directions corresponding to the projection of the emissive surface of each light source onto the hemisphere surrounding $\mathbf{x}$ , we only care about a single incident direction for the light. To summarize, the original goal of the standard model was to determine the light reflected back in the direction of the camera, only considering direct reflections, incident from a finite number of directions, one direction for each light source.

Now for the model. The basic idea is to classify light coming into the eye into four distinct categories, each of which has a unique method for calculating its contribution. The four categories are

• The emissive contribution, denoted ${\mathbf{c}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}$ , is the same as the rendering equation. It tells the amount of radiance emitted directly from the surface in the given direction. Note that without global illumination techniques, these surfaces do not actually light up anything (except themselves).
• The specular contribution, denoted ${\mathbf{c}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ , accounts for light incident directly from a light source that is scattered preferentially in the direction of a perfect “mirror bounce.”
• The diffuse contribution, denoted ${\mathbf{c}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ , accounts for light incident directly from a light source that is scattered in every direction evenly.
• The ambient contribution, denoted ${\mathbf{c}}_{\mathrm{a}\mathrm{m}\mathrm{b}}$ , is a fudge factor to account for all indirect light.

The letter $\mathbf{c}$ is intended to be short for “contribution.” Note the bold typeface, indicating that these contributions are not scalar quantities representing the amount of light of a particular wavelength, but rather they are vectors representing colors in some basis with a discrete number of components (“channels”). As stated before, due to the tri-stimulus human vision system, the number of channels is almost always chosen to be three. A less fundamental choice is which three basis functions to use, but in real-time graphics, by far the most common choice is to make one channel for red, one channel for blue, and one channel for green. These details are surprisingly irrelevant from a high-level discussion (they will not appear anywhere in the equations), but, of course, they are important practical considerations.

The emissive term is the same as in the rendering equation, so there's not much more detail to say about it. In practice, the emissive contribution is simply a constant color at any given surface point $\mathbf{x}$ . The specular, diffuse, and ambient terms are more involved, so we discuss each in more detail in the next three sections.

## 10.6.2The Specular Component

The specular component of the standard lighting model accounts for the light that is reflected (mostly) in a “perfect mirror bounce” off the surface. The specular component is what gives surfaces a “shiny” appearance. Rougher surfaces tend to scatter the light in a much broader pattern of directions, which is modeled by the diffuse component described in Section 10.6.3.

Now let's see how the standard model calculates the specular contribution. The important vectors are labeled in Figure 10.20.

• $\mathbf{n}$ is a the local outward-pointing surface normal.
• $\mathbf{v}$ points towards the viewer. (The symbol $\mathbf{e}$ , for “eye,” is also sometimes used to name this vector.)
• $\mathbf{l}$ points towards the light source.
• $\mathbf{r}$ is the reflection vector, which is the direction of a “perfect mirror bounce.” It's the result of reflecting $\mathbf{l}$ about $\mathbf{n}$ .
• $\theta$ is the angle between $\mathbf{r}$ and $\mathbf{v}$ . Figure 10.20Phong model for specular reflection

For convenience, we assume that all of these vectors are unit vectors. Our convention in this book is to denote unit vectors with hats, but we'll drop the hats to avoid decorating the equations excessively. Many texts on the subject use these standard variable names and, especially in the video game community, they are effectively part of the vernacular. It is not uncommon for job interview questions to be posed in such a way that assumes the applicant is familiar with this framework.

One note about the $\mathbf{l}$ vector before we continue. Since lights are abstract entities, they need not necessarily have a “position.” Directional lights and Doom-style volumetric lights (see Section 10.7) are examples for which the position of the light might not be obvious. The key point is that the position of the light isn't important, but the abstraction being used for the light must facilitate the computation of a direction of incidence at any given shading point. (It must also provide the color and intensity of incident light.)

Of the four vectors, the first three are inherent degrees of freedom of the problem, and the reflection vector $\mathbf{r}$ is a derived quantity and must be computed. The geometry is shown in Figure 10.21. Figure 10.21Constructing the reflection vector $\mathbf{r}$

As you can see, the reflection vector can be computed by

Computing the reflection vector is a popular job interview question
$\begin{array}{}\text{(10.10)}& \mathbf{r}=2\left(\mathbf{n}\cdot \mathbf{l}\right)\mathbf{n}-\mathbf{l}.\end{array}$

There are many interviewers for whom this equation is a favorite topic, which is why we have displayed it on a line by itself, despite the fact that it would have fit perfectly fine inline in the paragraph. A reader seeking a job in the video game industry is advised to fully digest Figure 10.21, to be able to produce Equation (10.10) under pressure. Notice that if we assume $\mathbf{n}$ and $\mathbf{l}$ are unit vectors, then $\mathbf{r}$ will be as well.

Now that we know $\mathbf{r}$ , we can compute the specular contribution by using the Phong model for specular reflection (Equation (10.11)).

The Phong Model for Specular Reflection
$\begin{array}{}\text{(10.11)}& {\mathbf{c}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}& =\left({\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\right)\left(\mathrm{cos}\theta {\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}=\left({\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\right)\left(\mathbf{v}\cdot \mathbf{r}{\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}.\end{array}$

In this formula and elsewhere in this book, the symbol $\otimes$ denotes componentwise multiplication of colors. Let's look at the inputs to this formula in more detail.

First, let's consider ${m}_{\mathrm{g}\mathrm{l}\mathrm{s}}$ , which is the glossiness of the material, also known as the Phong exponent, specular exponent, or just as the material shininess. This controls how wide the “hotspot” is—a smaller ${m}_{\mathrm{g}\mathrm{l}\mathrm{s}}$ produces a larger, more gradual falloff from the hotspot, and a larger ${m}_{\mathrm{g}\mathrm{l}\mathrm{s}}$ produces a very tight hotspot with sharp falloff. (Here we are talking about the hotspot of a reflection, not to be confused with the hotspot of a spot light.) Perfectly reflective surfaces, such as chrome, would have an extremely high value for ${m}_{\mathrm{g}\mathrm{l}\mathrm{s}}$ . When rays of light strike the surface from the incident direction $\mathbf{l}$ , there is very little variation in the reflected directions. They are reflected in a very narrow solid angle (“cone”) surrounding the direction described by $\mathbf{r}$ , with very little scattering. Shiny surfaces that are not perfect reflectors—for example, the surface of an apple—have lower specular exponents, resulting in a larger hotspot. Lower specular exponents model a less perfect reflection of light rays. When rays of light strike the surface at the same incident direction given by $\mathbf{l}$ , there is more variation in the reflected directions. The distribution clusters about the bounce direction $\mathbf{r}$ , but the falloff in intensity as we move away from $\mathbf{r}$ is more gradual. We'll show this difference visually in just a moment.

Like all of the material properties that are input to the lighting equation, the value for ${m}_{\mathrm{g}\mathrm{l}\mathrm{s}}$ can vary over the surface, and the specific value for any given location on that surface may be determined in any way you wish, for example with a texture map (see Section 10.5). However, compared to the other material properties, this is relatively rare; in fact it is quite common in real-time graphics for the glossiness value to be a constant for an entire material and not vary over the surface.

Another value in Equation (10.11) related to “shininess” is the material's specular color, denoted ${\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ . While ${m}_{\mathrm{g}\mathrm{l}\mathrm{s}}$ controls the size of the hotspot, ${\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ controls its intensity and color. Highly reflective surfaces will have a higher value for ${\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ , and more matte surfaces will have a lower value. If desired, a specular map14 may be used to control the color of the hotspot using a bitmap, much as a texture map controls the color of an object.

The light specular color, denoted ${\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ , is essentially the “color” of the light, which contains both its color and intensity. Although many lights will have a single constant color, the strength of this color will attenuate with distance (Section 10.7.2), and this attenuation is contained in ${\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ in our formulation. Furthermore, even ignoring attenuation, the same light source may shine light of different colors in different directions. For rectangular spot lights, we might determine the color from a gobo, which is a projected bitmap image. A colored gobo might be used to simulate a light shining through a stained glass window, or an animated gobo could be used to fake shadows of spinning ceiling fans or trees blowing in the wind. We use the letter $\mathbf{s}$ to stand for “source.” The subscript “ $\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}$ ” indicates that this color is used for specular calculations. A different light color can be used for diffuse calculations—this is a feature of the lighting model used to achieve special effects in certain circumstances, but it doesn't have any real-world meaning. In practice, ${\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ is almost always equal to the light color used for diffuse lighting, which, not surprisingly, is denoted in this book as ${\mathbf{s}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ . Figure 10.22 Different values for ${m}_{\mathrm{g}\mathrm{l}\mathrm{s}}$ and ${\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$

Figure 10.22 shows how different values of ${m}_{\mathrm{g}\mathrm{l}\mathrm{s}}$ and ${\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ affect the appearance of an object with specular reflection. The material specular color ${\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ goes from black on the leftmost column to white on the right-most column. The specular exponent ${m}_{\mathrm{g}\mathrm{l}\mathrm{s}}$ is large on the top row and decreases with each subsequent row. Notice that the heads in the left-most column all look the same; since the specular strength is zero, the specular exponent is irrelevant and there is no specular contribution in any case. (The lighting comes from the diffuse and ambient components, which are discussed in Sections 10.6.3 and 10.6.4, respectively.)

Blinn  popularized a slight modification to the Phong model that produces very similar visual results, but at the time was a significant optimization. In many cases, it is still faster to compute today, but beware that vector operations (which are reduced with this model) are not always the performance bottleneck. The basic idea is this: if the distance to the viewer is large relative to the size of an object, then $\mathbf{v}$ may be computed once and then considered constant for an entire object. Likewise for a light source and the vector $\mathbf{l}$ . (In fact, for directional lights, $\mathbf{l}$ is always constant.) However, since the surface normal $\mathbf{n}$ is not constant, we must still compute the reflection vector $\mathbf{r}$ , a computation that we would like to avoid, if possible. The Blinn model introduces a new vector $\mathbf{h}$ , which stands for “halfway” vector and is the result of averaging $\mathbf{v}$ and $\mathbf{l}$ and then normalizing the result:

The halfway vector $\mathbf{h}$ , used in the Blinn specular model
$\mathbf{h}=\frac{\mathbf{v}+\mathbf{l}}{\parallel \mathbf{v}+\mathbf{l}\parallel }.$

Then, rather than using the angle between $\mathbf{v}$ and $\mathbf{r}$ , as the Phong model does, the cosine of the angle between $\mathbf{n}$ and $\mathbf{h}$ is used. The situation is shown in Figure 10.23. Figure 10.23Blinn model for specular reflection

The formula for the Blinn model is quite similar to the original Phong model. Only the dot product portion is changed.

The Blinn Model for Specular Reflection
${\mathbf{c}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}=\left({\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\right)\left(\mathrm{cos}\theta {\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}=\left({\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\right)\left(\mathbf{n}\cdot \mathbf{h}{\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}.$

The Blinn model can be faster to implement in hardware than the Phong model, if the viewer and light source are far enough away from the object to be considered a constant, since then $\mathbf{h}$ is a constant and only needs to be computed once. But when $\mathbf{v}$ or $\mathbf{l}$ may not be considered constant, the Phong calculation might be faster. As we've said, the two models produce similar, but not identical, results (see Fisher and Woo  for a comparison). Both are empirical models, and the Blinn model should not be considered an “approximation” to the “correct” Phong model. In fact, Ngan et al.  have demonstrated that the Blinn model has some objective advantages and more closely matches experimental data for certain surfaces.

One detail we have omitted is that in either model, $\mathrm{cos}\theta$ may be less than zero. In this case, we usually clamp the specular contribution to zero.

## 10.6.3The Diffuse Component

The next component in the standard lighting model is the diffuse component. Like the specular component, the diffuse component also models light that traveled directly from the light source to the shading point. However, whereas specular light accounts for light that reflects preferentially in a particular direction, diffuse light models light that is reflected randomly in all directions due to the rough nature of the surface material. Figure 10.24 compares how rays of light reflect on a perfectly reflective surface and on a rough surface. Figure 10.24Diffuse lighting models scattered reflections

To compute specular lighting, we needed to know the location of the viewer, to see how close the eye is to the direction of the perfect mirror bounce. For diffuse lighting, in contrast, the location of the viewer is not relevant, since the reflections are scattered randomly, and no matter where we position the camera, it is equally likely that a ray will be sent our way. However, the direction if incidence $\mathbf{l}$ , which is dictated by the position of the light source relative to the surface, is important. We've mentioned Lambert's law previously, but let's review it here, since the diffuse portion of Blinn-Phong is the most important place in real-time graphics that it comes into play. If we imagine counting the photons that hit the surface of the object and have a chance of reflecting into the eye, a surface that is perpendicular to the rays of light receives more photons per unit area than a surface oriented at a more glancing angle, as shown in Figure 10.25. Figure 10.25 Surfaces more perpendicular to the light rays receive more light per unit area

Notice that, in both cases, the perpendicular distance between the rays is the same. (Due to an optical illusion in the diagram, the rays on the right may appear to be farther apart, but they are not.) So, the perpendicular distance between the rays is the same, but notice that on the right side of Figure 10.25, they strike the object at points that are farther apart. The surface on the left receives nine light rays, and the surface on the right receives only six, even though the “area” of both surfaces is the same. Thus the number of photons per unit area15 is higher on the left, and it will appear brighter, all other factors being equal. This same phenomenon is responsible for the fact that the climate near the equator is warmer than near the poles. Since Earth is round, the light from the sun strikes Earth at a more perpendicular angle near the equator.

Diffuse lighting obeys Lambert's law: the intensity of the reflected light is proportional to the cosine of the angle between the surface normal and the rays of light. We will compute this cosine with the dot product.

Calculating the Diffuse Component according to Lambert's Law
$\begin{array}{}\text{(10.12)}& {\mathbf{c}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}=\left({\mathbf{s}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\otimes {\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\right)\left(\mathbf{n}\cdot \mathbf{l}\right).\end{array}$

As before, $\mathbf{n}$ is the surface normal and $\mathbf{l}$ is a unit vector that points towards the light source. The factor ${\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ is the material's diffuse color, which is the value that most people think of when they think of the “color” of an object. The diffuse material color often comes from a texture map. The diffuse color of the light source is ${\mathbf{s}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ ; this is usually equal to the light's specular color, ${\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ .

Just as with specular lighting, we must prevent the dot product from going negative by clamping it to zero. This prevents objects from being lit from behind.

It's very instructive to see how diffuse surfaces are implemented in the framework of the rendering equation.

Diffuse reflection models light that is scattered completely randomly, and any given outgoing direction is equally likely, no matter what the incoming light direction. Thus, the BRDF for a perfectly diffuse surface is a constant.

Note the similarity of Equation (10.12) with the contents of the integral from the rendering equation,

${L}_{\mathrm{i}\mathrm{n}}\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}},\lambda \right)f\left(\mathbf{x},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}},{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{o}\mathrm{u}\mathrm{t}},\lambda \right)\left(-{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}\cdot \stackrel{^}{\mathbf{n}}\right).$

The first factor is the incident light color. The material color ${\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ is the constant value of the BRDF, which comes next. Finally, we have the Lambert factor.

## 10.6.4The Ambient and Emmissive Components

Specular and diffuse lighting both account for light rays that travel directly from the light source to the surface of the object, “bounce” one time, and then arrive in the eye. However, in the real world, light often bounces off one or more intermediate objects before hitting an object and reflecting to the eye. When you open the refrigerator door in the middle of the night, the entire kitchen will get just a bit brighter, even though the refrigerator door blocks most of the direct light.

To model light that is reflected more than one time before it enters the eye, we can use a very crude approximation known as “ambient light.” The ambient portion of the lighting equation depends only on the properties of the material and an ambient lighting value, which is often a global value used for the entire scene. None of the light sources are involved in the computation. (In fact, a light source is not even necessary.) Equation (10.13) is used to compute the ambient component:

Ambient contribution to the lighting equation
$\begin{array}{}\text{(10.13)}& {\mathbf{c}}_{\mathrm{a}\mathrm{m}\mathrm{b}}={\mathbf{g}}_{\mathrm{a}\mathrm{m}\mathrm{b}}\otimes {\mathbf{m}}_{\mathrm{a}\mathrm{m}\mathrm{b}}.\end{array}$

The factor ${\mathbf{m}}_{\mathrm{a}\mathrm{m}\mathrm{b}}$ is the material's “ambient color.” This is almost always the same as the diffuse color (which is often defined using a texture map). The other factor, ${\mathbf{g}}_{\mathrm{a}\mathrm{m}\mathrm{b}}$ , is the ambient light value. We use the notation $\mathbf{g}$ for “global,” because often one global ambient value is used for the entire scene. However, some techniques, such as lighting probes, attempt to provide more localized and direction-dependent indirect lighting.

Sometimes a ray of light travels directly from the light source to the eye, without striking any surface in between. The standard lighting equation accounts for such rays by assigning a material an emissive color. For example, when we render the surface of a light bulb, this surface will probably appear very bright, even if there are no other light sources in the scene, because the light bulb is emitting light.

In many situations, the emissive contribution doesn't depend on environmental factors; it is simply the emissive color of the material:

The emissive contribution depends only on the material
${\mathbf{c}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}={\mathbf{m}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}.$

Most surfaces don't emit light, so their emissive component is $\mathbf{0}$ . Surfaces that have a nonzero emissive component are called “self-illuminated.”

It's important to understand that in real-time graphics, a self-illuminated surface does not light the other surfaces—you need a light source for that. In other words, we don't actually render light sources, we only render the effects that those light sources have on the surfaces in the scene. We do render self-illuminated surfaces, but those surfaces don't interact with the other surfaces in the scene. When using the rendering equation properly, however, emissive surfaces do light up their surroundings.

We may choose to attenuate the emissive contribution due to atmospheric conditions, such as fog, and of course there may be performance reasons to have objects fade out and disappear in the distance. However, as explained in Section 10.7.2, in general the emissive contribution should not be attenuated due to distance in the same way that light sources are.

## 10.6.5The Lighting Equation: Putting It All Together

We have discussed the individual components of the lighting equation in detail. Now it's time to give the complete equation for the standard lighting model.

The standard lighting equation for one light source
${\mathbf{c}}_{\mathrm{l}\mathrm{i}\mathrm{t}}\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}\begin{array}{rl}& {\mathbf{c}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\\ +& {\mathbf{c}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\\ +& {\mathbf{c}}_{\mathrm{a}\mathrm{m}\mathrm{b}}\\ +& {\mathbf{c}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}\end{array}\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}\begin{array}{rl}& \left({\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\right){\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot \mathbf{h},0\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}\\ +& \left({\mathbf{s}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\otimes {\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\right)\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot \mathbf{l},0\right)\\ +& {\mathbf{g}}_{\mathrm{a}\mathrm{m}\mathrm{b}}\otimes {\mathbf{m}}_{\mathrm{a}\mathrm{m}\mathrm{b}}\\ +& {\mathbf{m}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}\end{array}$ Figure \caption@xref {fig:graphics_lighting_equation_head}{ on input line 3591} The visual contribution of each of the components of the lighting equation

Figure \caption@xref fig:graphics_lighting_equation_head on input line 3591 shows what the ambient, diffuse, and specular lighting components actually look like in isolation from the others. (We are ignoring the emissive component, assuming that this particular floating head doesn't emit light.) There are several interesting points to be noted:

• The ear is lit just as bright as the nose, even though it is actually in the shadow of the head. For shadows, we must determine whether the light can actually “see” the point being shaded, using techniques such as shadow mapping.
• In the first two images, without ambient light, the side of the head that is facing away from the light is completely black. In order to light the “back side” of objects, you must use ambient light. Placing enough lights in your scene so that every surface is lit directly is the best situation, but it's not always possible. One common hack, which Mitchell et al.  dubbed “Half Lambert” lighting, is to bias the Lambert term, allowing diffuse lighting to “wrap around” to the back side of the model to prevent it from ever being flattened out and lit only by ambient light. This can easily be done by replacing the standard $\mathbf{n}\cdot \mathbf{l}$ term with $\alpha +\left(1-\alpha \right)\left(\mathbf{n}\cdot \mathbf{l}\right)$ , where $\alpha$ is a tunable parameter that specifies the extra wraparound effect. (Mitchell et al. suggest using $\alpha =1/2$ , and they also square the result.) Although this adjustment has little physical basis, it has a very high perceptual benefit, especially considering the small computational cost.
• With only ambient lighting, just the silhouette is visible. Lighting is an extremely powerful visual cue that makes the object appear “3D.” The solution to this “cartoon” effect is to place a sufficient number of lights in the scene so that every surface is lit directly.

Speaking of multiple lights, how do multiple light sources work with the lighting equation? We must sum up the lighting values for all the lights. To simplify the notation, we'll go ahead and make the almost universal assumption that ${\mathbf{s}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}={\mathbf{s}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ . Then we can let ${\mathbf{s}}_{j}$ denote the color of the $j$ th light source, including the attenuation factor. The index $j$ goes from $1$ to $n$ , where $n$ is the number of lights. Now the lighting equation becomes

The standard lighting equation for multiple lights
$\begin{array}{rl}{\mathbf{c}}_{\mathrm{l}\mathrm{i}\mathrm{t}}& =\sum _{j=1}^{n}\left[\left({\mathbf{s}}_{j}\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\right){\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{h}}_{j},0\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}+\left({\mathbf{s}}_{j}\otimes {\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\right)\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{l}}_{j},0\right)\right]\\ \text{(10.14)}& & \phantom{\rule{1em}{0ex}}+{\mathbf{g}}_{\mathrm{a}\mathrm{m}\mathrm{b}}\otimes {\mathbf{m}}_{\mathrm{a}\mathrm{m}\mathrm{b}}+{\mathbf{m}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}.\end{array}$

Since there is only one ambient light value and one emissive component for any given surface, these components are not summed per light source.

## 10.6.6Limitations of the Standard Model

If you have read the OpenGL or DirectX documentation for setting material parameters, you are forgiven for thinking that ambient, diffuse, and specular are “how light works” (remember our warning at the beginning of this chapter) as opposed being arbitrary practical constructs peculiar to a particular lighting model. The dichotomy between diffuse and specular is not an inherent physical reality; rather, it arose (and continues to be used) due to practical considerations. These are descriptive terms for two extreme scattering patterns, and by taking arbitrary combinations of these two patterns, many phenomena are able to be approximated to a decent degree.

Because of the near unanimous adoption of this model, it is often used without giving it a name, and in fact there is still some confusion as to exactly what to call it. You might call it the Phong lighting model, because Phong introduced the basic idea of modeling reflection as the sum of diffuse and specular contributions, and also provided a useful empirically based calculation for specular reflection. (The Lambert model for diffuse reflection was already known.) We saw that Blinn's computation for specular reflection is similar but sometimes faster. Because this is the specific calculation most often used, perhaps we should call it the Blinn model? But Blinn's name is also attached to a different microfacet model in which diffuse and specular are at different ends of a continuous spectrum, rather than independent “orthogonal” components being mixed together. Since most implementations use Blinn's optimization for Phong's basic idea, the name Blinn-Phong is the one most often used for this model, and that's the name we use.

A huge part of realistic lighting is, of course, realistic shadows. Although the techniques for producing shadows are interesting and important, alas we will not have time to discuss them here. In the theory of the rendering equation, shadows are accounted for when we determine the radiance incident in a given direction. If a light (more accurately, an emissive surface) exists in a particular direction, and the point can “see” that surface, then its light will be incident upon the point. If, however, there is some other surface that obscures the light source when looking in that direction, then the point is in shadow with respect to that light source. More generally, shadows can be cast not just due to the light from emissive surfaces; the light bouncing off reflective surfaces can cause shadows. In all cases, shadows are an issue of light visibility, not reflectance model.

Finally, we would like to mention several important physical phenomena not properly captured by the Blinn-Phong model. The first is Fresnel16 reflectance, which predicts that the reflectance of nonmetals is strongest when the light is incident at a glancing angle, and least when incident from the normal angle. Some surfaces, such as velvet, exhibit retroreflection; you might guess this means that the surface looks like Madonna's earrings, but it actually means that the primary direction of reflection is not the “mirror bounce” as predicted by Blinn-Phong, but rather back towards the light source. Finally, Blinn-Phong is isotropic, which means that if we rotate the surface while keeping the viewer and light source stationary, the reflectance will not change. Some surfaces have anisotropic reflection, due to grooves or other patterns in the surface. This means that the strength of the reflection varies, based on the direction of incidence relative to the direction of the grooves, which is sometimes called the scratch direction. Classic examples of anisotropic materials are brushed metal, hair, and those little Christmas ornaments made of shiny fibers.

On modern shader-based hardware, lighting calculations are usually done on a per-pixel basis. By this we mean that for each pixel, we determine a surface normal (whether by interpolating the vertex normal across the face or by fetching it from a bump map), and then we perform the full lighting equation using this surface normal. This is per-pixel lighting, and the technique of interpolating vertex normals across the face is sometimes called Phong shading, not to be confused with the Phong calculation for specular reflection. The alternative to Phong shading is to perform the lighting equation less frequently (per face, or per vertex). These two techniques are known as flat shading and Gouraud shading, respectively. Flat shading is almost never used in practice except in software rendering. This is because most modern methods of sending geometry efficiently to the hardware do not provide any face-level data whatsoever. Gouraud shading, in contrast, still has some limited use on some platforms. Some important general principles can be gleaned from studying these methods, so let's examine their results. When using flat shading, we compute a single lighting value for the entire triangle. Usually the “position” used in lighting computations is the centroid of the triangle, and the surface normal is the normal of the triangle. As you can see in Figure 10.27, when an object is lit using flat shading, the faceted nature of the object becomes painfully apparent, and any illusion of smoothness is lost.

Gouraud shading, also known as vertex shading, vertex lighting, or interpolated shading, is a trick whereby values for lighting, fog, and so forth are computed at the vertex level. These values are then linearly interpolated across the face of the polygon. Figure 10.28 shows the same teapot rendered with Gouraud shading. As you can see, Gouraud shading does a relatively good job at restoring the smooth nature of the object. When the values being approximated are basically linear across the triangle, then, of course, the linear interpolation used by Gouraud shading works well. Gouraud shading breaks down when the values are not linear, as in the case of specular highlights. Compare the specular highlights in the Gouraud shaded teapot with the highlights in a Phong (per-pixel) shaded teapot, shown in Figure 10.29. Notice how much smoother the highlights are. Except for the silhouette and areas of extreme geometric discontinuities, such as the handle and spout, the illusion of smoothness is very convincing. With Gouraud shading, the individual facets are detectable due to the specular highlights.

The basic problem with interpolated shading is that no value in the middle of the triangle can be larger than the largest value at a vertex; highlights can occur only at a vertex. Sufficient tessellation can overcome this problem. Despite its limitations, Gouraud shading is still in use on some limited hardware, such as hand-held platforms and the Nintendo Wii.

One question that you should be asking is how the lighting can be computed at the vertex level if any maps are used to control inputs to the lighting equation. We can't use the lighting equation as given in Equation (10.14) directly. Most notably, the diffuse color ${\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ is not usually a vertex-level material property; this value is typically defined by a texture map. In order to make Equation (10.14) more suitable for use in an interpolated lighting scheme, it must be manipulated to isolate ${\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ . We first split the sum and move the constant material colors outside:

$\begin{array}{rl}{\mathbf{c}}_{\mathrm{l}\mathrm{i}\mathrm{t}}& =\sum _{j=1}^{n}\left[\left({\mathbf{s}}_{j}\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\right){\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{h}}_{j},0\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}+\left({\mathbf{s}}_{j}\otimes {\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\right)\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{l}}_{j},0\right)\right]\\ & \phantom{\rule{1em}{0ex}}+{\mathbf{g}}_{\mathrm{a}\mathrm{m}\mathrm{b}}\otimes {\mathbf{m}}_{\mathrm{a}\mathrm{m}\mathrm{b}}+{\mathbf{m}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}\\ & =\sum _{j=1}^{n}\left({\mathbf{s}}_{j}\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\right){\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{h}}_{j},0\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}+\sum _{j=1}^{n}\left({\mathbf{s}}_{j}\otimes {\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\right)\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{l}}_{j},0\right)\\ & \phantom{\rule{1em}{0ex}}+{\mathbf{g}}_{\mathrm{a}\mathrm{m}\mathrm{b}}\otimes {\mathbf{m}}_{\mathrm{a}\mathrm{m}\mathrm{b}}+{\mathbf{m}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}\end{array}$
$\begin{array}{rl}& =\left[\sum _{j=1}^{n}{\mathbf{s}}_{j}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{h}}_{j},0\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}\right]\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}+\left[\sum _{j=1}^{n}{\mathbf{s}}_{j}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{l}}_{j},0\right)\right]\otimes {\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\\ & \phantom{\rule{1em}{0ex}}+{\mathbf{g}}_{\mathrm{a}\mathrm{m}\mathrm{b}}\otimes {\mathbf{m}}_{\mathrm{a}\mathrm{m}\mathrm{b}}+{\mathbf{m}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}.\end{array}$

Finally, we make the very reasonable assumption that ${\mathbf{m}}_{\mathrm{a}\mathrm{m}\mathrm{b}}={\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ :

A version of the standard lighting equation more suitable for vertex-level lighting computations
$\begin{array}{}\text{(10.15)}& \begin{array}{rl}{\mathbf{c}}_{\mathrm{l}\mathrm{i}\mathrm{t}}& =\left[\sum _{j=1}^{n}{\mathbf{s}}_{j}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{h}}_{j},0\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}\right]\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\\ & \phantom{\rule{1em}{0ex}}+\left[{\mathbf{g}}_{\mathrm{a}\mathrm{m}\mathrm{b}}+\sum _{j=1}^{n}{\mathbf{s}}_{j}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{l}}_{j},0\right)\right]\otimes {\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\\ & \phantom{\rule{1em}{0ex}}+{\mathbf{m}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}.\end{array}\end{array}$

With the lighting equation in the format of Equation (10.15), we can see how to use interpolated lighting values computed at the vertex level. At each vertex, we will compute two values: ${\mathbf{v}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ contains the specular portion of Equation (10.15) and ${\mathbf{v}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}$ contains the ambient and diffuse terms:

Vertex-level diffuse and specular lighting values
$\begin{array}{rlrl}{\mathbf{v}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}& =\sum _{j=1}^{n}{\mathbf{s}}_{j}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{h}}_{j},0\right)}^{{m}_{\mathrm{g}\mathrm{l}\mathrm{s}}}& {\mathbf{v}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}& ={\mathbf{g}}_{\mathrm{a}\mathrm{m}\mathrm{b}}+\sum _{j=1}^{n}{\mathbf{s}}_{j}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{a}\mathrm{x}\left(\mathbf{n}\cdot {\mathbf{l}}_{j},0\right).\end{array}$

Each of these values is computed per vertex and interpolated across the face of the triangle. Then, per pixel, the light contributions are multiplied by the corresponding material colors and summed:

Shading pixels using interpolated lighting values
$\begin{array}{rl}{\mathbf{c}}_{\mathrm{l}\mathrm{i}\mathrm{t}}& ={\mathbf{v}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\otimes {\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{\mathbf{v}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\otimes {\mathbf{m}}_{\mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}}\phantom{\rule{thinmathspace}{0ex}}+\phantom{\rule{thinmathspace}{0ex}}{\mathbf{m}}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}}.\end{array}$

As mentioned earlier, ${\mathbf{m}}_{\mathrm{s}\mathrm{p}\mathrm{e}\mathrm{c}}$ is sometimes a constant color, in which case we could move this multiplication into the vertex shader. But it also can come from a specular map.

What coordinate space should be used for lighting computations? We could perform the lighting computations in world space. Vertex positions and normals would be transformed into world space, lighting would be performed, and then the vertex positions would be transformed into clip space. Or we may transform the lights into modeling space, and perform lighting computations in modeling space. Since there are usually fewer lights than there are vertices, this results in fewer overall vector-matrix multiplications. A third possibility is to perform the lighting computations in camera space.

# 10.7Light Sources

In the rendering equation, light sources produce their effect when we factor in the emissive component of a surface. As mentioned earlier, in real-time graphics, doing this “properly” with emissive surfaces is usually a luxury we cannot afford. Even in offline situations where it can be afforded, we might have reasons to just emit light out of nowhere, to make it easier to get control of the look of the scene for dramatic lighting, or to simulate the light that would be reflecting from a surface for which we're not wasting time to model geometry since it's off camera. Thus we usually have light sources that are abstract entities within the rendering framework with no surface geometry to call their own. This section discusses some of the most common types of light sources.

Section 10.7.1 covers the classic point, directional, and spot lights. Section 10.7.2 considers how light attenuates in the real world and how deviations from this reality are common for practical reasons. The next two sections move away from the theoretically pure territory and into the messy domain of ad-hoc lighting techniques in use in real-time graphics today. Section 10.7.3 presents the subject of Doom-style volumetric lights. Finally, Section 10.7.4 discusses how lighting calculations can be done offline and then used at runtime, especially for the purpose of incorporating indirect lighting effects.

## 10.7.1Standard Abstract Light Types

This section lists some of the most basic light types that are supported by most rendering systems, even older or limited platforms, such as the OpenGL and DirectX fixed-function lighting pipelines or the Nintendo Wii. Of course, systems with programmable shaders often use these light types, too. Even when completely different methods, such as spherical harmonics, are used at runtime, standard light types are usually used as an offline editing interface.

A point light source represents light that emanates from a single point outward in all directions. Point lights are also called omni lights(short for “omnidirectional”) or spherical lights. A point light has aposition and color, which controls not only the hue of the light, but also itsintensity. Figure 10.30 shows how 3DS Max represents point lightsvisually. Figure 10.30A point light

As Figure 10.30 illustrates, a point light may have a falloff radius, which controls the size of the sphere that is illuminated by the light. The intensity of the light usually decreases the farther away we are from the center of the light. Although not realistic, it is desirable for many reasons that the intensity drop to zero at the falloff distance, so that the volume of the effect of the light can be bounded. Section 10.7.2 compares real-world attenuation with the simplified models commonly used. Point lights can be used to represent many common light sources, such as light bulbs, lamps, fires, and so forth.

A spot light is used to represent light from a specific location in a specific direction. These are used for lights such as flashlights, headlights, and of course, spot lights! A spot light has a position and an orientation, and optionally a falloff distance. The shape of the lit area is either a cone or a pyramid.

A conical spot light has a circular “bottom.” The width of the cone is defined by a falloff angle (not to be confused with the falloff distance). Also, there is an inner angle that measures the size of the hotspot. A conical spot light is shown in Figure 10.31. Figure 10.31A conical spot light

A rectangular spot light forms a pyramid rather than a cone. Rectangular spot lights are especially interesting because they are used to project an image. For example, imagine walking in front of a movie screen while a movie is being shown. This projected image goes by many names, including projected light map, gobo, and even cookie.17 The term gobo originated from the world of theater, where it refers to a mask or filter placed over a spot light used to create a colored light or special effect, and it's the term we use in this book. Gobos are very useful for faking shadows and other lighting effects. If conical spot lights are not directly supported, they can be implemented with an appropriately designed circular gobo.

A directional light represents light emanating from a point in space sufficiently far away that all the rays of light involved in lighting the scene (or at least the object we are currently considering) can be considered as parallel. The sun and moon are the most obvious examples of directional lights, and certainly we wouldn't try to specify the actual position of the sun in world space in order to properly light the scene. Thus directional lights usually do not have a position, at least as far as lighting calculations are concerned, and they usually do not attenuate. For editing purposes, however, it's often useful to create a “box” of directional light that can be moved around and placed strategically, and we might include additional attenuation factors to cause the light to drop off at the edge of the box. Directional lights are sometimes called parallel lights. We might also use a gobo on a directional light, in which case the projection of the image is orthographic rather than perspective, as it is with rectangular spot lights.

As we've said, in the rendering equation and in the real world, lights are emissive surfaces with finite surface areas. Abstract light types do not have any surface area, and thus require special handling during integration. Typically in a Monte Carlo integrator, a sample is specifically chosen to be in the direction of the light source, and the multiplication by $d{\stackrel{^}{\omega \phantom{\rule{1px}{0ex}}\omega }}_{\mathrm{i}\mathrm{n}}$ is ignored. Imagine if, rather than the light coming from a single point, it comes instead from a disk of some nonzero surface area that is facing the point being illuminated. Now imagine that we shrink the area of the disk down to zero, all the while increasing the radiosity (energy flow per unit area) from the disk such that radiant flux (total energy flow) remains constant. An abstract light can be considered the result of this limiting process in a manner very similar to a Dirac delta (see Section 12.4.3). The radiosity is infinite, but the flux is finite.

While the light types discussed so far are the classic ones supported by fixed-function real-time pipelines, we certainly are free to define light volumes in any way we find useful. The volumetric lights discussed in Section 10.7.3 are an alternative system that is flexible and also amenable to real-time rendering. Warn  and Barzel  discuss more flexible systems for shaping lights in greater detail.

## 10.7.2Light Attenuation

Light attenuates with distance. That is, objects receive less illumination from a light as the distance between the light and the object increases. In the real world, the intensity of a light is inversely proportional to the square of the distance between the light and the object, as

Real-world light attenuation
$\begin{array}{}\text{(10.16)}& \frac{{i}_{1}}{{i}_{2}}={\left(\frac{{d}_{2}}{{d}_{1}}\right)}^{2},\end{array}$

where $i$ is the radiant flux (the radiant power per unit area) and $d$ is the distance. To understand the squaring in real-world attenuation, consider the sphere formed by all the photons emitted from a point light at the same instant. As these photons move outward, a larger and larger sphere is formed by the same number of photons. The density of this photon flow per unit area (the radiant flux) is inversely proportional to the surface area of the sphere, which is proportional to the square of the radius (see Section 9.3).

Let's pause here to discuss a finer point: the perceived brightness of an object (or light source) does not decrease with increased distance from the viewer, ignoring atmospheric effects. As a light or object recedes from the viewer, the irradiance on our eye decreases for the reasons just described. However, perceived brightness is related to radiance, not irradiance. Remember that radiance measures power per unit projected area per unit solid angle, and as the object recedes from view, the decrease in irradiance is compensated for by the decrease in solid angle subtended by the object. It's particularly educational to understand how the rendering equation naturally accounts for light attenuation. Inside the integral, for each direction on the hemisphere surrounding the shading point $\mathbf{x}$ , we measure the incident radiance from an emissive surface in that direction. We've just said that this radiance does not attenuate with distance. However, as the light source moves away from $\mathbf{x}$ , it occupies a smaller solid angle on this hemisphere. Thus, attenuation happens automatically in the rendering equation if our light sources have finite area. However, for abstract light sources emanating from a single point (Dirac delta), attenuation must be manually factored in. Because this is a bit confusing, let's summarize the general rule for real-time rendering. Emissive surfaces, which are rendered and have finite area, typically are not attenuated due to distance—but they might be affected by atmospheric effects such as fog. For purposes of calculating the effective light color when shading a particular spot, the standard abstract light types are attenuated.

In practice, Equation (10.16) can be unwieldy for two reasons. First, the light intensity theoretically increases to infinity at $d=0$ . (This is a result of the light being a Dirac delta, as mentioned previously.) Barzel  describes a simple adjustment to smoothly transition from the inverse square curve near the light origin, to limit the maximum intensity near the center. Second, the light intensity never falls off completely to zero.

Instead of the real-world model, a simpler model based on falloff distance is often used. Section 10.7 mentioned that the falloff distance controls the distance beyond which the light has no effect. It's common to use a simple linear interpolation formula such that the light gradually fades with the distance $d$ :

Typical linear attenuation model

As Equation (10.17) shows, there are actually two distances used to control the attenuation. Within ${d}_{\mathrm{m}\mathrm{i}\mathrm{n}}$ , the light is at full intensity (100%). As the distance goes from ${d}_{\mathrm{m}\mathrm{i}\mathrm{n}}$ to ${d}_{\mathrm{m}\mathrm{a}\mathrm{x}}$ , the intensity varies linearly from 100%down to 0%. At ${d}_{\mathrm{m}\mathrm{a}\mathrm{x}}$ and beyond, the light intensity is 0%. So basically, ${d}_{\mathrm{m}\mathrm{i}\mathrm{n}}$ controls the distance at which the light begins to fall off; it is frequently zero, which means that the light begins falling off immediately. The quantity ${d}_{\mathrm{m}\mathrm{a}\mathrm{x}}$ is the actual falloff distance—the distance where the light has fallen off completely and no longer has any effect. Figure 10.32 compares real-world light attenuation to the simple linear attenuation model. Figure 10.32Real-world light attenuation vs. simple linear attenuation

Distance attenuation can be applied to point and spot lights; directional lights are usually not attenuated. An additional attenuation factor is used for spot lights. Hotspot falloff attenuates light as we move closer to the edge of the cone.

## 10.7.3Doom-style Volumetric Lights

In the theoretical framework of the rendering equation as well as HLSL shaders doing lighting equations using the standard Blinn-Phong model, all that is required of a light source for it to be used in shading calculations at a particular point $\mathbf{x}$ is a light color (intensity) and direction of incidence. This section discusses a type of volumetric light, popularized by the Doom 3 engine (also known as id Tech 4) around 2003, which specifies these values in a novel way. Not only are these types of lights interesting to understand from a practical standpoint (they are still useful today), they are interesting from a theoretical perspective because they illustrate an elegant, fast approximation. Such approximations are the essence of the art of real-time rendering.

The most creative aspect of Doom-style volumetric lights is how they determine the intensity at a given point. It is controlled through two texture maps. One map is essentially a gobo, which can be projected by either orthographic or perspective projection, similar to a spot or directional light. The other map is a one-dimensional map, known as the falloff map, which controls the falloff. The procedure for determining the light intensity at point $\mathbf{x}$ is as follows: $\mathbf{x}$ is multiplied by a $4×4$ matrix, and the resulting coordinates are used to index into the two maps. The 2D gobo is indexed using $\left(x/w,y/w\right)$ , and the 1D falloff map is indexed with $z$ . The product of these two texels defines the light intensity at $\mathbf{x}$ .

 Omni Spot Fake spot   Example [-12pt]   Gobo [-12pt] Falloff   Projection Orthographic Perspective Orthographic   Bounding box [-12pt]
Figure 10.33Examples of Doom-style volumetric lights

The examples in Figure 10.33 will make this clear. Let's look at each of the examples in more detail. The omni light projects the circular gobo orthographically across the box, and places the “position” of the light (which is used to compute the $\mathbf{l}$ vector) in the center of the box. The $4×4$ matrix used to generate the texture coordinates in this case is

$\left[\begin{array}{cccc}1/{s}_{x}& 0& 0& 0\\ 0& -1/{s}_{y}& 0& 0\\ 0& 0& 1/{s}_{z}& 0\\ 1/2& 1/2& 1/2& 1\end{array}\right],$
Texture coordinate generation matrix for a Doom-style omni light

where ${s}_{x}$ , ${s}_{y}$ , and ${s}_{z}$ are the dimensions of the box on each axis. This matrix operates on points in the object space of the light, where the position of the light is in the center of the box, so for the matrix that operates on world-space coordinates, we would need to multiply this matrix by a $4×4$ world-to-object matrix on the left. Note the right-most column is $\left[0,0,0,1{\right]}^{\mathrm{T}}$ , since we use an orthographic projection onto the gobo. The translation of 1/2 is to adjust the coordinates from the $\left[-1/2,+1/2\right]$ range into the $\left[0,1\right]$ range of the texture. Also, note the flipping of the $y$ -axis, since $+y$ points up in our 3D conventions, but $+v$ points down in the texture.

Next, let's look at the spot light. It uses a perspective projection, where the center of projection is at one end of the box. The position of the light used for calculating the $\mathbf{l}$ vector is at this same location, but that isn't always the case! Note that the same circular gobo is used as for the omni, but due to the perspective projection, it forms a cone shape. The falloff map is brightest at the end of the box nearest the center of projection and falls off linearly along the $+z$ axis, which is the direction of projection of the gobo in all cases. Notice that the very first pixel of the spot light falloff map is black, to prevent objects “behind” the light from getting lit; in fact, all of the gobos and falloff maps have black pixels at their edges, since these pixels will be used for any geometry outside the box. (The addressing mode must be set to clamp to avoid the gobo and falloff map tiling across 3D space.) The texture generation matrix for perspective spots is

Texture coordinate generation matrix for a Doom-style spot light
$\left[\begin{array}{cccc}{s}_{z}/{s}_{x}& 0& 0& 0\\ 0& -{s}_{z}/{s}_{y}& 0& 0\\ 1/2& 1/2& 1/{s}_{z}& 1\\ 0& 0& 0& 0\end{array}\right].$

The “fake spot” on the right is perhaps the most interesting. Here, projection is orthographic, and it is sideways. The conical nature of the light as well as its falloff (what we ordinarily think of as the falloff, that is) are both encoded in the gobo. The falloff map used for this light is the same as for the omni light: it is brightest in the center of the box, and causes the light to fade out as we approach the $-z$ and $+z$ faces of the box. The texture coordinate matrix in this case is actually the same as that for the omni. The entire change comes from using a different gobo, and orienting the light properly!

You should study these examples until you are sure you know how they work.

Doom-style volumetric lights can be attractive for real-time graphics for several reasons:

• They are simple and efficient, requiring only the basic functionality of texture coordinate generation, and two texture lookups. These are flexible operations that are easily hardwired into fixed-function hardware such as the Nintendo Wii.
• Many different light types and effects can be represented in the same framework. This can be helpful to limit the number of different shaders that are needed. Lighting models, light types, material properties, and lighting passes can all be dimensions in the matrix of shaders, and the size of this matrix can grow quite quickly. It can also be useful to reduce the amount of switching of render states.
• Arbitrary falloff curves can be encoded in the gobo and falloff maps. We are not restricted to linear or real-world inverse squared attenuation.
• Due to the ability to control the falloff, the bounding box that contains the lighting volume can usually be relatively tight compared to traditional spot and omni lights. In other words, a large percentage of the volume within the box is receiving significant lighting, and the light falls off more rapidly than for traditional models, so the volume is as small and as tight as possible. Looking at the bottom row of Figure 10.33, compare the size of the box needed to contain the true spot light, versus the fake spot light.
This is perhaps the most important feature behind the introduction of these sorts of lights in Doom 3, which used an accumulated rendering technique with no lightmaps or precomputed lighting; every object was fully lit in real time. Each light was added into the scene by rerendering the geometry within the volume of the light and adding the light's contribution into the frame buffer. Limiting the amount of geometry that had to be redrawn (as well as the geometry that had to be processed for purposes of the stencil shadows that were used) was a huge performance win.

## 10.7.4Precalculated Lighting

One of the greatest sources of error in the images produced in real time (those positive thinkers among you might say the greatest opportunity for improvement) is indirect lighting: light that has “bounced” at least one time before illuminating the pixel being rendered. This is an extremely difficult problem. A first important step to making it tractable is to break up the surfaces in the scene into discrete patches or sample points. But even with a relatively modest number of patches, we still have to determine which patches can “see” each other and have a conduit of radiance, and which cannot see each other and do not exchange radiance. Then we must solve for the balance of light in the rendering equation. Furthermore, when any object moves, it can potentially alter which patches can see which. In other words, practically any change will alter the distribution of light in the entire scene.

However, it is usually the case that certain lights and geometry in the scene are not moving. In this case, we can perform more detailed lighting calculations (solve the rendering equation more fully), and then use those results, ignoring any error that results due to the difference in the current lighting configuration and the one that was used during the offline calculations. Let's consider several examples of this basic principle.

One technique is lightmapping. In this case, an extra UV channel is used to arrange the polygons of the scene into a special texture map that contains precalculated lighting information. This process of finding a good way to arrange the polygons within the texture map is often called atlasing. In this case, the discrete “patches” that we mentioned earlier are the lightmap texels. Lightmapping works well on large flat surfaces, such as floors and ceilings, which are relatively easy to arrange within the lightmap effectively. But more dense meshes, such as staircases, statues, machinery, and trees, which have much more complicated topology, are not so easily atlased. Luckily, we can just as easily store precomputed lighting values in the vertices, which often works better for relatively dense meshes.

What exactly is the precomputed information that is stored in lightmaps (or vertices)? Essentially, we store incident illumination, but there are many options. One option is the number of samples per patch. If we have only a single lightmap or vertex color, then we cannot account for the directional distribution of this incident illumination and must simply use the sum over the entire hemisphere. (As we have shown in Section 10.1.3, this “directionless” quantity, the incident radiant power per unit area, is properly known as radiosity, and for historical reasons algorithms for calculating lightmaps are sometimes confusingly known as radiosity techniques, even if the lightmaps include a directional component.) If we can afford more than one lightmap or vertex color, then we can more accurately capture the distribution. This directional information is then projected onto a particular basis. We might have each basis correspond to a single direction. A technique known as spherical harmonics  uses sinusoidal basis functions similar to 2D Fourier techniques. The point in any case is that the directional distribution of incident light does matter, but when saving precomputed incident light information, we are usually forced to discard or compress this information.

Another option is whether the precalculated illumination includes direct lighting, indirect light, or both. This decision can often be made on a per-light basis. The earliest examples of lightmapping simply calculated the direct light from each light in the scene for each patch. The primary advantage of this was that it allowed for shadows, which at the time were prohibitively expensive to produce in real time. (The same basic idea is still useful today, only now the goal is usually to reduce the total number of real-time shadows that must be generated.) Then the view could be moved around in real time, but obviously, any lights that were burned into the lightmaps could not move, and if any geometry moved, the shadows would be “stuck” to them and the illusion would break down. An identical runtime system can be used to render lightmaps that also include indirect lighting, although the offline calculations require much more finesse. It is possible for certain lights to have both their direct and indirect lighting baked into the lightmaps, while other lights have just the indirect portion included in the precalculated lighting and direct lighting done at runtime. This might offer advantages, such as shadows with higher precision than the lightmap texel density, improved specular highlights due to the correct modeling of the direction of incidence (which is lost when the light is burned into the lightmaps), or some limited ability to dynamically adjust the intensity of the light or turn it off or change its position. Of course, the presence of precalculated lighting for some lights doesn't preclude the use of completely dynamic techniques for other lights.

The lightmapping techniques just discussed work fine for static geometry, but what about dynamic objects such as characters, vehicles, platforms, and items? These must be lit dynamically, which makes the inclusion of indirect lighting challenging. One technique, popularized by Valve's Half Life 2 , is to strategically place light probes at various locations in the scene. At each probe, we render a cubic environment map offline. When rendering a dynamic object, we locate the closest nearby probe and use this probe to get localized indirect lighting. There are many variations on this technique—for example, we might use one environment map for diffuse reflection of indirect light, where each sample is prefiltered to contain the entire cosine-weighted hemisphere surrounding this direction, and a different cubic map for specular reflection of indirect light, which does not have this filtering.

# 10.8Skeletal Animation

The animation of human creatures is certainly of great importance in video games and in computer graphics in general. One of the most important techniques for animating characters is skeletal animation, although it is certainly not limited to this purpose. The easiest way to appreciate skeletal animation is to compare it to other alternatives, so let's review those first.

Let's say we have created a model of a humanoid creature such as a robot. How do we animate it? Certainly, we could treat it like a chess piece and move it around just like a box of microwavable herring sandwiches or any other solid object—this is obviously not very convincing. Creatures are articulated, meaning they are composed of connected, movable parts. The simplest method of animating an articulated creature is to break the model up into a hierarchy of connected parts—left forearm, left upper arm, left thigh, left shin, left foot, torso, head, and so on—and animate this hierarchy. An early example of this was Dire Straits' Money for Nothing music video. Newer examples include practically every PlayStation 2 game, such as the first Tomb Raider. The common feature here is that each part is still rigid; it does not be