Know your GPU

Compute performance

Now comes the biggest debate between compute performance and gaming performance. Allow us to clear the air on this matter a bit. First we will discuss what is compute performance and how it works and then we will do the same on gaming performance and the difference between the two if any.

What is Compute Performance in GPU?

Initially GPUs are made to play computer games and after 2 decades, their main purpose is same. They are special purpose processing units built to handle large number of pixels in parallel by performing two main tasks, Pixel Shading and Vertex Shading. After the unified shader architecture is adopted in GPU architecture, starting from Nvidia 8000 and AMD HD 3000 series, GPU business shows many different aspects apart from gaming. Before this, GPU used to consist of several Pixel Shader and Vertex Shader units; the former performing different pixel operations whereas the later performing the coordinate related calculations.

But in Unified shader model, there is a single unit, which we know as Stream Processor, is capable of doing both Vertex and Pixel operations and a GPU has many of them, resulting a huge parallel processing of vertex and pixel. Each of those units has its own resources like execution unit, Registers, Cache etc. Now GPU can be viewed as a SIMD or Single Instruction Multiple Data Model, where a single instruction can be performed over a huge number of data in parallel. This resembles to the traditional design of vector processors which do have simple but large number of execution units to perform a single operation over a large dataset.

Consider the following Pseudo code:


int a[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int b[10] = {11, 12, 13, 14, 15, 16, 17, 18, 19, 20};
int c[10];
for (int i = 0 to 9)
{ c[i] = a[i] + b[i]; }

In the above example,

a and b are two single dimensional array and each of them are holding 10 elements.

For an array, the values of it are stored in contagious memory location. Say, the values of a are stored in main memory or Ram in the location from 101 to 110 (Ram is divided into huge number of memory blocks, each of them having same capacity.) and for b, it is 201 to 210.

For c, Memory space is allocated starting from 301 and ends at 310. They are just initialized and not containing any value.

Now if you look at the for loop in the code, you will see that there are 10 ADD instruction and each instruction is operating over a different set of values; different values of a and b.
This is exactly how a conventional CPU will see.

So here how it will be executed in our CPU:


Set Counter to 0
LOOP 10 times
Fetch the instruction c[i] = a[i] + b[i];
Decode the instruction;
Fetch the value of a[counter] from Memory Location 101+Counter (101 is the base address of a) into Register A;
Fetch the value of b[counter] from Memory Location 201+Counter (201 is the base address of b) into Register B;
Store the value of a[counter] + b[counter] into Register C;
Write the result in Memory Location 301+Counter (301, base address of c);
Counter = Counter +1;
End Loop

So here CPU is basically looping 10 times, i.e., the size of the array, getting the instruction of addition each time, fetching the values of a and b for that iteration, adding them and storing the result back to the memory location, allocated for C in that iteration. But if you look closely, you will find that we are actually having single instruction which operates over 10 set of data, not 10 separate instructions. This thing can be fasten by use of Pipelining by overlapping the instructions but true parallel processing is not here. Also Multicore CPU is not going to help much as this whole thing is not going to create multiple thread, instead will be allocated to a single Core of the CPU.

Now how a Vector Processor will do it…we are going to see:

Unlike normal CPU, a Vector Processor has multiple execution units and unlike a normal CPU register which can hold only a single data, a Vector register can hold multiple Data. So consider our hypothetical Vector Processor given below, having the following components:-

Vector Registers:- A, B and C. Each of the Vector Register can hold up to 10 data.
Instruction Queue:- Instruction queue holds the instructions to be executed sequentially. But unlike a Scalar CPU where each instruction operates over a single dataset, here each instruction operates on multiple data item.Vector Execution Units:- Here we have 10 execution units, each of them can perform a single operation over a single set of data. Hence 10 of them can perform a single operation over 10 dataset.

Vector Fetch and Decode Unit:- This unit fetch and decode the data required for vector operation. But unlike a traditional CPU which fetch a single data, vector fetch unit fetches a whole vector or the whole set of data, over which the vector operation needs to be performed.
Scalar registers: Vector processor does have a set of scalar registers (in our case, it is three) and each of them can hold a single data. Here we are using those registers for holding the start address of the Array a, b and c respectively.

Program Counter: Program Counter holds the address of the next data to be fetched. Here it is holding the length of the Vector which is 10 in our case.

Scalar ALU (Arithmetic Logical Unit): It is basic execution unit which can execute one arithmetic instruction over a single set of data at a time, like a conventional CPU execution unit. It is used to execute instructions which need to be performed over a single dataset to save precious vector execution units. It also helps to verify different conditional statements. Here we are using it to generate the end address of array a and Array b by performing addition of the start address of the array (101 in case of a, 201 in case of b) and the vector length, 10 in our case.

Vector Write Unit: It writes a whole vector to memory as a whole, unlike a normal CPU write where each of the elements needs to be written sequentially. In our case, it is getting the starting location of Vector C and the end location from Program Counter and performs the write operation, writing all the values of C vector to the memory locations, allocated for Array c.

Here goes the Algorithm for Vector Execution:

Get the Starting address and length of a array and b array

Set Program Counter to 10, i.e. length of those arrays

Fetch memory address 101 to 110 into Vector Register A. Now A has all the
values of a Array.

Fetch memory address 201 to 210 into Vector Register B. Now B has all the
values of B Array.

Perform Vector C = Vector A + Vector B in a single iteration

Save the results back to memory location 301 to 310, i.e. the memory allocated
to c Array

You can understand here that instead of looping 10 times a Vector Processor performs the addition operation over 10 data simultaneously.

This model is similar to SIMD model discussed earlier and that is the reason modern day GPUs can be programmed to operate as Vector Processors.

Each Stream Processor of a GPU are capable to perform a single operation independently. So multiple of them can be used as multiple execution units of Vector Processors.
Stream Processors are programmable through software and therefore can emulate SIMD design.
GPU has very fast memory and very wide Memory Bus (384 bit, 256 bit for high end) and can provide huge memory bandwidth which is very much essential for Vector operations as it needs the whole dataset for a vector, unlike single data in single iteration.

But the problem here, unlike CPU, GPU supports very specific type of tasks through hardware. On the other hand, CPU support huge set of generic instruction set which help the programmers huge flexibility during coding. CPU supports complex instructions or Macro codes which will automatically divided into smaller instructions in execution time but programmer does not need to worry about it. That is not in case of GPU where programmer itself has to perform several smaller operations to implement a complex instruction and again, that will be different for different architecture.
But the scenario is changing as multiple API and Program Libraries are coming into the picture to reduce the liability from the developers. CUDA, OpenCL, OpenAL, DirectCompute etc are the example of widely used Libraries. These API models targets the SIMD model of GPU and apart from CUDA, other can implement generic code path for all the different GPU design. Even CUDA’s limitation is rather superficial than technical as it shares a lot from OpenCl.

Now lets read up a few points about Gaming performance.Next page please .

Last updated on July 29, 2014

Interfacing 16X2 LCD with LPC2148 tutorial

AMD Performance Edition 1600 Mhz DDR3 Ram review

About the author

Suryasis Mondal ( Aka: Cilus )

Subscribe Via Email

Facebook