This Project is a cubic panorama mapping converter using CUDA. You can see code here: GitHub PanoramaToCubeX

This is also an course project from course Parallel this year.

# Introduction

A panorama is any wide-angle view or representation of a physical space, whether in painting, drawing, photography, film, seismic images or a three-dimensional model. A panoramic view is also purposed for multi-media, cross-scale applications to an outline overview (from a distance) along and across repositories.

Today, panorama image is widely used in computer graphics systems and products like Google Street View and EveryScape. There are four typical types of panorama structures: the cylindrical, skydome, spherical and cubic panorama image. In this article, we will focus on the most two common used structures, the spherical panorama and the cubic panorama.

Spherical panoramas are the most widely used today in panorama photography and computer graphics, thanks to its ease to construct and make good use of memory spaces. However rendering a \textit{real} sphere is not a easy job. We must use a large quantity of faces to make the spherical panorama mapping correct, or we will meet a significant distortion. On the other hand, rendering too much faces could be a heavy task for some mobile devices which nowadays doesn't have a strong enough processor. So mapping spherical panoramas to cubic becomes a economical choice to balance the rendering quality and rendering performance.

(typical panorama image structure)

# Spherical and cubic panoramas

Let's start with cylindrical panoramas, which are commonly used because of their ease of construction. To build this kind of panorama, a sequence of images is taken by a camera mounted on a leveled tripod. If the focal length or field of view (fov) of the camera is known, each perspective image can be warped into cylindrical coordinates. To save a cylindrical panorama image, we map world coordinates (X, Y, Z) to 2D cylindrical screen coordinates (theta, vartheta) using

Similarly, spherical panoramas can be constructed as 2D spherical coordinates (theta, phi) using

The cubic panoramas split the image into 6 squares. Each square corresponds to a face of the perspective mapped sky box. Take the front face for example, we conduct perspective mapping to the world coordinates p = (X, Y, Z) into p' = (1, Y, Z).

(pixel mapping between spherical panoramas and cubic panoramas)

# Pipeline

In order to calculate the pixel mapping from spherical to cubic, the converter should complete 4 stage pipeline:

(four stage pipeline)

## In-face Coordinates Mapping

In this stage, the converter maps the (X, Y)$ the 2d pixel coordinates on a spherical panorama, into (face, inFaceX, inFaceY), in which the *face* identifies the face of a cubic, and *inFaceX* and *inFaceY* represent the pixel on that specific face. In order to reduce branching between different functions for different faces, we designed the $faceOffset$ matrix so that

```
X = inFaceX + FaceOffset[face][0];
Y = inFaceY + FaceOffset[face][1];
```

The *faceOffset* matrix is given by

```
// face corner offset
int faceOffset[6][2] = {
{0, outBlockWidth},
{outBlockWidth, 0},
{outBlockWidth, outBlockWidth},
{outBlockWidth, outBlockWidth << 1},
{outBlockWidth, outBlockWidth * 3},
{outBlockWidth << 1, outBlockWidth}
};
```

## World Coordinates Reconstruction

In this stage, the (face, inFaceX, inFaceY) is mapped to world coordinates (X, Y, Z) on the cube surface. Similarly, we use matrix to reduce branching here.

```
// 2D to 3D coords factor
// used to convert 2D coords
// [0..outBlockWidth - 1] to
// 3D coords[-1..1] on a cube
int face3DCoordsFactor[6][3][2] = {
{ // 0 top
{1, 0}, {0, 1}, {0, 0},
},
{ // 1 left
{0, 1}, {0, 0}, {-1, 0},
},
{ // 2 front
{0, 0}, {0, 1}, {-1, 0},
},
{ // 3 right
{0, -1}, {0, 0}, {-1, 0},
},
{ // 4 back
{0, 0}, {0, -1}, {-1, 0},
},
{ // 5 down
{-1, 0}, {0, 1}, {0, 0},
},
};
// 2D to 3D coords constant
// used to convert 2D coords
// [0..outBlockWidth - 1] to
// 3D coords[-1..1] on a cube
int face3DCoordsConstant[6][3] = {
{-1, -1, 1}, // 0 top
{-1, -1, 1}, // 1 left
{1, -1, 1}, // 2 front
{1, 1, 1}, // 3 right
{-1, 1, 1}, // 4 back
{1, -1, -1} // 5 down
};
```

Then we can calculate the world coordinates by

```
// 2D coords in this face within [0..2]
float inFaceX_2 =
(2.f * inFaceX) / outBlockWidth;
float inFaceY_2 =
(2.f * inFaceY) / outBlockWidth;
// 3D coords on the cube
float cubeX = Face3DCoordsFactor[face][0] +
Face3DCoordsFactor[face][0][0] * inFaceX_2 +
Face3DCoordsFactor[face][0][1] * inFaceY_2;
float cubeY = Face3DCoordsFactor[face][1] +
Face3DCoordsFactor[face][1][0] * inFaceX_2 +
Face3DCoordsFactor[face][1][1] * inFaceY_2;
float cubeZ = Face3DCoordsFactor[face][2] +
Face3DCoordsFactor[face][2][0] * inFaceX_2 +
Face3DCoordsFactor[face][2][1] * inFaceY_2;
```

## (theta, phi, r) Coordinates Mapping

In this stage, the converter maps cartesian coordinates (X, Y, Z) on the cube to (theta, phi, r) polar coordinates. That is what we used in spherical panorama.

```
float theta = atan2(cubeY, cubeX);
float r = hypot(cubeX, cubeY);
float phi = atan2(cubeZ, r);
```

## Panorama Four Corner Mapping

In the final stage, the converter calculate the final pixel position at the spherical panorama image. But we should notice that the pixel mapping is not a one-to-one mapping, so in order to make it look more smoothly, we perform a trick here: calculate the four corner and their weight to get the average color.

```
float uf =
(2.f * outBlockWidth * (theta + M_PI) / M_PI);
float vf =
(2.f * outBlockWidth * (M_PI_2 - phi) / M_PI);
int ui = floor(uf);
int vi = floor(vf);
int ui_2 = ui + 1;
int vi_2 = vi + 1;
float mu = uf - ui;
float nu = vf - vi;
#define cornerA_x \
(_clip(vi, 0, inImageHeight - 1))
#define cornerA_y (ui % inImageWidth)
#define cornerB_x \
(_clip(vi, 0, inImageHeight - 1))
#define cornerB_y (ui_2 % inImageWidth)
#define cornerC_x \
(_clip(vi_2, 0, inImageHeight - 1))
#define cornerC_y (ui % inImageWidth)
#define cornerD_x \
(_clip(vi_2, 0, inImageHeight - 1))
#define cornerD_y (ui_2 % inImageWidth)
```

# CUDA Support

## Memory

According to the CUDA architecture, memory on GPU must be allocated and filled before we call kernel function. So these variables are used to replace all related variables in the code we metioned above.

```
unsigned char *cudaInImageArray;
unsigned char *cudaOutImageArray;
int *cudaFaceOffset;
int *cudaFace3DCoordsFactor;
int *cudaFace3DCoordsConstant;
// Malloc memory space on device
cudaStatus = cudaMalloc(
(void**)&cudaInImageArray,
sizeof(unsigned char) * inImageWidth
* inImageHeight * 3);
cudaStatus = cudaMalloc(
(void**)&cudaOutImageArray,
sizeof(unsigned char) * outImageWidth
* outImageHeight * 3);
...
// Copy data to the device
cudaStatus = cudaMemcpy(
cudaInImageArray,
inImageArray,
sizeof(unsigned char) * inImageWidth
* inImageHeight * 3,
cudaMemcpyHostToDevice);
cudaStatus = cudaMemcpy(
cudaOutImageArray,
outImageArray,
sizeof(unsigned char) * outImageWidth
* outImageHeight * 3,
cudaMemcpyHostToDevice);
...
```

## Blocks and Threads

In order to measure the performances between different thread layout assignment, we use these code to assign tasks:

```
#define thread_per_block 512
// Define CUDA grid layout arrangement
dim3 dimGrid(6, outBlockWidth * outBlockWidth
/ thread_pre_block);
dim3 dimBlock(thread_per_block);
// Run kernel function
pixelWarpKernel<<<dimGrid, dimBlock>>>(
cudaInImageArray,
cudaOutImageArray,
(int (*) [2])cudaFaceOffset,
(int (*) [3][2])cudaFace3DCoordsFactor,
(int (*) [3])cudaFace3DCoordsConstant,
outBlockWidth);
// cudaThreadSynchronize waits for the kernel
// to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaThreadSynchronize();
```

# Results

## Visual Effects Comparison

We have tested 10 out of 84 spherical panorama from Visual Campus Project. By comparing the rendering input and output panoramas with the same outBlockWidth (1/2 height of spherical panoramas, or 1/3 height of cubic panoramas), we find that the spherical panoramas look more sharper than cubic panoramas. This is predictable since we use weighted average to conduct the pixel colors.

By increasing outBlockWidth of cubic panoramas, these kind of distortion disappeared. That is because the pixels structure difference in spherical panoramas which is not uniform distribution. The outBlockWidth in spherical panoramas is the lower bound of equator length, while outBlockWidth in cubic panoramas is the exact value of 1/4 perimeter length.

(visual effects comparison between spherical panoramas (left one) and cubic panoramas (right one) rendering in two different test cases. )

## Converting Speed Comparison

We have done two kinds of comparison in different input size and different thread_per_block configuration.

### Input Size

In this Comparison, we runs CPU program and CUDA optimized program on 4 test cases which is 2k, 4k, 8k and 16k spherical panorama input produced from the same original picture. Converting program runs on a machine with 2.7 GHz Intel Core i5 CPU, 8 GB 1867 MHz DDR3, and a Nvidia GTX970 external graphics card with 4 GB memory.

The running time is recorded using clock() function from Standard C Library, and only compare the time after reading to array and before saving (that is, input & output I/O is not recorded in the running time).

(Running Time Comparison of Input Size)

(Running Time Comparison of Input Size)

### thread_per_block

In order to measure how the thread layout affect the performance, we choose 5 different thread_per_block configuration (32, 64, 128, 256 and 512) running the 16k input size converting on the same machine described above. By analyzing the running time, there is no evidence that the thread_per_block configuration will influence the performance.

(Running Time Comparison of thread_per_block)

(Running Time Comparison of thread_per_block)

# Conclusion

The spherical panorama is widely used in photography and computer graphics, with its ease to generate. However, cubic mapping panorama has a more significant rendering performance on mobile devices. Converting between these two kind of panorama picture image can be optimized using CUDA technique. Considering the good space usage in our algorithm, CUDA optimized parallel program can run several times than CPU computation, even better with the increasing of input image size. On the other hand, this algorithm can also be extended into converting a large quantity if same image size spherical panorama picture, in which condition the performance will be improved again using cached pixel mapping pair.