Singular Value Decomposition using a Field Programmable Gate Array

I was honoured to have been selected to be a recipient of the Chancellor’s Undergraduate Research Scholar award for
the 2013 summer and autumn quarters at UWT. The research focused on implementing SVD in a FPGA.
Sponsor: Jie Sheng, Ph.D.

Problem Statement

The research is based on the masters work: "
AN FPGA-BASED SINGULAR VALUE DECOMPOSITION PROCESSOR" by Weiwei Ma.
Singular Value Decomposition (SVD) is a particular factorization of an n x m matrix into three
matrices into the form M = U(SIGMA)V*. Applications that take advantage of the SVD include signal processing,
bioinformatics, and digital image processing. In implementing the SVD on a Field Programmable Gate
Array (FPGA), the n x m matrix must be organized as a mesh-connected network of processors. Each one
of those processors represents a 2 x 2 matrix, such that the proposed n x m matrix is diagonalized by
a sequence of 2 x 2 SVDs.

Implement the algorithm using an Altera DE2 board featuring Altera Cyclone II 2C35 FPGA. Additionally,
simulate the mesh-connected network of processors in Matlab and Java.

Where to begin?

What is SVD?

The first step in approaching the research was to understand precisely what the problem was that
needed to be solved. While the mathematical definition of the SVD listed above is complete, it doesn't
necessarily give a lot of insight into what is being accomplished.

An image, like the one on the right can be represented by a matrix of grey-scale values 0-255. The picture
of the mathematician Gauss is a 340 x 280 pixel image with 256 shades of grey. When the singular value
decomposition of the image is calculated, three resultant matrices are produced. One of which, is the
matrix containing the singular values on the diagonal and all other values are zero. The diagonal singular
values are listed in order of greatest to least. It turns out that the smaller singular values in the SVD come
from the "boring" sections of the image, and they may be ignored. [3]

So, what does this mean? Imagine transmitting the data for this image. What if only a fraction of the
data could be transmitted while preserving the integrity of the original? The image where k = 2 represents
the image as reproduced from 2 singular values. It can be seen that when k = 32, there is almost no noticeable
difference between the original image and the one reproduced from only 32 singular values.

The applications of SVD include signal processing, bioinformatics, and digital image processing. However, this
example provided a little context to me on what is trying to be achieved and how it can be applied.

If you're interested in finding out more about SVD, try this link to a video that takes a light-hearted approach to
explaining what it is and how it can be used:
“It had to be U” – the SVD Song on YouTube.com
(First played at the 9th Tartu Conference on Multivariable Statistics and 20th International Workshop on Matrices
and Statistics in Tartu, Estonia 28 June, 2011.)

What is an FPGA?

A field programmable gate array is an integrated circuit designed to be programmed by the end user for very specific
applications. FPGAs are composed of large sections of logic gates and RAM blocks. (see FPGA on wiki)
The advantage of using a FPGA over using a standard microprocessor is that an FPGA is programmed for very specific operations
without having to compete with numerous background tasks that always run on a GPOS. Additionally, while concurrency can be simulated
on microprocessors through multi-threading, FPGAs can achieve true concurrency by running various processes simultaneously through
the numerous logic gates and memory components programmed into the design.

Phase I - Implementing SVD on a 2 x 2 matrix

Brent, Luk, Van Loan [4] proposed a mesh-connected network of processors for handling SVD on a larger scale. However, to
achieve the SVD through a network of processors, the SVD of a single 2 x 2 matrix must be first developed.
They proposed an iterative process, based on multiple processors concurrently providing an SVD for their corresponding
2 x 2 matrix. Forsythe and Henrici [6] proposed the diagonalization of a n x n matrix by a sequence of 2 x 2 SVD’s.

The math involved is fairly straightforward on the handy TI-83. However, how can this be implemented on an FPGA that is based
on mathematical operations through logic gates? Three trig functions needed to be developed for sin, cos, and atan.

To further break down the problem being solved, the left and right rotation angles must be written in terms of the
input values {(a, b), (c, d)} that compose the 2 x 2 matrix input.

Verilog Library of Programmable Modules (LPM’s) are extremely accurate but utilize a large percentage of the system resources.
Which is to say that there are built in Libraries within the FPGA's development software that already facilitate
trigonometric functions. However, it is noted that the single 2 x 2 matrix will take multiple trig functions to
return a desired SVD. Additionally, concurrency is the goal, which means that multiple calculations should be occurring
simultaneously. A single ATAN function in the LPM took up 20% of the system's resources alone. Therefore,
COordinate Rotation DIgital Computer (CORDIC) modules needed to be developed for this specific application. The CORDIC
algorithm works generally by a series of successive approximations that iteratively bring you closer to the true value through
vector rotations. The number of iterations and the resolution of the binary value being used determine the margin of error produced from the
calculations. The image below shows the generic CORDIC algorithm.

The CORDIC functions were developed in the FPGA for sin, cos, and atan. Input values were hard-coded into the FPGA to begin with
to test the algorithm. Using the input values of {(1, 2), (3, 4)}, the input values were scaled by multiplying them
by 2^16, which amounts to simply shifting the values to the left 16 places in binary. Scaled values were taken advantage
of in all levels of the algorithm to speed up calculation time. The singular values returned from the FPGA were:
[ A = -0.365859985 B = 0.000061035 ]
[ C = 0.0000610351 D = 5.463378906 ]
The results produced by Matlab were:
[ A = 5.4650 B = 0.0000 ]
[ C = 0.0000 D = 0.3660 ]
The results produced by the FPGA were not normalized. The results should be all positive, and arranged in order of greatest -> least.
This is fairly easily corrected by shifting around and inverting the rotation angles applied to the original matrix and running
the algorithm again. The state machine for the 2x2 SVD algorithm can be seen below.

Phase II - Implementing SVD in a mesh-connected network of processors

With the functions in place to handle the SVD of a single 2 x 2 matrix, the next step is to begin integrating
numerous instances of those processors and link them together. They are be linked in a mesh-connected network
of processors. The general flow of information works iteratively again where the diagonal elements first calculate
the left and right rotation angles and calculate the SVD locally. Those angles are propagated outward to neighbouring
processors. The SVD for those neighbouring elements are calculated while the rotation angles are propagated further
outwards until all elements have completed one iteration of the SVD. The flow of signals is represented below for a
8 x 8 matrix of values.

The next step is to trade the resultant data between neighbouring processors and repeat. To first understand
the processes, I evaluated and adapted a Matlab script from the masters work this research was based on: "
AN FPGA-BASED SINGULAR VALUE DECOMPOSITION PROCESSOR" by Weiwei Ma. Furthermore, I wrote a JAVA program
that accomplished much the same effect, only it provided me a little more flexibility in altering various aspects
of the algorithm to evaluate the results.

Unfortunately, what was not known when this research began was that the thesis it was based on did not provide
a normalized output for the singular value matrix. After an additional quarter trying to further understand the theory
behind the algorithm and produce a normalized output, I was unsuccessful. The singular value matrix produced could
easily be normalized at the end by making all values positive and restructuring the matrix to be organized in order
of greatest -> least. However, the matrices U and V* must also be produced in the process of arriving at the singular
value matrix. As such, the singular values cannot be arbitrarily altered without also altering the two matrices U and V*.

Conclusion

Using a counter on the FPGA, the number of clock cycles to complete processes within the state machines that composed
the 2 x 2 SVD algorithm was recorded. The maximum clock frequency of the FPGA is 50 MHz, which amounts to 0.02 micro-seconds
per clock cycle. The process to produce the SVD of the 2 x 2 matrix took 227 clock cycles or 4.54 micro-seconds. I am confident
that given more time to research the algorithm to interface the mesh-connected network of processors, I could produce the desired
results. Although, I would have quickly run into issues with the resources taken up on a single FPGA. The 2 x 2 SVD algorithm, in the end,
took up 20% of the system's resources. This percentage could have been decreased at the cost of accuracy of results. The resources
used could also be decreased by sharing resources like the state machines that implement the CORDIC functions.
That comes at a cost of slower run-time because true concurrency would not be taken advantage of.

References

(1) "AN FPGA-BASED SINGULAR VALUE DECOMPOSITION PROCESSOR" by Weiwei Ma
(2) Eigenvalues and Singular Values (Mathworks.com)
(3) Linear Algebra – A Modern Introduction 3rd Ed. (David Poole)
(4) R.P. Brent, F. T. Luk, F. T. and C. Van Loan, Computation of the Singular Value Decomposition using mesh-connected processors,
Journal of VLSI and computer systems, vol. 1, no. 3, pp. 242-270, 1985.
(5) J. Volder, “Binary Computation Algorithm for Coordinate Rotation and function Generation”, Convair Report IAT-1 148
Aeroelectrics Group, June 1956.
(6) G. E. Forsythe and P. Henrici, “The Cyclic Jacobi Method for Computing the Principal Values of a complex Matrix”,
Trans. Of the American Mathematical Society, 94(l):l-23, January, 1960.
(7) “It had to be U” – the SVD Song on YouTube.com (First played at the 9th Tartu Conference on Multivariable Statistics and
20th International Workshop on Matrices and Statistics in Tartu, Estonia 28 June, 2011.)