Chapter 2 - Kohonen Data Analyzing

This chapter contains some brief information about the Kohonen's SOFM (Self Organizing Feature Map) and the choices that was made to get it work satisfactory. One obvious alternative of using Kohonen's SOFM for the viewing of multiminensional data sets is the Principal Component Analysis, which is explained very biefly. There is also a short description of the source code to the program that was made to accomodate the demands.

2.1 Introduction to Kohonen's SOFM

Kohonen’s self organizing feature map is an extended competitive learning network. A competitive learning network is a feed forward network that consists of one layer of neurons. The neurons has one input for each dimension that is going to be analyzed and each neuron has one output that is supposed to represents a certain pattern in the input data. To map the patterns of the input data to a certain output neuron, the network has to be trained to recognize these patterns. That means that when an element of the data is analyzed there is one winning output neuron which is closest to recognize the pattern. The difference between Kohonen’s SOFM and a competitive learning network is that in a Kohonen’s SOFM there is a connection between the neurons. Because of this a certain pattern in the input data will seam to get a certain part of the output neurons as their winning neurons and there will be some clustering of the outputs that represents the classes of input.
[ toc | previous | next | up ]

2.2 Introduction to the Principal Component Analysis

The principal component analysis is a classical statistical method. The goal of the method is to decrease the number of dimensions in a multidimensional array. Basically only the axis with most variation is kept and the others are ignored, this can be calculated with statistical method. By doing this a data set with many dimensions can be reduced to two or three dimensions and plotted in a graph that can easily show if it is easy or hard to classify the data.
[ toc | previous | next | up ]

2.3 The Kohonen Choices

At the beginning of the course the Kohonen's SOFM seemed to be the better choice because it, for the first, did not contain any advanced mathematics. Second it was more in line with the course and felt more interesting to explore. This made the choice quite easy and it was decided to use the Kohonen's SOFM instead of the Principal Component Analysis in the beginning of the project.

When using the Kohonen's SOFM there is some parameters that needs to be addressed. The most obvious one is probably the size of the output layer. The size was chosen to be 13. The choice was made because it seemed to be a well balanced value (with respect to computational time and size). The computational time was also a big burden because there was a lot of squares and root squares (RMS calculating). This was reduced to a simple sum of the absolute values instead. More about this is described in chapter 3. The Learning Rate and the Sigma value was chosen to decrease over time so that an acceptable converging would be achieved.
[ toc | previous | next | up ]

2.4 Program Structure

The Kohonen program is divided in two parts and two files. One that is called readfile.c and one that is called kohonen.c. The first file reads the data files, which is used both in the Kohonen SOFM and in the back propagation program. The second file contains the Kohonen specific functions.
[ toc | previous | next | up ]

Readfile.c

This file takes care of the reading the input data files. Because all data files are a little different there is one function for each data file. This function reads one row of data at the time and saves it as an list of input and as the last element is the output. These functions are called lRead<DataSetName> and has to be executed before any data can be received. The last thing that happens is that the values get normalized between -1 and 1. During the operation two variables are set, lWidth and lSize which are the number of elements in one data row and the number of rows in the set. After the file has been read the function lGetRow can be called to get one row of data from 0 to lSize-1. The elements in the array that is returned is of the type floating-point numbers and the output is located last in the array and is an integer value between 0 and the number of outputs.
[ toc | previous | next | up ]

Kohonen.c

This file contains all the Kohonen's SOFM functions. First, all the values are initialized and all the data from the appropriate file is read. The weights of the network are set to random values between 0.2 and 0.8. Then for each row of the set a winning neuron is calculated. The winning neuron's weights and its neuron neighbours weights are then updated. Which neuron that are close enough to be called neighbour is determined by the function lambda. When the whole set has been searched, the network goes back to the first row and goes on like this until it is decided that it does not need to be trained more. The program is generally terminated after about 2000 iterations. Before every new iteration starts, the sigma and the learning rate are calculated. These values are depending on time and are used in the lambda function and in the function that updates the weights. After a predefined number of iterations the program runs a function that calculates the winner for all input rows in the set and write the distribution for all the different output to files. These files can then be used to plot 3D graphs showing the distribution of the outputs. In this case the program gnuplot was used. These graphs can be seen in the data analyze section.
[ toc | previous | next | up ]

Changeable Parameters

In a Kohonen network there are some parameters that can be changed. The first thing is the number of neurons in the network and the neurons initial weights. Two other parameters that can be altered are the sigma value and the learning rate, they could be constant values or functions that changes during time. The sigma value determines how many of the neurons surrounding the winning neuron are moved and the learning rate determines how much they will move each time. Also the number of inputs to the network can be changed, often this is set to the same size as the input data set. There is also the possibility to use the more time consuming RMS calculation instead of the sum of absolute values.
[ toc | previous | next | up ]