What is a Self-Organizing Map?
Self-Organing Maps (SOMs) are a tool for clustering and visualizing multi-dimensional data. Most common methods for visualizing data only allow you to see two or three dimensions at a time, but SOMs allow viewing dozens dimensions simultaneously. SOMs were invented by Professor Teuvo Kohonen
- The SOM algorithm is described on the google code page.
For more information on Self-Organizing Maps, see:
How to read a Self-Organizing Map
Let's say you want to organize a bunch of colored marbles according to their color, that is by the amount of each primary color of light they have (red/green/blue). For example, yellow is a mix of green and red, so it should go near those colors.
You might start with the bluish marbles all at the bottom, and then put in the red/yellows and finally put the greens in the empty corner.
Or you might start with green next to yellow and red and go from there.
Both of these are valid organizations. In these organizations, up, down, left and right are meaningless except as far as similar colors are next to each other.
Now if we display the same organization of the marbles in the three different colors (red, green, blue), we get:
We can now see for each marble how much red, green and blue it has. So for example, look at the marble on the top-right.
You can see that this marble is high in red and green, but low in blue. Red and green make yellow, so it is a yellow marble.
Now look at the marble on the top left.
This marble has high blue and green and a little red, which together make a sort of turquoise color.
Now let's look at a Self-Organizing Map made by IFCSoft on some test data:
The test data used are from test beads run through a flow cytometer. It's not important to know anything about flow cytometry other than that in this example, thousands of test beads were passed through the machine and their colors were measured. There are 4 colors (FITC, PE, APC, PERCP) and each bead is either bright in one of the colors, or dark in all of them. There are then 5 types of beads: FITC, PE, APC, PERCP, and Unstained.
Once IFCSoft has finished calculating the SOM, it will display the resulting organization of the data (there is no more "organizing" once the map is displayed).
This works just like the color marble example, where the same picture (or organization) is shown under different dimensions. Let's look at the data points at the top-right corner.
You can see that this node is very high in PE-A and low in the others. By placing the curser over a point, the program provides the raw values of the SOM in each dimension. In particular, the place where the curser is placed has values in FITC-A of about 960, PE-A of 22,000, APC-A 430 and PERCP-A of 6,500. If you open the raw data set file you can find beads with approximately those values.
This node in the middle is fairly high in both PE-A and PERCP-A. This could be where two beads got stuck together, so they have both colors.
This whole section on the top-left represents beads that are high in APC-A and low in the other colors.
The outlined section on the bottom-right is low in all the colors, so these are the unstained beads.
Recall that in SOMs, up, down, left and right aren't meaningful other than similar things being placed next to each other. If we were to make the SOM again, even if we use the same settings, where each group ends up could be different due to randomization in the organization process. So the FITC-A beads might end up at the top-right corner instead of the bottom-left.
The Edge UMap (right panel below) shows the border between groups in the data and can be useful for clustering. When there is a large change between two nodes (the hexagonal points on the maps) in the SOM, it is brighter to show there is a large change.
In this example, since the groups in the data are very different, the borders are fairly distinct. The program colors the strongest border in red and less clear ones are marked in yellow or green, though they are still pretty clear in this one. In most regular data sets, the natural groups in the data tend to overlap, so you get fuzzy and complicated borders which aren't as useful.
The SOM that we made is only an approximation of the data. This SOM has only 200 nodes (hexagonal points) while the data set has 10,000 beads, meaning that on average each node represents 50 beads. The resolution (number of nodes) can be set as needed when making an SOM.
After the SOM is calculated, the program then figures for each data point (bead) which node best represents it. It then displays a "Density Map" (bottom-left plot) showing how many data points each node has. For example, you can see that the unstained bead nodes have more beads per node on average.
You can see that the nodes with almost no data points tend to line up with the boundaries in the UMap. This is because the SOM puts nodes between the groups which are a little high in both colors, even though there are rarely beads like that. Since no beads in this data set are a little high in two colors, the density map shows no data points placed on those nodes. You can think of this like trees on a topographic map, there are trees at the top of the cliff and at the bottom, but not many trees are on the face of the cliff.
You can use density maps to compare related data sets that have the same dimensions and scales. For example, you could combine flow cytometry data sets of blood samples from several patients. Be aware, however, that samples processed on different runs may look different due to calibration differences such as changes in channel brightness. If these differences are not adjusted for before hand, the SOM will not give a meaningful comparison.
To compare patients, first we make an SOM of all the cells from all the patients combined (in this case we only have sample bead data sets), then we show the density map for each patient (or data set), showing which cell populations they each have.
Above is an SOM on 5 different data sets of bead samples. The top row shows the organization done on all the beads from all the samples combined and the bottom row has the density maps from each sample. For example, Dataset 1 has 2 areas with beads, those on the bottom-left are the unstained beads, and those on the top-left are APC-A beads. Each Sample has a group of unstained beads, but the only two with PE-A beads are Dataset 3 and Dataset 5.
You can read how the actual algorithm works on the google code page.