An useful application of SOM is to organize objects by similarity of features. Once objects are organized by similarity of features, we are able to make interpretations based on dimensions, distance, and display. While most applications focus on segmentation, we are also able to interpret results based on dimensions as well as distance/location. We also illustrate how Python libraries may be used to present changes in segment membership as we approach a solution.
We will describe:
- different kinds of data used for illustration - classic Iris data (3x3x4) - oil patch data (3x3x4) - wealth of nations (7x7)
- measures of distance and interpretation
- dimensions and interpretations of dimensions
- observing the changes in assignment of objects to segments
- a Python program which provides a basis for modifying data sources, measures of distance, and number of dimensions
- an interpretation of results
There are three examples of outcomes from the use of SOM below. It's worth examining these results and discussing them to see the utility of these kinds of segmentation.
The first illustrates the results for the classis Iris data. The data is classic for two reasons. First, it is used to demonstrate how good an algorithm is at grouping or segmenting objects in comparison to other algorithms. Second, the Iris genus has about 300 species, of which versicolor is an allopolyploidy of setosa and virginica - multiple genomes from different species operating within one plant. The result is that any segmentation may reflect ploids found within Iris versicolor.
For this matrix, I provide background colors and textual displays within each cells as standard output for this Python stream - the stars and summaries on the left are additional decorations. The matrix consists of 3 dimensions measuring 3 by 3 by 4 cells. At the bottom of the matrix is the order of the four most important measures: petal width, petal length, sepal length, and sepal width. The RGB colors reflect combination of the three most important measures. The colors are grayish reflecting the fact that the top three measures are highly correlated with one another - after all, physical sizes are generally correlated with one another.
In doing segmentation, I generally prefer SOM over G-means. In my experience, G-means has a tendency to produce few groups which have a very uneven distribution of objects. As an example, there might be 2% in one group 5% in a second, and 93% in a third group. While 3 groups might be reasonable for the classic Iris data, the display of SOM provides an interpretable basis for deciding if any particular number of groups is really the best. Beyond this, using multi-dimensional SOM allows for interpretation of dimensions.
The second segmentation illustrates results for oil patch data from around the world. In contrast to the previous table, the colors are more intense reflecting the fact that the dimensions are more independent of one another. On the left we have a summary of oil patches and their position in the matrix. At the bottom are the four most important measures: downhole pump, pipeline, water injection, and development intensity.
Next, we observe the results for the classic SOM analysis of national wealth. For this we use a 7 by 7 matrix linked with a map of the world. What is unique about this solution is we are able to observe the iterations to solution on both the matrix and the map at the same time using libraries available in Python. It's important to note that there are multiple solutions when grouping objects and it's possible that investigators may obtain insights by watching the convergence to a solution.
The convergence displayed below generally fits with other classifications of nations.
Considerable effort is made to segment objects on the basis of features. These streams illustrate outcomes where there are multiple dimensions in the results and dimensions may represent gradations of features.
- Jeff Heaton, Jeff Heaton's Deep Learning Course,
- Available from https://www.heatonresearch.com/course/, accessed February, 2020.
- Tory Walker, Histogram equalizer,
- Available from https://github.com/torywalker/histogram-equalizer, accessed March, 2020
- maps production from,
- https://mohammadimranhasan.com/geospatial-data-mapping-with-python/
- maps using Python,
- https://github.com/hasanbdimran/Mapping-in-python
- using geoPandas,
- https://medium.com/@james.moody/creating-html-choropleths-using-geopandas-2b8ced9f1632
- chloropleth generators,
- https://github.com/jmsmdy/html-choropleth-generator
- description of issues identified and resolved within specified limitations
- code fragments illustrating the core of how an issue was resolved
- three Python programs which illustrate the use of SOMs in multidimensional arrrays
- stream: refers to the overall process of streaming/moving data through input, algorithms, and output of data and its evaluation.
- convergence: since there are no unique solutions in GAN, convergence is sufficient when there are no apparent improvements in a subjective evaluation of clarity of images being generated.
- limited applicability: the methods described work for a limited set of data and cGan problems.
- bounds of model loss: there is an apparent relationship between mode collapse and model loss - when model loss is extreme (too high or too low) then there is mode collapse.
- Python version 3.7.3
- Numpy version 1.17.3
- Matplotlib version 3.0.3
- geopandas version
- os, sys, datetime, math, time
- urllib, webbrowser
- cartopy version
- Operating system used for development and testing: Windows 10
Creating three SOMs as illustration, I provide limited working solutions to the following problems:
- classifying the class allopolyploidy Iris dataset in a 3D array
- classifying oilpatch data in a 3D array
- classifying nation's wealth and mapping iterations
The recommended folder structure looks as follows:
- embedding_derived_heatmaps-master (or any folder name)
- files (also contains three Python programs - program run from here)
- SOM map
- label_results (contains five .h5 generator model files)
- xray
- label_results (contains five .h5 generator model files)
- cgan (contains images from summary analysis of models)
- images (contains images for README file)