Group Members: Xinran Wang, Eason Zhang, Ruican Zhong, and Sophia Jannetty
This repo is deployed at: https://cse512-22sp.pages.cs.washington.edu/Sonar-Principal-Components/
Principal component analysis is a powerful dimensionality reduction technique that is useful for extracting temporal patterns in ocean sonar time series data. Current visualization methods visualize components in isolation or low-rank approximations of datasets. This makes it difficult for researchers to get a full picture of what is happening in the dataset and makes it difficult for students to understand the meaning of individual components.
We wanted to create a visualization that helps students understand how principal components combine to make low rank approximations of data, and how low rank approximations constructed from principal components that capture most variance in the data more closely resemble the full dataset than approximations from principal components capturing less variance. We also wanted to let ocean researchers explore the principal components of daily data in Dr. Wu-Jung Lee and Dr. Valentina Staneva’s echosounder water column data published in their 2020 paper in The Journal of the Acoustical Society of America: "Compact Representation of Temporal Processes in Echosounder Time Series via Matrix Decomposition.”.
We used echosounder water column data preprocessed by Dr. Wu-Jung Lee and Dr. Valentina Staneva that can be found in this repo. These data comprise echosounder data collected at 37 depths using 3 different frequencies over 62 days. The data are preprocessed so that noise is minimized (using the technique described in Lee and Staneva 2020) and there are 144 data points per depth per frequency per day. Each data point is the volume backscattering strength averaged over each 10 minute time period of the day (these averaged values are called the "mean backscattering strength" or MVBS). There is a MVBS for each 10 minute period of each day, at each depth and in each frequency over the 62 days.
To use PCA to extract daily patterns, we restructured the data so that for each frequency, the data collected each day was its own column. We then ran PCA on this restructured dataset (so that each day would be considered a new observation). We exported .txt files containing data matricies defining all 62 principal components and the percent variance captured by each principal component for each frequency.
We wanted to show students that low-rank approximations of data that use principal components that capture more variance in the dataset more closely resemble a full-rank approximation of the data than low-rank approximations. We therefore designed the bottom three graphs on our visualization.
One graph is a scatterplot that depicts the percent variance captured by each principal component. Hovering a mouse over each point will cause the point to expand so users can tell which point their mouse will select if they click. Clicking a point changes the point's color to indicate the point has been selected. A total percentage of the amount variance captured by all selected principal components is displayed on the graph.
The left echogram below the scatter plot depicts a low rank approximation of the data using the principal components selected in the scatter plot. The right echogram depicts an approximation of the data using all of the unselected principal components. The percent variance captured by all incorperated principal components is in each echogram title and the apprximation rank and selected principal components are appended to a label below each echogram. The depths are labeled in meters on the Y axis and the time of day is labeled on the x axis to orient readers. Color legends for the MVBS values are provided to the right of each echogram. A diverging color scheme was used to be consistent with previous echogram visualizations, and to help users easily identify shapes in the data to ease comparison between patterns in different low-rank approximations.
We wanted to let researchers explore all frequencies in the dataset, but decided that we wanted to prioritize enabling comparison between different low-rank data approximations over enabling comparison between frequencies. We thererfore made the choice to visualize one frequency at a time and to provide a drop down selector so users can select which frequency they want to explore.
Helpful feedback from Dr. Valentina Staneva led us to add additional components to our visualization. Dr. Staneva suggested we provide a way to allow students to quickly view a full rank approximation of the data (and explore the data by removing principal components instead of adding them). In response, we added a "Select All" button and a complimentary "Clear All" button. These buttons allow users to quickly select all (or deselect all) principal components to see the resulting data approximations. We put all of the interacting components of the visualization (the scatter plot, buttons, and frequency drop down selector) above the responding echograms to help users identify what is and isn't interactive. Dr. Staneva was confused about how the data we visualized related back to the full dataset. To address this confusion, we added a static echogram of the full dataset to the top of the visualization and accompanying explanatory text to help orient readers to the data. The full dataset only changes when users change the selected frequency. The X axis is labeled with the day number and the Y axis is labeled with the depth of each data point in meters. A color legend for the MVBS values is provided to the right of the echogram. This echogram takes a little while to load so we added a loading message that appears while the echogram is loading. Finally, Dr. Staneva suggested we add additional explanatory information. We added explanatory text above and below the visualizations to help users understand the data behind the visualization, the purpose of the visualiztion, and the interactive components of the visualization.
There are several challenging concepts surrounding principal component analysis. We chose to focus on the challenge of explaining how low rank approximations of data using principal components that capture more variance more closely resemble full rank approximations of the data. We did this in the context of an actual use case of PCA (using PCA to extract daily patterns from large echosounder datasets). However, we do not then reconstruct the full dataset by multiplying the principal components by the loadings each day. Essentially, our visualization makes the assumption that readers understand enough about principal component analysis to understand why we ran the PCA with each day as different observation (to extract daily patterns) and subsequently, why the full-rank approximation of one day's worth of data is not the same dimensions as the full dataset. It would be helpful to have both visualization (the low-rank approximations we provided and low-rank approximations of the full dataset that incorperate the loadings for each day) to aid student understanding.
Additionally, our TA Philip Garrison made an interesting point that using a diverging color scale may obfuscate the fact that the pixel values of the principal components (scaled to their singular values) are summed when constructing a low-rank approximation of the data. Using a sequential color scale would have made this point clearer. However, using a sequential color scale would have been different from what is currently done by researchers in the field and would have made it more challenging for students familliar with diurnal patterns of plankton in the water column to identify that the patterns in the principal components that capture high amounts of variance are familliar while the patterns in less significant principal components are unfamilliar. Future efforts to explain principal component analysis through visualization tools should consider using sequential color scales.
Lee, Wu-Jung, and Valentina Staneva. 2020. “Compact Representation of Temporal Processes in Echosounder Time Series via Matrix Decomposition.” The Journal of the Acoustical Society of America 148 (6). United States: 3429–42. doi:10.1121/10.0002670. With the accompanying GitHub repo
All group members worked on all parts of implementation. Ruican spent more time working on the scatter plot, Eason spent more time working on the echograms, Xinran spent more time working on the interactions, and Sophia spent more time working on the back-end calculations. However all members have contributed to all components of the project. Each group member spent ~20 hours working on this project.