Federated Principal Component Analysis (FPCA) is proposed to compute the principal components of a given data set in a distributed fashion where there are several workers (distributed devices) and one master (data center). This framework is called federated learning, which allows a user to attain the global model without sharing the data among the local workers. This idea is essential in protecting data privacy.
An optimization algorithm called augmented direction method of multipliers is used (ADMM) to achieve federated learning. To start with, The objective function of PCA can be formulated as:
Then by ADMM we can derive the worker’s and master’s algorithms, where z is denoted as the global model and u is the dual variable from ADMM:
We can illustrate the idea of FPCA by the toy example below:
In the toy example, there are 10 workers each having their own share of data. The dots scattered on the plot is the global data, assuming not seen by all the workers. The thick yellow line represents the first global PC and the other 10 lines stand for the first PCs calculated by each worker. At the begining (iter = 0), it can be observed that the PC each worker initialzes is stretching toward differnt direction from each other. However, at iteration 10 the workers already seem to reach a consensus solution without seeing the whole data throughout the whole process.
- macOS: 11.2.1
- CPU: Intel Core i5 at 2.9GHz
- RAM: 8 GB 2133 MHz LPDDR3
- Matlab: 2019b
- Iris: http://archive.ics.uci.edu/ml/datasets/Iris/
- Breast cancer: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
- Bank: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
- Synthetic: https://archive.ics.uci.edu/ml/datasets/synthetic+control+chart+time+series
We compare our FPCA method to the SVD method, which is currently the most common way to find PCs. The evaluation metric is the cosine similarity between the first PC we derived by FPCA and the one by SVD. The comparisons are performed using 3 real and 1 synthetic datasets. Each experiment is repeated 10 times and presented in the convergence plots below:
In our work PCA is adapted into a federated learning setting, which not only solves the distributed machine learning problem but also ensure data privacy. The federated PCA is built up by formulating the projection approximation method as a consensus optimization problem and solving it by ADMM. In the application scenario, the data is distributed across multiple workers who compute and submit their model to a master functioning as the center integrating and broadcasting back the models to the workers. During the training process, the data is always kept private and unshared, which guarantees the data privacy. In addition to that, when processing a large dataset that can't be stored in a single device, the federated setting also allows us to conduct PCA by computing on multiple devices. Through a series of experiment it's found that the first PC calculated by FPCA is consistent with the one calculated by SVD method, which proves our method valid.