-
Notifications
You must be signed in to change notification settings - Fork 0
/
search.xml
594 lines (286 loc) · 350 KB
/
search.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
<?xml version="1.0" encoding="utf-8"?>
<search>
<entry>
<title>WWW21-Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data</title>
<link href="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/"/>
<url>/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/</url>
<content type="html"><![CDATA[<h1 id="WWW21-Characterizing-Impacts-of-Heterogeneity-in-Federated-Learning-upon-Large-Scale-Smartphone-Data"><a href="#WWW21-Characterizing-Impacts-of-Heterogeneity-in-Federated-Learning-upon-Large-Scale-Smartphone-Data" class="headerlink" title="WWW21-Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data"></a>WWW21-Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data</h1><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225204238693.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225204238693.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225204238693"></p><p>FedScale用的就是这篇文章的数据做了个 异构性的数据生成。</p><p>数据的Ethic consideration处理可以参考这篇。</p><p><strong>摘要:</strong></p><ul><li>异构性对FL的训练过程有影响,例如导致设备训练的时候unavailable或者不能够上传模型的更新。</li><li>本文:对FL中设备异构性做的第一个empirical study。收集了从136k手机收集到的能够真实的反应真实世界异构性的数据。建立了一个heterogeneity-aware的FL平台,这个平台符合标准的FL,但是考虑了异构性。</li><li>比较了当前sota的FL算法在有异构和无异构上的表现。结果显示异构性会导致FL中non-trivial的性能下降,9.2%的精度下降,2.32x长的训练时间,undermined fairness。</li><li>分析了潜在的影响银子发现device failure和participant bias是两个导致performance degradation的因子。</li></ul><p><strong>使用框架LEAF实现。</strong></p><h2 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h2><p>当前的方法例如FedAvg、Structured Updates和q-FedAvg常常用模拟的方法去测试。然而,模拟的时候常常过度假设,所有的设备在训练的时候都是available的,硬件的配置都是一个样的。</p><p>异构性分为:</p><ul><li>硬件异构性。(不同的CPU、RAM和电池寿命)</li><li>状态异构性。(CPU的busy/free,服务器的稳定或者不稳定的网络连接)等等。</li></ul><p>主要做的事:</p><ul><li><p>开发了一个和当前主流的FL范式符合的holistic platform,第一次加速了在设备的异构性下的FL算法的开发。收集了136k手机用户在一个商业输入法app(IMA)上的额数据,然后把这个数据plug在开发的平台上模拟设备的状态和硬件异构性。</p></li><li><p>比较了当前sota的FL算法在有异构和无异构上的表现。采用的数据集是4个经典的FL任务下的四个数据集(三个常用的数据集和一个IMA的数据集)。</p></li><li><p>发现:对accuracy , training time和fairness都有影响。对FedAvg、压缩算法、聚合算法都有影响,例如异构性会导致q-FedAvg产生处理fairness的问题,压缩算法很难起作用,最坏的时候传输时间被提升了3.5x。</p></li><li><p>潜在因子的分析:(1)DevIce Failure,11.6%的设备不能成功上传模型更新,会导致模型收敛变慢,浪费宝贵的硬件资源。(2)Participants bias,模型收敛的时候,more than 30%的设备从没参加学习的过程,模型的收敛被active的devices支配。(30%的device占据了81%的计算)。</p></li></ul><h3 id="Background"><a href="#Background" class="headerlink" title="Background"></a>Background</h3><p>异构性对FL十分重要。很多算法处理异构性的时候并不严谨,例如:</p><ul><li>FedProx通过允许每个参与者perform一系列的工作,模拟硬件的异构性,但是每个硬件的capability 是随机设置的,硬件状态的改变也没有考虑。</li><li>FedCS通过基于设备的资源条件管理设备,允许服务器聚集许多的设备更新,但是假定网络是稳定的,不会产生拥塞,在5-500秒的范围内随机设置了训练的时间。</li></ul><p>总之,人家有问题</p><h2 id="THE-MEASUREMENT-APPROACH(度量的方法)"><a href="#THE-MEASUREMENT-APPROACH(度量的方法)" class="headerlink" title="THE MEASUREMENT APPROACH(度量的方法)"></a>THE MEASUREMENT APPROACH(度量的方法)</h2><h3 id="整个的实验流程:"><a href="#整个的实验流程:" class="headerlink" title="整个的实验流程:"></a>整个的实验流程:</h3><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225214007200.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225214007200.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225214007200"></p><h3 id="数据集"><a href="#数据集" class="headerlink" title="数据集"></a>数据集</h3><p><strong>Device state traces数据:</strong></p><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225214425651.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225214425651.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225214425651"></p><ul><li>2020年1月31日开始一周的数据。</li><li>136k的设备的轨迹,包含180million的state entries和111GB的存储。</li></ul><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225215655934.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225215655934.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225215655934"></p><ul><li>轨迹只有在available interval才有用。</li></ul><p><strong>算力数据:</strong></p><p>超过1000种设备,进行聚类,具体是通过mapping:</p><ul><li>(1) The total device models are first mapped to the device models profiled by AI-Benchmark, a comprehensive AI performance benchmark. For a few device models that AI-Benchmark does not cover, we make a random mapping. It reduces the number of device models to 296.</li><li>(2) The remaining device models are then mapped to what we afford to profile -> three representative and widely-used device models (Samsung Note 10, Redmi Note 8, and Nexus 6).</li><li>(3) To profile these devices, we run on-device training using the open-source ML library DL4J [15] and record their training time for each ML model used in our experiments.</li></ul><p><strong>通信数据(使用志愿者模拟):</strong></p><p>we recruit 30 volunteers and deploy a testing app on their devices to periodically obtain (i.e., every two hours) the downstream/upstream bandwidth between the devices and a cloud server.</p><p><strong>benchmark:</strong></p><ul><li><p>三个合成数据集(Reddit、Femnist和Celeba)。Femnist和Celeba图像分类,Reddit和MType是next-word预测任务,分别使用CNN和LSTM模型。 </p></li><li><p>IMA的真实输入数据集。</p></li></ul><h3 id="平台的模拟"><a href="#平台的模拟" class="headerlink" title="平台的模拟"></a>平台的模拟</h3><p>根据Google的报告配置了FL system:</p><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225221539154.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225221539154.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225221539154"></p><ul><li>具体的平台设置详见原文。</li></ul><h3 id="实验的设置"><a href="#实验的设置" class="headerlink" title="实验的设置"></a>实验的设置</h3><p><strong>算法的配置:</strong></p><ul><li>基本的算法:FedAvg</li><li>聚合算法:q-FedAvg、FedProx。</li><li>压缩算法:Structured Updates、Gradient Dropping (GDrop)、SignSGD。</li></ul><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225223002188.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225223002188.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225223002188"></p><p><strong>Metric:</strong></p><ul><li>Convergence accuracy</li><li>Training time/round</li><li>Compression ratio</li><li>Variance of accuracy</li></ul><p><strong>实验环境:</strong></p><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225223056029.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225223056029.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225223056029"></p><h2 id="实验结果"><a href="#实验结果" class="headerlink" title="实验结果"></a>实验结果</h2><h3 id="Impacts-on-Basic-Algorithm’s-Performance"><a href="#Impacts-on-Basic-Algorithm’s-Performance" class="headerlink" title="Impacts on Basic Algorithm’s Performance"></a>Impacts on Basic Algorithm’s Performance</h3><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225224747777.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225224747777.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225224747777"></p><ul><li>异构性对FL精度的下降影响挺大。</li><li>异构性会明显的降低FL训练过程,会增加<strong>训练时间</strong>和<strong>训练轮次</strong>。</li></ul><h3 id="Impacts-on-Advanced-Algorithms’-Performance"><a href="#Impacts-on-Advanced-Algorithms’-Performance" class="headerlink" title="Impacts on Advanced Algorithms’ Performance"></a>Impacts on Advanced Algorithms’ Performance</h3><p>FedProx允许设备依据系统资源做训练工作。也对local optimization增加了一个proximal term。由于q-FedAvg和FedProx的优化目标不同,分开比较,使用的baseline为FedAvg。</p><p><strong>q-FedAvg的结果</strong>如Table 3所示。</p><ul><li><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225225252617.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225225252617.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225225252617"></li></ul><p><strong>FedProx的结果</strong>如图Figure-5所示。</p><ul><li><p>发现:(1)q-FedAvg that is supposed to address fairness issues is less effective in ensuring fairness under heterogeneity-aware settings. </p><p> <img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225225616014.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225225616014.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225225616014"></p></li><li><p>发现:(2)FedProx is less effective in improving the training process with heterogeneity considered</p></li></ul><p><strong>梯度压缩算法:</strong></p><p>Structured Updates、Structured Updates(GDrop)、SignSGD。(具体的设置看原文)</p><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225230119319.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225230119319.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225230119319"></p><ul><li>Heterogeneity introduces a similar accuracy drop to compression algorithms as it does to the basic algorithm</li><li>Gradient compression algorithms can hardly speed up the model convergence under heterogeneity-aware settings</li></ul><h2 id="影响因子的分析"><a href="#影响因子的分析" class="headerlink" title="影响因子的分析"></a>影响因子的分析</h2><p>主要分析了两种异构性:</p><p>(1)选中的device由于一些原因上传模型更新失败。称作<strong>device failure</strong>。</p><p>(2)成功上传的仍可能对global model产生biased的contribution。称作<strong>participant bias</strong>。</p><p><strong>探究两种异构性(state heterogeneity)和(hardware heterogeneity)的各自的影响</strong>。</p><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225231310104.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225231310104.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225231310104"></p><ul><li>Both state heterogeneity and hardware heterogeneity <strong>slow down the model convergence</strong>.</li><li><strong>State heterogeneity is more influential than hardware heterogeneity on the model accuracy</strong>.</li></ul><p><strong>Device Failure</strong></p><ul><li>(1) Network failure</li><li>(2) Interruption failure</li><li>(3) Training failure</li></ul><p>回答问题:</p><ul><li>(1) how often the devices may fail and what the corresponding reasons for the failure are;</li><li>(2) and which type of heterogeneity is the major factor.</li></ul><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225231737771.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225231737771.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225231737771"></p><ul><li>Heterogeneity introduces non-trivial device failure even when an optimal deadline setting is given.</li></ul><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225231747928.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225231747928.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225231747928"></p><ul><li>Hardware heterogeneity leads to more device failure than state heterogeneity.</li></ul><p><strong>Participant</strong></p><p>Participant bias refers to the phenomenon that devices do not participate in FL with the same probability. It can lead to different contributions to the global model, thus making some devices underrepresented.</p><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225232028818.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225232028818.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225232028818"></p><ul><li>The computation loads get more uneven under heterogeneityaware settings.</li><li>The number of inactive devices increases significantly under heterogeneity-aware settings.</li><li>Up to 30% devices have not participated in FL process when the global model reaches the target accuracy under heterogeneityaware settings.</li></ul><p><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225232124678.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225232124678.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225232124678"></p><ul><li><img src="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225232147128.png" class="lazyload placeholder" data-srcset="/2022/12/25/www21-characterizing-impacts-of-heterogeneity-in-federated-learning-upon-large-scale-smartphone-data/image-20221225232147128.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225232147128"></li><li>State heterogeneity is more responsible for participant bias</li></ul>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> About Empirical Study </tag>
</tags>
</entry>
<entry>
<title>MobiSys22-FedBalancer: Data and Pace Control for Efficient Federated Learning on Heterogeneous Clients</title>
<link href="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/"/>
<url>/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/</url>
<content type="html"><![CDATA[<h1 id="MobiSys22-FedBalancer-Data-and-Pace-Control-for-Efficient-Federated-Learning-on-Heterogeneous-Clients"><a href="#MobiSys22-FedBalancer-Data-and-Pace-Control-for-Efficient-Federated-Learning-on-Heterogeneous-Clients" class="headerlink" title="MobiSys22-FedBalancer: Data and Pace Control for Efficient Federated Learning on Heterogeneous Clients"></a>MobiSys22-FedBalancer: Data and Pace Control for Efficient Federated Learning on Heterogeneous Clients</h1><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225144005529.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225144005529.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225144005529"></p><p><strong>收获:</strong></p><p>该优化方法精心设计了在客户端中进行数据选择的策略,Oort精心设计了进行device选择的策略,Hermes精心设计了在本地Model上进行的剪枝操作(相当于考虑到了本地的数据),FedNAS运用神经搜索的方法搜索最优的结构。</p><p><strong>疑问:</strong></p><ul><li>是否缺少对模型本身和FL的特点都进行优化的工作?</li><li>FL如何流水线化?当前的工作是否流水线化?client-server的结构是client等server还是server等client?</li><li>FL每个阶段的方法划分?</li><li>这些论文用的数据集和评测结果。</li></ul><p><strong>摘要:</strong></p><ul><li><strong>经典的FL训练方案</strong>:把所有数据同等对待,导致计算资源的浪费,降低了全局的学习过程。</li><li><strong>FedBalancer</strong>:主动选择clients’的训练样本,兼顾算力和隐私的同时prioritize more “informative” data。介绍了自适应的deadline control scheme去预测每一个轮次的optimal deadline。</li><li><strong>评测结果</strong>:五个数据集、三个不同的domain,FedBalancer提升了time-to-accuracy的性能1.20<del>4.48x,提升了模型的精度1.1</del>5.0%。FedBalancer可以和其它的FL approach集成到一块。</li></ul><p><strong>FedBalancer:</strong></p><ul><li><p>prioritize more “informative” samples去高效的利用他们的computational effort。这使得low-end devices在round-deadline内对global training产生贡献,因为low-end devices聚焦于更小的但是更重要的training samples。</p></li><li><p>为了获得更高的time-to-accuracy performance,the sample selection被设计为在FL rounds的sample utility measurement的时候operate without additional forward or backward pass。</p></li><li><p>FedBalancer能够和其它的FL approach一起使用。</p></li></ul><p><strong>解决的挑战:</strong></p><ul><li>简单的随机采样会导致模型精度的下降,因为训练数据 的statistical utility会下降,因此要根据数据的statistical utility measurement去选择samples。</li><li>sample selection时候收集sample-level的statistical utility会打破FL的privacy guarantee。提出了client-sever coordination的方法去维护一个loss threshold,这允许clients去高效的选择重要的samples同时只暴露他们differentially-private的数据。</li><li>当FL的round deadline固定的时候,光样本选择可能不会导致time-to-accuracy的性能提升。为了formulate the benefit of selecting different deadlines,我们提出了一个metric deadline efficiency (DDL-E)的方法,该方法计算每次计算完成round的clients的数量。</li></ul><p>在FLASH上实现(FLASH是一个基于LEAF实现的框架)。FLASH提供了从135k手机收集到的算力和真实世界轨迹的数据集,涉及千种设备,2062行代码实现,使用的aggregation算法是PROX。</p><h2 id="Background(当前有什么问题)"><a href="#Background(当前有什么问题)" class="headerlink" title="Background(当前有什么问题)"></a>Background(当前有什么问题)</h2><p><strong>优化time-to-accuracy的性能十分重要。</strong></p><ul><li>对用户来说:FL花费了边端用户设备的巨大的计算和网络资源。</li><li>对模型的开发者来说:在几千到几百万的设备上快速的收敛对高效的测试多个模型架构和超参很有用 。</li><li>对于服务的提供者来说:不断地模型地更新要求减小用户的overhead,同时带来更高的time-to-accuracy的性能。</li></ul><p><strong>异构性带来的问题:</strong></p><ul><li>centralized learning采用importance sampling去优化训练的过程,但是FL中还很少采用。</li><li>Preliminary study显示the ratio of informative samples随着FL训练过程的推进从93.2%减小到20%,硬件的异构性会导致这种状况进一步恶化,因为low-end的clients可能输送模型更新失败。</li></ul><h2 id="Motivation(Fedbalancer的灵感来源)"><a href="#Motivation(Fedbalancer的灵感来源)" class="headerlink" title="Motivation(Fedbalancer的灵感来源)"></a>Motivation(Fedbalancer的灵感来源)</h2><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225153112536.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225153112536.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225153112536"></p><ul><li>很少有人这么干。</li><li>样本的GN(Gradient Norm)在FL的开始都很高,但是在FL训练的后期,只有很少的样本有很高的GN。</li><li>这<strong>启发FedBalancer开始用所有的样本训练,之后不再使用模型已经学到的样本</strong>。</li></ul><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225153702160.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225153702160.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225153702160"></p><ul><li>尽管获得最短训练事件的optimal的deadline已被证明存在(见原文引文),为high time-to-accuracy控制deadline没有被研究过。</li><li>图显示:deadline对取得最快的收敛速度和更高的精度很重要。没有任何一个方法在各种任务上能够完胜别的方法。</li><li>找到一个最优的deadline十分重要。</li></ul><h2 id="FEDBALANCER(模型的具体的设计)"><a href="#FEDBALANCER(模型的具体的设计)" class="headerlink" title="FEDBALANCER(模型的具体的设计)"></a>FEDBALANCER(模型的具体的设计)</h2><p>FL的每一个round,FedBalancer自适应的选择client的训练数据,控制deadline,以取得更高的time-to-accuracy的性能。</p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225165926210.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225165926210.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225165926210"></p><ul><li>Fedbalancer主动控制**(1) loss threshold (lt)** 和**(2) deadline (ddl)**。</li><li>(1) loss threshold:决定每一个client的训练数据。</li><li>(2) deadline决定the round的终止事件。</li><li>①:服务器首先把当前的模型的权重$W_R$,loss的threshold $lt_R$,deadline $ddl_R$(R代表第R个轮次)传送到该轮次被选定的clients。</li><li>②:device上的sample selection模块选择在loss threshold的部分的训练数据。</li><li>③:device上训练received model。</li><li>④:client奖模型的更新和sample selection收集到的metadata输送给服务器。</li><li>⑤:服务器聚合来自所有clients的responses。</li><li>⑥:根据clients的metadata,loss threshold selection module和deadline selection module分别选择下一个round的$lt_{R+1}$和deadline $ddl_{R+1}$。</li></ul><h3 id="Sample-selection-module(样本选择模块的具体算法)"><a href="#Sample-selection-module(样本选择模块的具体算法)" class="headerlink" title="Sample selection module(样本选择模块的具体算法)"></a>Sample selection module(样本选择模块的具体算法)</h3><p>不想让clients暴露自己的隐私就需要client使用本地数据中对样本的重要性进行分类,而不暴露任何的信息,需要解决的难点就是clients很难在不知道global data的前提下确定什么样本重要什么样本不重要。于是FedBalancer就搞了个client-server coordination维护一个loss threshold $lt_R$。假设client i的样本选择模块工作在round R,有一个给定的loss threshold $lt_R$。</p><p><strong>流程:</strong></p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225172400845.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225172400845.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225172400845"></p><ul><li>首先module这个轮次是否需要sample selection,也即是判断这个client是否可以在给定的$ddl_R$内训练完他完整的数据集。给予client一定的自主权,如果这个client可以飞快的训完,那么就不管他的sample重要不重要了。</li><li>于是可以计算出client i在deadline之前的最大的样本数量S,验证其是否大于client dataset的大小$D^i$。(2-4行)。由于client的计算能力可能会根据device的runtime condition改变(见原文引文),FedBalancer收集batch训练的延迟,也就是$B^i$,<strong>使用平均延迟来估计他能够处理的最大的样本数</strong>。为了在第一个round计算平均的延迟,FedBalancer要求clients在FL开始前采样k次,这次k设置为10.</li><li>如果client i可以训练完整的数据集,sample selection模块就会使用the list of sample loss ($loss^i$)来决定用什么样本去训练。the loss list展示了当前模型所有样本的statistical utility,这个loss list可以通过推导所有样本在每个轮次的最新模型得到,这样的推导会带来额外的forward pass latency,可能会降低time-to-accuracy的性能。因此,FedBalancer的clients只做整个数据集的forward pass一次,也就是在这个client被第一次选中的时候做一次。每当他们训练数据的subset的时候,就会更新选中的sample的loss value。</li><li>然后FedBalancer根据the list of sample loss选择client sample。</li><li>如何选择client samples?首先将client i’s samples划分为两个groups,一个是Under-Threshold ($UT^i$),一个是Over-Threshold ($OT^i$),样本根据threshold分别放入$UT^i$和$OT^i$。总共从$OT^i$中采样$L\cdot p$个样本,从$UT^i$中采样$L\cdot (1-p)$个样本,$p$在$[35,78]$之间取值。要采样点的样本数$L$,也就是$len(OT^i)$. Loss的threshold会慢慢的增加。如果S大于$len(OT^i)$,改用S去最大化statistical utility within the deadline。</li><li>FedBalancer是在Prox的基础上做出来的,允许clients训练的epoch数量少一些,因此clients with S less than $len(OT^i)$任然可以对模型的更新起作用。</li></ul><h3 id="Loss-threshold-selection-module(Loss的阈值选择模块)"><a href="#Loss-threshold-selection-module(Loss的阈值选择模块)" class="headerlink" title="Loss threshold selection module(Loss的阈值选择模块)"></a>Loss threshold selection module(Loss的阈值选择模块)</h3><p>Loss值的分布会随着训练的变化而变化,module对当前分布知晓是十分重要的。为了了解这种分布的变化,服务器会在每个round结束后从loss list of clients收集一些metadata,也就是client i在第R个轮次的$LLow^i_R$和$LHigh^i_R$80%分位数(防止异常噪声)。在这个值上加上高斯噪声保护用户隐私。这些值在server上经过进一步的aggregation得到$LLow_R$和$LHigh$。</p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225192502196.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225192502196.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225192502196"></p><ul><li>loss threshold ratio (ltr),控制FedBalancer gradually increase the value by loss threshold step size (lss)。</li></ul><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225192922803.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225192922803.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225192922803"></p><ul><li>ltr初始化为0,然后慢慢的增加loss threshold step size (lss), deadline ratio (lss) 也是由ltr控制的。</li><li>具体的控制原理见原文。</li></ul><h3 id="Client-selection-with-sample-selection(结合sample-selection去选择client)"><a href="#Client-selection-with-sample-selection(结合sample-selection去选择client)" class="headerlink" title="Client selection with sample selection(结合sample selection去选择client)"></a>Client selection with sample selection(结合sample selection去选择client)</h3><p>提出一个新的formulation去计算statistical utility of a client i:</p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225193338918.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225193338918.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225193338918"></p><ul><li>为什么这么做,看原文。</li></ul><h3 id="Adaptive-Deadline-Control(自适应的deadline控制)"><a href="#Adaptive-Deadline-Control(自适应的deadline控制)" class="headerlink" title="Adaptive Deadline Control(自适应的deadline控制)"></a>Adaptive Deadline Control(自适应的deadline控制)</h3><p>详情看原文。</p><p><strong>Efficiency of a deadline</strong>:deadline的效率</p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225193439019.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225193439019.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225193439019"></p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225193511504.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225193511504.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225193511504"></p><p><strong>Deadline selection module</strong>:deadline的选择模块</p><h3 id="Collaboration-with-FL-Methods(和其它FL方法的结合)"><a href="#Collaboration-with-FL-Methods(和其它FL方法的结合)" class="headerlink" title="Collaboration with FL Methods(和其它FL方法的结合)"></a>Collaboration with FL Methods(和其它FL方法的结合)</h3><p>把FedBalancer可以通过简单的把sample selection和deadline control stategies加到其它的方法上即可。</p><p>在三个FL方法上做了实现。</p><p>其中<strong>Oort</strong>每个本地的epoch使用了一个batch而不是完整的数据集去训练,使得直接集成费了点劲,于是搞了个OortBalancer,FedBalancer做了简单的调整。(怎么调整的看原文)。</p><h2 id="EVALUATION(实验)"><a href="#EVALUATION(实验)" class="headerlink" title="EVALUATION(实验)"></a>EVALUATION(实验)</h2><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225194132127.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225194132127.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225194132127"></p><h3 id="Experimental-Setup(实验的设置)"><a href="#Experimental-Setup(实验的设置)" class="headerlink" title="Experimental Setup(实验的设置)"></a>Experimental Setup(实验的设置)</h3><p><strong>数据集:</strong></p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195030720.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195030720.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195030720"></p><p><strong>评价指标:</strong></p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195054907.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195054907.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195054907"></p><p><strong>Baseline:</strong></p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195220397.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195220397.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195220397"></p><p><strong>实验方法:</strong></p><p><strong>其它配置:</strong></p><h3 id="Speedup-and-Accuracy-on-Five-FL-Tasks(实验结果)"><a href="#Speedup-and-Accuracy-on-Five-FL-Tasks(实验结果)" class="headerlink" title="Speedup and Accuracy on Five FL Tasks(实验结果)"></a>Speedup and Accuracy on Five FL Tasks(实验结果)</h3><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195309266.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195309266.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195309266"></p><ul><li>实验的结果</li></ul><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195453798.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195453798.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195453798"></p><ul><li>实验所用的参数</li></ul><h3 id="Parameter-Sensitivity-Analysis"><a href="#Parameter-Sensitivity-Analysis" class="headerlink" title="Parameter Sensitivity Analysis"></a>Parameter Sensitivity Analysis</h3><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195511507.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195511507.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195511507"></p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195600594.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195600594.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195600594"></p><h3 id="Effect-of-FedBalancer-Components"><a href="#Effect-of-FedBalancer-Components" class="headerlink" title="Effect of FedBalancer Components"></a>Effect of FedBalancer Components</h3><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195618034.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195618034.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195618034"></p><h3 id="Collaboration-with-FL-Algorithms"><a href="#Collaboration-with-FL-Algorithms" class="headerlink" title="Collaboration with FL Algorithms"></a>Collaboration with FL Algorithms</h3><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195805163.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195805163.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195805163"></p><h3 id="Testbed-Experiments-with-Android-Clients"><a href="#Testbed-Experiments-with-Android-Clients" class="headerlink" title="Testbed Experiments with Android Clients"></a>Testbed Experiments with Android Clients</h3><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195719312.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195719312.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195719312"></p><p><img src="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195826714.png" class="lazyload placeholder" data-srcset="/2022/12/24/mobisys-fedbalancer-data-and-pace-control-for-efficient-federated-learning-on-heterogeneous-clients/image-20221225195826714.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221225195826714"></p>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> About DataSelecting </tag>
</tags>
</entry>
<entry>
<title>arXiv22-FEDNAS: FEDERATED DEEP LEARNING VIA NEURAL ARCHITECTURE SEARCH</title>
<link href="/2022/12/24/arxiv22-fednas-federated-deep-learning-via-neural-architecture-search/"/>
<url>/2022/12/24/arxiv22-fednas-federated-deep-learning-via-neural-architecture-search/</url>
<content type="html"><![CDATA[<h1 id="arXiv22-FEDNAS-FEDERATED-DEEP-LEARNING-VIA-NEURAL-ARCHITECTURE-SEARCH"><a href="#arXiv22-FEDNAS-FEDERATED-DEEP-LEARNING-VIA-NEURAL-ARCHITECTURE-SEARCH" class="headerlink" title="arXiv22-FEDNAS: FEDERATED DEEP LEARNING VIA NEURAL ARCHITECTURE SEARCH"></a>arXiv22-FEDNAS: FEDERATED DEEP LEARNING VIA NEURAL ARCHITECTURE SEARCH</h1>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> About NAS </tag>
</tags>
</entry>
<entry>
<title>SenSys21-FedMask: Joint Computation and Communication-Efficient Personalized Federated Learning via Heterogeneous Masking</title>
<link href="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/"/>
<url>/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/</url>
<content type="html"><![CDATA[<h1 id="SenSys21-FedMask-Joint-Computation-and-Communication-Efficient-Personalized-Federated-Learning-via-Heterogeneous-Masking"><a href="#SenSys21-FedMask-Joint-Computation-and-Communication-Efficient-Personalized-Federated-Learning-via-Heterogeneous-Masking" class="headerlink" title="SenSys21-FedMask: Joint Computation and Communication-Efficient Personalized Federated Learning via Heterogeneous Masking"></a>SenSys21-FedMask: Joint Computation and Communication-Efficient Personalized Federated Learning via Heterogeneous Masking</h1><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224150046850.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224150046850.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224150046850"></p><p>本篇工作和MobiCom21-Hermes: An Efficient Federated Learning Framework for Heterogeneous Mobile的思想本质上一致。</p><p>FedMask improves the inference accuracy by 28.47% and reduces the communication cost and the computation cost by 34.48× and 2.44×. FedMask also achieves 1.56× inference speedup and reduces the energy consumption by 1.78×.</p><table><thead><tr><th>Thesis</th><th>inference accuracy</th><th>communication cost</th><th>computation cost</th><th>inference speedup</th><th>energy consumption</th></tr></thead><tbody><tr><td>SenSys21</td><td>28.47%</td><td>34.48x</td><td>2.44x</td><td>1.56x</td><td>1.78x</td></tr><tr><td>MobiCom21</td><td>32.17%</td><td>3.48x</td><td></td><td>1.83x</td><td>1.8x</td></tr></tbody></table><p>每个设备学习一个稀疏的binary mask(每个网络参数1个bit),保持本地模型的参数不变,在服务器和设备之间只传输binary mask。</p><p>和经典的FL学习共享模型不同,每个device通过将学到的binary mask运用到本地模型固定的参数上去得到一个个性化的和结构稀疏的模型。</p><ul><li>只在device和server之间传输mask的话,本地得到的其它server的信息实质上只能是本地模型的参数哪些地方保留,哪些地方剪掉这样的信息,本地想取得不错的效果,要求本地能够训练出不错的模型,如果<strong>本地的数据如果太少</strong>,本地就无法训练出不错的模型,因此这个方法对本地数据很少的情况可能并不适用。</li><li>为什么不把这个方法和MobiCom21的方法结合起来去做一个优化?</li><li>剪枝的方法需要对每一个特定的模型做特定的剪枝策略设计和算法设计,并不是一个通用的方法。</li><li>该工作和MobiCom21那一篇的主要精力其实都集中于如何对手机上的模型进行剪枝,关于中央服务器聚合的部分,其实花的精力并不多。</li></ul><h2 id="Background(当前存在什么问题)"><a href="#Background(当前存在什么问题)" class="headerlink" title="Background(当前存在什么问题)"></a>Background(当前存在什么问题)</h2><p>和MobiCom21的那篇基本一致。</p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224152144195.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224152144195.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224152144195"></p><p>Background增加了计算成本论述的一小节。</p><h2 id="Challenges(当前的挑战)"><a href="#Challenges(当前的挑战)" class="headerlink" title="Challenges(当前的挑战)"></a>Challenges(当前的挑战)</h2><ul><li><strong>第一个挑战</strong>是如何联合优化通信和计算效率,Fedmask要去优化binary masks而不是去优化local model parameters只把optimized binary masks送到中央服务器。因此,需要设计一个优化binary mask的训练方法,当前的SGD方法不可用因为mask中的元素都binary value,0值会阻止梯度下降的反向传播,因此需要设计一个binary mask的训练方法。此外,优化binary mask的时候需要设计限制,否则就会变成非结构化稀疏方法,对硬件来说并不友好。</li><li><strong>第二个挑战</strong>是如何保留每个device的个性,当前的剪枝方法是个model weights这种浮点值设计的,这种方法不能被直接用到binary values上,所以也不能直接运用到FedMask的binary masks上,因此需要设计一个可以生成异构binary mask的方法。</li><li><strong>第三个挑战</strong>是如何在保留device个性化的前提下聚合异构的二进制mask。聚合有两个难点,一个是聚合是在binary mask上做的而不是在model parameters上做的,第二是这些binary mask是异构的而不是拥有同样的网络结构。</li></ul><h2 id="FEDMASK-DESIGN(FedMask实际的设计)"><a href="#FEDMASK-DESIGN(FedMask实际的设计)" class="headerlink" title="FEDMASK DESIGN(FedMask实际的设计)"></a>FEDMASK DESIGN(FedMask实际的设计)</h2><p><strong>全文和MobiCom21那篇不一样的地方应该主要是这里</strong>。</p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224152949878.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224152949878.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224152949878"></p><ul><li>each device learns a heterogeneous binary mask via the proposed one-shot pruning method ( 1 )</li><li>each device optimizes the binary mask with a structured sparsity regularization while freezing the parameters of local model ( 2 -a)</li><li>only the optimized binary masks are transmitted from the devices to the central server ( 2 -b)</li><li>The aggregation strategy is specifically designed such that only the elements that are overlapped across the binary masks of the devices are aggregated while keeping non-overlapping elements unchanged ( 3 -a)</li><li>the personalization of the binary masks is preserved and the updated binary masks will be sent back to each device ( 3 -b)</li><li>The above process ( 2 - 3 ) repeats until reaching a predefined number of communication rounds.</li><li>the binary mask will be elementwise applied to the frozen parameters to generate a personalized and structured sparse model ( 4 )</li><li><strong>和MobiCom21那篇不一样的地方是,那篇传送的是子网不是mask,这篇传送mask,且在本地只做mask的优化,其它的参数不变</strong></li></ul><h3 id="Binary-Mask-Optimization(第一个挑战如何解决)"><a href="#Binary-Mask-Optimization(第一个挑战如何解决)" class="headerlink" title="Binary Mask Optimization(第一个挑战如何解决)"></a>Binary Mask Optimization(第一个挑战如何解决)</h3><p>学习一个binary mask的同时冻结model parameters是FedMask技术的基础,下面论述。</p><p>不失一般性,以全连接为例(卷积层类似实现),bias项暂时忽略。</p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224160412547.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224160412547.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224160412547"></p><ul><li>全连接层:$y=W\cdot x$, $y\in R^m$代表输出,$x\in R^{n}$代表输入,$W\in R^{m\times n}$代表权重矩阵。</li><li>加上binary mask之后,也就是$m\in {\left{0,1\right}^{m\times n}}$和$W$的shape一样,带有Mask的全连接层:<img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224160955140.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224160955140.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224160955140"></li><li>当前的优化算法(例如SGD)用在binary value上不可行,因此,引入一个real-valued mask即$m^r\in R^{m\times n}$来设计一个binary mask optimization,在feedforward step的时候,$m^r$被使用threshold binaried to m,如方程3所示:<img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224161349004.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224161349004.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224161349004"></li><li>在back-propagation step,梯度m由方程4计算:<img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224161714822.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224161714822.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224161714822"></li><li>尽管这样的策略能够实现$m^r$和$m$的优化,但可能会导致巨大的梯度尺度的变化,这会影响$m^r$的优化(见原文引文),为了减少这种gradient variance,加入sigmoid函数:<img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224162103399.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224162103399.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224162103399"></li><li><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224162328426.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224162328426.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224162328426"></li></ul><p><strong>实质上:</strong>就是加了个sigmoid函数,训练的时候没有特殊的处理,在推理的时候就用方程3操作一下。</p><h3 id="One-Shot-Pruning-for-Mask-Initialization(第一个挑战如何解决)"><a href="#One-Shot-Pruning-for-Mask-Initialization(第一个挑战如何解决)" class="headerlink" title="One-Shot Pruning for Mask Initialization(第一个挑战如何解决)"></a>One-Shot Pruning for Mask Initialization(第一个挑战如何解决)</h3><p>经过Binary Mask Optimization我们知道了如何在binary matrix上进行反向传播,下面介绍如何进行剪枝,<strong>结构化剪枝而不是非结构化剪枝</strong>。</p><p>有各种各样的方法去确定要剪去什么样的参数,例如threshold、kernel sparsity、entropy、filter importance等等,然而当前的方法都是为real-valued parameters设计的,因此不能够被直接用于剪枝binary mask。</p><p>在Binary Mask Optimization已经介绍了real-valued mask,因此一个naive的方法是直接将当前的剪枝的方法用到real-valued mask上。然而, 由方程4和7得知,real-valued的masks是直接由fixed weight来scale的,因此<strong>real-valued masks的值的大小不能作为pruning的依据</strong>,基于这样的观察,于是设计了one-shot pruning method,这个method基于real-valued的masks和the fixed weights,也就是$W\odot m^r$。</p><p>每个device在binary mask的top layers保存dense structure,在最后几层进行剪枝(也就是由分类的部分组成)。<strong>不用优化后的绝对值作为剪枝的依据</strong>,而<strong>使用$W\odot m^r$绝对值的变化作为剪枝的依据</strong>。定义剪枝率为$p_r$,剪枝的过程由两步组成:</p><ul><li>(1)每个device<strong>一个epoch</strong>更新他们的real-valued masks</li><li>(2)通过对$W_{ij}\cdot m^r_{ij}$的值进行排序,选择最大的$p_r$比例的元素,其余的元素设为0并且冻结。然后在本地做local training。</li></ul><p><strong>作者认为第一个epoch更新的参数就是比较重要的参数,和W做一个点积可以知道哪些未知的参数比较重要,哪些位置参数不重要,把不重要的参数剪掉。</strong></p><h3 id="Local-Binary-Mask-Optimization(第二个挑战如何解决)"><a href="#Local-Binary-Mask-Optimization(第二个挑战如何解决)" class="headerlink" title="Local Binary Mask Optimization(第二个挑战如何解决)"></a>Local Binary Mask Optimization(第二个挑战如何解决)</h3><p>在作为one-shot pruning之后,每个device就有了一个heterogeneous mask。训练的时候,冻结模型的参数,训练binary的mask,且只训练排名靠前的binary的mask。</p><p><strong>模型的Loss:</strong></p><p>为了提升设备上的计算的精度,在mask optimization的过程中使用<strong>结构化的稀疏正则化方法</strong>去学习binary masks with structured sparsity。目标是在卷积层获得channel-wise和filter-wise的sparsity,在全连接层获得row-wise和column-wise的sparsity。下面的公式和MobiCom21的基本一样:</p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224194914540.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224194914540.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224194914540"></p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224194935059.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224194935059.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224194935059"></p><p><strong>方案的简单测试:</strong></p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224195622039.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224195622039.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224195622039"></p><ul><li><p>LSTM的效果不好,为什么?</p></li><li><p>LSTM的公式如下所示:<img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224195653878.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224195653878.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224195653878"></p></li><li><p>加了mask的LSTM的公式如下所示:<img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224195738865.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224195738865.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224195738865"></p></li><li><p>原因:尽管masked based的CNN、MLP的performance drop都很小,但是mask based的CNN和MLP的表达能力确实受到了限制,这种表达能力的衰减在LSTM中表现得更厉害,因为这种表达能力的衰减会通过方程12中的nested mask structure展现出来,这种不断累加的表达能力的衰减会导致LSTM严重的performance drop。定义real-valued unit的表达能力为$\mu$,一个masked unit的表达衰减是$\epsilon<1$,在一次feedforward的过程中,mask based的LSTM会衰减到$(1-\epsilon)^3{\mu}^3$,正如图6中展示的那样,从$h_{t-1}$到$h_t$的时候,表达能力衰减的累加为$1-(1-\epsilon)^3$:<img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224200539110.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224200539110.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224200539110"></p></li><li><p>为了减少这种衰减,就移除了$W_o$,$W_g$和$W_i$的binary mask,这样的话,表达能力的decay就只有$1-(1-\epsilon)^2$,只增加了25%的额外通信开销。</p></li></ul><h3 id="Aggregate-Heterogeneous-Binary-Masks(第三个挑战如何解决)"><a href="#Aggregate-Heterogeneous-Binary-Masks(第三个挑战如何解决)" class="headerlink" title="Aggregate Heterogeneous Binary Masks(第三个挑战如何解决)"></a>Aggregate Heterogeneous Binary Masks(第三个挑战如何解决)</h3><p>对至少出现两次的elements做averging,其它的不变。</p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201552518.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201552518.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224201552518"></p><p>一个例子:</p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201607117.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201607117.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224201607117"></p><ul><li>这张图和MobiCom那张基本一样。</li></ul><h3 id="完整算法"><a href="#完整算法" class="headerlink" title="完整算法"></a>完整算法</h3><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201656905.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201656905.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224201656905"></p><h2 id="EVALUATION(评测)"><a href="#EVALUATION(评测)" class="headerlink" title="EVALUATION(评测)"></a>EVALUATION(评测)</h2><p>和MobiCom21的那篇实验设置很类似。</p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201820023.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201820023.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224201820023"></p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201827422.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201827422.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224201827422"></p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201835911.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201835911.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224201835911"></p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201846183.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201846183.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224201846183"></p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201856172.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201856172.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224201856172"></p><p><img src="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201905717.png" class="lazyload placeholder" data-srcset="/2022/12/24/sensys21-fedmask-joint-computation-and-communication-efficient-personalized-federated-learning-via-heterogeneous-masking/image-20221224201905717.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224201905717"></p>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> About Prune </tag>
</tags>
</entry>
<entry>
<title>MobiCom21-Hermes: An Efficient Federated Learning Framework for Heterogeneous Mobile</title>
<link href="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/"/>
<url>/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/</url>
<content type="html"><![CDATA[<h1 id="MobiCom21-Hermes-An-Efficient-Federated-Learning-Framework-for-Heterogeneous-Mobile"><a href="#MobiCom21-Hermes-An-Efficient-Federated-Learning-Framework-for-Heterogeneous-Mobile" class="headerlink" title="MobiCom21-Hermes: An Efficient Federated Learning Framework for Heterogeneous Mobile"></a>MobiCom21-Hermes: An Efficient Federated Learning Framework for Heterogeneous Mobile</h1><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223140946531.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223140946531.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223140946531"></p><ul><li>这种剪枝方法自适应的根据设备数据学习到较好的剪枝方法,基于的假设是:相似的数据进行梯度下降时更新的参数应该也是类似的。个人人为这种方法可能不能很好的学习到设备的异构性。</li><li>作者没有在真实场景下进行测试,算法的测试规模较小。</li></ul><p>Hermes: a <strong>communication</strong> and <strong>inference</strong>-efficient FL framework <strong>under data heterogeneity</strong>.</p><p>不去像传统的FL那样去优化base neural network,每个设备运用<strong>结构化剪枝</strong>找到一个小的子网,这个子网在本地数据上泛化性良好,只有这些子网的更新才在服务器和设备之间传输。服务器只对每个子网的重合的参数做平均。</p><p>这样做的假设是:“lottery ticket hypothesis”,即原本网络的最优结构可以通过剪枝出一个子网达到类似的效果。拥有相似local数据的设备共享相似的子网架构和更多的相似参数,找到这些子网的话通信和推理效率都可以被提高,也可以进行个性化的定制。</p><p>Hermes achieves as high as <strong>32.17% increase in inference accuracy</strong>, <strong>3.48× reduction on the communication cost</strong>, <strong>1.83× speedup in inference efficiency</strong>, and <strong>1.8× savings on energy consumption</strong>.</p><ul><li>当前很少有考虑推理效率的工作,本篇工作可以提高推理效率。</li><li>减少通信成本。</li><li>学习到个性化的模型。</li></ul><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223174538450.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223174538450.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223174538450"></p><h2 id="Background(当前存在什么问题)"><a href="#Background(当前存在什么问题)" class="headerlink" title="Background(当前存在什么问题)"></a>Background(当前存在什么问题)</h2><p><strong>联邦学习</strong>:</p><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223174215637.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223174215637.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223174215637"></p><p><strong>解决数据异构性的问题</strong>:FL personalization,多数都是two-step的方法,第一步学好一个global model,第二步fine tune globel model,这种两步方法不可避免的会导致额外的计算开销。<strong>数据异构性不应该是障碍,而应该是实际的个性化的需求。</strong></p><ul><li>fine-tuning</li><li>multi-task learning</li><li>contextualization</li></ul><p><strong>解决通信开销的问题</strong>:压缩设备和服务器之间的通信开销。CIFAR-10 500个通信轮次,每次20个参与设备,VGG16和Inception-v4,FedAvg分别产生了10.59TB和32.22TB的数据,LG-FedAvg减少到了8.28TB和2.45TB,还是太大了。</p><ul><li>量化</li><li>稀疏化</li><li>混合方法</li><li><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223184009626.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223184009626.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223184009626"></li></ul><p>有没有既考虑设备算力和通信能力(HeteroFL),又考虑设备上数据量(Hermes)做定制的工作</p><p>同时解决<strong>数据异构性</strong>和<strong>通信开销</strong>的方法很少:</p><ul><li><strong>LG-FedAvg</strong>:结合局部的表示学习和全局的联邦学习,将模型参数分为两个部分,local parameters和global parameters,分两步走,第一步设备使用FedAvg策略和更新和交换整个模型,第二步更新整个模型,但是每次只把部分参数发送给中央服务器。LG-FedAvg只能在第二部减小通信开销,模型的划分用的是启发式的方法而不是数据驱动的方法。LG-FedAvg的评测也不是在真实的FL setting下做的,他的评测每个设备上都具有充足的训练数据。</li><li><strong>HeteroFL</strong>:根据设备的算力和通信能力自适应的分配submodel,每个设备只训练和通信submodel,因此通信和计算效率都能提升。但是子模型是根据设备的能力设计的而不是依据本地数据设计的,使得子模型的分配很不灵活。</li><li><strong>SFSL</strong>:是为推荐系统设计的安全的联邦子模型推荐框架,子模型由用户在电商平台上的历史数据来确定。每一个设备只需要训练和传输子模型,因此通信和计算效率都能提升。然而这个特定的联邦学习策略是为推荐系统设计的,具有特殊的输入数据的结构,SFSL可能对其它场景并不适用,因为其它场景模型和输入数据的格式都和SFSL不一样。</li></ul><p><strong>当前还没有把计算效率考虑进来的研究工作</strong>。</p><h2 id="Challenges(Hermes设计的挑战和解决挑战的方法)"><a href="#Challenges(Hermes设计的挑战和解决挑战的方法)" class="headerlink" title="Challenges(Hermes设计的挑战和解决挑战的方法)"></a>Challenges(Hermes设计的挑战和解决挑战的方法)</h2><p>挑战:</p><ul><li><strong>第一个挑战</strong>是设计一local training方法,使得设备能够学习一个子网,这个子网能把个性化的本地信息数据嵌入,联合提升通信和推理效果:一个novel的本地训练方法,这个方法使用每一个client每一轮通信过程中的的基本模型寻找子网。每一device学习了一个结构化的稀疏子网,在通信阶段只和服务器通信参数。</li><li><strong>第二个挑战</strong>是为服务器设计一个特定的聚合策略去最大化保存每个子网的个性化属性:Hermes在服务器端设计了一个个性化的保存聚合策略。也就是该聚合方法只做subnetworks交叉部分的平均,这样的聚合策略可以阻止每个subnetwork的的个性化参数被其他的参数干扰。</li><li>还要处理部分设备的数据严重不足的问题。</li></ul><p>实验:</p><ul><li>在手机上三个代表性的深度学习应用上做了实验。</li><li>对比的方法有:FedAvg、Top-k,Per-FedAvg,LG-FedAvg。</li></ul><p>理论上的收敛性证明。</p><h2 id="DESIGN-OF-HERMES(Hermes实际的设计)"><a href="#DESIGN-OF-HERMES(Hermes实际的设计)" class="headerlink" title="DESIGN OF HERMES(Hermes实际的设计)"></a>DESIGN OF HERMES(Hermes实际的设计)</h2><h3 id="Overview(Hermes整体架构)"><a href="#Overview(Hermes整体架构)" class="headerlink" title="Overview(Hermes整体架构)"></a>Overview(Hermes整体架构)</h3><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223184032385.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223184032385.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223184032385"></p><ul><li><p>在每一个通信阶段,一系列的device参与FL的训练。</p><ul><li>①:Instead of 直接优化本地模型,每一个device在优化局部模型时incorporates a structured sparsity regularization学习一个子网。</li><li>②:只有这个子网在中央服务器和设备之间传送。</li><li>③:只有各个设备的子网间相交的参数在服务器计算平均,不处理不相交的参数。例如图3中,设备1、2和N的子网的前两个部分是相交的,其它的部分不重叠。</li><li>④:然后更新的子网会发布到各个设备。</li><li>重复以上四个步骤。</li><li>⑤:最终,每个设备学习到一个个性化的模型。</li></ul><p> <img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223185533707.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223185533707.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223185533707"></p><ul><li>$C={C_{1},…,C_{N}}$代表N个devices,$C_{k}$代表第k个设备。</li><li>$S_{c}\subset C$,代表每一个训练轮次选出的设备。</li><li>$W_{k}$代表$C_k$的局部模型参数。</li><li>每一个设备 $C_k$ 同时学习一个local的mask $M_k \in {\left{0,1\right}}^{|W_k|}$,表示运用了结构化剪枝后的子网。$W_k\odot M_k$表示设备$C_i$的相关的子网参数。</li><li>设备$C_k$上的数据$D_k$被划分为训练集$D_k^{train}$、验证集$D_k^{val}$和测试集$D_k^{test}$。</li><li>上标T代表第T个轮次。</li></ul></li></ul><h3 id="Learn-Subnetwork-for-Joint-Efficiency-and-Personalization(学习出一个既有效率又可以个性化的子网)"><a href="#Learn-Subnetwork-for-Joint-Efficiency-and-Personalization(学习出一个既有效率又可以个性化的子网)" class="headerlink" title="Learn Subnetwork for Joint Efficiency and Personalization(学习出一个既有效率又可以个性化的子网)"></a>Learn Subnetwork for Joint Efficiency and Personalization(学习出一个既有效率又可以个性化的子网)</h3><p>剪枝就是要找到对在训练上做推理的最有意义的参数。</p><ul><li><p>unstructured pruning</p></li><li><p>structured pruning</p></li><li><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223191944891.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223191944891.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223191944891"></p><ul><li>(a)代表未被剪枝过的模型,(b)代表结构化剪枝被剪枝过的模型,每个设备上的部分的channel被剪掉,(c)代表无结构化的剪枝。</li><li>结构化剪枝通常是channel-wise或者是filter-wise,相比无结构化剪枝虽然不够灵活但是对硬件却十分友好。</li><li>无结构化剪枝通常是parameter-wise,又更好的灵活性和压缩能力,但是需要特定的硬件支持。</li></ul><p> 本次使用<strong>结构化剪枝</strong>去做优化。不失一般性,以卷积为例,使用SGD优化局部模型的时候,每个device都想在<strong>卷积层上获取channel-wise的稀疏性</strong>,在<strong>全连接层上取得row-wise和column-wise的稀疏性</strong>,结构化剪枝:<strong>?</strong></p><p> <img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223192719808.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223192719808.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223192719808"></p><p> <img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223195537942.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223195537942.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223195537942"></p><ul><li>剪枝后的正则化项如上所示。<strong>?</strong></li></ul><p> 运用结构化剪枝之后,mask optimization的训练的loss是:</p><p> <img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223195857179.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223195857179.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223195857179"></p><ul><li>$F_D(W)$是本地数据的loss,$\lambda$是结构化稀疏正则化的系数。</li><li>通过优化等式(3),每个设备就可以通过判断W中的零值和非零值去derive M,也就是W中的零值和非零值会在M中分别标记为0和1。</li></ul><p> 第T轮通信轮,每一个参与设备下载相关的子网$W_k^T=W\odot M_k^T$,$M_k^T$由(T-1)th等式3计算得到。首先,设备在本地的验证集$D_k^{val}$上做验证,如果当前子网$W_k^T$的效果比预定义的阈值$acc_{threshold}$好并且当前的剪枝率$r_k^T$没有达到目标的剪枝率$r_{target}$,设备就会使用固定的剪枝率$r_p$剪枝$W_k^T$的low-magnitude的权重。结构化剪枝后,device能够得到$M_k^{T+1}$,$M_k^{T+1}$表示这$W_k^{T+1}$的稀疏结构且嵌入了数据依赖的特征。之后,设备使用数据集$D_k^{train}$基于$W_k^T\odot M_k^{T+1}$进行几个epochs的小批量的训练得到$W_k^{T+1}$。除了把子网参数$W_k^{T+1}$发送给服务器,每个clients也需要把相关的binary mask $M_k^{T+1}$发送给服务器,这个binary mask帮助服务器识别剪枝的位置,根据掩码保留参数。binary mask只使用1bit来表示每一位的参数,额外的通信开销和传输的浮点的模型参数相比可以忽略不记。最终,设备将$W_k^{T+1}$和$M_k^{T+1}$传输给服务器。</p></li></ul><h3 id="Personalization-Preserving-Aggregation(中央服务器的聚合)"><a href="#Personalization-Preserving-Aggregation(中央服务器的聚合)" class="headerlink" title="Personalization-Preserving Aggregation(中央服务器的聚合)"></a>Personalization-Preserving Aggregation(中央服务器的聚合)</h3><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223202009346.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223202009346.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223202009346"></p><ul><li>只对相交的参数进行聚合,然后把根据device的需要将聚合后的参数发送给对应的device。</li><li>具体的算法如下所示:</li><li><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223202124245.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223202124245.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223202124245"></li></ul><h2 id="THEORETICAL-ANALYSIS(理论上的分析,证明这样可以收敛)"><a href="#THEORETICAL-ANALYSIS(理论上的分析,证明这样可以收敛)" class="headerlink" title="THEORETICAL ANALYSIS(理论上的分析,证明这样可以收敛)"></a>THEORETICAL ANALYSIS(理论上的分析,证明这样可以收敛)</h2><p>见原文。</p><h3 id="Notations-and-Terminologies(理论分析相关的术语)"><a href="#Notations-and-Terminologies(理论分析相关的术语)" class="headerlink" title="Notations and Terminologies(理论分析相关的术语)"></a>Notations and Terminologies(理论分析相关的术语)</h3><h3 id="Convergence-Analysis(收敛性分析)"><a href="#Convergence-Analysis(收敛性分析)" class="headerlink" title="Convergence Analysis(收敛性分析)"></a>Convergence Analysis(收敛性分析)</h3><h2 id="EVALUATIONS(实验的评测)"><a href="#EVALUATIONS(实验的评测)" class="headerlink" title="EVALUATIONS(实验的评测)"></a>EVALUATIONS(实验的评测)</h2><h3 id="数据集:"><a href="#数据集:" class="headerlink" title="数据集:"></a>数据集:</h3><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223202438045.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223202438045.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223202438045"></p><ul><li>VGG16、EMNIST和CIFAR10做图像分类任务。EMNIST是一个手写数字识别的数据集,自然的将一个人写的数字分布到一个手机上。CIFAR10,每个device保留两个类别的数据,不同设备上的两个类别不一,设备上每个类别的数据量不一致。</li><li>HAR数据集,手机的来自30个人的加速器和陀螺仪的数据,HAR包含6种活动。自然的将每个人的数据分不到一个设备上,我们用了3层全连接来识别人类的活动。</li></ul><h3 id="实验的实现:"><a href="#实验的实现:" class="headerlink" title="实验的实现:"></a>实验的实现:</h3><ul><li><p>手机:Google Pixel 3 (CPU) smartphones running Android 9.0,PyTorch 1.5</p></li><li><p>中央服务器:Intel Xeon <a href="mailto:E5-2630@2.6GHz">E5-2630@2.6GHz</a> and 128G RAM</p></li><li><p>耗电量检测:Monsoon power monitor [38] to measure the power consumption at runtime</p></li><li><p>其它配置见附件C的表4:</p></li><li><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203431216.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203431216.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223203431216"></p></li></ul><h3 id="对比实验设置:"><a href="#对比实验设置:" class="headerlink" title="对比实验设置:"></a>对比实验设置:</h3><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203522071.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203522071.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223203522071"></p><h3 id="评测指标:"><a href="#评测指标:" class="headerlink" title="评测指标:"></a>评测指标:</h3><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203601000.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203601000.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223203601000"></p><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203609211.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203609211.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223203609211"></p><h3 id="实验结果"><a href="#实验结果" class="headerlink" title="实验结果"></a>实验结果</h3><h4 id="Convergence-Speed(收敛的速度)"><a href="#Convergence-Speed(收敛的速度)" class="headerlink" title="Convergence Speed(收敛的速度)"></a>Convergence Speed(收敛的速度)</h4><p>又快又好。</p><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203814814.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203814814.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223203814814"></p><h4 id="Inference-Accuracy-vs-Communication-Cost(推理精度和通信开销)"><a href="#Inference-Accuracy-vs-Communication-Cost(推理精度和通信开销)" class="headerlink" title="Inference Accuracy vs. Communication Cost(推理精度和通信开销)"></a>Inference Accuracy vs. Communication Cost(推理精度和通信开销)</h4><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203926053.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223203926053.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223203926053"></p><h4 id="Hyper-Parameter-Evaluation(超参的评测)"><a href="#Hyper-Parameter-Evaluation(超参的评测)" class="headerlink" title="Hyper-Parameter Evaluation(超参的评测)"></a>Hyper-Parameter Evaluation(超参的评测)</h4><ul><li><p>Number of Participating Devices:</p></li><li><p>Data Volume and Unbalance Rate:</p></li><li><p>Target Pruning Rate:</p><p> <img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223204246817.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223204246817.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223204246817"></p></li></ul><h4 id="Runtime-Performance(运行时的性能)"><a href="#Runtime-Performance(运行时的性能)" class="headerlink" title="Runtime Performance(运行时的性能)"></a>Runtime Performance(运行时的性能)</h4><p><strong>Reduction of Memory footprint</strong>:</p><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223204303888.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223204303888.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223204303888"></p><p><strong>Inference Speedup</strong>:</p><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223204342244.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223204342244.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223204342244"></p><p><strong>Reduction on Energy Consumption:</strong></p><p><img src="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223204406232.png" class="lazyload placeholder" data-srcset="/2022/12/23/mobicom21-hermes-an-efficient-federated-learning-framework-for-heterogeneous-mobile-clients/image-20221223204406232.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221223204406232"></p><h2 id="DISCUSSION(讨论)"><a href="#DISCUSSION(讨论)" class="headerlink" title="DISCUSSION(讨论)"></a>DISCUSSION(讨论)</h2><ul><li>Generality of Hermes</li><li>Defending Against Privacy Leakage</li></ul><h2 id="RELATED-WORK(相关工作)"><a href="#RELATED-WORK(相关工作)" class="headerlink" title="RELATED WORK(相关工作)"></a>RELATED WORK(相关工作)</h2><ul><li>Communication-Efficient Distributed Deep Learning</li><li>Personalization for F</li></ul><h2 id="CONCLUSION(结论)"><a href="#CONCLUSION(结论)" class="headerlink" title="CONCLUSION(结论)"></a>CONCLUSION(结论)</h2>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> About Prune </tag>
</tags>
</entry>
<entry>
<title>IMC17-Complexity vs. Performance: Empirical Analysis of Machine Learning as a Service</title>
<link href="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/"/>
<url>/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/</url>
<content type="html"><![CDATA[<h1 id="IMC17-Complexity-vs-Performance-Empirical-Analysis-of-Machine-Learning-as-a-Service"><a href="#IMC17-Complexity-vs-Performance-Empirical-Analysis-of-Machine-Learning-as-a-Service" class="headerlink" title="IMC17-Complexity vs. Performance: Empirical Analysis of Machine Learning as a Service"></a>IMC17-Complexity vs. Performance: Empirical Analysis of Machine Learning as a Service</h1><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222135251608.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222135251608.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222135251608"></p><p>Empirial Study的findings、goal和metholody基于的RQ很多时候并不是一致的。goal和RQ相关,根据RQ提出方法,在方法做出的实验结果中总结出findings。</p><p>本文做的是一个MLaaS服务的一个实证性分析,告诉了人们应该怎么选这些平台,这些平台的表现如何。</p><p>MLaaS (Machine Learning as a Service platforms),发现:</p><ul><li>用户的自主决定权越高,得到不好的结果的风险越大(with more user control comes greater risk).</li><li>服务端的优化优于一般的平台,但是仍然不如精心设计的机器学习平台(server side optimizations help fully-automated systems outperform default settings on competitors, but still lag far behind well-tuned MLaaS systems which compare favorably to standalone ML libraries).</li><li>选择什么样的分类器是决定模型性能的主要因素,用户可以随机尝试很少的分类器就可以得到一个较好的结果(classifier choice is the dominating factor in determining model performance, and that users can approximate the performance of an optimal classifier choice by experimenting with a small subset of random classifiers).</li></ul><h2 id="Backend(为什么做这个事情,怎么做这个事情)"><a href="#Backend(为什么做这个事情,怎么做这个事情)" class="headerlink" title="Backend(为什么做这个事情,怎么做这个事情)"></a>Backend(为什么做这个事情,怎么做这个事情)</h2><p><strong>为什么做这个事情?</strong></p><ul><li>机器学习分类器是当前数据分析的主要工具(Machine learning (ML) classifiers are now common tools for data analysis).</li><li>机器学习工具商业化,大多数网络的研究者关心这些ML系统优化的咋样(As ML tools are increasingly commoditized, most network researchers are interested in them as black box tools, and lack the resources to optimize their deployments and configurations of ML systems).</li><li>当前的机器学习工具很多类别,大体可以分为fully-automated, turnkey systems to fully-customizable systems。</li><li>对当前的MLaaS知之甚少(MLaaS today are opaque systems, with little known about their efficacy)。</li></ul><p><strong>怎么做这个事情?</strong></p><p><strong>6</strong> of the <strong>most popular MLaaS platform</strong>, a large number (<strong>119</strong>) of <strong>labeled datasets</strong> for <strong>binary classification</strong>.</p><p><strong>goals</strong>:</p><ul><li>MLaaS互相比,性能咋样(how MLaaS systems compare in performance against each other, and against a fully customized and tuned local ML library).</li><li>使用MLaaS的过程中,complexity、performance 和performance variability之间的关系怎么样(better understand the correlations between complexity, performance and performance variability).</li><li>什么样的因素对性能的影响最大(which key knobs have the biggest impact on performance, and try to design generalized techniques to optimize those knobs).</li></ul><p><strong>findings</strong>:</p><ul><li>(current MLaaS systems cover the full range of tradeoffs between ease of use and user-control)</li><li>选择什么样的分类器是决定模型性能的主要因素,用户可以随机尝试很少的分类器就可以得到一个较好的结果(classifier choice accounts for much of the benefits of customization,a user can achieve nearoptimal results by experimenting with a small random set of classifiers)</li><li>大机构搞得分类器偶尔还是会出现问题(fully automated (blackbox) systems like Google and ABM are using server-side tests to automate classifier choices,their mechanisms occasionally err and choose suboptimal classifiers). </li></ul><h2 id="UNDERSTANDING-MLAAS-PLATFORMS(当前的ML平台是什么样的)"><a href="#UNDERSTANDING-MLAAS-PLATFORMS(当前的ML平台是什么样的)" class="headerlink" title="UNDERSTANDING MLAAS PLATFORMS(当前的ML平台是什么样的)"></a>UNDERSTANDING MLAAS PLATFORMS(当前的ML平台是什么样的)</h2><p><strong>ML平台处理任务的流程:</strong></p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222145238395.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222145238395.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222145238395"></p><p><strong>调查了6个主流的ML平台:</strong></p><ul><li>Amazon Machine Learning (Amazon1)</li><li>Automatic Business Modeler (ABM2)</li><li>BigML3</li><li>Google Prediction API (Google4)</li><li>Microsoft Azure ML Studio (Microsoft5)</li><li>PredictionIO6</li></ul><p><strong>六个主流的平台在</strong>Preprocessing、Feature selection、Classifier selection、Parameter tuning上的支持如何。(看原文)</p><h2 id="METHODOLOGY(RQ?具体的实验方法?)"><a href="#METHODOLOGY(RQ?具体的实验方法?)" class="headerlink" title="METHODOLOGY(RQ?具体的实验方法?)"></a>METHODOLOGY(RQ?具体的实验方法?)</h2><p><strong>RQ?</strong></p><ul><li>对ML systems的控制程度的增加和模型的accuracy之间的关系(How does the complexity (or control) of ML systems correlate with ideal model accuracy?)</li><li>对ML systems的控制程度的增加会导致建立一个更坏的ML model吗(Can increased control lead to higher risks (of building a poorly performing ML model)?)</li><li>当前的MLaaS系统能将ML任务优化的怎么样(How much can MLaaS systems optimize the automated portions of their pipeline?)</li></ul><h3 id="Datasets(数据集)"><a href="#Datasets(数据集)" class="headerlink" title="Datasets(数据集)"></a>Datasets(数据集)</h3><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222153511346.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222153511346.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222153511346"></p><ul><li>119个数据集,94个来自UCI machine learning repository,16个来自scikit-learn合成的数据集,9个来自其它ML的研究。</li><li>数据集的类别、样本数、特征数的分布都很广。</li><li>做了一些预处理。(具体什么预处理看原文)</li></ul><h3 id="回答三个RQ的具体实验设置"><a href="#回答三个RQ的具体实验设置" class="headerlink" title="回答三个RQ的具体实验设置"></a>回答三个RQ的具体实验设置</h3><p><strong>实验设置:</strong></p><p>将机器学习的几个阶段根据具体的平台的情况做了划分合并,最终分出<strong>三个阶段</strong>(具体的划分理由见原文):</p><ul><li>Preprocessing (data transformation) and Feature Selection (FEAT)</li><li>Classifier Choice (CLF</li><li>Parameter Tuning (PARA)</li></ul><p>每个阶段分析研究了ML平台的<strong>所有的可选的参数</strong>:<br><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222153942187.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222153942187.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222153942187"></p><ul><li>表1展示了前两个阶段ML平台上所有的可选配置,要做<strong>所有的实验</strong>。</li></ul><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222153951637.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222153951637.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222153951637"></p><ul><li>表2前两列表示实验选用的特征,第三列是根据平台给的默认参数自己生成的一些配置参数,最后一列代表在这个平台上做的实验的个数。</li></ul><p><strong>baseline</strong>:</p><ul><li>一组完全不用用户的控制:使用平台的默认参数。</li><li>一组对所有的dimensions做人为的控制:使用scikit-learn做测试。</li><li>由于所有平台都有的操作只有<strong>逻辑回归</strong>:使用<strong>逻辑回归</strong>做测试。</li></ul><p><strong>评测指标</strong>:</p><p>F-score,unbiased,其它的平台都不支持。</p><p>为了验证F-score一个指标是不是就足够了,计算了所有数据在平台的Friedman ranking。Friedman ranking statistically ranks platforms by considering a given metric (e.g. F-score) across all datasets. <strong>A platform with a higher Friedman rank exhibits statistically better performance when considering all datasets</strong>, compared to a lower ranked platform. We observed that the platform ranking based on average F-score is consistent with the Friedman ranking (using F-score), suggesting that average F-score is a representative metric.</p><h2 id="COMPLEXITY-VS-PERFORMANCE(RQ1的回答)"><a href="#COMPLEXITY-VS-PERFORMANCE(RQ1的回答)" class="headerlink" title="COMPLEXITY VS. PERFORMANCE(RQ1的回答)"></a>COMPLEXITY VS. PERFORMANCE(RQ1的回答)</h2><p><strong>平台内部:baseline和优化后的结果比较:</strong></p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222161015289.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222161015289.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222161015289"></p><ul><li>得到一些结论。(具体的结论看原文)</li></ul><p><strong>平台之间:不同平台之间的比较</strong></p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222161045097.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222161045097.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222161045097"></p><ul><li>得到一些结论。(具体的结论看原文)</li></ul><p><strong>不同dimensions之间(三个dimensions)</strong>:</p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222161521625.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222161521625.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222161521625"></p><ul><li>得到一些结论。(具体的结论看原文)</li></ul><p><strong>不同的分类器之间:</strong></p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222161737326.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222161737326.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222161737326"></p><ul><li>得到一些结论。(具体的结论看原文)</li></ul><h2 id="RISKS-OF-INCREASING-COMPLEXITY(RQ2的回答)"><a href="#RISKS-OF-INCREASING-COMPLEXITY(RQ2的回答)" class="headerlink" title="RISKS OF INCREASING COMPLEXITY(RQ2的回答)"></a>RISKS OF INCREASING COMPLEXITY(RQ2的回答)</h2><p><strong>平台的复杂度和平台预测结果的稳定性</strong>:</p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222162146128.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222162146128.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222162146128"></p><ul><li>一个平台的精度变化范围越大,不好的配置就越可能得到越坏的结果。</li><li>得到一些结论。(具体的结论看原文)</li></ul><p><strong>不同的dimension哪个dimension的调整对accuary影响最大?</strong></p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222162459638.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222162459638.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222162459638"></p><p><strong>用户只随便选几个classifier就能得到不错的效果:</strong></p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222162627835.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222162627835.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222162627835"></p><h2 id="HIDDEN-OPTIMIZATIONS(RQ3的回答)"><a href="#HIDDEN-OPTIMIZATIONS(RQ3的回答)" class="headerlink" title="HIDDEN OPTIMIZATIONS(RQ3的回答)"></a>HIDDEN OPTIMIZATIONS(RQ3的回答)</h2><p>能给选择什么样的企业的平台带来什么启发?<strong>优化分类器的选择</strong></p><p><strong>可以预测企业有没有做分类器的选择的优化!</strong></p><p><strong>验证了企业在分类器的选择上做了优化(选择了google和ABM做验证)</strong>:</p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222163334957.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222163334957.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222163334957"></p><ul><li>选择的数据集分布</li></ul><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222163351740.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222163351740.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222163351740"></p><ul><li>google和ABM对非线性和线性数据集的分类,使用的方法确实不一样。</li></ul><p><strong>使用两个信息:knowledge of the training dataset and prediction results from the platform就可以知道企业有么有做分类器选择的优化</strong></p><p>it is possible to accurately infer the broad classifier family, more specifically, linear or non-linear classifiers,例如线性和非线性。</p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222163752237.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222163752237.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222163752237"></p><p><img src="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222163824075.png" class="lazyload placeholder" data-srcset="/2022/12/21/imc17-complexity-vs-performance-empirical-analysis-of-machine-learning-as-a-service/image-20221222163824075.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221222163824075"></p><ul><li>the performance (F-score) of the two categories of classifiers on the two datasets</li></ul><p><strong>很有意思的实验:首先使用本地的分类器和允许选择分类器种类的平台上用所有数据集做了测试,然后用这些测试的结果造了一个数据集:label为用了什么分类器,特征为训练的精度等选择的特征,使用随机森林预测这些平台用了什么类型的分类器(线性分类器和非线性分类器)。</strong></p><p>具体的实验细节和结果看原文。</p>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About Empirical Study </tag>
</tags>
</entry>
<entry>
<title>OSDI22-Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning</title>
<link href="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/"/>
<url>/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/</url>
<content type="html"><![CDATA[<h1 id="OSDI22-Walle-An-End-to-End-General-Purpose-and-Large-Scale-Production-System-for-Device-Cloud-Collaborative-Machine-Learning"><a href="#OSDI22-Walle-An-End-to-End-General-Purpose-and-Large-Scale-Production-System-for-Device-Cloud-Collaborative-Machine-Learning" class="headerlink" title="OSDI22-Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning"></a>OSDI22-Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning</h1><p><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221220224519633.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221220224519633.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221220224519633"></p><p><strong>联邦学习</strong>是将所有的数据放在边端训练,云只负责协调,<strong>Walle</strong>的目标是是实现边云协同的训练,边端和云可能各存储一部分数据,能在边端做的就在边端做,需要在云做的就在云做,<strong>当前常常采用的</strong>是只用云的计算。</p><p>在介绍系统的实现的时候,常常将<strong>设计目标</strong>和<strong>具体的实现</strong>分开来讲,</p><p>文章的结构安排:</p><ul><li>摘要讲清楚为了解决什么痛点问题设计了什么样的系统。</li><li>introduction讲清楚当前的系统存在什么样的问题,当前有什么有利条件去解决这样的问题,是怎么解决这样的问题的(建立了一个系统),建立这个系统时有什么样的需求,解决了什么样的难点问题。<strong>在introduction中可以不用讲的这么细,且要先讲核心的问题,非核心的设计在后面提到就可以。</strong></li><li>background讲当前ML任务巨大的市场,和当前的痛点问题、解决这个问题遇到的挑战以及当前这个系统的实际的需求。</li></ul><p>Walle:device和cloud合作的ML,已在淘宝上验证。在micro-benchmarks上的表现也非常出色,已经在阿里巴巴大规模使用。</p><ul><li>一个部署平台。a deployment platform.<ul><li>部署平台通过push-then-oull的方法发布ML任务,支持多粒度的部署策略。</li></ul></li><li>一个数据流水线。a data pipeline.<ul><li>on-device的流处理框架,可以在源头处理用户数据。</li></ul></li><li>一个计算容器。a compute container (MNN).<ul><li>MNN是一个张良计算引擎,带有数据处理和模型的执行库。</li><li>MNN通过优化过的Python线程级虚拟机支持各种二氧的ML任务和并行的任务。</li><li>MNN的核心是operator decomposition和semi-auto search,极大的减少了为几十种硬件后端和几百种操作符进行优化的工作量。</li></ul></li></ul><h2 id="Background(当前系统存在的问题,当前的需求,和满足需求需要面对的挑战)"><a href="#Background(当前系统存在的问题,当前的需求,和满足需求需要面对的挑战)" class="headerlink" title="Background(当前系统存在的问题,当前的需求,和满足需求需要面对的挑战)"></a>Background(当前系统存在的问题,当前的需求,和满足需求需要面对的挑战)</h2><h3 id="当前的ML任务很多,用处十分广泛,市场很大"><a href="#当前的ML任务很多,用处十分广泛,市场很大" class="headerlink" title="当前的ML任务很多,用处十分广泛,市场很大"></a>当前的ML任务很多,用处十分广泛,市场很大</h3><p><strong>ML任务</strong>:开发者的视角来看,ML task包含代码、资源和配置。</p><p><strong>ML任务的工作流</strong>:pre-processing,模型的执行,post-processing。</p><p><strong>ML任务巨大的市场</strong>:阿里巴巴有至少上百的ML任务,十亿级别的日活用户,几十种商业场景(CV占30%、NLP占10%、推荐任务占60%),每天运行几十亿,几百亿,几千亿次。</p><h3 id="当前将所有的数据上传云端有什么样的问题?"><a href="#当前将所有的数据上传云端有什么样的问题?" class="headerlink" title="当前将所有的数据上传云端有什么样的问题?"></a>当前将所有的数据上传云端有什么样的问题?</h3><ol><li><p>High Latency:将数据上传云端的时间延迟是s级别的,但很多任务要求几百甚至几十的时延。</p></li><li><p>High Cost and Heavy Load:设备端流量成本,云端的数据负载和处理负载。</p></li><li><p>Data Security and Privacy:用户隐私数据不方便上传,上传云端后可能会泄露。</p></li></ol><h3 id="当前不将数据上传云端有什么条件?"><a href="#当前不将数据上传云端有什么条件?" class="headerlink" title="当前不将数据上传云端有什么条件?"></a>当前不将数据上传云端有什么条件?</h3><ol><li>mobile devices的算力的增长。</li></ol><h3 id="当前利用边端数据的研究工作?(边云协作的范式)"><a href="#当前利用边端数据的研究工作?(边云协作的范式)" class="headerlink" title="当前利用边端数据的研究工作?(边云协作的范式)"></a>当前利用边端数据的研究工作?(边云协作的范式)</h3><ol><li>算法决策方面的工作<ul><li>device-cloud task splitting strategy</li><li>collaboration/interaction paradigm</li></ul></li></ol><h3 id="当前缺少一个统一的系统来处理ML任务的每一个阶段"><a href="#当前缺少一个统一的系统来处理ML任务的每一个阶段" class="headerlink" title="当前缺少一个统一的系统来处理ML任务的每一个阶段"></a>当前缺少一个统一的系统来处理ML任务的每一个阶段</h3><p><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221220235451424.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221220235451424.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221220235451424"></p><p><strong>Walle设计的目标:</strong></p><ul><li>Walle支持ML任务的整个周期(数据预处理、模型训练和推导、后处理):<strong>MNN计算引擎</strong>。<ul><li>将ML任务的迭代周期和APPs的迭代<strong>解耦</strong>:当前App的更新周期几周,大的App甚至几个月,然后ML tasks需要频繁的实验和部署(为了验证ML算法、模型和超参的有效性)。</li><li>支持各种ML tasks、OS、hardware,同时考虑奥手机端有限的资源(淘宝最大的RAM是200MB,包的大小不超过300MB)。</li><li><strong>取得不错的性能</strong>。<ul><li>使用C/C++写tensor compute engine。</li><li>做operator-level和system-level的优化。<ul><li>手动优化(基本上所有的ML engines都是这么干的),工作量很大,❌</li><li>auto tuning,不能支持运行时优化,不能满足工业界的需求(异构设备、频繁的ML任务更新)❌</li></ul></li><li>需要将pre-processing和post-processing和MNN集成为一个框架,可以最充分的进行优化。</li></ul></li></ul></li><li>Walle支持各种各样源数据的处理:<strong>data pipeline</strong>,能将来自不同的源和不同数据格式的数据转化为设备端和云端的ML模型的输入。</li><li>Walle支持应用部署和运行阶段的全管理:deployment platform。<ul><li>给各种各样的手机后端,管理、发布和部署ML的任务。</li></ul></li></ul><p><strong>具体如何实现Walle</strong>:</p><ul><li>MNN计算引擎:<ul><li>使用Python,通过修改CPython实现Python虚拟机。<ul><li>抛弃GIL,支持任务级别的多线程,实现VM isolation和data isolation</li><li>为不同的设备端定制实际的需求。</li></ul></li><li>使用MNN作为底层的计算引擎,将MNN作为Python thread-level的VM。</li></ul></li><li>data pipeline:全新的设备上的流处理框架。<ul><li>一些非经典的数据(例如用户行为数据)的处理。</li><li>通过管理多个流处理任务的触发条件去生成不同的特征,with a trie(字典查找树) structure for concurrent triggering。</li><li>建立一个实时的tunnel,可以把设备端的fresh features送到云端。</li></ul></li><li>deployment platform:使用git管理任务实体,categorize task-related files into shared and exclusive ones to facilitate multi-granularity deployments, and release tasks with an efficient push-then-pull method and in steps<strong>?</strong><ul><li>需要处理:处理大规模的部署需求。</li><li>需要处理:设备的可用性问题。</li><li>需要处理:可能带来的Task Failure问题。</li></ul></li></ul><p><strong>Walle的评测</strong>:</p><ul><li>真实场景(在线直播和推荐)。</li><li>Micro-benmarks of MNN和Python thread-level VM。</li></ul><h2 id="3-Walle-Architecture-and-Design-Rationale(Walle的架构和设计原理)"><a href="#3-Walle-Architecture-and-Design-Rationale(Walle的架构和设计原理)" class="headerlink" title="3 Walle: Architecture and Design Rationale(Walle的架构和设计原理)"></a>3 Walle: Architecture and Design Rationale(Walle的架构和设计原理)</h2><p><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221004619237.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221004619237.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><figure class="half"> <img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221005206478.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221005206478.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" width="400"> <img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221005218469.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221005218469.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" width="400"></figure><ul><li><p>Compute Container</p><ul><li>通过refine CPython实现了一个<strong>Python VM</strong>,针对实际的mobile APP的需求实现了定制。</li><li>考虑到不同ML任务的独立特征和每个ML任务不同阶段的执行,在实现的Python VM中抛弃了GIL,第一次通过将每一个ML task绑定到线程并执行thread isolation来支持tasl-level的thread 。这样的Python VM-based的设计给Compute Container赋予了动态任务派发的能力,将ML任务的迭代和数月/数周的移动App的更细解绑。</li><li>在底层,使用C/C++实现了跨平台高性能的<strong>tensor compute engine</strong>,core使用了geometric computing和semi-auto search的创新的机制。</li><li><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221142104361.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221142104361.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221221142104361"><ul><li><strong>geometric computing</strong> extracts a new atomic operator from transform operators, by leveraging the nature of coordinate transformation as well as the linear mapping between an element’s coordinate and its memory address. <strong>As a result</strong>, all the transform and composite operators, accounting for roughly <strong>49% of all the operators, can be decomposed to the atomic operators</strong>, <strong>reducing 46% of the workload</strong> of manually implementing and optimizing 124 operators for 16 kinds of backends from algorithm, ISA, memory, and assembly.</li><li>Then, to <strong>quickly identify the backend available on a mobile device or a cloud server</strong> to execute a computation graph with a series of operators at the minimum cost, <strong>semi-auto search is applied in runtime to find the optimal implementation algorithm with the optimal parameters for each operator on each available backend</strong>.</li><li>Based on the tensor compute engine, we implement <strong>the libraries of scientific computing, image processing, model inference, and model training</strong>, and expose them to Python VM as standard APIs, supporting the whole cycle of diverse ML tasks with standard data input.</li></ul></li></ul></li><li><p>Data Pipeline:</p><ul><li>输入数据主要是<strong>用户行为</strong>(记录为time-level的事件序列,通过按照页面聚类形成page-level的事件序列),<strong>流处理任务的触发条件</strong>就可以被<strong>定义为</strong>一些列的<strong>event/page ids</strong>。</li><li>为了支持<strong>multiple 触发</strong>,将触发条件和事件序列建模为字符串匹配问题,提出将触发条件组织成trie(单词查找树),这样的话当新的任务来的时候,所有的触发任务能够被执行。</li><li>一个给定的stream processing task能够被多次触发,不断地生成事件序列,每一次地输出都很小,于是设计了一个collective storage mechanism减少write地frequency。</li><li>为了低延迟地上传on-device的流处理的输出,使用persistent connection实现real-time tunnel,可以在500ms内传输30KB的数据。</li></ul></li></ul><ul><li>Deployment Platform:<ul><li>使用git管理所有任务entity,将任务相关的文件分类为共享和独立文件。</li><li>为了保证任务的timeliness的部署,基于transient connection(瞬时连接)提出了一个novel的push-then-pull的方法,push重用当前client-side的http request,pull通过CDN (content delivery network) 和Alibaba CEN (cloud enterprise network) 实现。</li><li>为了task部署的鲁棒性,使用云端的计算引擎在任务发布前引入了任务模拟测试,强制分布执行发布任务,任务失败时候能够快速进行rollback。</li></ul></li></ul><h2 id="4-Compute-Container-in-Walle(计算容器的具体实现)"><a href="#4-Compute-Container-in-Walle(计算容器的具体实现)" class="headerlink" title="4 Compute Container in Walle(计算容器的具体实现)"></a>4 Compute Container in Walle(计算容器的具体实现)</h2><p>自底向上分别介绍MNN、Python thread-level的VM和MNN的标准的APIs。</p><h3 id="Tensor-Compute-Engine(底层的张量计算引擎怎么实现和优化)"><a href="#Tensor-Compute-Engine(底层的张量计算引擎怎么实现和优化)" class="headerlink" title="Tensor Compute Engine(底层的张量计算引擎怎么实现和优化)"></a>Tensor Compute Engine(底层的张量计算引擎怎么实现和优化)</h3><p><strong>张量的计算操作符</strong>:</p><ul><li>Atomic Operators,作为backend优化的基本单元,例如unary operations(例如求square)、binary operators(例如加减乘除)。</li><li>Transform Operators,改变元素的shape或者重排元素,例如transpose、slicing、concatenation和permutation。</li><li>Composite Operators,能够被分解为Atomic和Transform操作符,例如3D convolution and pooling, normalization, exponential linear unit, long short-term memory cell, exponential linear unit。</li><li>Control-Flow Operators,包括if和while。</li></ul><p><strong>Geometric Computing</strong>:</p><ul><li>目前,MNN支持$N_{aop} = 61$ atomic operators,$N_{top} = 45$ transform operators,$N_{cop} = 16$ composite operators,$N_{fop} = 2$ control-flow operators,$N_{ba}$ = 16 backends,工作量为$O((N_{aop} + N_{top} + N_{cop}) × N_{ba} + N_{fop} = 1954)$</li><li>考虑到atomic和control-flow的operators不可避免,而Transform和Composite可以由atomic和control-flow操作符组成,于是准备<strong>去减少Transform和Composite的工作量</strong>,着占据了几乎一般的工作量,将来Transform和Composite的占比会更高(DNN的广泛使用)。如何减小?<strong>提出了一种新的automic operator叫做”raster”</strong>, raster来自transform operators。于是,transform operators和composite operators都可以被分解为raster operators和atomic operators。由于只需要优化atomic和raster operators,工作量变为了$O((N_{aop} + 1) × N_{ba} + N_{top} + N_{cop} + N_{fop} = 1055)$。</li><li><strong>什么是”raster”?</strong>transform operators的基本功能是把元素从一个内存地址移动到另一个内存地址,或者说把元素从一个坐标(geometry)变换到另一个坐标。内存地址是坐标的确定的线性函数。也就是说,给定一个确定的transform operator,坐标变换的公式是确定的,因此,输入向量或者输出向量中的某个元素的坐标(元素的index),在变换前的内存地址和变换后的内存地址都是确定的。<strong>raster operator就是根据memory addresses和坐标变换将元素在input tensor和output tensors之间移动的操作符</strong>。以slicing为例,A为$2\times4$的矩阵,放在连续的内存中,带有一个唯一的标识符/ 指针,A的只留下第二行的切片表示为B,也就是一个$1\times4$的矩阵,对于B中的每一个元素$B_{i,j}$,其每一个元素相对于B的标识符的内存地址为$i*4+j$,这和$B_{i,j}$的坐标是线性关系,系数$(4,1)$称为strides。根据切片的定义例如$B_{i,j}=A_{i+1,j}$,$A_{i+1,j}$中元素的表座为$(i+1,j)$,相对的内存标识符为$(i+1)\times4+j=4i+j+4$,strides为$(4,1)$,4是offset。raster操作符可以通过迭代${(i,j)|0\leq i<1,0\leq j<4,i,j\in \Epsilon}$将每一个元素$A_{i+1,j}$移动到$B_{i,j}$。</li><li><strong>“raster”实际的实现?</strong>实际的实现过程中,引入一个概念:”region”,包含一个input tensor,the range of coordinate和linear mappings(成为”views”,可以由strides和offsets表示)。在完成transform operator和composite operator的分解之后,一些raster operator可以结合起来做优化。一种策略是vertical merging,主要处理两个两虚的raster operations,跳过不直接的references,直接操作original tensor;另一种策略是 horizontal merging,处理两个在同样region的parallel raster操作,只保留一个。</li></ul><p><strong>Atomic Operator Optimization:</strong></p><p>包括raster之内的原子操作符的优化。综合考虑了硬件的异构性,从<strong>算法</strong>、<strong>ISA</strong>、<strong>memory</strong>和<strong>assembly</strong>的角度优化了实现。</p><ul><li><strong>算法-level的优化</strong>:对于一些计算密集型的操作符,例如卷积和矩阵的乘法,采取了Winograd和Strassen等更高效的算法,极大的减少了乘法的数量。</li><li><strong>the ISA-level optimization</strong>:充分利用单指令多数据(SIMD)的优点,例如ARM Neon和X86的AVX512来加速。为了充分发挥SIMD中data-level的并行,仔细设计了data layout和data packing,具体来说没使用新的NC/4HW4 layout和一个为卷积打造的channel-major packing。</li><li><strong>the memory-level的优化</strong>集中于减少内存读写的次数,提升内存分配的contiguity。具体的说,对于矩阵乘法来说,使用了tiling和memory reordering。</li><li> <strong>the assembly-based optimization</strong>能够取得指令级别的加速。实现了一些core operators的手写汇编,仔细做了一些优化,例如loop unrolling, software pipelining和instruction reordering。</li></ul><p><strong>Semi-Auto Search (?):</strong></p><p>Data processing和model execution通常涉及一些列操作(也就是automic, raster和control-flow的operators),不同的backends对于operators由不同的实现和优化,mobile device和cloud server通常由several backends。因此<strong>Semi-Auto Search的全局目标是使用minimum cost去identify backend</strong>。每一个backend的cost是所有operators采取optimal implementations的和,为了identify一个操作符在一个确定的backend上最优的算法实现,需要找到每一个可能的算法的optimal参数,这就变成了 一个<strong>可解的constrained optimization problem</strong>,可以被快速求解,目Semi-Auto Search。算发的规范化表述如下:</p><ul><li><p><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221174812508.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221174812508.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221221174812508"></p></li><li><p>公式(3)较难理解:</p><ul><li><p>(1) $Q_{alg}$表示算法$alg$中基本的计算的次数,当参数和input的size给定时,可以被计算出来。</p></li><li><p>(2) $P_{ba}$表示backend $Ba$的的性能,在MNN中,一个CPU-type的backend,如果$ba$支持ARMv8.2-FP6,$P_{ba}$通常花费16倍的frequency;否则,$P_{ba}$花费8倍的frequency。对于GPU-type的backend,$P_{ba}$ 通常设置为手动测试时每秒浮点操作 (FLOPS)的个数。$S_{alg,ba}$表示算法$alg$在后端$ba$上的调度成本,在MNN中,对于CPU-typed的backend,$S_{alg,ba}$设置为0,GPU-typed的backend,$S_{alg,ba}$根据经验设置,主要考虑数据传输的时间。</p></li><li><p>现在,对于操作符$op_{i}$,backend <em>ba</em>,实现算法 $alg$,inputs的sizes,如何确定算法的最优参数是个问题,实际使用中,将其抽象为一个constrained optimization problem,目标是最小化计算和内存开销,<strong>约束条件</strong>主要包括SIMD unit的宽度,寄存器的数量,threads的数量,inputs的size,此外,我们主要集中与优化参数:SIMD中的packing size,矩阵乘法的tile size,Winograd algorithm的block unit,Strassen algorithm的elementary calculations。例如,矩阵乘法的tile size如下:A表示一个$a\times e$的矩阵,B表示一个$e\times b$的矩阵,$t_e$表示tile size along the axis with the equal size,$t_b$表示tile size along the axis of B's columns,$N_r$表示寄存器的数量,优化的目标是最小化内存的读写次数,公式如下:</p><p> <img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221202952881.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221202952881.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221221202952881"></p></li></ul></li><li><p>相比于手动搜索的优化实现算法with some common parameters for each operator case by case,半自动搜索不仅可以极大的减少工作量,也有更大的概率找到最优的参数。为什么在TVM中不用auto tuning,这是因为auto tuning没有充分利用operator optimization中的manual experience,由于operator-level和graph-level的巨大的搜索空间,花费太长静态编译的时间,且不能支持运行时优化。最重要的是,给定restriction on executable files和just-in-time(JIT)compilation on iOS devices for security,TVM生成的编译模型必须被链接到mobile APPs,这样就不能实现ML和APP本身update的解耦,因此TVM在工业应用中不可行。相比之下,我们的设计充分发挥了异构backend的operator-level的优化去见笑了semi-auto search的搜索空间,因此支持将模型部署为regular resource files,可以进一步的进行runtime的optimization和日常的Python VM中的ML task的迭代。另一个好处是mobile APPs的package size随着ML tasks的变多不会增加。</p></li></ul><h3 id="Data-and-Model-Related-Libraries-数据和模型相关的库"><a href="#Data-and-Model-Related-Libraries-数据和模型相关的库" class="headerlink" title="Data and Model Related Libraries (数据和模型相关的库)"></a>Data and Model Related Libraries (数据和模型相关的库)</h3><p><strong>MNN实现了</strong></p><ul><li>pre-processing的科学计算和图像处理:科学计算和图像处理库分别是NumPy和OpenCV的优化实现,要求轻量且高性能,轻量体现在原本的NumPy 1.9.3和OpenCV 3.4.3有2.1MB和1.2MB,在MNN里面只有51KB和129KB,高性能体现在底层tensor计算引擎的优化在库上体现了出来:使用atomic,raster和control-flow的操作符支持科学计算和图像处理的各种操作。</li><li>模型训练和推理库:目前MNN中提供两种模型推理的模式:session和module,实现了模型训练的两种优化器,SGD和Adam。<ul><li>module mode:支持control module(transformer, dynamic RNN都需要),session mode不支持(<strong>In the second step of session mode</strong>, the controlflow operators require the intermediate result to determine the following execution order and thus cannot be supported in the session mode,为了解决这样的问题,session mode的第一步后,会将计算图根据control flow operators分为多个module,这样的话,每一个module的执行和session的执行是一样的)。</li><li>session mode:分为四个步骤 <ul><li>(1) load a model, create a session, arrange all the operators in the computation graph according to the topological ordering, and apply for the tensors that all the operators need;</li><li>(2) given the shape of each input tensor and the definition of each operator, compute the shapes of all the tensors;</li><li>(3) perform geometric computing, particularly, first decompose the transform and composite operators into the atomic and raster operators, and then do vertical and horizontal merging for raster operators;</li><li>(4) identify the optimal backend with semi-auto search, request memory for each operator and execute in sequence, and return the inference result.</li></ul></li></ul></li><li>post-processing。</li></ul><h3 id="Python-Thread-Level-Virtual-Machine(Python线程级虚拟机的实现)"><a href="#Python-Thread-Level-Virtual-Machine(Python线程级虚拟机的实现)" class="headerlink" title="Python Thread-Level Virtual Machine(Python线程级虚拟机的实现)"></a>Python Thread-Level Virtual Machine(Python线程级虚拟机的实现)</h3><p>CPython存在两个问题:</p><ul><li><p>size of package太大了:根据Taobao的实际需求定制了库、模块。For example, on ARM64-based iOS, the package size decreases from 10MB+ to only 1.3MB.</p><ul><li>(1) Functionality Tailoring:CPython first compiles Python code into bytecode with the file suffix “.pyc” and then interprets the bytecode for execution. By leaving the compile phase on the cloud and sending only the bytecode to mobile devices for execution, we can delete all the compile modules, saving 17 scripts in C.</li><li>(2) Library and Module Tailoring:We keep 36 necessary libraries (e.g., abc, type, re, and functools) and 32 modules (e.g., zipimport, sys, exceptions, and gc).</li></ul></li><li><p>CPython不支持多线程:abandon GIL and further design and implement the first Python thread-level interpreter in industry.</p><p> <img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221213241327.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221213241327.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221221213241327"></p><ul><li>each task is scheduled to a certain thread, which creates an independent VM and contains the VM runtime and task-related data.<ul><li>(1) VM Isolation:看原文</li><li>(2) Data Isolation:看原文</li></ul></li></ul></li></ul><p>每一个mobile APP只有一个进程,不允许multi-processing。</p><h3 id="4-4-Standard-APIs"><a href="#4-4-Standard-APIs" class="headerlink" title="4.4 Standard APIs"></a>4.4 Standard APIs</h3><p>科学计算和图像处理的API和原先的API一致,训练和推理的API是自定义的。</p><h2 id="5-Data-Pipeline-in-Walle(数据流水线的具体实现)"><a href="#5-Data-Pipeline-in-Walle(数据流水线的具体实现)" class="headerlink" title="5 Data Pipeline in Walle(数据流水线的具体实现)"></a>5 Data Pipeline in Walle(数据流水线的具体实现)</h2><h3 id="5-1-On-Device-Stream-Processing-Framework(设备上的流处理框架)"><a href="#5-1-On-Device-Stream-Processing-Framework(设备上的流处理框架)" class="headerlink" title="5.1 On-Device Stream Processing Framework(设备上的流处理框架)"></a>5.1 On-Device Stream Processing Framework(设备上的流处理框架)</h3><p>We introduce on-device stream processing from <strong>event sequence creation</strong>, <strong>trigger management</strong>, <strong>task triggering</strong>, <strong>task execution</strong>, and <strong>collective storage</strong>.</p><p><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221213805038.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221213805038.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221221213805038"></p><ul><li><strong>event sequence creation</strong>:用户和App的交互事件,记录为page enter, page scroll, exposure, click和page exit,每一个事件都有一个独特的event id, a page id, a timestamp, and event contents(例如对于exposure事件,event contents是item id,对于click事件,event contents是graphical widget),按照页对事件进行聚类。</li><li><strong>trigger management</strong>:一个流处理任务包括scripts和configurations,scripts实现数据处理算法,configurations主要指定触发条件。触发条件主要使用字符串匹配算法处理,采用Trie的数据结构(后缀树)。(具体的算法看原文的那一段)</li><li><strong>task triggering</strong>:当一个新的事件来临时(带有一个event id和page id),the set of触发的任务会被返回。(具体的任务触发算法看原文的那一段)</li><li><strong>task execution</strong>:任务出发后,就执行scripts中的脚本。为了方便先从event sequence中提取相关任务和相关任务内容的处理,流处理框架提供了一些函数:<ul><li>(1) KeyBy</li><li>(2) TimeWindow</li><li>(3) Filter</li><li>(4) Map</li></ul></li><li><strong>collective storage</strong>:对于流处理任务,一般来说,输出,通常是特征会被保存在SQLite中,但是流处理任务可能会被触发很多次,因此设置了一个collective data storage,(具体的算法看原文)</li></ul><h3 id="5-2-Real-Time-Device-Cloud-Tunnel(实时的设备云的Tunnel)"><a href="#5-2-Real-Time-Device-Cloud-Tunnel(实时的设备云的Tunnel)" class="headerlink" title="5.2 Real-Time Device-Cloud Tunnel(实时的设备云的Tunnel)"></a>5.2 Real-Time Device-Cloud Tunnel(实时的设备云的Tunnel)</h3><p>on-device的stream processing可以被上传到云做实时的使用。根据可持久化链接船舰了一个tunnel,使用优化过的SSL,数据传输前压缩,云上部署异步的服务框架。</p><h2 id="6-Deployment-Platform(部署平台的具体实现)"><a href="#6-Deployment-Platform(部署平台的具体实现)" class="headerlink" title="6 Deployment Platform(部署平台的具体实现)"></a>6 Deployment Platform(部署平台的具体实现)</h2><p><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221220948303.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221220948303.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221221220948303"></p><ul><li>ML task的管理、发布和部署。</li><li><strong>Task Management</strong>:(具体的管理方法见原文)</li><li><strong>Task Release & Deployment</strong>:(具体的管理方法见原文)</li></ul><h2 id="7-Evaluation-of-Walle(Walle的评测)"><a href="#7-Evaluation-of-Walle(Walle的评测)" class="headerlink" title="7 Evaluation of Walle(Walle的评测)"></a>7 Evaluation of Walle(Walle的评测)</h2><p>(具体的细节见原文)</p><p><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221221415753.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221221415753.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221221221415753"></p><p><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221221425099.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221221425099.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221221221425099"></p><p><img src="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221221432292.png" class="lazyload placeholder" data-srcset="/2022/12/20/osdi22-walle-an-end-to-end-general-purpose-and-large-scale-production-system-for-device-cloud-collaborative-machine-learning/image-20221221221432292.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221221221432292"></p>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
</tags>
</entry>
<entry>
<title>arXiv21-FEDLAB: A FLEXIBLE FEDERATED LEARNING FRAMEWORK</title>
<link href="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/"/>
<url>/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/</url>
<content type="html"><![CDATA[<h1 id="arXiv21-FEDLAB-A-FLEXIBLE-FEDERATED-LEARNING-FRAMEWORK"><a href="#arXiv21-FEDLAB-A-FLEXIBLE-FEDERATED-LEARNING-FRAMEWORK" class="headerlink" title="arXiv21-FEDLAB-A-FLEXIBLE-FEDERATED-LEARNING-FRAMEWORK"></a>arXiv21-FEDLAB-A-FLEXIBLE-FEDERATED-LEARNING-FRAMEWORK</h1><p><img src="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218145512372.png" class="lazyload placeholder" data-srcset="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218145512372.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218145512372"></p><p>设计的目标:</p><ul><li><strong>轻量化</strong>的进行FL的<strong>模拟</strong>,a lightweight open-source framework for FL simulation,只为开发者实现必要的工具。</li><li>聚焦于提高算法效率和通信效率,focuses on FL algorithm effectiveness and communication efficiency。</li></ul><p><strong>缺少验证</strong></p><h2 id="Background(当前FL的现状,存在什么问题)"><a href="#Background(当前FL的现状,存在什么问题)" class="headerlink" title="Background(当前FL的现状,存在什么问题)"></a>Background(当前FL的现状,存在什么问题)</h2><p>FL分为几个步骤:</p><ul><li>i) local update on client’s model using their own localized data;</li><li>ii) clients upload their local trained model parameters to server;</li><li>iii) server performs aggregation strategy on collected clients’ model parameters to obtain global model</li><li>iv) server selects a subset of clients and distributes the latest global model to them</li></ul><p>当前FL的研究集中于优化上面提到的几个步骤中的某些策略。</p><p><img src="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218145837760.png" class="lazyload placeholder" data-srcset="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218145837760.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218145837760"></p><p>当前大多数的研究者都自己从头到尾去实现这样的一个东西,所以希望做出一个框架出来。</p><p>面向<strong>工业生产</strong>的FL框架:</p><ul><li>FATE,微众银行</li><li>PaddleFL,百度</li><li>FedLearner,字节</li></ul><p>适合实际的应用,不适合实验室的模拟。</p><p>聚焦于安全多方计算:</p><ul><li>Rosetta</li><li>PySyft</li></ul><p>支持模拟但仅仅只支持单机:</p><ul><li>TFF</li></ul><p>FedML很全面,聚焦于Research</p><p>Flower后端支持不同的框架。</p><p><strong>这些框架都太重量级了,不轻量?</strong></p><p><strong>FedLab</strong>:</p><ul><li>接口灵活,supports standalone, cross machine and hierarchical simulation paradigms</li><li>提供various data partition tools</li><li>实现了一些FL的scheme</li></ul><h2 id="Framework-Overview(框架的整体架构)"><a href="#Framework-Overview(框架的整体架构)" class="headerlink" title="Framework Overview(框架的整体架构)"></a>Framework Overview(框架的整体架构)</h2><h3 id="单机的框架"><a href="#单机的框架" class="headerlink" title="单机的框架"></a>单机的框架</h3><p><img src="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218150221243.png" class="lazyload placeholder" data-srcset="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218150221243.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218150221243"></p><ul><li>使用torch.distributed作为communication backend</li><li>使用NetworkManager管理网络的拓扑结构</li><li>each role of FedLab is represented by single system process</li></ul><p><strong>Network Manager</strong>(传递向量,可以使用Pytorch轻松的实现):</p><ul><li><p>传递的消息为Package</p><ul><li>header tensor存储必要的控制信息</li><li>content tensor包含打包好的tensor list</li></ul></li><li><p>PackageProcessor提供打包tensor list,解包tensor list必要的函数。</p></li><li><p>Package的内容可以自定义。</p></li><li><p>支持同步和异步的通信模式。<img src="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218151243254.png" class="lazyload placeholder" data-srcset="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218151243254.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218151243254"></p></li></ul><h3 id="多机的框架"><a href="#多机的框架" class="headerlink" title="多机的框架"></a>多机的框架</h3><ul><li><p>Schedular来解决需要模拟很多nodes的问题,只能在LAN中使用。</p></li><li><p>多机的框架和单机的框架结合起来就形成了一个层次的结构。</p><p> <img src="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218151728451.png" class="lazyload placeholder" data-srcset="/2022/12/18/arxiv21-fedlab-a-flexible-federated-learning-framework/image-20221218151728451.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218151728451"></p></li></ul><h3 id="其它"><a href="#其它" class="headerlink" title="其它"></a>其它</h3><ul><li>ClientSGDTrainer is a standard implementation of Trainer for users</li><li>提供了数据划分的工具。</li></ul><h2 id="Pipeline-and-Examples(怎么使用这个框架)"><a href="#Pipeline-and-Examples(怎么使用这个框架)" class="headerlink" title="Pipeline and Examples(怎么使用这个框架)"></a>Pipeline and Examples(怎么使用这个框架)</h2><ul><li>The first part is definition of communication agreements. The prototype of synchronous and asynchronous communication patterns have been implemented for users.</li><li>The second part is ParameterServerHandler module of server and Trainer module of client, which represents FL optimization process.</li></ul>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
</tags>
</entry>
<entry>
<title>arXiv20-FLOWER: A FRIENDLY FEDERATED LEARNING FRAMEWORK</title>
<link href="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/"/>
<url>/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/</url>
<content type="html"><![CDATA[<h1 id="arXiv20-FLOWER-A-FRIENDLY-FEDERATED-LEARNING-FRAMEWORK"><a href="#arXiv20-FLOWER-A-FRIENDLY-FEDERATED-LEARNING-FRAMEWORK" class="headerlink" title="arXiv20-FLOWER-A-FRIENDLY-FEDERATED-LEARNING-FRAMEWORK"></a>arXiv20-FLOWER-A-FRIENDLY-FEDERATED-LEARNING-FRAMEWORK</h1><p><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217221517671.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217221517671.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221217221517671"></p><p><strong>解决的问题</strong>:Although there are a number of research frameworks available to simulate FL algorithms, they do not support <strong>the study of scalable FL workloads</strong> on <strong>heterogeneous edge devices</strong>.</p><ul><li>FL workloads的可拓展问题。</li><li>异构的边缘设备问题。</li></ul><p><strong>实验的结果</strong>:Our experiments show Flower can perform FL experiments <strong>up to 15M in client size using only a pair of high-end GPUs</strong>. Researchers can then seamlessly migrate experiments to real devices to examine other parts of the design space.</p><ul><li>仅使用一对高性能的GPU,Flowers可以模拟出15M的client size。</li></ul><h2 id="Background(当前的框架存在问题)"><a href="#Background(当前的框架存在问题)" class="headerlink" title="Background(当前的框架存在问题)"></a>Background(当前的框架存在问题)</h2><h3 id="当前的框架存在问题:"><a href="#当前的框架存在问题:" class="headerlink" title="当前的框架存在问题:"></a><strong>当前的框架存在问题</strong>:</h3><ul><li><p>TFF、PySyft和LEAF这些框架支持FL算法的小规模的模拟实验,但不支持在<strong>边缘设备上的真实的联邦学习</strong>。</p></li><li><p>当前的研究很少使用超过100个clients的,即使使用了超过100个,也依赖于模拟而不是实际的实现。<strong>工业上的clients的规模十分的巨大。</strong></p><p> <img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217224332105.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217224332105.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221217224332105"></p><ul><li>图片显示Flower比FedScale的规模大。</li></ul></li></ul><h3 id="Flower(Flower的优点)"><a href="#Flower(Flower的优点)" class="headerlink" title="Flower(Flower的优点)"></a>Flower(Flower的优点)</h3><p>针对这两个问题,<strong>Flower</strong>:</p><ul><li><strong>在云环境下模拟了真实的设备场景</strong>。Flower provides builtin tools to simulate many of these challenging conditions in a cloud environment and allows for a realistic evaluation of FL algorithms.</li><li><strong>scalability考虑的非常好,支持大面积的训练和评测</strong>。Flower is designed with scalability in mind and enables large-cohort research that leverages both a large number of connected clients and a large number of clients training concurrently.</li></ul><p>具体的说,Flower:</p><ul><li>支持<strong>真实的设备</strong>、<strong>单节点</strong>和<strong>多节点的计算集群</strong>。(clients的规模可以模拟的很大)</li><li><strong>Flower可拓展性好,能够支持新的算法、训练方法和通信代理,模拟和真实设备之间的迁移很丝滑</strong>。</li><li>clients上支持各种各样的ML框架。</li><li>做了算法和系统的实验,验证了Flower可以将规模拓展到 15million。</li></ul><p>Flower预先实现的一些算法:</p><p><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217225255713.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217225255713.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221217225255713"></p><h2 id="Flower(Flower的架构)"><a href="#Flower(Flower的架构)" class="headerlink" title="Flower(Flower的架构)"></a>Flower(Flower的架构)</h2><p><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217225322970.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217225322970.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221217225322970"></p><ul><li><p>在Server上给一个Stategy指示Server怎么做:client selection、configuration和parameter update aggregation等等。</p></li><li><p>Client分为真实的Edge Client和模拟出的Virtual Client,真实的Client和Client Manager之间通过Client Manager连接。</p></li><li><p>Client Manager管理着一系列的Client Proxy,这是Client Manager和Client之间存在的一层中间层,抽象掉Client上使用的不同的ML框架。</p></li><li><p>Server端包括的components:Client Manager、FL loop和一个用户自定义的Strategy。<strong>服务器端的components</strong>从ClientManager采样clients,<strong>ClientManager</strong>管理着一系列的ClientProxy对象,每一个<strong>ClientProxy对象</strong>表示一个连接到服务器的客户端,这些ClientProxy负责服务器和Clients之间的消息的传递。<strong>FL loop</strong>在FL process的heart,负责协调整个学习的过程。<strong>FL Process</strong>不负责配置、算法,这些<strong>由Stategy负责</strong>。FL loop要求Stategy配置下一轮次的FL,把这些配置发送给对应的clients,接受这些clients的更新,并把聚合的结果返回给Strategy。</p></li></ul><h3 id="Virtual-Client-Engine(核心框架上的一个组件)"><a href="#Virtual-Client-Engine(核心框架上的一个组件)" class="headerlink" title="Virtual Client Engine(核心框架上的一个组件)"></a>Virtual Client Engine(核心框架上的一个组件)</h3><p>a tool that enables the virtualization of Flower Clients to maximise utilization of the available hardware。</p><h3 id="Edge-Client-Engine(核心框架上的一个组件)"><a href="#Edge-Client-Engine(核心框架上的一个组件)" class="headerlink" title="Edge Client Engine(核心框架上的一个组件)"></a>Edge Client Engine(核心框架上的一个组件)</h3><ul><li>Raspberry Pi或者NVIDIA Jetson这样的设备可以直接当做Flower Clients.</li><li>手机这种commodity设备由更严格的、受限的、有时专有的软件栈。为了解决这个问题:<strong>Flower提供了一个low-level的集成(直接处理在Client上的Flower Protocol)</strong>。<strong>没太看懂为什么这样可以解决这样的问题</strong>。</li></ul><h3 id="Secure-Aggregation(Flower为保护隐私实现的算法)"><a href="#Secure-Aggregation(Flower为保护隐私实现的算法)" class="headerlink" title="Secure Aggregation(Flower为保护隐私实现的算法)"></a>Secure Aggregation(Flower为保护隐私实现的算法)</h3><ul><li>实现了SecAgg、SecAgg+。</li><li>安全聚合的计算和任何特殊的软件和ML framework无关,对client的dropouts十分的鲁棒、通信和计算的理论瓶颈都很低。</li></ul><h3 id="FL-Framework-Comparison(和其它FL框架的比较)"><a href="#FL-Framework-Comparison(和其它FL框架的比较)" class="headerlink" title="FL Framework Comparison(和其它FL框架的比较)"></a>FL Framework Comparison(和其它FL框架的比较)</h3><p><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217232626708.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221217232626708.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221217232626708"></p><ul><li>Single-node simulation: 都支持单节点的模拟。</li><li>Multi-node execution: Syft和Flower支持多机器的模拟。FedScale支持多机器的模拟,但不支持实际环境中的部署。TFF计划支持。</li><li>Scalability: TFF和LEAF截止写文的时候,只支持单节点,FedScale支持多节点,但是只拓展到了100个并行的clients,Syft支持网络间的通信,但是只能通过连接作为服务器持有数据的clients,限制了scalability。</li><li>Heterogeneous clients:Flower支持通过language-agnostic和communication-agnostic实现异构client pool。TFF和Syft需要framework-provided client runtime,FedScale和LEAF集中于Python实现的模拟。</li><li>ML framework-agnostic:可以用各种各样的ML framework。TFF和TensorFlow结合,experimentally支持JAX,LEAF也依赖于TensorFlow,Syft和Pytorch以及Keras关系密切。</li><li>Language-agnostic:Flower通过使用protocol-level的集成支持多种语言,其它的框架都只支持python。</li><li>Baselines:LEAF和FedScale支持大量的Baseline,TFF提供了使用相同的数据集建立baseline的库,Flower当前实现了一些主流的FL Method和ML benchmark。</li></ul><h2 id="IMPLEMENTATION(各个部件如何实现)"><a href="#IMPLEMENTATION(各个部件如何实现)" class="headerlink" title="IMPLEMENTATION(各个部件如何实现)"></a>IMPLEMENTATION(各个部件如何实现)</h2><p>Flower has an extensive implementation of FL averaging algorithms, a robust communication stack, and various examples of deploying Flower on real and simulated clients.</p><h3 id="Communication-stack(通信栈是如何实现的)"><a href="#Communication-stack(通信栈是如何实现的)" class="headerlink" title="Communication stack(通信栈是如何实现的)"></a>Communication stack(通信栈是如何实现的)</h3><ul><li>当前的通信协议使用bi-directional <strong>gRPC</strong>实现。</li><li><strong>为什么这么实现</strong>:高校的二进制序列化格式在低带宽的手机连接中尤其重要。Bi-directional streaming允许多个message之间的交换,而不会由于重新建立连接导致额外的overhead。</li></ul><h3 id="Serialization(序列化是如何实现的)"><a href="#Serialization(序列化是如何实现的)" class="headerlink" title="Serialization(序列化是如何实现的)"></a>Serialization(序列化是如何实现的)</h3><ul><li>Flower的clients接受源字节序列,然后deserialize指令,然后执行指令。得到的结果序列化发送给server,message的格式和语言无关。</li><li><strong>为什么这样实现:</strong>用户可以自定义传输信息的方法。</li></ul><h3 id="Alternative-communication-stacks(可替代的通信栈)"><a href="#Alternative-communication-stacks(可替代的通信栈)" class="headerlink" title="Alternative communication stacks(可替代的通信栈)"></a>Alternative communication stacks(可替代的通信栈)</h3><ul><li>尽管message暂时是用gRPC实现的,但实际上用户可以自定义通信的方法。Flower内部的实现使用了模块化的抽象,因此用户可以自定义gRPC。</li></ul><h3 id="ClientProxy(客户端代理)"><a href="#ClientProxy(客户端代理)" class="headerlink" title="ClientProxy(客户端代理)"></a>ClientProxy(客户端代理)</h3><ul><li>每一个ClientProxy代表一个单独的client,offline的或者不满足要求的client没有ClientProxy。所有SeverSide的逻辑都是依据ClientProxy来实现的。</li><li>ClientProxy知识一个抽象的接口,而不是一个实现,因此ClientProxy的实现可以自定义。</li></ul><h3 id="Virtual-Client-Engine-VCE"><a href="#Virtual-Client-Engine-VCE" class="headerlink" title="Virtual Client Engine (VCE)"></a>Virtual Client Engine (VCE)</h3><ul><li>VCE为每一个Client创建一个ClientProxy,但是推迟了实际的Client Object的实例化(包括本地模型和数据)直到需要去执行任务的时候才会去实例化,这样避免了大量的资源的浪费。</li><li>VCE在Ray(Moritz et al., 2018)的框架基础上实现,Ray可以序列化client-side的序列化执行,因此可以在通用硬件上做大规模的实现。</li></ul><h2 id="FRAMEWORK-EVALUATION(框架的实测结果如何)"><a href="#FRAMEWORK-EVALUATION(框架的实测结果如何)" class="headerlink" title="FRAMEWORK EVALUATION(框架的实测结果如何)"></a>FRAMEWORK EVALUATION(框架的实测结果如何)</h2><ul><li><strong>Scalability</strong>: We show that Flower can (a) efficiently make use of available resources in single-machine simulations and (b) run experiments with millions of clients whilst sampling thousands in each training.</li><li><strong>Heterogeneity</strong>: We show that Flower can be deployed in real, heterogeneous devices commonly found in crossdevice scenario and how it can be used to measure system statistics.</li><li><strong>Realism</strong>: We show through a case study how Flower can throw light on the performance of FL under heterogeneous clients with different computational and network capabilities.</li><li><strong>Privacy</strong>: Finally, we show how our implementation of Secure Aggregation matches the expected theoretical overhead as expected.</li></ul><h3 id="Large-Scale-Experiment(验证可以大规模实验)"><a href="#Large-Scale-Experiment(验证可以大规模实验)" class="headerlink" title="Large-Scale Experiment(验证可以大规模实验)"></a>Large-Scale Experiment(验证可以大规模实验)</h3><ul><li>任务:五分类任务,使用Amazon Book Review预测User的评分。</li><li>模型:finetune了一个Transformer network(DistilBERT model)</li><li>数据集:Amazon Book Review Dataset:51M reviews、15Musers。</li><li>训练过程中采样的Clients的数量,测试过程中使用两块V100,22-Cores的Intel Xeon Gold 6152(2.10GHz)的CPU。</li><li><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218001939134.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218001939134.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218001939134"><ul><li>10-500的收敛都很快,1000的收敛很慢,作者说可能是因为数据的不平衡分布。</li></ul></li></ul><h3 id="Single-Machine-Experiments(验证可以帮助研究者快速实现idea,跑得快)"><a href="#Single-Machine-Experiments(验证可以帮助研究者快速实现idea,跑得快)" class="headerlink" title="Single Machine Experiments(验证可以帮助研究者快速实现idea,跑得快)"></a>Single Machine Experiments(验证可以帮助研究者快速实现idea,跑得快)</h3><p>为了证明Flower可以帮助research,能够快速的给出新的idea的结果,Flower needs to be fast at providing reliable results when experimenting new ideas, e.g. a new aggregation strategy。</p><ul><li>和不同的框架做了个端到端的比较:FedScale,TFF,FedJax和原先的LEAF。</li><li>三个FL的设置(Caldas et al., 2018):FEMNIST数据集。<ul><li>number of clients (c) and local epochs per round change (l) 变化的时候,轮次和客户端的数量控制在2000和179.</li></ul></li><li>设备:Intel Xeon E5-2680 CPU (2.40GHz) equipped with two NVIDIA RTX2080 GPUs and 20GB of RAM</li></ul><p><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218002728311.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218002728311.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218002728311"></p><ul><li>c=3,l=1的时候,the overhead of having a multi-task system, like the Virtual Client Engine (VCE), causes Flower to sightly under-perform in comparison to loop-based simulators, like LEAF.</li><li>l增加,c增加的时候,这个系统的好处就体现出来了,速度更快。</li><li>VCE允许给每一个client指定分配的GPU memory:<ul><li>The VCE allows us to specify the amount of GPU memory we want to associate with each client, this allows for more efficient data and model loading of different clients on the same GPU, making the overall training considerably faster.</li></ul></li><li>FedJax在l很小的时候很高效,当通讯成为瓶颈、FedJax的性能稍微下降,FedScale的表现则一直都不是很好。</li></ul><h3 id="Flower-enables-FL-evaluation-on-real-devices(验证可以在真实设备上跑)"><a href="#Flower-enables-FL-evaluation-on-real-devices(验证可以在真实设备上跑)" class="headerlink" title="Flower enables FL evaluation on real devices(验证可以在真实设备上跑)"></a>Flower enables FL evaluation on real devices(验证可以在真实设备上跑)</h3><h4 id="可以在很多设备上跑"><a href="#可以在很多设备上跑" class="headerlink" title="可以在很多设备上跑"></a>可以在很多设备上跑</h4><p>deploying Flower on six types of heterogeneous real-world mobile and embedded devices, including Java-based <strong>Android smartphones</strong> and Python-based Nvidia <strong>Jetson series devices</strong> and <strong>Raspberry Pi</strong>.</p><ul><li>FedAvg</li><li>云虚拟机上跑。</li><li>Python实现的clients</li><li>TensorFlow(TensorFlow Lite),While TFLite is primarily designed for on-device inference, we leverage its capabilities to do on-device model personalization to implement a FL client application (Lite, 2020)</li></ul><p><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218004537772.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218004537772.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218004537772"></p><ul><li>DeepConvLSTM</li><li>a human activity recognition task on Python-enabled Jetson and Raspberry Pi devices</li><li>client上的代码可以不用改就能跑。</li><li>启发:By comparing the relative energy consumption and training times across various devices, FL researchers can devise more informed client selection policies that can tradeoff between FL convergence time and overall energy consumption.</li></ul><h4 id="可以在实际设备上实现细粒度的分析"><a href="#可以在实际设备上实现细粒度的分析" class="headerlink" title="可以在实际设备上实现细粒度的分析"></a>可以在实际设备上实现细粒度的分析</h4><ul><li>10 Android clients</li><li>2 convolutional layers and 3 fully-connected layers</li><li>CIFAR10 dataset</li><li>TensorFlow Lite is used as the training ML framework on the devices</li><li>We measure the time taken for various FL operations, such as local SGD training, communication between the server and client, local evaluation on the client, and the overhead due to the Flower framework:</li><li><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218005254632.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218005254632.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218005254632"></li><li>大部分的时间都是device上的训练时间,Flower整个框架的overhead很小的。</li></ul><h3 id="Realism-in-Federated-Learning(验证可以帮助研究者开发更好的优化算法)"><a href="#Realism-in-Federated-Learning(验证可以帮助研究者开发更好的优化算法)" class="headerlink" title="Realism in Federated Learning(验证可以帮助研究者开发更好的优化算法)"></a>Realism in Federated Learning(验证可以帮助研究者开发更好的优化算法)</h3><p>举了两个设计样例去验证了一下。</p><h4 id="Computational-Heterogeneity-across-Clients"><a href="#Computational-Heterogeneity-across-Clients" class="headerlink" title="Computational Heterogeneity across Clients"></a>Computational Heterogeneity across Clients</h4><p>比如根据Flower上观测的结果去微调了FedAvg算法,牺牲了部分的精度换来了更快的收敛时间。</p><h4 id="Heterogeneity-in-Network-Speeds"><a href="#Heterogeneity-in-Network-Speeds" class="headerlink" title="Heterogeneity in Network Speeds"></a>Heterogeneity in Network Speeds</h4><p>客户端的选择策略、通信都会影响训练。</p><p><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218010543925.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218010543925.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218010543925"></p><ul><li>可以观测到,然后针对性的进行优化。</li></ul><h3 id="Secure-Aggregation-Overheads"><a href="#Secure-Aggregation-Overheads" class="headerlink" title="Secure Aggregation Overheads"></a>Secure Aggregation Overheads</h3><p>实现了SecAgg和SecAgg+</p><p>模型的向量size和clients发生dropouts时候,在服务器端评测了计算和通信开销。we evaluate its impact on server-side computation and communication overhead with the model vector size and clients dropouts。</p><ul><li>Intel Xeon E-2136 CPU (3.30GHz), with 256 GB of RAM</li><li>all entries of our local vectors are of size 24 bits</li><li>ignore communication latency</li><li>all dropouts simulated happen after stage 2, i.e. Share Keys Stage. This is because this imposes the most significant overhead as the server not only needs to regenerate dropped-out clients’ secrets, but also compute their pairwise masks generated between their neighbours.</li><li>the n and t parameters of the t-outof-n secret-sharing scheme are set to 51 and 26,These parameters are chosen to reference SecAgg+’s proven correctness and security guarantees</li><li>Fixing the number of sampled clients to 100,</li><li>CPU running times through aggregating a vector of size 100k entries to aggregating one of size 500k entries in Figure 8:</li><li><img src="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218011017894.png" class="lazyload placeholder" data-srcset="/2022/12/17/arxiv20-flower-a-friendly-federated-learning-framework/image-20221218011017894.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221218011017894"></li></ul><h2 id="Conclusion(最终结论)"><a href="#Conclusion(最终结论)" class="headerlink" title="Conclusion(最终结论)"></a>Conclusion(最终结论)</h2>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
</tags>
</entry>
<entry>
<title>arXiv20-FedML-A Research Library and Benchmark for Federated Machine Learning</title>
<link href="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/"/>
<url>/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/</url>
<content type="html"><![CDATA[<h1 id="arXiv20-FedML-A-Research-Library-and-Benchmark-for-Federated-Machine-Learning"><a href="#arXiv20-FedML-A-Research-Library-and-Benchmark-for-Federated-Machine-Learning" class="headerlink" title="arXiv20-FedML-A Research Library and Benchmark for Federated Machine Learning"></a>arXiv20-FedML-A Research Library and Benchmark for Federated Machine Learning</h1><p><a href="https://fedml.ai/">FedML:an open research <strong>library</strong> and <strong>benchmark</strong></a></p><p>three computing paradigms:</p><ul><li>on-device training for edge devices.</li><li>distributed computing.</li><li>single-machine simulation.</li></ul><h2 id="Significance(当前别人实现的框架都不行)"><a href="#Significance(当前别人实现的框架都不行)" class="headerlink" title="Significance(当前别人实现的框架都不行)"></a>Significance(当前别人实现的框架都不行)</h2><h3 id="distributed-training-in-data-centers"><a href="#distributed-training-in-data-centers" class="headerlink" title="distributed training in data centers"></a>distributed training in data centers</h3><ul><li>PyTorch</li><li>TensorFlow</li><li>MXNet</li><li>Horovod (distributed training-specialized libraries)</li><li>BytePS (distributed training-specialized libraries)</li></ul><h3 id="simulation-oriented-FL-libraries"><a href="#simulation-oriented-FL-libraries" class="headerlink" title="simulation-oriented FL libraries"></a>simulation-oriented FL libraries</h3><ul><li>TensorFlow-Federated (TFF)</li><li>PySyft</li><li>LEAF</li></ul><p>only support <strong>centralized topology-based FL algorithms</strong> with <strong>simulation</strong> <strong>in a single machine</strong></p><ul><li>FedAvg</li><li>FedProx</li></ul><h3 id="production-oriented-FL-libraries"><a href="#production-oriented-FL-libraries" class="headerlink" title="production-oriented FL libraries"></a>production-oriented FL libraries</h3><ul><li>FATE</li><li>PaddleFL</li></ul><p><strong>not</strong> designed as <strong>flexible</strong> frameworks that aim to support algorithmic innovation for open FL problems</p><h3 id="summary"><a href="#summary" class="headerlink" title="summary"></a>summary</h3><p><img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211212248490.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211212248490.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211212248490"></p><h2 id="Background(当前别人实现的框架哪里不行)"><a href="#Background(当前别人实现的框架哪里不行)" class="headerlink" title="Background(当前别人实现的框架哪里不行)"></a>Background(当前别人实现的框架哪里不行)</h2><ul><li><p><strong>Lack of support of diverse FL computing paradigms.</strong></p></li><li><p><strong>Lack of support of diverse FL configurations.</strong> FL is diverse in network topology, exchanged information, and training procedures.</p></li><li><p><strong>Lack of standardized FL algorithm implementations and benchmarks.</strong></p><p> <img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211212649188.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211212649188.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211212649188"></p></li></ul><h2 id="FedML(我的框架行)"><a href="#FedML(我的框架行)" class="headerlink" title="FedML(我的框架行)"></a>FedML(我的框架行)</h2><p>前面四点是针对当前的问题来说的。</p><ul><li><p>(i) Support of diverse FL computing paradigms.</p><ul><li><p>on-device training for edge devices including smartphones and Internet of Things (IoT).</p><ul><li>FedML-Mobile: Android smartphones</li><li>FedML-IoT: Raspberry PI 4 and NVIDIA Jetson Nano</li><li><strong>smoothly transplant</strong> the distributed computing code <strong>to the FedML-Mobile and FedML-IoT platforms</strong></li></ul></li><li><p>distributed computing.</p></li><li><p>single-machine simulation to meet algorithmic and system-level research requirements under different system deployment scenarios.</p></li></ul></li><li><p>(ii) Support of diverse FL configurations.</p><ul><li>a worker/client-oriented programming interface to enable diverse network topologies.</li></ul></li><li><p>(iii) Standardized FL algorithm implementations.</p><ul><li>FedML includes standardized implementations of status quo FL algorithms.</li></ul></li><li><p>(iv) Standardized FL benchmarks.</p><ul><li>FedML provides standardized benchmarks with well-defined evaluation metrics, multiple synthetic and real-world non-I.I.D. datasets, as well as verified baseline results to facilitate fair performance comparison.</li></ul></li><li><p>(v) Fully open and evolving.</p></li></ul><h3 id="FedML-Library"><a href="#FedML-Library" class="headerlink" title="FedML Library"></a>FedML Library</h3><p><img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211213555125.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211213555125.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211213555125"></p><ul><li><p>FedML-API:high-level API</p><ul><li>FedML-API: new algorithms in distributed version can be easily implemented by adopting the <strong>client-oriented programming interface</strong>.<ul><li><strong>is essential for scenarios</strong> in which <strong>large DNN training cannot be handled by standalone simulation due to GPU memory and training time constraints.</strong></li><li><strong>separates</strong> the implementations of <strong>models, datasets, and algorithms</strong>.</li></ul></li></ul></li><li><p>FedML-core:low-level API</p><p> <img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211220013821.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211220013821.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211220013821"></p><ul><li>The distributed communication module is responsible for low-level communication among different workers/clients (自底向上分别为).<ul><li><strong>The communication backend is based on MPI (message passing interface).</strong>[MPI,OpenMPI 与深度学习](<a href="https://zhuanlan.zhihu.com/p/158584571">MPI,OpenMPI 与深度学习 - 知乎 (zhihu.com)</a>)</li><li><strong>TopologyManager supports a variety of network topologies that can be used in many existing FL algorithms.</strong></li><li></li><li><strong>security/privacy-related functions are also supported?</strong></li></ul></li></ul></li></ul><h2 id="FedML-Library-Programming-Interface(我的框架怎么用)"><a href="#FedML-Library-Programming-Interface(我的框架怎么用)" class="headerlink" title="FedML Library: Programming Interface(我的框架怎么用)"></a>FedML Library: Programming Interface(我的框架怎么用)</h2><p>总结</p><ul><li><p>用的编程模式是Worker/client-oriented programming,可以通过集成WorkerManager实现自定义的Worker上跑的代码。</p></li><li><p>可以自定义不同的Worker之间发送的消息。</p></li><li><p>每一个Worker可以通过TopologyManager获取neighbour的worker ID</p></li><li><p><strong>Trainer and coordinator</strong>这种不是必要的东西不帮用户去实现,让用户自己去实现。</p></li><li><p>保护隐私方面:</p><ul><li><p>提供了一些常见<strong>加密原语</strong>的低层级的API的实现。</p></li><li><p><strong>plan to</strong> include an implementation of <strong>Lagrange Coded Computing (LCC)</strong>.</p></li><li><p><strong>a sample implementation</strong> of the <strong>secure aggregation algorithm</strong>.</p></li><li><p>To <strong>accelerate generating benchmark results on new types of adversarial attacks</strong> in FL, we include the latest robust aggregation methods presented in literature, <strong>Our APIs are easily extendable to support newly developed types of robust aggregation methods</strong>.</p><p> <strong>most of the existing attacks are highly task-specific</strong>. Thus, it is challenging to provide general adversarial attack APIs. Support the backdoor with model replacement attack presented in [20] and the edge-case backdoor attack presented in [118] to <strong>provide a reference for researchers to develop new attacks</strong>.</p><ul><li>(i) norm difference clipping; weak differential private (DP)</li><li>(ii) RFA (geometric median)</li><li>(iii) KRUM and (iv) MULTIKRUM</li></ul></li></ul></li></ul><p>具体的介绍:</p><ul><li><p>FedML-API:high-level API:<strong>Worker/client-oriented programming</strong>.</p><p> <img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211225128831.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211225128831.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211225128831"></p><ul><li>之前的框架例如torch.distributed(the standard distributed training library)是左边的例子:需要在一个客户端上写完整的训练过程,不够灵活。</li><li>我们的方法右边的例子:只需要聚焦于每个客户端上具体是怎么做的,较为灵活。可以通过继承WorkerMananger类实现register_message_receive_handlers等方法去实现。</li></ul></li><li><p><strong>Message definition beyond gradient and model.</strong> (可以传递除了梯度和模型之外的任何<strong>自定义</strong>的消息)</p></li><li><p><strong>Topology management.</strong></p><ul><li>FedML provides TopologyManager to manage the topology and allows users to send messages to arbitrary neighbors during training.</li><li>after the initial setting of TopologyManager is completed, for each trainer in the network, <strong>the neighborhood worker ID can be queried via the TopologyManager.</strong></li></ul></li><li><p>**Trainer and coordinator.**(给点样例,让用户自己去实现)</p><ul><li>For the trainer and coordinator, FedML does not over-design. Rather, it gives the implementation completely to the developers, reflecting the flexibility of our framework.</li></ul></li><li><p><strong>Privacy, security, and robustness.</strong></p><ul><li><p>we include <strong>low-level APIs</strong> that <strong>implement common cryptographic primitives</strong> such as secrete sharing, key agreement, digital signature, and public key infrastructure.</p></li><li><p><strong>plan to</strong> include an implementation of <strong>Lagrange Coded Computing (LCC)</strong>.</p></li><li><p><strong>a sample implementation</strong> of the <strong>secure aggregation algorithm</strong>.</p></li><li><p>To <strong>accelerate generating benchmark results on new types of adversarial attacks</strong> in FL, we include the latest robust aggregation methods presented in literature, <strong>Our APIs are easily extendable to support newly developed types of robust aggregation methods</strong>.</p><p> <strong>most of the existing attacks are highly task-specific</strong>. Thus, it is challenging to provide general adversarial attack APIs. Support the backdoor with model replacement attack presented in [20] and the edge-case backdoor attack presented in [118] to <strong>provide a reference for researchers to develop new attacks</strong>.</p><ul><li>(i) norm difference clipping; weak differential private (DP)</li><li>(ii) RFA (geometric median)</li><li>(iii) KRUM and (iv) MULTIKRUM</li></ul></li></ul></li></ul><h2 id="FedML-Benchmark-Algorithms-Models-and-Datasets-我用我的框架实现了哪些算法、运用了哪些模型、提供了哪些数据集"><a href="#FedML-Benchmark-Algorithms-Models-and-Datasets-我用我的框架实现了哪些算法、运用了哪些模型、提供了哪些数据集" class="headerlink" title="FedML Benchmark: Algorithms, Models, and Datasets (我用我的框架实现了哪些算法、运用了哪些模型、提供了哪些数据集)"></a>FedML Benchmark: Algorithms, Models, and Datasets (我用我的框架实现了哪些算法、运用了哪些模型、提供了哪些数据集)</h2><h3 id="Algorithms-Federated-Optimizer(当前用此框架实现的算法)"><a href="#Algorithms-Federated-Optimizer(当前用此框架实现的算法)" class="headerlink" title="Algorithms: Federated Optimizer(当前用此框架实现的算法)"></a>Algorithms: Federated Optimizer(当前用此框架实现的算法)</h3><ul><li>Federated Averaging (FedAvg)</li><li>Decentralized FL</li><li>Vertical Federated Learning (VFL)</li><li>Split learning</li><li>Federated Neural Architecture Search (FedNAS)</li><li>Turbo-Aggregate</li><li><strong>will keep following the latest algorithm</strong> to be published at top-tier machine learning conferences</li><li>will continuously add new FL algorithms such as:<ul><li>Adaptive Federated Optimizer</li><li>FedNova</li><li>FedProx</li><li>FedMA</li></ul></li></ul><h3 id="Models-and-Datasets(当前此框架提供的模型和数据集)"><a href="#Models-and-Datasets(当前此框架提供的模型和数据集)" class="headerlink" title="Models and Datasets(当前此框架提供的模型和数据集)"></a>Models and Datasets(当前此框架提供的模型和数据集)</h3><ul><li><p>总结了过去两年(19-20)顶会上用的non-I.I.D.的数据集和模型:</p><p> <img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211232944169.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211232944169.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211232944169"></p></li><li><p>对模型做了一个分类:</p><ul><li><p>linear models (convex optimization)</p><p> <img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211233126326.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211233126326.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211233126326"></p></li><li><p>lightweight shallow neural networks (non-convex optimization)</p><p> <img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211233139976.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211233139976.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211233139976"></p></li><li><p>deep neural networks (non-convex optimization)</p><p> <img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211233407833.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211233407833.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211233407833"></p><p> Given the resource constraints of edge devices, large DNN models are usually trained under the cross-organization FL (also called cross-silo FL) setting.</p></li></ul></li></ul><h2 id="Experiments(我们增加新的benchmark的时候还会给大家提供一个能训练成什么效果的参考)"><a href="#Experiments(我们增加新的benchmark的时候还会给大家提供一个能训练成什么效果的参考)" class="headerlink" title="Experiments(我们增加新的benchmark的时候还会给大家提供一个能训练成什么效果的参考)"></a>Experiments(我们增加新的benchmark的时候还会给大家提供一个能训练成什么效果的参考)</h2><p>例如使用ResNet-56和MobileNet去测试的结果<strong>精度随着epoch的变化</strong>如下。</p><p><img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211234055821.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211234055821.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211234055821"></p><p>单GPU模拟和多GPU模拟的区别:</p><p><img src="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211234358774.png" class="lazyload placeholder" data-srcset="/2022/12/11/arxiv20-fedml-a-research-library-and-benchmark-for-federated-machine-learning/image-20221211234358774.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221211234358774"></p><ul><li>Table reveals that when training large CNNs, <strong>the standalone simulation is about 8 times slower than distributed computing with 10 parallel workers</strong>. </li><li>Therefore, <strong>when training large DNNs, we suggest using FedML’s distributed computing paradigm, which is not supported by existing FL libraries such as PySyft [28], LEAF [40], and TTF [39]</strong>. </li><li>Moreover, <strong>FedML supports multiprocessing in a single GPU card</strong> which enables FedML to run a large number of training workers by using only a few GPU cards. As an example, when training ResNet on CIFAR-10, FedML can run 112 workers in a server with 8 GPUs.</li></ul>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> Benchmark </tag>
</tags>
</entry>
<entry>
<title>Data-Free Knowledge Distillation for Heterogeneous Federated Learning</title>
<link href="/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/"/>
<url>/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/</url>
<content type="html"><![CDATA[<h1 id="PMLR21-Data-Free-Knowledge-Distillation-for-Heterogeneous-Federated-Learning"><a href="#PMLR21-Data-Free-Knowledge-Distillation-for-Heterogeneous-Federated-Learning" class="headerlink" title="PMLR21-Data-Free-Knowledge-Distillation-for-Heterogeneous-Federated-Learning"></a>PMLR21-Data-Free-Knowledge-Distillation-for-Heterogeneous-Federated-Learning</h1><h2 id="1-Overview"><a href="#1-Overview" class="headerlink" title="1. Overview"></a>1. Overview</h2><p><a href="https://github.com/zhuangdizhu/FedGen">code</a></p><ul><li><p><strong>User heterogeneity</strong> has imposed significant challenges to FL, which can incur drifted global models that are slow to converge.</p></li><li><p>past Knowledge Distillation: <strong>Knowledge Distillation</strong> has recently emerged to tackle this issue, by <strong>refining the server model using aggregated knowledge from heterogeneous users</strong>, other than directly aggregating their model parameters. <strong>the ensemble knowledge is not fully utilized to guide local model learning</strong></p></li><li><p>current Knowledge Distillation: <strong>the server learns a lightweight generator to ensemble user information in a data-free manner, which is then broadcasted to users, regulating local training using the learned knowledge as an inductive bias.</strong></p></li><li><p>benefits:</p><ul><li>i) It extracts the knowledge out of users which was otherwise mitigated after model averaging, without depending on any external data.</li><li>ii) Contrary to certain prior work that only refines the global model, our approach directly regulates local model updating using the extracted knowledge</li></ul></li></ul><h2 id="2-Notations-and-Preliminaries"><a href="#2-Notations-and-Preliminaries" class="headerlink" title="2. Notations and Preliminaries"></a>2. Notations and Preliminaries</h2><ul><li>可结合<a href="https://blog.csdn.net/qq_45478482/article/details/127513032">csdn1</a>和<a href="https://blog.csdn.net/Shawn2134123/article/details/122502649">csdn2</a>一起看。</li></ul><h3 id="Overview-of-FEDGEN"><a href="#Overview-of-FEDGEN" class="headerlink" title="Overview of FEDGEN"></a>Overview of FEDGEN</h3><ul><li><img src="/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/123.png" class="lazyload placeholder" data-srcset="/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/123.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></li></ul><h3 id="Algorithm"><a href="#Algorithm" class="headerlink" title="Algorithm"></a>Algorithm</h3><p><img src="/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/123-1668951855007-2.png" class="lazyload placeholder" data-srcset="/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/123-1668951855007-2.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><ul><li>G和client是如何配合起来工作的?</li></ul><h3 id="FEDGEN-Data-Free-Federated-Distillation-via-Generative-Learning"><a href="#FEDGEN-Data-Free-Federated-Distillation-via-Generative-Learning" class="headerlink" title="FEDGEN: Data-Free Federated Distillation via Generative Learning"></a>FEDGEN: Data-Free Federated Distillation via Generative Learning</h3><ul><li><p>3.1. Knowledge Extraction</p></li><li><p>3.2. Knowledge Distillation</p></li><li><p>3.3. Extensions for Flexible Parameter Sharing</p><ul><li>by sharing only the prediction layer $\theta^p_k$ of local models, which is the primary information needed to optimizing Equation 4, while keeping the feature extractor $\theta^f_k$ localized.</li><li>This partial sharing paradigm is more efficient, and at the same time less vulnerable to data leakage, as compared with a strategy that shares the entire model.</li></ul></li></ul><h2 id="FEDGEN-Analysis"><a href="#FEDGEN-Analysis" class="headerlink" title="FEDGEN Analysis"></a>FEDGEN Analysis</h2><h3 id="4-1-Knowledge-Distillation-for-Inductive-Bias"><a href="#4-1-Knowledge-Distillation-for-Inductive-Bias" class="headerlink" title="4.1. Knowledge Distillation for Inductive Bias"></a>4.1. Knowledge Distillation for Inductive Bias</h3><p><img src="/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/image-20221121000905303.png" class="lazyload placeholder" data-srcset="/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/image-20221121000905303.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221121000905303"></p><p><img src="/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/image-20221121000941292.png" class="lazyload placeholder" data-srcset="/2022/11/20/pmlr21-data-free-knowledge-distillation-for-heterogeneous-federated-learning/image-20221121000941292.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221121000941292"></p><h3 id="4-2-Knowledge-Distillation-for-Distribution-Matching"><a href="#4-2-Knowledge-Distillation-for-Distribution-Matching" class="headerlink" title="4.2. Knowledge Distillation for Distribution Matching"></a>4.2. Knowledge Distillation for Distribution Matching</h3><ul><li>见原文</li></ul><h3 id="4-3-Knowledge-Distillation-for-Improved-Generalization"><a href="#4-3-Knowledge-Distillation-for-Improved-Generalization" class="headerlink" title="4.3. Knowledge Distillation for Improved Generalization"></a>4.3. Knowledge Distillation for Improved Generalization</h3><ul><li>见原文</li></ul>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> About Knowledge Distillation </tag>
</tags>
</entry>
<entry>
<title>FL</title>
<link href="/2022/11/12/fl/"/>
<url>/2022/11/12/fl/</url>
<content type="html"><![CDATA[<h1 id="FL"><a href="#FL" class="headerlink" title="FL"></a>FL</h1><h2 id="Statistical-heterogeneity"><a href="#Statistical-heterogeneity" class="headerlink" title="Statistical heterogeneity"></a>Statistical heterogeneity</h2><h3 id="Adapt-the-global-model-to-accommodate-personalized-local-models-for-non-IID-data"><a href="#Adapt-the-global-model-to-accommodate-personalized-local-models-for-non-IID-data" class="headerlink" title="Adapt the global model to accommodate personalized local models for non-IID data"></a>Adapt the global model to accommodate personalized local models for non-IID data</h3><h4 id="Meta-learning"><a href="#Meta-learning" class="headerlink" title="Meta learning"></a>Meta learning</h4><ul><li><p><strong>arXiv’19</strong>-Improving federated learning personalization via model agnostic meta learning-(Yihan Jiang-uw, Jakub Konečný-Google research, Keith Rush, Sreeram Kannan-uw)</p></li><li><p><strong>NIPS’19</strong>-Adaptive gradient-based metalearning methods-(Mikhail Khodak-CMU, Maria-Florina Balcan-CMU, Ameet Talwalkar-CMU)</p></li><li><p><strong>NIPS’20</strong>-Personalized federated learning with theoretical guarantees: A modelagnostic meta-learning approach</p></li><li><p><strong>ICLR’20</strong>-Differentially private metalearning</p></li></ul><h4 id="Multi-task-learning"><a href="#Multi-task-learning" class="headerlink" title="Multi-task learning"></a>Multi-task learning</h4><ul><li><p><strong>NIPS’17</strong>-MOCHA: Federated multi-task learning-(Virginia Smith-stanford, Chao-Kai Chiang-usc, Maziar Sanjabi-usc, Ameet S. Talwalkar-CMU)</p></li><li><p><strong>arXiv’19</strong>-Variational federated multi-task learning</p></li><li><p><strong>ICML’19</strong>-Semi-cyclic stochastic gradient descent</p></li><li><p><strong>NIPS’19</strong>-Adaptive gradient-based metalearning methods</p></li></ul><h4 id="Transfer-learning"><a href="#Transfer-learning" class="headerlink" title="Transfer learning"></a>Transfer learning</h4><ul><li><p><strong>arXiv’19</strong>-Federated evaluation of on-device personalization-(Kangkang Wang-Google, Rajiv Mathews-Google, Chloé Kiddon-Google, Hubert Eichner-Google, Françoise Beaufays-Google, Daniel Ramage-Google)</p></li><li><p><strong>arXiv’20</strong>-Three approaches for personalization with applications to federated learning-(Yishay Mansour-Google, Mehryar Mohri-Google, Jae Ro-Google, Ananda Theertha Suresh-Google)</p></li></ul><h4 id="Knowledge-distillation"><a href="#Knowledge-distillation" class="headerlink" title="Knowledge distillation"></a>Knowledge distillation</h4><ul><li><input disabled type="checkbox"> <strong>arXiv’19</strong>-Fedmd: Heterogenous federated learning via model distillation-(Daliang Li, Junpu Wang Harvard-Yale-Pennsylvania)</li></ul><h4 id="Lottery-ticket-hypothesis"><a href="#Lottery-ticket-hypothesis" class="headerlink" title="Lottery ticket hypothesis"></a>Lottery ticket hypothesis</h4><ul><li><input disabled type="checkbox"> <strong>arXiv’20</strong>-Lotteryfl: Personalized and communication-efficient federated learning with lottery ticket hypothesis on non-iid datasets-(Ang Li-Duke, Jingwei Sun-Duke, Binghui Wang-Duke, Lin Duan-Duke, Sicheng Li-Alibaba, Yiran Chen-Duke, Hai Li-Duke)</li></ul><h3 id="Client-clustering"><a href="#Client-clustering" class="headerlink" title="Client clustering"></a>Client clustering</h3><ul><li><input disabled type="checkbox"> <strong>NIPS’20</strong>-An efficient framework for clustered federated learning-()-()-Client clustering.</li></ul><h3 id="Other"><a href="#Other" class="headerlink" title="Other"></a>Other</h3><ul><li><input disabled type="checkbox"> <strong>NIPS’19Workshop</strong>-Federated learning with local and global representations-(Paul Liang-CMU, Terrance Liu-CMU)</li><li><input disabled type="checkbox"> <strong>arXiv’21</strong>-Fed-ensemble: Improving generalization through model ensembling in federated learning</li><li><input disabled type="checkbox"> <strong>ICML’20</strong>-SCAFFOLD: Stochastic controlled averaging for federated learning</li></ul><h2 id="System-heterogeneity"><a href="#System-heterogeneity" class="headerlink" title="System heterogeneity"></a>System heterogeneity</h2><h3 id="Asynchronous-communication"><a href="#Asynchronous-communication" class="headerlink" title="Asynchronous communication"></a>Asynchronous communication</h3><h3 id="Active-sampling-of-clients"><a href="#Active-sampling-of-clients" class="headerlink" title="Active sampling of clients"></a>Active sampling of clients</h3><ul><li><input checked disabled type="checkbox"> <strong>SysML’19</strong>-Towards federated learning at scale: System design-(Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, Jason Roselander)</li><li><input disabled type="checkbox"> <strong>ICC’19</strong>-Client selection for federated learning with heterogeneous resources in mobile edge-(Takayuki Nishio-Kyoto, Ryo Yonetani-Kyoto)</li><li><input disabled type="checkbox"> <strong>OSDI’21</strong>-Oort: Efficient federated learning via guided participant selection</li><li><input disabled type="checkbox"> <strong>NSDI’20</strong>-Sol: A federated execution engine for fast distributed computation over slow networks</li></ul><h2 id="Communication"><a href="#Communication" class="headerlink" title="Communication"></a>Communication</h2><h3 id="Data-compression-techniques"><a href="#Data-compression-techniques" class="headerlink" title="Data compression techniques"></a>Data compression techniques</h3><h4 id="Quantization-and-sketching"><a href="#Quantization-and-sketching" class="headerlink" title="Quantization and sketching"></a>Quantization and sketching</h4><ul><li><input disabled type="checkbox"> <strong>arXiv’16</strong>-Federated learning: Strategies for improving communication efficiency-(Jakub Konečný-Google, H. Brendan McMahan-Google, Felix X. Yu-Google, Peter Richtárik-KAUST, Ananda Theertha Suresh-Google, Dave Bacon-Google)</li><li><input disabled type="checkbox"> <strong>NIPS’17</strong>-Qsgd: Communication-efficient sgd via gradient quantization and encoding-(Dan Alistarh-ETH, Demjan Grubic-Eth&Google, Jerry Z. Li-MIT, Ryota Tomioka-Microsoft Research, Milan Vojnovic-London School of Economics)</li><li><input disabled type="checkbox"> <strong>NIPS’19</strong>-Communication efficient distributed sgd with sketching-(Nikita Ivkin-Amzaon, Daniel Rothchild-UCB, Enayat UIIah-JHU, Vladimir Braverman-JHU, Ion Stoica-UCB, Raman Arora-JHU)</li><li><input disabled type="checkbox"> <strong>arXiv’19</strong>-Error feedback fixes signsgd and other gradient compression schemes</li><li><input disabled type="checkbox"> <strong>arXiv’16</strong>-Federated learning: Strategies for improving communication efficiency</li><li><input disabled type="checkbox"> <strong>arXiv’18</strong>-Expanding the reach of federated learning by reducing client resource requirements</li><li><input disabled type="checkbox"> <strong>NIPS’18</strong>-ATOMO: Communication-efficient learning via atomic sparsification</li><li><input disabled type="checkbox"> <strong>PISIT’18</strong>-Gradient coding using the stochastic block model</li></ul><h3 id="Local-training"><a href="#Local-training" class="headerlink" title="Local training"></a>Local training</h3><ul><li><p><strong>ICAIS’17</strong>-Communication-efficient learning of deep networks from decentralized data</p></li><li><p><strong>ICLR’19</strong>-Local SGD converges fast and communicates little</p></li><li><p><input disabled type="checkbox"> CoCoA: A general framework for communication-efficient distributed optimization</p></li><li><p><strong>NSDI’20</strong>-Sol: A federated execution engine for fast distributed computation over slow networks</p></li></ul><h3 id="Split-learning"><a href="#Split-learning" class="headerlink" title="Split learning"></a>Split learning</h3><ul><li><input disabled type="checkbox"> <strong>arXiv’20</strong>-Splitfed: When federated learning meets split learning-(Chandra Thapa-Lehigh University, M.A.P. Chamikara-Lehigh University, Seyit Camtepe-Lehigh University, Lichao Sun-Lehigh University)</li></ul><h3 id="Decentralized-training"><a href="#Decentralized-training" class="headerlink" title="Decentralized training"></a>Decentralized training</h3><ul><li><input disabled type="checkbox"> <strong>ICLR’19</strong>-Anytime Minibatch: Exploiting stragglers in online distributed optimization</li><li><input disabled type="checkbox"> <strong>NIPS’19</strong>-Robust and communication-efficient collaborative learning</li></ul><h3 id="Asynchronous-and-synchronous"><a href="#Asynchronous-and-synchronous" class="headerlink" title="Asynchronous and synchronous"></a>Asynchronous and synchronous</h3><ul><li><input disabled type="checkbox"> <strong>NIPS’15</strong>-Deep learning with elastic averaging SGD</li><li><input disabled type="checkbox"> <strong>NIPS’11</strong>:HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent</li></ul><h2 id="Data-privacy"><a href="#Data-privacy" class="headerlink" title="Data privacy"></a>Data privacy</h2><h3 id="Break-privacy"><a href="#Break-privacy" class="headerlink" title="Break privacy"></a>Break privacy</h3><ul><li><input disabled type="checkbox"> <strong>NIPS’20</strong>-Inverting gradients - how easy is it to break privacy in federated learning</li><li><input disabled type="checkbox"> <strong>NIPS’20</strong>-Attack of the tails: Yes, you really can backdoor federated learning</li><li><input disabled type="checkbox"> <strong>arXiv’19</strong>-Can you really backdoor federated learning?</li></ul><h3 id="Differentially-private"><a href="#Differentially-private" class="headerlink" title="Differentially private"></a>Differentially private</h3><ul><li><p><strong>PCCCS’17</strong>-Practical secure aggregation for privacy-preserving machine learning</p></li><li><p><strong>NIPS’17</strong>-Differentially private federated learning: A client level perspective</p></li><li><p><strong>ICLR’18</strong>-Learning differentially private recurrent language models</p></li><li><p><strong>MobiCom’20</strong>-Billion-Scale Federated Learning on Mobile Clients: A Submodel Design with Tunable Privacy</p></li></ul><h3 id="Others"><a href="#Others" class="headerlink" title="Others"></a>Others</h3><ul><li><p><strong>NIPS’17</strong>-“Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent</p></li><li><p><strong>arXiv’18</strong>-Communication-efficient distributed strongly convex stochastic optimization: Non-asymptotic rates</p></li><li><p><strong>SP’19</strong>-Exploiting unintended feature leakage in collaborative learning</p></li><li><p><strong>NIPS’19</strong>-Deep leakage from gradients</p></li><li><p><strong>arXiv’20</strong>-idlg: Improved deep leakage from gradients</p></li><li><p><strong>arXiv’20</strong>-Threats to federated learning: A survey</p></li><li><p><strong>arXiv’21</strong>-Practical and private (deep) learning without sampling or shuffling</p></li></ul><h2 id="Scales"><a href="#Scales" class="headerlink" title="Scales"></a>Scales</h2><ul><li><input disabled type="checkbox"> <strong>arXiv’18</strong>-Applied federated learning: Improving Google keyboard query suggestions.</li></ul><h2 id="FL定制"><a href="#FL定制" class="headerlink" title="FL定制"></a>FL定制</h2><ul><li><input checked disabled type="checkbox"> <strong>ICLR’21</strong>-HETEROFL: computation and communication efficient federated learning for heterogeneous clients-(Enmao Diao-Duke, Jie Ding-UMN, Vahid Tarokh-Duke)-<a href="https://github.com/dem123456789/HeteroFL-Computation-and-Communication-Efficient-Federated-Learning-for-Heterogeneous-Clients">Code</a></li><li><input checked disabled type="checkbox"> <strong>MobiCom’20</strong>-Billion-Scale Federated Learning on Mobile Clients: A Submodel Design with Tunable Privacy</li></ul><h2 id="FL系统"><a href="#FL系统" class="headerlink" title="FL系统"></a>FL系统</h2><h4 id="Benchmark"><a href="#Benchmark" class="headerlink" title="Benchmark"></a>Benchmark</h4><ul><li><input disabled type="checkbox"> <strong>NIPS’19</strong>-Leaf: A benchmark for federated settings-()-<a href="https://leaf.cmu.edu/">Data</a></li><li><input disabled type="checkbox"> <strong>TensorFlow Federated (TFF)</strong>- “TensorFlow federated: Machine learning on decentralized data.”-()-[Data](<a href="https://www.tensorflow/">https://www.tensorflow</a> .org/federated)</li><li><input disabled type="checkbox"> <strong>arXiv’20</strong>-FedML: A research library and benchmark for federated machine learning</li><li><input disabled type="checkbox"> <strong>arXiv’21</strong>-Flower: A friendly federated learning framework</li><li><input disabled type="checkbox"> <strong>MLSys’20</strong>-Mlperf training benchmark</li><li><input disabled type="checkbox"> PySyft (pys)</li><li><input checked disabled type="checkbox"> <strong>ICML22’</strong>-FedScale: Benchmarking Model and System Performance of Federated Learning at Scale</li></ul><h4 id="Others-1"><a href="#Others-1" class="headerlink" title="Others"></a>Others</h4><ul><li><input disabled type="checkbox"> <strong>MLSys’20</strong>-Federated optimization in heterogeneous networks-()-<a href>FedProx</a></li><li><input disabled type="checkbox"> <strong>arXiv’20</strong>-Adaptive federated optimization-()-<a href>FedYoGi</a></li><li><input disabled type="checkbox"> <strong>OSDI’21</strong>-Oort: Efficient federated learning via guided participant selection</li></ul><h2 id="Summarize"><a href="#Summarize" class="headerlink" title="Summarize"></a>Summarize</h2><ul><li><input checked disabled type="checkbox"> <strong>IEEE Signal Processing Magazine’20</strong>-Federated learning: Challenges, methods, and future directions-(Tian Li-CMU, Anit Kumar Sahu, Ameet Talwalkar, Virginia Smith)</li><li><input disabled type="checkbox"> <strong>arXiv’21</strong>-Advances and Open Problems in Federated Learning-()</li><li><input disabled type="checkbox"> <strong>ACM Computing Surveys (CSUR)’19</strong>-Demystifying parallel and distributed deep learning: An in-depth concurrency analysis</li></ul><h2 id="Basic"><a href="#Basic" class="headerlink" title="Basic"></a>Basic</h2><ul><li><input disabled type="checkbox"> <strong>arXiv’18</strong>-Federated learning for mobile keyboard prediction</li><li><input disabled type="checkbox"> <strong>arXiv’18</strong>-LoAdaBoost: Lossbased adaboost federated machine learning on medical data</li></ul><h2 id="Problems"><a href="#Problems" class="headerlink" title="Problems"></a>Problems</h2><h3 id="Novel-models-of-asynchrony"><a href="#Novel-models-of-asynchrony" class="headerlink" title="Novel models of asynchrony"></a>Novel models of asynchrony</h3><p>bulk synchronous approaches and asynchronous approaches, it is worth studying the effects of this more realistic device-centric communication scheme</p><h3 id="Extreme-communication-schemes"><a href="#Extreme-communication-schemes" class="headerlink" title="Extreme communication schemes"></a>Extreme communication schemes</h3><p>optimization methods used for machine learning can tolerate a lack of precision. this error can, in fact, help with generalization. </p><h3 id="Communication-reduction-and-the-Pareto-frontier"><a href="#Communication-reduction-and-the-Pareto-frontier" class="headerlink" title="Communication reduction and the Pareto frontier"></a>Communication reduction and the Pareto frontier</h3><p>how these techniques such as local updating and model compression compose with one another and to systematically analyze the tradeoff between accuracy and communication for each approach. In particular, the most useful techniques will demonstrate improvements at the Pareto frontier.</p><h3 id="Heterogeneity-diagnostics"><a href="#Heterogeneity-diagnostics" class="headerlink" title="Heterogeneity diagnostics"></a>Heterogeneity diagnostics</h3><p>quantify statistical heterogeneity through metrics such as local dissimilarity, however these metrics cannot be easily calculated over the federated network before training occurs</p><h3 id="Granular-privacy-constraints"><a href="#Granular-privacy-constraints" class="headerlink" title="Granular privacy constraints"></a>Granular privacy constraints</h3><h3 id="Beyond-supervised-learning"><a href="#Beyond-supervised-learning" class="headerlink" title="Beyond supervised learning"></a>Beyond supervised learning</h3><h3 id="Productionizing-federated-learning"><a href="#Productionizing-federated-learning" class="headerlink" title="Productionizing federated learning"></a>Productionizing federated learning</h3><ul><li><input disabled type="checkbox"> concept drift(when the underlying data-generation model changes over time)</li><li><input disabled type="checkbox"> diurnal variations(when the devices exhibit different behavior at different times of the day or week)<ul><li><input disabled type="checkbox"> <strong>ICML19’</strong>-Semi-cyclic stochastic gradient descent</li></ul></li><li><input disabled type="checkbox"> cold-start problems(when new devices enter the network)</li></ul><h3 id="Benchmarks"><a href="#Benchmarks" class="headerlink" title="Benchmarks"></a>Benchmarks</h3>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
</tags>
</entry>
<entry>
<title>ICLR21-HETEROFL-COMPUTATION AND COMMUNICATION EFFICIENT FEDERATED LEARNING FOR HETEROGENEOUS CLIENTS</title>
<link href="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/"/>
<url>/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/</url>
<content type="html"><![CDATA[<h1 id="ICLR21-HETEROFL-COMPUTATION-AND-COMMUNICATION-EFFICIENT-FEDERATED-LEARNING-FOR-HETEROGENEOUS-CLIENTS"><a href="#ICLR21-HETEROFL-COMPUTATION-AND-COMMUNICATION-EFFICIENT-FEDERATED-LEARNING-FOR-HETEROGENEOUS-CLIENTS" class="headerlink" title="ICLR21-HETEROFL-COMPUTATION AND COMMUNICATION EFFICIENT FEDERATED LEARNING FOR HETEROGENEOUS CLIENTS"></a>ICLR21-HETEROFL-COMPUTATION AND COMMUNICATION EFFICIENT FEDERATED LEARNING FOR HETEROGENEOUS CLIENTS</h1><ul><li><p>作者:</p><ul><li><a href="https://diaoenmao.com/%EF%BC%88Enmao">https://diaoenmao.com/(Enmao</a> Diao)(刁恩茂)(Duke University)</li><li><a href="https://jding.org/%EF%BC%88Jie">https://jding.org/(Jie</a> Ding)(University of Minnesota-Twin Cities)</li><li><a href="https://ece.duke.edu/faculty/vahid-tarokh%EF%BC%88Vahid">https://ece.duke.edu/faculty/vahid-tarokh(Vahid</a> Tarokh)(Duke University)</li></ul></li><li><p>中央服务器用的模型和client用的模型这样的一个FL的前提假设会使得FL的应用得到极大的限制,同时会带来client的无谓的计算和通信开销。于是作者提出在clients上部署几种不同复杂度的模型,不同复杂度的模型分别更新server上模型的不同部分,建立的数学公式较为优美。</p></li><li><p>HETEROFL:address <strong>heterogeneous clients</strong> equipped with very different computation and communication capabilities</p></li><li><p><strong>For the first time</strong>, our method <strong>challenges the underlying assumption of existing work that local models have to share the same architecture as the global model</strong></p></li><li><p><strong>several strategies</strong> to <strong>enhance FL training</strong> and conduct extensive empirical evaluations, <strong>including five computation complexity levels of three model architecture on three datasets</strong></p></li></ul><h2 id="INTRODUCTION"><a href="#INTRODUCTION" class="headerlink" title="INTRODUCTION"></a>INTRODUCTION</h2><ul><li><p>A widely accepted assumption is that local models have to share the same architecture as the global model (Li et al., 2020b).</p></li><li><p>It is crucial to address heterogeneous clients equipped with very different computation and communication capabilities.</p></li><li><p><strong>HeteroFL</strong>: This model heterogeneity differs significantly from the classical distributed machine learning framework where local data are trained with the same model architecture.</p></li><li><p>Contributions:</p><ul><li><p>模型自定义:propose an easy-to-implement framework <strong>HeteroFL</strong> that can train <strong>heterogeneous local models</strong> and aggregate them stably and effectively into a single global inference model. Outperforms <strong>state-of-the-art results</strong> <strong>without</strong> introducing <strong>additional computation overhead</strong>.</p></li><li><p>设备异构:addresses <strong>various heterogeneous settings</strong>. the learning result <strong>stable and effective</strong>, <strong>the communication costs</strong> are small. a<strong>llow local clients to adaptively contribute to the training of global models</strong>,System heterogeneity and communication efficiency can be well addressed.</p></li><li><p>数据异构:several strategies <strong>robust against the balanced non-IID statistical heterogeneity</strong>, <strong>reduce the number of communication rounds</strong>.</p><p> <strong>”Masking Trick”</strong> for balanced non-IID data partition in n classification problems</p><p> a modification of Batch Normalization (BN)</p></li></ul></li></ul><h2 id="RELATED-WORK"><a href="#RELATED-WORK" class="headerlink" title="RELATED WORK"></a>RELATED WORK</h2><ul><li>train massively distributed models at a large scale (Bonawitz et al., 2019)</li><li><strong>FedAvg</strong> by McMahan et al. (2017) is currently t<strong>he most widely adopted FL baseline</strong>, which r<strong>educes communication cost by allowing clients to train multiple iterations locally.</strong></li><li><strong>communication efficiency</strong>:<ul><li>data compression techniques(quantization and sketching), split learning</li></ul></li><li><strong>system heterogeneity</strong><ul><li>asynchronous communication</li><li>active sampling of clients</li></ul></li><li><strong>statistical heterogeneity</strong>(major battleground)<ul><li><strong>adapt the global model to accommodate personalized local models for non-IID data</strong><ul><li>integrating FL with other frameworks such as assisted learning, metalearning, multi-task learning, transfer learning, knowledge distillation, lottery ticket hypothesis, but <strong>often introduce additional computation and communication overhead</strong> that may not be necessary.</li></ul></li></ul></li><li><strong>privacy</strong><ul><li>model gradient updates can reveal sensitive information or even local training data</li></ul></li></ul><h2 id="HETEROGENEOUS-FEDERATED-LEARNING"><a href="#HETEROGENEOUS-FEDERATED-LEARNING" class="headerlink" title="HETEROGENEOUS FEDERATED LEARNING"></a>HETEROGENEOUS FEDERATED LEARNING</h2><h3 id="HETEROGENEOUS-MODELS"><a href="#HETEROGENEOUS-MODELS" class="headerlink" title="HETEROGENEOUS MODELS"></a>HETEROGENEOUS MODELS</h3><ul><li>consider <strong>local models</strong> to <strong>have similar architecture</strong> but can <strong>shrink their complexity within the same model class</strong>.</li><li><strong>new challenges</strong>: the optimal way to select subsets of global model parameters, compatibility of the-state-of-art model architecture, and minimum modification from the existing FL framework</li><li>we can modulate the size of deep neural networks by <strong>varying the width and depth of networks</strong>(Zagoruyko & Komodakis, 2016; Tan & Le, 2019):<ul><li>Because we <strong>aim to reduce the computation complexity of local models</strong>, we choose to <strong>vary the width of hidden channels</strong></li><li><strong>locally distributed data</strong>: ${X_1,…,X_m}$,m clients。</li><li><strong>model parameters</strong>: ${W_1, …, W_m}$, m clients。</li><li><strong>global model</strong>: $W_g$</li><li>each round:<ul><li>$W_g^t=\frac{1}{m}\sum_{i=1}^{m}W_i^t$</li><li>$W_i^{t+1}=W^t_g$</li></ul></li><li>对于全局参数$W_g \in \bold{R}^{d_g \times k_g}$中的某一个隐藏层$W_l$,$d_g和k_g$分别是输入输出的参数,$W_l$的参数的部分:$W_l^p \subset W_l^{p-1} … \subset W_l^{1}$,那么$W_l$这些部分的参数相对$W_l$有一个shrink,即$d_l^p=r^{p-1}d_g$ and $k_l^p=r^{p-1}k_g$ ,于是有$|W_l^p|=r^{2(p-1)}|W_g|$, shinkage ratio: $R=\frac{W_l^p}{W_g}=r^{2(p-1)}$</li><li>下图展示的就是给不同的clent分配不同大小的model的一个示意图:</li><li><img src="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123.png" class="lazyload placeholder" data-srcset="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></li><li>全局聚合:<ul><li><img src="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123-1668237191694-2.png" class="lazyload placeholder" data-srcset="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123-1668237191694-2.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></li><li><img src="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123-1668237570920-4-1668237816488-6.png" class="lazyload placeholder" data-srcset="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123-1668237570920-4-1668237816488-6.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></li><li><img src="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123-1668237829864-8.png" class="lazyload placeholder" data-srcset="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123-1668237829864-8.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></li><li>算力或者网络状况不好的机器就跑小模型,好的就跑大模型,公平高效。</li></ul></li></ul></li></ul><h3 id="STATIC-BATCH-NORMALIZATION"><a href="#STATIC-BATCH-NORMALIZATION" class="headerlink" title="STATIC BATCH NORMALIZATION"></a>STATIC BATCH NORMALIZATION</h3><p>classical FedAvg and most recent works <strong>avoid BN</strong>. A major concern of BN is that <strong>it requires running estimates of representations at every hidden layer</strong>(?). <strong>Uploading these statistics to the server will cause higher communication costs and privacy issues</strong> Andreux et al. (2020) proposes to track running statistics locally。</p><p>batch normalization在non-IID数据上的表现并不好:<a href="https://zhuanlan.zhihu.com/p/374432534%EF%BC%8Chttps://zhuanlan.zhihu.com/p/309381344%EF%BC%8C%E5%90%8C%E6%97%B6%E5%AD%98%E5%9C%A8privacy">https://zhuanlan.zhihu.com/p/374432534,https://zhuanlan.zhihu.com/p/309381344,同时存在privacy</a> concerns,因为计算用到了全局的数据。</p><p>**static Batch Normaliztion (sBN)**:</p><ul><li>During the training phase, <strong>sBN</strong> does not track running estimates and <strong>simply normalize batch data</strong>.</li><li>We do not track the local running statistics as the size of local models may also vary dynamically. (?)</li><li>This method is suitable for HeteroFL as every communication round is independent. (?)</li><li>After the training process finishes, the server sequentially query local clients and cumulatively update global BN statistics. (?)</li><li>empirically found this trick s<strong>ignificantly outperforms other forms of normalization methods</strong> including the InstanceNorm (Ulyanov et al., 2016), GroupNorm (Wu & He, 2018), and LayerNorm (Ba et al., 2016)</li></ul><h3 id="SCALER"><a href="#SCALER" class="headerlink" title="SCALER"></a>SCALER</h3><ul><li>很有意思:</li><li>local model parameters at different computation complexity levels will digress to various scales: 局部模型自己都有一些个性化的特点,直接用全局模型的参数在本地做推导可能就忽略了这些特点:<ul><li><strong>To directly use the full model during the inference phase</strong>, inverted dropout with dropout rate q scales representations with $\frac{1}{1-q}$ during the training phase。drop out通常在激活函数后面,或是在<strong>sBN和激活层</strong>加入<strong>Scalar</strong>层,将represention放大$\frac{1}{r^{p-1}}$,这样的话全局聚合之后,本地可以直接用本地的数据去做inference,做了<strong>消融实验</strong>。</li><li><img src="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123-1668240923021-10.png" class="lazyload placeholder" data-srcset="/2022/11/12/iclr21-heterofl-computation-and-communication-efficient-federated-learning-for-heterogeneous-clients/123-1668240923021-10.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></li></ul></li></ul><h2 id="EXPERIMENTAL-RESULTS"><a href="#EXPERIMENTAL-RESULTS" class="headerlink" title="EXPERIMENTAL RESULTS"></a>EXPERIMENTAL RESULTS</h2><ul><li><p>600 individual models</p></li><li><p>datasets</p><ul><li>MNIST and CIFAR10 image classification tasks</li><li>WikiText2 language modeling task</li></ul></li><li><p>three different models</p><ul><li>CNN for MNIST</li><li>preactivated ResNet (PreResNet18) for CIFAR10</li><li>Transformer for WikiText2</li></ul></li><li><p>replace BN in CNN and PreResNet18 with our proposed sBN and attach the Scaler module after each convolution layer</p></li><li><p>data partition the same as in (McMahan et al., 2017; Liang et al., 2020).</p></li><li><p><strong>100</strong> clients, fraction <strong>C</strong> of active clients per communication round is <strong>0.1</strong> throughout our experiments</p><ul><li>For <strong>IID data partition</strong>, we uniformly assign the same number of data examples for each client</li><li>For <strong>balanced non-IID data partition</strong>, we assume that <strong>the label distribution is skewed</strong>, where clients will only have examples at most from two classes and the number of examples per class is balanced.</li><li>other kinds of non-IID data partition: the unbalanced non-IID data partition where clients may <strong>hold unbalanced labeled dataset and the feature distribution skew</strong> where clients may hold different features.</li><li><strong>masked language modeling task</strong> with a <strong>15% masking rate</strong> and <strong>assign balanced data examples for each client</strong>, each client will roughly have 3000 different words in their local dataset,</li></ul></li><li><p><strong>five different computation complexity levels</strong> {a, b, c, d, e} with the hidden channel shrinkage <strong>ratio r = 0.5</strong>, we found that it is most illustrative to use the discrete complexity levels <strong>0.5, 0.25, 0.125, and 0.0625</strong></p></li><li><p><strong>Each local client</strong> is assigned an <strong>initial computation complexity level</strong></p></li><li><p>To <strong>demonstrate the effect of dynamically varying computation and communication capabilities</strong>, we uniformly sample from various combinations of computation complexity levels</p></li></ul><h3 id="Masked-Cross-Entropy-Loss"><a href="#Masked-Cross-Entropy-Loss" class="headerlink" title="Masked Cross-Entropy Loss"></a>Masked Cross-Entropy Loss</h3><ul><li><strong>instead of</strong> a full Cross-Entropy Loss <strong>for all classes</strong>, we are motivated to <strong>train each local model only with their corresponding classes</strong>, each local model will train a <strong>sub-task</strong> given locally available label information</li><li><strong>Masked Cross-Entropy Loss</strong>: <strong>mask out the output of the model before passing it Cross-Entropy Loss</strong></li><li>We experimented with several different ways of masking, we find <strong>replacing the last layer outputs that are not associated with local labels with zero achieves both stable and comparable local and global results</strong></li><li>When aggregating local model parameters, we do not aggregate the untrained parameters in the last classification layers</li><li>Masked Cross-Entropy Loss <strong>significantly improve local performance and moderately global performance of balanced non-IID data partition task</strong></li></ul><h2 id="CONCLUSIONS-AND-FUTURE-WORK"><a href="#CONCLUSIONS-AND-FUTURE-WORK" class="headerlink" title="CONCLUSIONS AND FUTURE WORK"></a>CONCLUSIONS AND FUTURE WORK</h2>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> About Model </tag>
</tags>
</entry>
<entry>
<title>MobiCom20-Billion-Scale Federated Learning on Mobile Clients A Submodel Design with Tunable Privacy</title>
<link href="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/"/>
<url>/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/</url>
<content type="html"><![CDATA[<h1 id="MobiCom20-Billion-Scale-Federated-Learning-on-Mobile-Clients-A-Submodel-Design-with-Tunable-Privacy"><a href="#MobiCom20-Billion-Scale-Federated-Learning-on-Mobile-Clients-A-Submodel-Design-with-Tunable-Privacy" class="headerlink" title="MobiCom20-Billion-Scale Federated Learning on Mobile Clients A Submodel Design with Tunable Privacy"></a>MobiCom20-Billion-Scale Federated Learning on Mobile Clients A Submodel Design with Tunable Privacy</h1><p>作者:</p><ul><li><a href="https://niuchaoyue.github.io/%EF%BC%88Chaoyue">https://niuchaoyue.github.io/(Chaoyue</a> Niu)()(Shanghai Jiao Tong University)</li><li><a href="https://www.cs.sjtu.edu.cn/~fwu/%EF%BC%88Fan">https://www.cs.sjtu.edu.cn/~fwu/(Fan</a> Wu)(吴帆)(Shanghai Jiao Tong University)</li><li>etc</li></ul><p><img src="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112220418688.png" class="lazyload placeholder" data-srcset="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112220418688.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221112220418688"></p><ul><li>针对这样的淘宝推荐模型做的一个FL的改编。</li><li>淘宝的商品太多,本地用户想知道自己有关的商品的词嵌入,肯定不能一次性把几十亿的商品id全部放到自己的本地。</li><li>用户直接向server要和自己相关的商品的id会泄露隐私。本篇文章就是在探讨<strong>如何不泄露用户的隐私</strong>。</li></ul><h2 id="Motivitation"><a href="#Motivitation" class="headerlink" title="Motivitation"></a>Motivitation</h2><ul><li>当前的联邦学习存在一些<strong>问题</strong>:要求客户端充分的<strong>利用整个模型</strong>去训练,在<strong>大规模学习任务和资源受限的手机设备中</strong>效率极低。</li><li>深度学习的输入十分稀疏,因此常常用一个<strong>嵌入层</strong>(embedding layer),去把输入变到低维,使得相似的输入较为接近,且保留输入的信息较多(淘宝的输入保留了$98.22%$的信息,Google的安卓键盘Gboard保留了超过$2/3$)的信息。淘宝有20亿的商品,比Google的10000词多多了,给所有商品ID的词嵌入需要20亿行,134G空间(嵌入向量的维度为18),<strong>每一个client不可能使用完整的模型</strong>,注意到<strong>每一个client子需要输入的很小的特征空间(例如一个用户可能只有300个商品的浏览记录)</strong>,因此可以在client上只要着一点点的特征空间。</li></ul><h3 id="Submodel"><a href="#Submodel" class="headerlink" title="Submodel"></a>Submodel</h3><p>提出<strong>子模型框架</strong>,客户端只从server下载需要的部分,也就是子模型,并且<strong>只上传子模型($1.99%$)的更新</strong>。</p><p>上传子模型的更新会带来新的问题:客户端真实需要的<strong>子模型暴露了用户的隐私数据</strong>,违背FL初衷。首先,client下载和上传子模型的时候需要制定一个索引集(index set)作为辅助信息,索引集指示的模型的参数部分可能就是这个用户的输入或者一些其它隐含用户信息的词向量。其次,有了这个索引之后后续的更新也会被服务器知道,可以根据这个更新和模型的参数重建出用户的隐私数据。其次,<strong>淘宝每一个用户的子模型也很难和其它用户的对齐</strong>。</p><p>因此设计一个<strong>安全联邦子模型学习策略,附带一个隐私集联合代理作为基石</strong>。(a secure federated submodel learning scheme coupled with a private set union protocol as a cornerstone)。这个策略的属性:<strong>随机响应,安全聚合,布鲁姆过滤器,定制的可信否认</strong>(randomized response, secure aggregation, and Bloom filter, and endows each client with customized plausible deniability (in terms of local differential privacy) against the position of its desired submodel)</p><h2 id="解决子模型暴露用户隐私的问题"><a href="#解决子模型暴露用户隐私的问题" class="headerlink" title="解决子模型暴露用户隐私的问题"></a>解决子模型暴露用户隐私的问题</h2><ol><li><p><strong>客户端如何下载矩阵的一些行而不需要告诉服务器我需要哪一行和哪一个行的索引。</strong></p><ul><li>客户端直接下载服务器完整的模型,本地把自己需要的模型拉出来(❌)。</li><li><strong>不能下载完整的模型</strong>,private information retrieval(PIR,只读模式,获取到的数据的隐藏(concealment of the retrieved elements))</li></ul></li><li><p><strong>客户端如何修改矩阵的一些行而不让服务器知道哪一行被修改了或者替换了。</strong></p><ul><li>客户端要是直接去修改模型上的行和列,那么服务器肯定知道它改了什么(❌)。</li><li><strong>首先</strong>进行安全聚合,即先把所有的修改相加组成模型,<strong>然后</strong>使用聚合修改,即把所有修改相加组成的模型应用到整个模型上,这样就不知道谁改了什么。可用的加密策略:很多,例如,the protocol specific to the FL setting in [8] and additively homomorphic encryption [9, 43]。这样的话,只要每一个客户端至少修改了一个它对应向量的数值,那么就无法知道谁改了什么东西。有一个极端的策略叫secure federated learning(SFL),强制每一轮此所有被选择的客户端都必须参与修改无论他们是不是真的想要修改,隐私保护最好,另一个极端的策略是只让真的想要改他们对应行的客户端参与修改,效率最高。<strong>然而</strong>,不同的客户端倾向于修改<strong>高区分度甚至是互斥的行(highly differentiated or even mutually exclusive rows)<strong>,对于多个客户端一起修改一些行的情况,只有一个客户端参与的概率是很高的,这种条件下安全聚合就</strong>失去了他的作用</strong>。</li></ul></li><li><p>secure federated submodel learning (SFSL)。本文提出的模型:</p><p> 首先确定了每一个轮次联合修改的范围,这样可以对其分化的子模型。关键的困难在于,隐藏客户端想要修改的子模型的位置。原本的SFL使用巨大的索引集,相比之下,SFSL确定了需要的对齐范围,即客户端们的真实索引集的集合,private set union(PSU)。</p><p> 每一个被选择的客户端<strong>生成随机索引集</strong>去替换和保护自己真实的索引集,随机索引集是应用两次随机化相应(random response)生成,<strong>随机化响应的参数</strong>设置<strong>由客户端制定</strong>,使用local differential privacy(LDP)去严格量化deniability的强度,服务器可以从聚集后的修改中推断初客户端的真实意图的概率也可以算出来,随机索引的基数(cardinality)决定了客户端的开销,客户端的真实和随机索引集的交叉控制了其本地训练的开销,每个客户端都能够微调隐私的保护程度和效率。</p></li></ol><h2 id="相关工作"><a href="#相关工作" class="headerlink" title="相关工作"></a>相关工作</h2><h2 id="准备工作"><a href="#准备工作" class="headerlink" title="准备工作"></a>准备工作</h2><h3 id="安全需要"><a href="#安全需要" class="headerlink" title="安全需要"></a>安全需要</h3><p>each client should have plausible deniability of whether a certain index is or is not in its real index set. <strong>the strength of plausible deniability</strong>, we adopt <strong>LDP</strong>(Local DP,本地的差分隐私)。不仅可以保护external attackers,也可以保护不可信的data curator。</p><ol><li><p><strong>LDP的定义</strong>:</p><p> <img src="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221110221433864.png" class="lazyload placeholder" data-srcset="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221110221433864.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221110221433864"></p></li></ol><p>来自客户端的输入变了,输出不会变化太多。</p><ol start="2"><li>**允许客户端自定义自己的隐私等级。</li></ol><p><img src="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221110233043387.png" class="lazyload placeholder" data-srcset="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221110233043387.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221110233043387"></p><ol start="3"><li><p><strong>随机Response</strong></p><p> 假设我们想要调研一个敏感问题,如调查已婚人群中的出轨比率,那么让每一个被调研者如实回答问题必然导致个人隐私被侵犯。但是我们想要获取的是统计信息,而非每一个个体的信息,因此可以构建如下随机回答算法:令受访者自己抛一枚均匀硬币,如果正面朝上,那么如实回答问题,如果背面朝上,那么再抛一枚硬币,正面朝上回答是,背面朝上回答否。这样,对于受访者而言,无论他回答的结果如何都不会侵犯隐私,因为对于一个人而言至少有四分之一的概率会回答“有出轨”,因此他的回答并不会透露真实的情况。而对于研究者而言,出轨比例p可以通过简单的计算得到:</p></li></ol><p><img src="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221110233329380.png" class="lazyload placeholder" data-srcset="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221110233329380.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221110233329380"></p><p><img src="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221110233823708.png" class="lazyload placeholder" data-srcset="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221110233823708.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221110233823708"></p><h2 id="DESIGN-OF-SFSL"><a href="#DESIGN-OF-SFSL" class="headerlink" title="DESIGN OF SFSL"></a>DESIGN OF SFSL</h2><h3 id="Design-Rationale"><a href="#Design-Rationale" class="headerlink" title="Design Rationale"></a>Design Rationale</h3><p><strong>key design principles</strong> through how to handle the <strong>two fundamental problems</strong> raised in Section 1.4 and how to resolve <strong>several practical issues</strong></p><h4 id="Solve-two-fundamental-problems"><a href="#Solve-two-fundamental-problems" class="headerlink" title="Solve two fundamental problems"></a>Solve two fundamental problems</h4><p><strong>two fundamental problems</strong> :</p><ul><li>How a client can <strong>download</strong> a row of a matrix, which represents the global/full model and is maintained by an untrusted cloud server, <strong>without revealing which row or the row index to the cloud server</strong>;</li><li>how a client can <strong>modify</strong> a row of the matrix, still <strong>without revealing which row was modified and the altered content</strong>. </li></ul><p>Design:</p><ul><li>During the <strong>download and upload phases</strong>, a client consistently <strong>uses a randomized index set in place of its real index set</strong></li><li>during the <strong>local training phase</strong>, the client <strong>leverages the intersection of its real and randomized index sets</strong>, the client holds <strong>plausible deniability</strong> of some index being in or not in its real index set.</li></ul><p>examine the feasibility of index set randomization in handling two problems:</p><ul><li><strong>For the first problem</strong> in the download phase, if a client intends (resp., does not intend) to download a certain row, and it actually downloads (resp., does not download), it can blame its action to randomization</li><li><strong>Regarding the second problem</strong> in the upload phase, the usage of the randomized index set still empowers a client to deny its real intention of modifying or not modifying some row, even if the cloud server observes its binary action of modifying or not modifying</li></ul><p>for a concrete row, there are two different groups of clients involved in the joint modification:</p><ul><li>(1) One group consists of <strong>those clients who intend to modify the row and contribute nonzero modifications</strong>;</li><li>(2) the other group comprises <strong>those clients who do not intend to modify the row and pretend to modify by submitting zero modifications</strong>. </li><li><strong>With the secure aggregation guarantee</strong>(???),it is <strong>hard to identify any individual modification</strong> and further to infer whether some client originally intends to perform a modification or not</li><li>The <strong>hardness</strong> is controlled by <strong>the sizes of two groups</strong>:<strong>the probabilities of an index in and not in the real index set</strong>. <strong>Tunable</strong></li></ul><h4 id="two-practical-issues"><a href="#two-practical-issues" class="headerlink" title="two practical issues"></a>two practical issues</h4><ul><li>The first issue regards <strong>whether it is practical and necessary for the cloud server to ask “Do you have a certain index?” for each index in the full index set</strong>? <strong>Impractical</strong> and <strong>Unnecessary</strong> . <ul><li>We turn to <strong>narrowing down the scope to the union of 𝑛 chosen clients’ real index sets</strong>, which is normally far smaller than the full index set (i.e., <strong>the union of the whole clients’ real index sets</strong>), it is <strong>unnecessary to cover any index outside the union of 𝑛 chosen clients’ real index sets</strong>. (这样会不会导致一些商品永远无法被用户所感知到?)</li><li>how multiple clients can obtain the union of their real index sets under the mediation of an untrusted cloud server without revealing any client’s real index set(Prosecution Services Unit, <strong>PSU</strong>):<ul><li><strong>a novel design of PSU</strong> based on <strong>Bloom filter</strong> and <strong>secure aggregation</strong></li></ul></li></ul></li><li>The second issue regards <strong>the longitudinal privacy</strong> when a client is chosen to participate in multiple rounds of FSL<ul><li>we need to extend the initial version to <strong>allow repeated responses from the same client to those already answered indices</strong>. (key principles from randomized aggregatable privacypreserving ordinal response (RAPPOR) [19, 21]): the noisy answers generated by the inner (permanent) randomized response will <strong>be memoized and permanently replace the real answers</strong> in the outer (instantaneous) randomized response</li></ul></li></ul><h3 id="Design-Details"><a href="#Design-Details" class="headerlink" title="Design Details"></a>Design Details</h3><h4 id="Secure-Federated-Submodel-Learning-SFSL"><a href="#Secure-Federated-Submodel-Learning-SFSL" class="headerlink" title="Secure Federated Submodel Learning (SFSL)"></a>Secure Federated Submodel Learning (SFSL)</h4><p><img src="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112212046250.png" class="lazyload placeholder" data-srcset="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112212046250.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221112212046250"></p><ul><li>At the initial stage, the cloud server randomly initializes the global model (Line 1). In each communication round, the cloud server selects 𝑛 clients to participate (Lines 2 and 3) and also maintains an up-to-date set of the chosen clients who are alive throughout the whole round, denoted by $\hat{C}$. A chosen client determines its real index set based on its local data, which specifies the “position” of its truly required submodel (Line 10)。</li><li>Then, the cloud server launches PSU to obtain the union of the chosen clients’ real index sets while keeping each individual client’s real index set in secret (Lines 4 and 11)。</li><li>The union is delivered to live clients for them to generate randomized index sets with customized LDP guarantees against the cloud server (Line 12)</li><li>Each live client will use its randomized index set, rather than its real index set, to download the submodel and to securely upload the submodel update (Lines 13 and 19)</li><li>Upon receiving the randomized index set from a client, the cloud server stores it for later usage and returns the corresponding submodel and training hyperparameters to the client (Line 6)</li><li>the client extracts a succinct submodel and prepares involved data as the succinct training set (Line 14). For example, if a Taobao user’s real index set is {1, 2, 4}, and his/her randomized index set is {2, 4, 6, 9}, he/she receives a submodel with the row indices {2, 4, 6, 9} from the cloud server, but just needs to train the succinct submodel with the row indices {2, 4} over his/her local data <strong>involving the goods IDs {2, 4}</strong>, <strong>(why???)</strong></li><li>After training under the preset hyperparameters, the client obtains the update of the succinct submodel (Line 15)</li><li>it prepares the submodel update to be uploaded with the randomized index set by adding the update of the succinct submodel to the rows with the succinct indices and padding zero vectors to the other rows (Line 16)</li><li>To help the cloud server average multiple submodel updates according to the size of relevant local training data, <strong>each chosen client also needs to count the number of its training samples involving every index</strong> in the randomized index set (Line 17). the numbers of samples involving the indices outside the succinct index set are all zeros.</li></ul><h4 id="Index-Set-Randomization"><a href="#Index-Set-Randomization" class="headerlink" title="Index Set Randomization"></a>Index Set Randomization</h4><p>how a client can generate a randomized index set in each participating round</p><ul><li>basic design</li></ul><p><img src="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112214156196.png" class="lazyload placeholder" data-srcset="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112214156196.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221112214156196"></p><ul><li>let client 𝑖 maintain two index sets with “Yes” and “No” answers in the permanent randomized response</li></ul><ul><li>Client 𝑖’s Index Set Randomization</li></ul><p><img src="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112214520078.png" class="lazyload placeholder" data-srcset="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112214520078.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221112214520078"></p><ul><li>主要解决 随机生成的index跟随轮次变化的问题,希望跟随轮次的变化,用户要的index也不要被猜出来。细节见原文。</li></ul><h4 id="Private-Set-Union-PSU"><a href="#Private-Set-Union-PSU" class="headerlink" title="Private Set Union (PSU)"></a>Private Set Union (PSU)</h4><p><img src="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112214740791.png" class="lazyload placeholder" data-srcset="/2022/11/12/mobicom20-billion-scale-federated-learning-on-mobile-clients-a-submodel-design-with-tunable-privacy/image-20221112214740791.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221112214740791"></p><h2 id="THEORETICAL-ANALYSIS"><a href="#THEORETICAL-ANALYSIS" class="headerlink" title="THEORETICAL ANALYSIS"></a>THEORETICAL ANALYSIS</h2><h3 id="Security-and-Privacy-Analyses"><a href="#Security-and-Privacy-Analyses" class="headerlink" title="Security and Privacy Analyses"></a>Security and Privacy Analyses</h3><h3 id="Complexity-Analysis-and-Comparison"><a href="#Complexity-Analysis-and-Comparison" class="headerlink" title="Complexity Analysis and Comparison"></a>Complexity Analysis and Comparison</h3><h2 id="EVALUATION"><a href="#EVALUATION" class="headerlink" title="EVALUATION"></a>EVALUATION</h2><ul><li>Dataset</li><li>Model, Hyperparameters, and Metric</li><li>Prototypes and Configurations</li></ul><h3 id="Model-Accuracy-and-Convergency"><a href="#Model-Accuracy-and-Convergency" class="headerlink" title="Model Accuracy and Convergency"></a>Model Accuracy and Convergency</h3><h3 id="Communication-Overhead"><a href="#Communication-Overhead" class="headerlink" title="Communication Overhead"></a>Communication Overhead</h3><h3 id="Computation-Overhead"><a href="#Computation-Overhead" class="headerlink" title="Computation Overhead"></a>Computation Overhead</h3><h3 id="Memory-and-Disk-Loads"><a href="#Memory-and-Disk-Loads" class="headerlink" title="Memory and Disk Loads"></a>Memory and Disk Loads</h3><h3 id="Discussion-on-Billion-Scale-Issues"><a href="#Discussion-on-Billion-Scale-Issues" class="headerlink" title="Discussion on Billion-Scale Issues"></a>Discussion on Billion-Scale Issues</h3><h2 id="CONCLUSION"><a href="#CONCLUSION" class="headerlink" title="CONCLUSION"></a>CONCLUSION</h2>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> About Model </tag>
</tags>
</entry>
<entry>
<title>OSDI21-Oort Efficient Federated Learning via Guided Participant Selection</title>
<link href="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/"/>
<url>/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/</url>
<content type="html"><![CDATA[<h1 id="OSDI21-Oort-Efficient-Federated-Learning-via-Guided-Participant-Selection"><a href="#OSDI21-Oort-Efficient-Federated-Learning-via-Guided-Participant-Selection" class="headerlink" title="OSDI21-Oort Efficient Federated Learning via Guided Participant Selection"></a>OSDI21-Oort Efficient Federated Learning via Guided Participant Selection</h1><p><img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224210212604.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224210212604.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224210212604"></p><p><a href="https://zhuanlan.zhihu.com/p/507487766">一种新型的联邦分布式架构(Oort)及未来研究方向 - 知乎 (zhihu.com)</a></p><p><strong>优先选择既有优质的数据(快速的让loss下降)又算的快的client</strong>。(prioritizes the use of those clients who have both data that offers the greatest utility in improving model accuracy and the capability to run training quickly)</p><p><strong>Oort可以强制要求参与者数据的分布,这样开发者可以较方面的解释他的结果。</strong>(To enable FL developers to interpret their results in model testing, Oort enforces their requirements on the distribution of participant data while improving the duration of federated testing by cherry-picking clients)</p><p><strong>Result</strong>: <strong>time-to-accuracy performance</strong> by <strong>1.2×-14.1×</strong> and <strong>final model accuracy</strong> by <strong>1.3%-9.8%</strong>, while efficiently enforcing developer-specified model testing criteria at the scale of millions of clients.</p><p><strong>方法</strong>:通过client最近的loss(utility)去选择一个client的子集(惩罚会导致全局聚合时间增加的client)去优化statistical model efficiency和system efficiency。</p><p><strong>实现</strong>:<strong>PySyft</strong>上,真实的workloads上测试。</p><ul><li>选择client的策略同时考虑了设备异构性、数据异构型和通信开销的问题。</li><li>通过选择client的策略来优化训练的效率,缺少了个性化。</li></ul><h2 id="Background(为什么做这件事)"><a href="#Background(为什么做这件事)" class="headerlink" title="Background(为什么做这件事)"></a>Background(为什么做这件事)</h2><p>联邦学习通常每一轮几百台设备,花好几天训练完成。</p><p><strong>训练模型的wall clock time仍然是一个关键的性能指标,需要提升</strong>。</p><p><strong>开发者测试的时候有挑选出一部分数据集做训练和测试的需求</strong>:“50k representative samples”、“x samples of class y”、“a subset with less than X% data deviation from the global”。</p><p><strong>当前的FL有什么挑战?</strong></p><ul><li><p>数据异构性:展示了一些数据集上不同clients之间数据规模和pairwise data divergence的CDF图。</p><p> <img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224213011528.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224213011528.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224213011528"></p></li><li><p>设备异构性:用mobilenet在几百个设备上跑,计算了<strong>推理事件</strong>和<strong>网络的吞吐率</strong>画了两张图。</p><p> <img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224213023703.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224213023703.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224213023703"></p></li><li><p>Enormous population and pervasive uncertainty</p></li><li><p>Privacy concerns</p></li></ul><p><strong>当前的解决办法有什么不足(Limitations of Existing FL Solutions)</strong></p><ul><li><p>the potential for curbing these disadvantages by cherry-picking participants before execution has largely been overlooked</p></li><li><p>Suboptimality in maximizing efficiency: 用MobileNet和ShuffleNet在1.6 million的OpenImage数据集上做训练,在超过一万四的clients上随机选择100个做训练,用图片在100clients均匀分布训练出的结果为upper bound,发现即使用了YoGi和Prox等优化技术,忽略系统的异构性都会显著增加每一轮此的延迟。<img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224213308401.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224213308401.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224213308401"></p></li><li><p>Inability to enforce data selection criteria:<strong>Figure 4(a)</strong>展示,即使是同样的participants,随机选择策略会导致显著的data deviations,尽管这种deviation会随着participants数量的增加而增加,但去两户啊这种deviation仍然十分重要。Figure 4(b)显示,选择participants的时候,开发者不能指定他想要的分布,会导致测试的结果的偏差范围很大。</p><p> <img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224213431960.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221224213431960.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221224213431960"></p></li></ul><h2 id="Oort整体设计"><a href="#Oort整体设计" class="headerlink" title="Oort整体设计"></a>Oort整体设计</h2><p><img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111134439590.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111134439590.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111134439590"></p><ul><li><strong>Job submission</strong>: the developer submits and specifies the participant selection criteria to the FL coordinator in the cloud. </li><li><strong>Participant selection</strong>: the coordinator enquires the clients meeting eligibility properties (e.g., battery level), and forwards their characteristics (e.g., liveness) to Oort.</li><li><strong>Given the developer requirements (and execution feedbacks in case of training 2a).</strong></li><li><strong>Oort selects participants based on the given criteria and notifies the coordinator of this participant selection( 2b )</strong></li><li>Execution: the coordinator distributes relevant profiles (e.g., model) to these participants, and then each participant independently computes results (e.g., model weights in training) on her data</li><li>Aggregation: when participants complete the computation, the coordinator aggregates updates from participants</li></ul><h3 id="Interface(Oort的接口设计)"><a href="#Interface(Oort的接口设计)" class="headerlink" title="Interface(Oort的接口设计)"></a>Interface(Oort的接口设计)</h3><p><strong>Training selector</strong>: </p><ul><li>improve the timeto-accuracy performance of federated training</li><li>it captures the utility of clients in training, and efficiently explores and selects high-utility clients at runtime</li></ul><p><img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111135103438.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111135103438.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111135103438"></p><ul><li><strong>When the individual client data characteristics (e.g., categorical distribution) are not provided</strong>, the testing selector determines the number of participants needed to cap the data deviation of participants <strong>from the global</strong></li><li>it cherry-picks participants to <strong>serve the exact developer-specified requirements</strong> on data while minimizing the duration of testing</li></ul><h2 id="联邦模型的训练"><a href="#联邦模型的训练" class="headerlink" title="联邦模型的训练"></a>联邦模型的训练</h2><h3 id="the-trade-off-in-selecting-participants-for-FL-training"><a href="#the-trade-off-in-selecting-participants-for-FL-training" class="headerlink" title="the trade-off in selecting participants for FL training"></a>the trade-off in selecting participants for FL training</h3><p>客户端数据和系统performance有coupled的本质,因此要综合考虑。</p><p>又使用了MobileNet在OpenImage的dataset可视化了这两个有效性的分别的影响(Figure 7).</p><ul><li>优化System Efficiency(“Opt-Sys. Efficiency”)能够减少每个轮次的执行时间(例如选择最快的clients),这可能会导致更多的训练轮次(相比随机选择而言),因为client的数据之前可能已经被其它参与者过去的轮次的数据表示过了(<strong>为什么?</strong>)。</li><li>选择高statistical utility(”Opt-Stat. Efficiency”)的clients可能会导致每一轮的时间更长(如果clients是system的bottleneck的话(<strong>为什么</strong>?))</li><li><img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111141201330.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111141201330.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111141201330"></li></ul><h3 id="how-Oort-quantifies-the-client-utility-while-respecting-privacy"><a href="#how-Oort-quantifies-the-client-utility-while-respecting-privacy" class="headerlink" title="how Oort quantifies the client utility while respecting privacy"></a>how Oort quantifies the client utility while respecting privacy</h3><h4 id="Client-Statistical-Utility:"><a href="#Client-Statistical-Utility:" class="headerlink" title="Client Statistical Utility:"></a>Client Statistical Utility:</h4><p>be able to efficiently capture the client data utility toward improving model performance for various training tasks, and respect privacy</p><p><strong>leverage importance sampling</strong>:就是选择带来梯度更新最大的clients,</p><p><strong>impractical</strong>:</p><ul><li>客户端算这些范数的时候会带来额外的时间开销。</li><li>这个梯度范数随着模型的更新会不断的变化。(为什么这是impractical的理由?)</li></ul><p><img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111141548161.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111141548161.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111141548161"></p><p><strong>因此</strong>:做了个近似来减少时间开销,就是用loss而不是用weights(感觉像是编了个故事,先编个比较难的,然后说它不好,然后说我这个好,搞个简单的)。Insight:更大的梯度对应着更大的loss(为什么?)</p><p><img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111142603020.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111142603020.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111142603020"></p><p><strong>How Oort respects privacy:</strong></p><ul><li>Training loss measures the prediction confidence of a model without revealing the raw data and is often collected in real FL deployments。(人家都收集了这个,我这个没有额外的收集数据)。</li><li>三种考虑隐私的方式(其实就是对上面这个公式的解释,没有额外的多做多少工作):<ul><li>只依靠loss选择client,loss是本地计算的,不会揭示客户端上样本的数据分布。</li><li>和LDP一样,即使的loss引起了隐私问题,也可以用差分隐私技术去解决这样的问题。</li><li>Oort可以灵活的更换其它的选择client的客户端的策略。</li></ul></li><li>在technical report上有进一步的各种策略的理论分析(比如使用batch数据的梯度的norm),并且用实验展示了使用数据有噪音情况下的较好的性能(为了证明确实可以用差分隐私这样的技术去解决这个问题)。</li></ul><h4 id="Trading-off-Statistical-and-System-Efficiency"><a href="#Trading-off-Statistical-and-System-Efficiency" class="headerlink" title="Trading off Statistical and System Efficiency"></a>Trading off Statistical and System Efficiency</h4><p>Simply selecting clients with high statistical utility can hamper the system efficiency。(因为statistical utility的增加会导致训练时间的增长,这可能会影响system efficiency,好的system efficiency应该是训练每一轮此的时间相对来说比较快的),采取了图中所示的计算方式。</p><ul><li>t是每一个客户端训练的时间。</li><li>惩罚了那些会成为系统overhead的客户端。</li></ul><p><img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111150040681.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111150040681.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111150040681"></p><ul><li><p><strong>Navigating the trade-off</strong>:</p><p> <img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111153210944.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111153210944.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111153210944"></p><p> 随着round的进行,模型的loss会逐渐下降,在<strong>模型的训练的后期,应该减小对能够取得高statistical utility的clients的选择</strong>:</p><p> <img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111153506139-1668152108109-1.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111153506139-1668152108109-1.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111153506139"></p></li></ul><h4 id="how-it-selects-high-utility-clients-at-scale-despite-staleness-in-client-utility-as-training-evolves"><a href="#how-it-selects-high-utility-clients-at-scale-despite-staleness-in-client-utility-as-training-evolves" class="headerlink" title="how it selects high-utility clients at scale despite staleness in client utility as training evolves"></a>how it selects high-utility clients at scale despite staleness in client utility as training evolves</h4><p>还有一些其它的考虑:</p><ul><li><strong>Scalability</strong>: a client’s utility can only be determined after it has participated in training; <strong>how to choose from clients at scale without having to try all clients once</strong>? (为了提高选择器的效率)</li><li><strong>Staleness</strong>: since not every client participates in every round, <strong>how to account for the change in a client’s utility since its last participation</strong>?(客户端并不是每一轮都参加的,如何处理这样的问题)。</li><li><strong>Robustness</strong>: how to <strong>be robust to outliers in the presence of corrupted clients</strong> (e.g., with noisy data)? (如何对有outliers的数据鲁棒)</li></ul><p>做了一些实验上的探索:</p><p><strong>Online exploration-exploitation of high-utility clients</strong></p><ul><li>将选择客户端抽象为多臂老虎机问题,clients比作“arm”,utility比作“reward”,多臂老虎机和强化学习的方法比起来,scalable and flexible(即使解空间例如客户端的数量跟随时间变化很大)(这里其实就是解释为什么要用多臂老虎机,这样好)。<strong>多臂老虎机既可以根据之前的概率做出选择,也可以探索未被选择的客户端:</strong><ul><li>每一个选择轮次开始的时候,Oort都接受上一个训练轮次的反馈,会更新每一个客户端的statistical utility和system performance。</li><li>对于已经explored的客户端来说,Oort计算客户端的utility并依据high-utility减小选择的范围。</li><li>然后采样$\epsilon \in [0,1]$比率的participants(那些之前没有被selected的)。 </li></ul></li></ul><p><strong>Exploitation under staleness in client utility</strong></p><p>受到多臂老虎机置信区间的影响,搞了个incentive term,慢慢的增加client的utility(如果一个client已经被忽略了很长的时间),这样的话当一个client被忽略很久了,他的被选的概率会慢慢的增加。</p><p>不直接的选择top-k的utility,搞了个confidence interval c on the cut-off utility(默认$95%$),就是选择utility大于$c%$的top $(1-\epsilon) \times K$的,这样的话可以去除clients utility不确定性的干扰。</p><p><strong>Robust exploitation under outliers</strong></p><p>只根据utility选be vulnerable to outliers in unfavorable settings</p><ul><li><p>因此Oort会在一些客户端已经被用过一些轮次之后把其移除。</p></li><li><p>限制client utility的上限值。</p></li></ul><p><strong>Accommodation to diverse selection criteria</strong></p><p>可以自定义utility的计算方法。</p><h2 id="联邦学习的模型测试"><a href="#联邦学习的模型测试" class="headerlink" title="联邦学习的模型测试"></a>联邦学习的模型测试</h2><h3 id="开发者未自定义数据的分布时默认给出的数据具有代表性"><a href="#开发者未自定义数据的分布时默认给出的数据具有代表性" class="headerlink" title="开发者未自定义数据的分布时默认给出的数据具有代表性"></a>开发者未自定义数据的分布时默认给出的数据具有代表性</h3><p>enable guided participant selection by determining the number of participants needed to guarantee this deviation target.</p><p><strong>好像只是对数据的样本数进行了一个抽取数据的分布的限制,使抽取的数据更加具有代表性。</strong></p><h4 id="Preserving-Data-Representativeness"><a href="#Preserving-Data-Representativeness" class="headerlink" title="Preserving Data Representativeness"></a>Preserving Data Representativeness</h4><p>使用L1 distance去度量所有参与者与全局数据分布的偏差,对于类别X, L1-distance计算如下:</p><p>$\left| \overline{X} - E[\overline{X}] \right|$,表示所有参与者的样本均值和所有clients之间样本均值的关系。</p><p>由于一个客户端的样本的数量不会影响其它客户端的样本的数量,所以每一个客户端的样本的个数可以看作总体X的采样,给定置信度和置信度,我们的<strong>目标</strong>就是求出需要的参与者的数量,这样的话要选出来的代表性的样本的分布被bounded,也就是:</p><p>$Pr[\overline{X}-E[\overline{X}]]>\delta$,这就变成了一个随机数采样的问题,运用了<strong>Hoeffding bound</strong>去度量不同参与者采样得到的数据偏移。(技术报告中有理论的额证明和结果)。</p><h4 id="Estimating-the-number-of-participants-to-cap-deviation"><a href="#Estimating-the-number-of-participants-to-cap-deviation" class="headerlink" title="Estimating the number of participants to cap deviation"></a>Estimating the number of participants to cap deviation</h4><p>即使个体的数据特征不可得,开发者也可以去指定tolerance $\epsilon$,Oort会输出满足这样的tolerance需要的参与者数目。</p><p>模型要求开发者输入每一个client的最大和最小样本数,客户端的总数(为什么要客户端的总数?)。开发者也可以设一个模糊的限制(例如依据设备模型的容量来设置)</p><p>开发者可以将自己的模型先部署到已经选择的参与者上面,再收集了这些参与者的结果之后,再去确认一下计算数据的代表性。</p><h3 id="开发者自定义自己的数据分布"><a href="#开发者自定义自己的数据分布" class="headerlink" title="开发者自定义自己的数据分布"></a>开发者自定义自己的数据分布</h3><p>对于每一个client,根据每一个client可以提供的开发人员需要的样本的数目,每一个client的算力,每一个client上传输的数据大小,每一个client的带宽,可以计算出所有participant上最多的花费时间,我们的目标是优化这个最大时间。</p><p><img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111222910741.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111222910741.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111222910741"></p><p>这样的mixed-integer linear programming (MILP) model解出来的结果很好,但是<strong>时间复杂度较高</strong>,时间复杂度很高(没说多少)。</p><ul><li><p>贪心的启发式算法解决时间复杂度高的问题:</p><p> <img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111223455902.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111223455902.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111223455902"></p></li></ul><h2 id="Implementation"><a href="#Implementation" class="headerlink" title="Implementation"></a>Implementation</h2><p>Oort,2617行代码,和FL engine例如(PySyft和TensorFlow集成)。</p><ul><li>低消耗</li><li>定期备份。</li><li>故障恢复。</li><li>Gurobi solver去解决MILP问题。</li></ul><h2 id="实验"><a href="#实验" class="headerlink" title="实验"></a>实验</h2><p><strong>68个英伟达 Tesla P100的GPUs上每一轮模拟了1300个参与者</strong>,在。训练和测试阶段模拟真实的异构client系统的性能和数据,使用了一个开源的FL benchmark:</p><ul><li>不同的设备</li><li>不同的模型</li><li>不同的网络吞吐量和连接性</li></ul><p>手动的让不同client上的数据产生数据异构性。</p><p>coordinator和clients之间的交流架构是参数服务器架构。(PySyft和真实的FL部署都是这么干的)。</p><p>为了减小staggers的影响,使用了真实FL部署中使用的机制。</p><p><strong>数据集和模型</strong>:</p><p><img src="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111224714887.png" class="lazyload placeholder" data-srcset="/2022/11/12/osdi21-oort-efficient-federated-learning-via-guided-participant-selection/image-20221111224714887.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221111224714887"></p><p><strong>模型设置的一些参数</strong></p><h3 id="FL-Training-Evaluation"><a href="#FL-Training-Evaluation" class="headerlink" title="FL Training Evaluation"></a>FL Training Evaluation</h3><h4 id="End-to-End-Performance"><a href="#End-to-End-Performance" class="headerlink" title="End-to-End Performance"></a>End-to-End Performance</h4><ul><li>Oort improves time-to-accuracy performance</li><li>Oort improves final model accuracy</li></ul><h4 id="Performance-Breakdown"><a href="#Performance-Breakdown" class="headerlink" title="Performance Breakdown"></a>Performance Breakdown</h4><ul><li>Breakdown of time-to-accuracy efficiency</li><li>Oort achieves close to upper-bound statistical performance</li></ul><h4 id="Sensitivity-Analysis"><a href="#Sensitivity-Analysis" class="headerlink" title="Sensitivity Analysis"></a>Sensitivity Analysis</h4><ul><li>Impact of number of participants K</li><li>Impact of penalty factor α on stragglers</li><li>Impact of outliers</li><li>Impact of noisy utility</li><li>Oort can respect developer-preferred fairness</li></ul>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> About Algorithm </tag>
</tags>
</entry>
<entry>
<title>ICML22-FedScale-Benchmarking Model and System Performance of Federated Learning at Scale'</title>
<link href="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/"/>
<url>/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/</url>
<content type="html"><![CDATA[<h1 id="ICML22-FedScale-Benchmarking-Model-and-System-Performance-of-Federated-Learning-at-Scale"><a href="#ICML22-FedScale-Benchmarking-Model-and-System-Performance-of-Federated-Learning-at-Scale" class="headerlink" title="ICML22-FedScale-Benchmarking Model and System Performance of Federated Learning at Scale"></a>ICML22-FedScale-Benchmarking Model and System Performance of Federated Learning at Scale</h1><p>这篇文章开始做了一个empirial study,发现了当前benchmark上存在的一些问题(绘制了一些统计分析的图,得到了一些结论),然后根据初步得到的结论做了自己的一个平台。</p><ul><li>a federated learning (FL) benchmarking suite. <a href="http://fedscale.ai/">官网</a><ul><li>realistic datasets, Each dataset comes with a unified evaluation protocol using real-world data splits and evaluation metrics.<ul><li>20个真实的数据集(小规模、中等规模、大规模)(图像分类、目标检测、单词于此、语音识别、强化学习)</li><li>根据真实世界的情况(设备的system speed、connectivity和availability对数据集进行了划分和模拟)</li></ul></li><li><strong>automated evaluation platform</strong>,Fedscale Runtime.<ul><li>mobile backend:enable on-device FL evaluation</li><li>cluster backend: benchmark various practical FL metrics</li></ul></li><li><strong>systematic experiments</strong>(some implications):highlight the pressing need for co-optimizing system and statistical efficiency, especially in tackling system stragglers, accuracy bias, and device energy trade-offs</li></ul></li></ul><h2 id="Background(当前的FL-Benchmark存在各种各样的问题)"><a href="#Background(当前的FL-Benchmark存在各种各样的问题)" class="headerlink" title="Background(当前的FL Benchmark存在各种各样的问题)"></a>Background(当前的FL Benchmark存在各种各样的问题)</h2><h3 id="总结当前FL的Benchmark存在的各种各样的问题"><a href="#总结当前FL的Benchmark存在的各种各样的问题" class="headerlink" title="总结当前FL的Benchmark存在的各种各样的问题"></a>总结当前FL的Benchmark存在的各种各样的问题</h3><p>具体的比较见<a href>附件C</a></p><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212134351748.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212134351748.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><ul><li>当前的benchmark在这些方面做的都不是很好:data heterogeneity、device heterogeneity、heterogeneous connectivity、availability conditions、multiple scales (robustness)、broad variety of ML tasks</li><li>[19-LEAF](Caldas et al., 2019),数据集CIFAR包含合成的数据集,这些数据集通常都是从经典的ML benchmark或者是使用模拟的FL环境例如TFF和PySyft生成的。</li><li><a href>20-FedML</a>和<a href>21-Flower</a>:当前的FL Benchmark忽略system speed, connectivity和availability.</li><li>当前benchmark的数据集都是小规模的,不支持大规模的数据集。</li><li>缺少user-friendly的APIs</li></ul><h3 id="当前的FL的benchmark缺少真实世界的数据和系统行为、且并不是大规模的数据集导致的问题初探"><a href="#当前的FL的benchmark缺少真实世界的数据和系统行为、且并不是大规模的数据集导致的问题初探" class="headerlink" title="当前的FL的benchmark缺少真实世界的数据和系统行为、且并不是大规模的数据集导致的问题初探"></a>当前的FL的benchmark缺少真实世界的数据和系统行为、且并不是大规模的数据集导致的问题初探</h3><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212141014692.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212141014692.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221212141014692"></p><ul><li>2(a)展示了,在遇见真实设备行为的时候、statistical performance会变差。</li><li>2(b)展示了,在参与的客户端不是很多的性能表现劣于参与客户端很多时候的性能表现。</li></ul><p>因此,当前的benchmark存在问题,我们要解决这样的一个问题。</p><h2 id="FedScale-Dataset-Realistic-FL-Workloads(我们提出的数据集)"><a href="#FedScale-Dataset-Realistic-FL-Workloads(我们提出的数据集)" class="headerlink" title="FedScale Dataset: Realistic FL Workloads(我们提出的数据集)"></a>FedScale Dataset: Realistic FL Workloads(我们提出的数据集)</h2><h3 id="为了解决数据异构性的问题,在搜集的数据集的方法、数据集在不同clients上的划分做了创新。"><a href="#为了解决数据异构性的问题,在搜集的数据集的方法、数据集在不同clients上的划分做了创新。" class="headerlink" title="为了解决数据异构性的问题,在搜集的数据集的方法、数据集在不同clients上的划分做了创新。"></a>为了解决数据异构性的问题,在搜集的数据集的方法、数据集在不同clients上的划分做了创新。</h3><ul><li>提出收集数据集的方法(其实就是收集别人的数据集,做清洗和整理),然后提供了一组API可以比较方便的去利用这个数据集(使用自定义的数据集、使用数据集的某个分布),例如:<ul><li>Puffer dataset (Yan et al., 2020) is from FL video streaming deployed to edge users over the Internet. The raw data of FedScale datasets are collected from different sources and stored in various formats. We clean up the raw data, partition them into new FL datasets, streamline new datasets into consistent formats, and categorize them into different FL use cases.</li></ul></li></ul><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667052732044-2.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667052732044-2.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><ul><li><p>收集数据集的时候要求收集带有client information的数据集,使用独特的客户端标识去划分数据集。</p><ul><li><p>例如,OpenImage是一个vision的数据集,不同的手机用户上传他们自己的图片,作为公共用途,就可以使用OpenImage数据集的AuthorProfileUrl这个属性去把数据映射到每一个client,通过这样的操作我们就可以得到<strong>真实世界的数据的分布</strong>。</p><p> <strong>真实世界的数据的分布(PDF-Probability Density Function):</strong></p><ul><li>不同clients样本数的分布 Figure (a):每一个数据集都存在在clients上分布严重不均匀的问题。</li><li>数据的分布 Figure (b):数据集中,数据集本身(比如类别等)分布也是非常不均匀的。</li></ul> <figure class="half"> <img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212154712574.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212154712574.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" width="500"> <img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212154752591.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212154752591.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" width="500"> </figure></li></ul></li><li><p>根据<a href="arxiv.org/abs/1812.02903">Yang et al., 2018</a>等人的实际的FL的部署,将收集到的数据集根据收集数据时得到的数据的分布(在4个真实世界的数据集video (Charades)、audio (Google Speech)、image (OpenImage)、text (Reddit)上得到的分布,每个数据集都有小至几百多至百万的客户端和百万的数据点),在每一个client上将比如OpenImage数据集划分为训练集、测试集和验证集。</p></li><li><p>将数据集划分成小、中、大等不同的规模。一些数据集可以稍微改改(比如目标检测的数据集可以改一下作为分类的数据集),拓展了任务范围。</p></li></ul><h3 id="为了解决设备异构性的问题,"><a href="#为了解决设备异构性的问题," class="headerlink" title="为了解决设备异构性的问题,"></a>为了解决设备异构性的问题,</h3><p><strong>使用<a href>AI Benchmark, Ignatov et al.</a>和<a href>MobiPerf Measurements, mob</a>去获取了不同clints的运行轨迹。</strong></p><ul><li>AI Benchmark:<strong>不同模型</strong>在<strong>不同设备</strong>的训练和推理的速度。</li><li>MobiPerf:<strong>云端</strong>到<strong>设备</strong>的网络吞吐量(超过100k的收集)。</li></ul><p>可以看到,不同设备的算力和网络的带宽有显著的不同。</p><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212161006092-1670832614889-1.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212161006092-1670832614889-1.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221212161006092"></p><p>根据真实的部署<a href>Bonawitz et al., 2019</a>和<a href>Yang et al., 2018</a>,选择了<strong>内存>2GB</strong>且<strong>连接WIFI</strong>的设备。</p><p><strong>根据<a href>Yang et al., 2021</a>等人的136k用户一周180 million的行为数据集(例如电池充电情况和屏幕锁屏情况),观察到设备的availability有很大的不同</strong>:</p><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212161556429.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212161556429.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221212161556429"></p><ul><li>日夜变化时的设备量。5 (a),可能会导致statistical performance下降<a href>Eichner et al., 2019</a>。</li><li>每一个可用的设备的可用时间也存在问题。5 (b)</li></ul><h2 id="FedScale-Runtime-Evaluation-Platform(我们是怎么做我们的runtime的?)"><a href="#FedScale-Runtime-Evaluation-Platform(我们是怎么做我们的runtime的?)" class="headerlink" title="FedScale Runtime: Evaluation Platform(我们是怎么做我们的runtime的?)"></a>FedScale Runtime: Evaluation Platform(我们是怎么做我们的runtime的?)</h2><h3 id="Mobile-Backend(手机上的backend)"><a href="#Mobile-Backend(手机上的backend)" class="headerlink" title="Mobile Backend(手机上的backend)"></a>Mobile Backend(手机上的backend)</h3><p>FedScale mobile backend (Singapuram et al., 2022) is <strong>built atop the Termux app</strong> (ter)(一个支持linux环境的安卓终端)。</p><ul><li>例如:</li></ul><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212221929723.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212221929723.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221212221929723"></p><ul><li>这样实现的好处是开发者不用更改他们写的Python程序(例如Pytorch代码),程序的full-operator set(例如Pytorch程序的完整的操作符)也都可以使用,原本在server的GPUs/CPUs上实现的模型和算法也能够使用FedScale Runtime。</li><li>We are currently implementing the <strong>Google Remote Procedure Call (gRPC)</strong> for distributed mobile devices <strong>to interface with FedScale Runtime cloud server</strong>.</li></ul><h3 id="Cluster-Backend(集群上的backend)"><a href="#Cluster-Backend(集群上的backend)" class="headerlink" title="Cluster Backend(集群上的backend)"></a>Cluster Backend(集群上的backend)</h3><p>集群上的backend支持<strong>真实的部署</strong>和<strong>集群内部的模拟</strong>。</p><h4 id="Deployment-mode-真实的部署"><a href="#Deployment-mode-真实的部署" class="headerlink" title="Deployment mode (真实的部署)"></a>Deployment mode (真实的部署)</h4><ul><li>acts as the <strong>cloud aggregator</strong> and orchestrates FL executions across real devices.</li></ul><h4 id="Simulation-mode-模拟模式"><a href="#Simulation-mode-模拟模式" class="headerlink" title="Simulation mode (模拟模式)"></a>Simulation mode (模拟模式)</h4><ul><li>Providing various <strong>practical FL metrics</strong> by emulating realistic FL behaviors, such as <strong>computation/communication cost, latency and wall clock time</strong>. </li><li>第一个在GPUs/CPUs上使用实际的FL benchmarking进行模拟的平台。</li><li>在真实的部署模式和模拟模式之间只需要改一点点代码就可以跑了。</li></ul><p>模拟模式的系统架构图如下所示:</p><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212232807520.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212232807520.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221212232807520"></p><ol><li><strong>Aggregator Simulator</strong>: <ul><li>It acts as the aggregator in practical FL, which selects participants, distributes execution profiles (e.g., model weight), and handles result (e.g., model updates) aggregation.</li><li>In each round, its client manager uses the client behavior trace to monitor whether a client is available;</li><li>then it selects the specified number of clients to participate that round.</li><li>Once receiving new events, the event monitor activates the <strong>handler</strong> (e.g., <strong>aggregation handle</strong>r to perform model aggregation), or the <strong>gRPC communicator</strong> to send/receive messages. <strong>The communicator records the size (cost) of every network traffic, and its runtime latency in FL wall-clock time ( traffic_size / client_bandwidth )</strong></li></ul></li><li><strong>Client Simulator</strong>: <ul><li>It works as the FL client. </li><li>FedScale data loader loads the federated dataset of that client and feeds this data to the compute engine to run real training/testing.</li><li><strong>The computation latency</strong>: #_processed_sample × latency_per_sample. <strong>The communication latency</strong> ( traffic_size client_bandwidth ). </li><li><strong>The device monitor</strong> will terminate the simulation of a client if the current FL runtime exceeds his available slot.(e.g.,client drops out), indicated by the availability trace.</li></ul></li><li><strong>Resource Manager</strong>: <ul><li>It <strong>orchestrates the available physical resource for evaluation</strong> to maximize the resource utilization. For example, when the number of participants/round exceeds the resource capacity (e.g., simulating thousands of clients on a few GPUs), the resource manager queues the overcommitted tasks of clients and schedules new client simulation from this queue once resource becomes available. T<strong>his queuing will not affect the simulated FL runtime, as this runtime is controlled by a global virtual clock, and the event monitor will manage events in the correct runtime order.</strong></li></ul></li></ol><p><strong>Note that capturing runtime performance (e.g., wall clock time) is rather slow and expensive in practical FL</strong> – each mobile device takes several minutes to train a round – but <strong>our simulator enables fast-forward simulation, as training on CPUs/GPUs takes only a few seconds per round, while providing simulated runtime using realistic traces.</strong></p><h5 id="Automated-FL-simulation(我们的Runtime还可以自动化的模拟执行)"><a href="#Automated-FL-simulation(我们的Runtime还可以自动化的模拟执行)" class="headerlink" title="Automated FL simulation(我们的Runtime还可以自动化的模拟执行)"></a>Automated FL simulation(我们的Runtime还可以自动化的模拟执行)</h5><p>默认使用之前提到的真实用户的trace去做模拟,开发者想指定自定义的模拟也可以(指定developer-specified profile from the mobile bacjend)</p><ol><li>Task submission. FL的开发者指定配置文件(例如模型和数据集、训练还是测试),resource manager会初始化aggregator和client simulator。</li><li>FL simulation. 执行经典的FL流程。每一个轮次,aggregator选择参与者,同时resource manager将客户端的配置分发到client simulator上,每个client完成之后,client simulator把模型更新发送给aggregator,然后进行模型的聚合。</li><li>Metrics output. 训练的时候 ,开发者就可以查看评价指标(Figure 8中列出的指标)。</li></ol><h5 id="FedScale-Runtime-is-easily-deployable-and-extensible-with-plugins(我们的Runtime还易于部署、可拓展性好)"><a href="#FedScale-Runtime-is-easily-deployable-and-extensible-with-plugins(我们的Runtime还易于部署、可拓展性好)" class="headerlink" title="FedScale Runtime is easily deployable and extensible with plugins(我们的Runtime还易于部署、可拓展性好)"></a>FedScale Runtime is easily deployable and extensible with plugins(我们的Runtime还易于部署、可拓展性好)</h5><ol><li><p>FedScale Runtime provides flexible APIs, which can accommodate with different execution backends (e.g., PyTorch) by design, for the developer to quickly benchmark new plugins.</p><p> 例如提供的一些API:</p><p> <img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212234807507.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212234807507.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221212234807507"></p></li><li><p>Figure 9 dictates an example showing how these APIs help to benchmark a new design of local client training with a few lines of code by inheriting the base Client module.</p></li></ol><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212234901369.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221212234901369.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221212234901369"></p><ol start="3"><li>FedScale Runtime can <strong>embrace new realistic (statistical client or system behavior) datasets</strong> with the built-in APIs. For example, the developer can import his own dataset of the client availability with the API (load_client_availability), and FedScale Runtime will automatically enforce this trace during evaluations. </li></ol><p>We provide more examples and a comparison with other frameworks, in <strong>Appendix D to show the ease of evaluating various today’s FL work in FedScale</strong>. 附录D展示了当今的FL的工作怎么稍微修改一下就用到我们的框架上。</p><h5 id="FedScale-Runtime-is-scalable-and-efficient(我们的Runtime可拓展性强、高效)"><a href="#FedScale-Runtime-is-scalable-and-efficient(我们的Runtime可拓展性强、高效)" class="headerlink" title="FedScale Runtime is scalable and efficient(我们的Runtime可拓展性强、高效)"></a>FedScale Runtime is scalable and efficient(我们的Runtime可拓展性强、高效)</h5><ul><li>模拟模式可以进行大规模的模拟,In the simulation mode, FedScale Runtime can perform large-scale simulations (thousands of clients per round) in both standalone (single CPU/GPU) and distributed (multiple machines) settings<ul><li>使用了<a href>GPU共享技术, Yu & Chowdhury, 2020</a>,这样的话多个client simulator可以在一个GPU上。</li><li>resource管理器监控了机器资源的使用,overcommitted execution requests会进入队列,在不同的机器上自适应的分配达到负载聚恒,根据client的虚拟时钟去模拟。</li></ul></li><li>分布式的模拟支持大规模的模拟(21-FedJax就不支持),FedScale Runtime最小化了overhead (e.g., frequent data serialization) in the fleet training of FL clients. <strong>和FedML以及Flower相比,更高效的支持大面积的评测。</strong></li></ul><h2 id="Experiments(我们测试了,确实比别人好)"><a href="#Experiments(我们测试了,确实比别人好)" class="headerlink" title="Experiments(我们测试了,确实比别人好)"></a>Experiments(我们测试了,确实比别人好)</h2><ul><li><strong>10 NVIDIA Tesla P100 GPUs</strong> in our evaluations, <strong>collects updates from the first N(默认100) completed participants out of 1.3N participants to mitigate system stragglers in each round.</strong> (和(Bonawitz et al., 2019; Yang et al., 2018)的真实场景的训练一致),细节见<strong>附件A</strong>。</li></ul><h3 id="Statistical-efficiency的实验的结果"><a href="#Statistical-efficiency的实验的结果" class="headerlink" title="Statistical efficiency的实验的结果"></a>Statistical efficiency的实验的结果</h3><ol><li><p>不使用可选的system traces的时候,和以前的benchmark有类似的性能。<strong>the performance of existing benchmarks</strong> and <strong>FedScale</strong> are <strong>quite close</strong> in the same settings if we <strong>turn off the optional system traces</strong> in FedScale. (Because underlying training and FL protocols in evaluations are the same)。</p></li><li><p>Benchmarking FL statistical efficiency. (<strong>IID data vs. Non-IID data</strong>)。使用了当前最好的优化方式(FedAvg、FedProx和FedYoGi),在有数据异构性和没有数据异构性的时候做了实验</p><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667055455629-4.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667055455629-4.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221213000524191.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221213000524191.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221213000524191"></p><ul><li>非IID的实验组相同轮次的精度性能没有IID的好。</li><li>不同的任务不同的优化方向可能也不太一样。(FedYoGi performs the best on OpenImage, but it is inferior to FedAvg on Google Speech.)</li><li>现在的benchmark可能会嘀咕当前一些FL方法的结果。(Figure 11c)</li></ul><p><strong>FedScale可以支持100个参与者,FedML一次只能支持30个参与者。</strong></p></li></ol><h3 id="System-efficiency的实验的结果"><a href="#System-efficiency的实验的结果" class="headerlink" title="System efficiency的实验的结果"></a>System efficiency的实验的结果</h3><p>(1). <strong>FedScale Runtime enables fast-forward evaluations of the practical FL wall-clock time with fewer evaluation hours</strong>. Taking different number of local steps K in local SGD as an example (McMahan et al., 2017), <strong>Figure 12(a)</strong> and <strong>Table 12(b)</strong> illustrate that FedScale can evaluate this impact of K on practical FL runtime in a few hours.</p><p>(2). FedScale Runtime can <strong>dictate the FL execution cost by using realistic system traces</strong>.</p><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667055708884-6.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667055708884-6.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><ul><li>Figure 12 (a)、(b)显示Fedscale可以在数个小时内就跑完结果。</li><li>Figure 12 (c)显示Fedscale可以显示通信开销。</li></ul><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221213001425357.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221213001425357.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221213001425357"></p><ul><li>Figure 15显示了单个客户端的完成模拟的时间,这可以帮助开发者去做accuracy-cost的取舍。</li></ul><h3 id="Benchmarking-FL-privacy-and-security"><a href="#Benchmarking-FL-privacy-and-security" class="headerlink" title="Benchmarking FL privacy and security"></a>Benchmarking FL privacy and security</h3><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667055979760-8.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667055979760-8.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><ul><li><p>FedScale可以帮助评测数据和设备的efficiency,以便于隐私和安全优化。</p></li><li><p>例如保护隐私的实验(测试了差分隐私):</p><ul><li>例如:DP-SGD (Geyer et al., 2017;Kairouz et al., 2021a):使用差分隐私保护client的隐私,我们采取不同的privacy target $\sigma$和每个round不同数量的参与者(N)。</li><li>Figure 13显示,参与者的数量(N)可能会误导隐私的评测:for σ=0.01, while we notice great performance degradation (12.8%) in the final model accuracy when N =30, this enhancement is viable in practical FL (N =100) with decent accuracy drop (4.6%)</li></ul></li><li><p>安全的实验(再OpenImage数据集上测试了backdoor attack, where corrupted clients flip their ground-truth labels to poison the training. ):</p><ul><li>backdoor attacks (Sun et al., 2019; Wang et al., 2020), 测试了两组,one without security enhancement, while the other clips the model updates as (Sun et al., 2019).</li></ul><p> <img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221213002523695.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/image-20221213002523695.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221213002523695"></p><ul><li>As shown in Figure 14, while state-of-the-art optimizations report this can mitigate the attacks without hurting the overall performance on their synthesized datasets, <strong>large accuracy drops can occur in more practical FL settings</strong>.</li></ul></li></ul><h2 id="Implications(我们实验得到的启发)"><a href="#Implications(我们实验得到的启发)" class="headerlink" title="Implications(我们实验得到的启发)"></a>Implications(我们实验得到的启发)</h2><ol><li>Heterogeneity-aware co-optimizations of communication and computation</li><li>Co-optimizations of statistical and system efficiency</li><li>FL design-decisions considering mobile environment</li></ol><h1 id="Related-work"><a href="#Related-work" class="headerlink" title="Related work"></a>Related work</h1><p><img src="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667056105725-11.png" class="lazyload placeholder" data-srcset="/2022/10/28/icml22-fedscale-benchmarking-model-and-system-performance-of-federated-learning-at-scale/123-1667056105725-11.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
<tag> Benchmark </tag>
</tags>
</entry>
<entry>
<title>OSDI22-Automatic Reliability Testing for Cluster Management Controllers</title>
<link href="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/"/>
<url>/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/</url>
<content type="html"><![CDATA[<h1 id="OSDI22-Automatic-Reliability-Testing-for-Cluster-Management-Controllers"><a href="#OSDI22-Automatic-Reliability-Testing-for-Cluster-Management-Controllers" class="headerlink" title="OSDI22-Automatic Reliability Testing for Cluster Management Controllers"></a>OSDI22-Automatic Reliability Testing for Cluster Management Controllers</h1><p><strong>作者根据insight给的一些可能的错误类型,对k8s的调度器调用API的接口进行了一个封装,通过可能的错误类型生成了一些测试计划,找到了一些bug。</strong></p><p>Borg, Omega and Kubernetes 依赖state-reconciliation principle去变得高弹性和可拓展性,所有的<strong>集群管理</strong>的<strong>逻辑</strong>即<strong>控制器</strong>被嵌入在松散耦合的微服务中(这些<strong>集群管理的逻辑就是微服务</strong>)。每一个控制器独立的观测当前的集群的状态,采取特定的措施使得当前的集群满足特定的状态。复杂的分布式的本质是建立稳定性强的正确的控制器很难,控制器面临着<strong>无数的可靠性问题</strong>,导致<strong>数据损失,安全和资源泄漏</strong>。</p><p><strong>Sieve</strong>,测试集群控制器的工具,通过不断的干扰控制器看到的当前集群的状态,然后去比较集群受到干扰和没受到干扰时候的状态去检测安全和liveness的问题。Sieve的设计基于基本<strong>状态恢复系统的基本机会</strong>,这些系统基于<strong>state-centric interfaces between controllers and cluster state</strong>,Sieve找到了46个严重的安全和liveness的bug,35个确认,22个修复,false-positive比例是$3.5%$</p><h2 id="Intro"><a href="#Intro" class="headerlink" title="Intro"></a>Intro</h2><p>控制器遵循<strong>state-reconciliation principle</strong>,reconciles the current state of the cluster to match a desired state. 集群的状态放在逻辑上中心化的,高可用的数据中心,K8S里面的<strong>pods、nodes、volumns和应用实例都被表示为集群中状态的对象</strong>。</p><p><img src="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/123.png" class="lazyload placeholder" data-srcset="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/123.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><p>Kubernetes controllers管理Cassandra的时候,删掉pod和finalizing中间如果受到干扰,vol没删掉,就会导致storage的泄漏。</p><p><strong>优点</strong>:高可用,1) 不需要正式的对控制器和集群管理的配置说明文件 2) 不需要假设代码可能在哪里出现bug 3) 高度专业化的测试输入。<strong>只需要创建控制器镜像和基本测试工作流的表示,自动化测试</strong>,即只需要提供manifest说明如何在测试下创建和部署控制器,以及一系列的测试工作流(成熟的控制器一般都有)。</p><p><strong>insight</strong>:认为state-centric interfaces非常适合去观察和干扰控制器的view。<strong>控制器的动作是当前它看到的集群状态的严格函数</strong>。</p><p>1) 自动的修改了控制器声称支持的观测到的集群状态。</p><p>2)自动的使用生成的测试计划,标记安全和liveness的问题。(可以这么做是由于state-centric interfaces的一些特点,这样的系统往往<strong>有简单的高度内聚的</strong>state centric的接口,这种接口基本上只做读写和接收状态改变的操作,并且所有的对象共享策略(例如k8s里面就用同样的域去表示metadata))。</p><p><strong>测试方法:</strong>干扰控制器的view,插入1)中间状态 2)已经失效的状态 3)未观测的状态,根据控制器的状态和行为去生成<strong>测试计划</strong>,避免冗余和无效的测试计划。</p><h2 id="Bg"><a href="#Bg" class="headerlink" title="Bg"></a>Bg</h2><p><strong>k8s的架构</strong>:</p><p><img src="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/123-1667563760827-2.png" class="lazyload placeholder" data-srcset="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/123-1667563760827-2.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><p>k8s的核心由API服务器的集成和高可用、高一致性的数据存储(etcd)组成,etcd里面存储了<strong>对象状态(pods、volumns、nodes、groups of applications)</strong>,所有k8s中的其他组件都和k8s进行交互。控制器调用API Server中的API去获得Object的状态。</p><p>好处:可拓展性强,增加新的应用的时候只需要增加新的控制器和新的State Object。</p><p><strong>k8s的控制器的工作方式</strong>:</p><p><img src="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/123-1667564276682-4.png" class="lazyload placeholder" data-srcset="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/123-1667564276682-4.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><p>部署ZooKeeper集群的时候,用户<strong>首先创建</strong>一个ZooKeeper的对象,这个<strong>对象</strong>指定了用户想要的集群的状态,例如多少个,版本,存储的大小,然后<strong>ZooKeeper Controller</strong>接受到ZooKeeper对象被创建的消息,就想方设法达到用户制定的状态,首先创建一个StatefulSet的对象(有状态应用的抽象),然后一个a StatefulSet controller随后被告知一个StatefulSet的对象被创建了,然后轮流的创建pod和volumn,之后scheduler,storage controller,worker都被创建去带来实际的container和volumn。如果这个时候用户编辑了ZooKeeper对象时候,每一个控制器就都会微调达到对应的状态。</p><p><strong>这个测试很重要,控制器的可靠性很难保证,当前的测试方法不行</strong>。</p><h2 id="Design"><a href="#Design" class="headerlink" title="Design"></a>Design</h2><p><strong>Sieve tests controllers with the following workflow:</strong></p><ul><li>Collecting reference traces:学习控制器在没有错误时候的表现(测试工作流下),记录这个时候的状态转移,从而对控制器和集群状态交互的接口进行检测。</li><li>生成测试计划:根据参考轨迹生成测试计划(描述了具体的干扰),描述了要注入什么错误,什么时候注入错误。</li><li>避免无效的测试计划:删除多余或者无用的测试计划(例如明显不会导致错误的测试计划)。</li><li>执行测试计划:使用test coordinator,执行每一个测试计划,<strong>test coordinator</strong>监控集群测试时候的状态转变,注入错误。</li><li>检查测试结果:generic, effective, differential oracles to automatically check test results</li></ul><h4 id="如何Perturbing控制器的state-view"><a href="#如何Perturbing控制器的state-view" class="headerlink" title="如何Perturbing控制器的state view"></a>如何Perturbing控制器的state view</h4><p>注入特定的错误(crash、delay、connection change)。</p><p><strong>decouple policy from mechanism</strong>:有利于拓展当前的策略,通过编排潜在的干扰机制增加新的策略。<strong>策略</strong>:a view Sieve exposes to the controller at a particular condition. <strong>机制</strong>:说明了如何注入错误去create a view。</p><h4 id="这三种干扰分别是什么?"><a href="#这三种干扰分别是什么?" class="headerlink" title="这三种干扰分别是什么?"></a>这三种干扰分别是什么?</h4><p><strong>Intermediate states</strong>:中间状态指的是控制器还没完成所有状态更新之前中间的一些状态,在controller失败后,k8s就会开一个新的实例,恢复之前的中间状态。</p><p><img src="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/Snipaste_2022-11-04_21-35-34.png" class="lazyload placeholder" data-srcset="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/Snipaste_2022-11-04_21-35-34.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><p>例如RabbitMQ控制器,找到了一个bug:</p><p>workload:尝试将存储从10GB->15GB,1)首先更新VolCur为15GB,然后更新VolReq为15GB(触发k8s)resize大小。在更新VolCur和VolReq的时候Sieve挂了,但是却不能正确恢复,已经被700行go代码修复了。</p><p><strong>Stale states</strong>:</p><p>图2可以看出来,控制器不直接和consist data stores去交互,而是和API server去交互,API server的话里面的状态可能会受到delayed notifications的影响。</p><p>假如有多个API Server都可用的时候,可能有一些API Server的状态比较新,有的状态不是很新,要是控制器刚开始连的是一个比较新的API Server,然后又连接了一个比较旧的Server(由于负载均衡等原因),控制器可能会做重新配置,但是控制器不应该这么做,应该正确识别这些错误。</p><p>Percona’s MongoDB,找到一个新bug:</p><p><img src="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/Snipaste_2022-11-04_21-45-19.png" class="lazyload placeholder" data-srcset="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/Snipaste_2022-11-04_21-45-19.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><p>关闭MongoDB集群的时候,controller等着看见MongoDB的state object的删除的时间戳,控制器看到这个变化的时候,就会删除所有的pods和volumes。</p><p>Sieve让这个控制器就错误的删除了一个live的MongoDB的集群。有一个工作流首先关闭了MongoDB的集群然后重新创建了一个同样的集群,等到新的集群创建完成后,Sieve就引入了一个这样的错误,使得controller把这个集群给删了。</p><p><strong>Unobserved states</strong>:现在的controller设计成level-triggered systems(opposed to being edge-triggered),和别的设计不同的是,别的设计观测所有的集群状态的改变,做这所有的状态的改变,但是level-triggered systems就只根据当前的状态去控制转变集群的状态.</p><p>Instaclustr’s Cassandra controller,找到了一个bug导致资源泄露和服务失败:</p><p><img src="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/Snipaste_2022-11-04_21-56-52.png" class="lazyload placeholder" data-srcset="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/Snipaste_2022-11-04_21-56-52.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><p>在scale-down的时候,controller应该得知pods被删除后移除了所有的volumns,而pods的生命周期是由StatefulSet Controller管理的,Sieve就暂停了了发送给Cassandra控制器的消息,因此Cassandra不知道pods被删除了,这导致Cassandra控制器没有删除对应的volumns。</p><h4 id="如何收集Reference-Traces"><a href="#如何收集Reference-Traces" class="headerlink" title="如何收集Reference Traces"></a>如何收集Reference Traces</h4><p>通过封装API Server的函数,封装了10个函数,在这里面注入干扰。</p><p>由此跑一下所有开发者提供的workload,通过学习每一个controller收到的集群状态改变的提醒,和控制器对集群状态的任何读和写的操作,或者是对API Server的client(也就是controller机器自己)维护的local cache去收集两方面的reference traces:</p><ul><li><strong>Controller trace</strong>:一系列使用client的API就可以观测的时间,例如状态改变的notification,reconciliation cycle的entry和exits,client-API的invocation(调用)</li><li><strong>Cluster state trace</strong>:初始化的集群状态和状态改变的序列。</li></ul><h4 id="如何使用Reference-Traces生成测试计划:"><a href="#如何使用Reference-Traces生成测试计划:" class="headerlink" title="如何使用Reference Traces生成测试计划:"></a>如何使用Reference Traces生成测试计划:</h4><p>每一个测试计划做一种干扰。</p><p><strong>测试计划</strong>:每一个测试计划由self-contained的文件构成,这个文件描述了<strong>测试工作流、一些列的注入的错误、注入错误这一行为的触发条件</strong>。</p><p>目前支持:</p><ul><li>crash/restart a controller</li><li>disconnect/reconnect a controller to an API server</li><li>block/unblock a controller from processing events</li><li>block/unblock an API server from processing events</li></ul><p>根据测试计划就可以复现bug。</p><p><img src="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/image-20221104221955496.png" class="lazyload placeholder" data-srcset="/2022/10/28/osdi22-automatic-reliability-testing-for-cluster-management-controllers/image-20221104221955496.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg" alt="image-20221104221955496"></p><h4 id="测试计划生成"><a href="#测试计划生成" class="headerlink" title="测试计划生成"></a>测试计划生成</h4><p>制定了一系列规则,在这些规则里生成计划,<strong>这个说法很有意思,不说如何制定测试状态,因为大家看到你是怎么制定的就感觉就这,但是避免无用的测试计划这个说法让我感觉很牛</strong>。</p><ul><li><strong>Intermediate-state rule</strong>:集群从一个状态更新到另一个状态要若干步骤,对于若干个步骤$U_{1}, U_{2}…$,做一个步骤就让Controller崩溃一遍,生成一个测试计划。</li><li><strong>Stale-state rule</strong>:不停的让controller travel back in time,然后给他看以前的状态,方法就是不断的做一些可能会带来冲突效果的操作,去观察controller是否正常。具体的做法是,首先观测到更新U的N个结果,然后搜索之后可能的N’个结果,尽量用例如先删了一个object,再创建一个object这样可能会导致冲突的操作。</li><li><strong>Unobserved-state rule</strong>:跳过可能的正常状态的改变,对于状态对(N, N’),这样生成测试计划1)先不让控制器看到状态N 2)当N’到来的时候让控制器看到N。</li></ul><h4 id="如何避免无用的测试计划"><a href="#如何避免无用的测试计划" class="headerlink" title="如何避免无用的测试计划"></a>如何避免无用的测试计划</h4><p>依据<strong>测试计划生成产生了大量的测试计划</strong>,太多了,光stale-state就在MongoDB Controller上生成了140000+测试计划。</p><p><strong>a guiding principle</strong>:</p><ul><li>prune a test plan if the test plan does not introduce an intermediate-, a stale- or an unobservedstate that can affect the controller’s outputs.</li><li>the introduced state is identical to states introduced by other test plans.</li></ul><p><strong>3.4.1 Pruning by Causality</strong></p><p>根据<strong>因果关系</strong>(更新U是基于状态N做出来的,那么N和U是因果相关的)筛选。因果关系很难做、现在的k8s不好做因果分析。</p><p>所以制定了<strong>两个规则(个人认为:基于局部性原理)</strong>,可能会导致<strong>false positive</strong>的错误:</p><ul><li>Read-before-update rule:the object pertaining to N is read by the controller before it issues U。更新U是在读取N之后作出的。<strong>个人理解:</strong>刚读过状态就去更新,那么更新U大概率和N相关。</li><li>Earliest-reconciliation rule:N and U happen in the same or adjacent reconciliation cycles. N和U发生在相同或者相近的reconciliation周期中,可以理解。</li></ul><p><strong>例如只保留至少一个U和N因果相关的测试计划</strong>。</p><p><strong>3.4.2 Pruning Unsuccessful Updates</strong></p><p><strong>忽略任何不改变当前集群状态的更新</strong>。</p><p>理由就是:如果一个更新没有改变当前的集群的状态,就不大可能导致新的状态出现。</p><h4 id="Test-Plan-Execution"><a href="#Test-Plan-Execution" class="headerlink" title="Test Plan Execution"></a>Test Plan Execution</h4><p>由Sieve测试协调器执行。</p><h4 id="Differential-Test-Oracles"><a href="#Differential-Test-Oracles" class="headerlink" title="Differential Test Oracles"></a>Differential Test Oracles</h4><p><strong>3.6.1 Checking End States</strong></p><p><strong>3.6.2 Checking State-Update Summaries</strong></p>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About Serverless </tag>
<tag> About Bug Detection </tag>
</tags>
</entry>
<entry>
<title>SysML19-TOWARDS FEDERATED LEARNING AT SCALE: SYSTEM DESIGN</title>
<link href="/2022/10/28/sysml19-towards-federated-learning-at-scale-system-design/"/>
<url>/2022/10/28/sysml19-towards-federated-learning-at-scale-system-design/</url>
<content type="html"><![CDATA[<h1 id="SysML19-TOWARDS-FEDERATED-LEARNING-AT-SCALE-SYSTEM-DESIGN"><a href="#SysML19-TOWARDS-FEDERATED-LEARNING-AT-SCALE-SYSTEM-DESIGN" class="headerlink" title="SysML19-TOWARDS FEDERATED LEARNING AT SCALE: SYSTEM DESIGN"></a>SysML19-TOWARDS FEDERATED LEARNING AT SCALE: SYSTEM DESIGN</h1><p>Android’s AIDL IPC mechanism?</p><h2 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h2><p>把代码放数据那,处理的问题:privacy, ownership, the locality of data</p><h2 id="部分相关工作"><a href="#部分相关工作" class="headerlink" title="部分相关工作"></a>部分相关工作</h2><ol><li><p>general description of FL: McMahan & Ramage (2017)</p></li><li><p>theory of FL: Koneˇ cn ́ y et al. (2016a), McMahan et al. (2017; 2018)</p></li><li><p>Federated Learning infrastructure:</p><ol><li>whether to focus on asynchronous or synchronous training algorithms: asynchronous training: Dean et al. (2012)</li><li>a consistent trend towards synchronous large batch training: (data center)Goyal et al., 2017; Smith et al., 2018</li><li>(Federated Averaging algorithm)McMahan et al. (2017)</li><li>enhancing privacy guarantees: (differential privacy)McMahan et al., 2018, (Secure Aggregation)Bonawitz et al., 2017</li></ol></li><li><p>Our system:</p><ol><li>Federated Averaging</li><li>Secure Aggregation: ensures that on a global level individual updates from phones are uninspectable</li><li>phone keyboard, tens of millions of real-world devices</li><li>issues: <ol><li>device availability that correlates with the local data distribution in complex ways</li><li>unreliable device connectivity and interrupted execution</li><li>orchestration of lock-step execution across devices with varying availability</li><li>limited device storage and compute resources</li></ol></li></ol></li></ol><h2 id="PROTOCOL"><a href="#PROTOCOL" class="headerlink" title="PROTOCOL:"></a>PROTOCOL:</h2><ol><li><p>训练过程:device训练完毕时,服务器向device发送本轮的全局参数和其他必要的状态例如FL Checkpoint,每个参与者在本地根据全局的状态和本地的数据集进行梯度的计算,将更新的数据发送给服务器,服务器再将这些更新全部聚合。</p></li><li><p>选择:设备的选择,满足一定的条件(例如充电和使用wifi),一轮选择小几百的数据量,满足条件后就和服务器建立起一个双向的连接,用于检查设备的liveness并进行multi-step的交流。(选择的数量大概小几百)。未连接的设备服务器告诉他稍后连接。达到足够数量的设备本轮才算是成功。</p></li><li><p>Pace steering: 给device发送重连的时间窗口,FL Populations少的时候,服务器使用无状态的概率算法拒绝一些设备以确保checkin同时到达。多的时候,使用随机的设备check in时间,避免了thundering herd问题,告诉设备按需连接。</p></li><li><p>服务器</p><ol><li>处理10个-上亿个设备的情况,处理10个-几万个参与者的情况,没一轮的更新可能是千K到几十M。</li><li>多个Coordinators:enable global synchronization and advancing rounds in lockstep,每个Coordinator负责一个FL Population,Coordinator用地址和FL Population的名字注册一个共享锁服务,一个FL Population只能有一个owner。接收并处理每个Selector有多少设备参与的信息,生成Aggregator去管理不同FL任务的轮次。</li><li>Selector:和device交流,不断接受Coordinator的信息(FL Population需要多少设备,如何决定要不要一个设备),当Aggregator生成后,将subset的devices发送给Aggregator。</li><li>Master Aggregator:管理每一个FL任务的轮次,</li><li>流水线优化,这样的架构使得上一轮的报告和下一轮的选择可以一起做。</li></ol></li><li><p>设备:</p><ol><li>服务器分反一个a TensorFlow graph and instructions for how to execute it。</li><li>在example store里面选择数据的标准,如何打包data,设备上跑多少个epoch,计算图中节点的标签个数。维护一个example store,实现提供数据的API,不断的移除旧的数据,并按照平台的推荐对数据进行加密。</li><li>设备中的架构(图2),多租户,可以训练多个FL Population。</li><li>保护FL不受到不可信设备的攻击:不能用用户身份授权,所以用了Android’s remote attestation mechanism,并且这样的方式也提供了data poisoning的保护(Bagdasaryan et al., 2018),</li></ol></li><li><p>通过分析数据去监测Device的健康(将数据上传到云):</p><p>Device上传:训练被激活时设备的状态,跑了多频繁和多久,用了多少内存,有什么错误,OS等等,不包含用户信息(PII),记录设备的状态序列。服务器端记录:每一个轮次多少设备被接受和拒绝了等等。</p></li></ol><h2 id="SECURE-AGGREGATION(reporting-phrase)"><a href="#SECURE-AGGREGATION(reporting-phrase)" class="headerlink" title="SECURE AGGREGATION(reporting phrase)"></a>SECURE AGGREGATION(reporting phrase)</h2><p> 每个设备发送的梯度加密,服务器端只有在收集到足够多的数据之后才进行总和的计算。时间复杂度随着用户是O($用户数^2$),首先所有的Aggregator做安全聚合(固定的数目k),然后Master Aggregator再做一次非安全聚合。</p><h2 id="工具和工作流:"><a href="#工具和工作流:" class="headerlink" title="工具和工作流:"></a>工具和工作流:</h2><ol><li><p>不能直接推测每一个的训练样本? 做一个工具去查看测试和模拟的数据</p></li><li><p>新修改的模型要编成FL Plan再放到服务器上跑。</p></li><li><p>模型资源的话费和运行时候的兼容性质必须自动的由基础框架验证。</p></li><li><p>如何建立模型?</p><ol><li>定义在手机上跑着的FL任务(训练和评测任务),就是实现对应的函数接口,实现输入向量到loss或者accuracy的映射。部署时候就使用设备上 样本商店 提供的数据。</li><li>除了实现每个设备的FL任务之外,还要给定一个配置文件:一轮多少个设备效果最好,模型的超参(学习率)等等。同时,可以定义多组FLtask,这在比如探索什么样的学习率比较好十分有用。</li><li>使用模型工具可以模拟 FL服务器和一些列设备,使用一些模拟的数据集。 训练出来的结果也可用作预训练的模型。</li></ol></li><li><p>FL Plan生成:</p><ol><li>FL Plan自动由前面的模型和配置生成。这个计划,其实可以由一个python程序表示,python程序会编排一个TF计算图。</li><li>版本、测试和部署管理:<ol><li>模型的更新:<ol><li>必须来自可审计的,同行审议过的代码</li><li>必须在模拟测试集上已经经过测试</li><li>使用的资源限定在一定的范围内</li><li>声称支持的tf上全部经过测试。不同终端上TF版本代码的变换,和数据中心化的训练不同的是,数据中心化的训练可以一直rebuild图,但是终端上的设备的TF runtime版本可能很老。因此由FL infrastructure来对FL Plan进行等价的变换以适应特定的版本,不同的版本经过同样的release test以确保不同的版本的变换是等价的。</li></ol></li><li>数据的写入:每一轮结束之后,模型的聚类的参数就会被写到server的被指定的位置。每个设备的每一次训练由任务名称、论次名称和其他的一些数据标识,对应的一些指标可以由这个系统去分析。</li></ol></li></ol><h2 id="应用"><a href="#应用" class="headerlink" title="应用"></a>应用</h2></li><li><p>On-device item ranking</p></li><li><p>Content suggestions for on-device keyboards</p></li><li><p>Next word prediction</p></li></ol><h2 id="OPERATIONAL-PROFILE(经验)"><a href="#OPERATIONAL-PROFILE(经验)" class="headerlink" title="OPERATIONAL PROFILE(经验)"></a>OPERATIONAL PROFILE(经验)</h2><p> </p>]]></content>
<categories>
<category> About Thesis </category>
</categories>
<tags>
<tag> About Thesis </tag>
<tag> About FL </tag>
</tags>
</entry>
<entry>
<title>概率论部分内容复习</title>
<link href="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/"/>
<url>/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/</url>
<content type="html"><![CDATA[<h1 id="概率论部分内容复习"><a href="#概率论部分内容复习" class="headerlink" title="概率论部分内容复习"></a>概率论部分内容复习</h1><h2 id="卡方分布"><a href="#卡方分布" class="headerlink" title="卡方分布"></a>卡方分布</h2><p>设$X_{1}, X_{2}, …, X_{n}$是来自总体$N(0, 1)$的样本,则称统计量</p><p>$$\chi^2=X_{1}^{2}+X_{2}^{2}+…+X_{n}^{2}$$</p><p>服从自由度为n的$\chi^{2}$分布。记为$\chi^{2}\sim \chi^{2}(n)$.</p><ol><li><p><strong>满足可加性:即从总体中分两次抽取$n1$和$n2$个样本的平方和,和一次抽取$n1+n2$个样本的平方和是相等的。</strong></p><p>$$\chi_{1}^{2}+ \chi_{2}^{2} \sim \chi^{2}(n_{1}+n_{2})$$</p></li></ol><h2 id="t分布"><a href="#t分布" class="headerlink" title="t分布"></a>t分布</h2><p>若$X \sim N(0, 1)$, $Y\sim\chi^{2}(n)$,且X, Y相互独立,则称随机变量:</p><p>$t=\frac{X}{\sqrt{Y/n}}$服从自由度为n的t分布。记为$t \sim t(n)$。</p><p><strong>t分布怎么来的?</strong></p><p><strong>源自对$\frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\sim N(0, 1)$的研究。</strong></p><p>也就是根据中心极限定理,从总体X中抽出n个样本,样本的均值为$\overline{X}$,总体的均值为$\mu$,方差为$\sigma$,样本的方差为$\sigma_{\overline{X}}$:</p><p>$\overline{X}\sim N(\mu, \sigma_{\overline{X}}^{2})$即$\frac{\overline{X}-\mu}{\sigma_{\overline{X}}}\sim N(0, 1)$即$t=\frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\sim N(0, 1)$,为自由度问$n-1$的t分布。</p><h2 id="中心极限定理"><a href="#中心极限定理" class="headerlink" title="中心极限定理"></a>中心极限定理</h2><p>和的分布收敛于正态分布的定理如:</p><p><strong>样本均值$\overline{X}\sim N(\mu,\frac{\sigma^{2}}{n})$。</strong></p><h2 id="大数定律"><a href="#大数定律" class="headerlink" title="大数定律"></a>大数定律</h2><p><strong>样本均值依概率收敛于期望值</strong></p><h2 id="假设检验"><a href="#假设检验" class="headerlink" title="假设检验"></a>假设检验</h2><ul><li>原假设$H_0$:实验之前已有的假设,AB两次测试的差距为0.</li><li>备择假设$H_{1}$:对立于原假设。</li><li><strong>先对总体的特征作出某种假设,然后通过抽样研究的统计推理,对此假设应该被拒绝还是接受作出判断。</strong></li></ul><h3 id="两类错误"><a href="#两类错误" class="headerlink" title="两类错误"></a>两类错误</h3><ul><li><p>原假设为真,即$\mu=\mu_{0}$,无显著性差异。我们却不接受结果,叫<strong>弃真错误,第一类错误</strong>,犯错概率为$\alpha$。<strong>即实际为真,但是样本却抽取到了让我们判断结果为假的情况。</strong></p></li><li><p><strong>显著性水平(犯错的概率)</strong>越大,那么原假设被拒绝的可能性就越大,犯第一类错误的可能性也就越大。<strong>p值</strong>即为在观测数据下拒绝原假设的最小<strong>显著性水平</strong>。</p></li><li><p>原假设为假,即$\mu\neq\mu_{0}$,有显著性差异。我们却接受结果,叫<strong>取伪错误,第二类错误</strong>,犯错概率为$\beta$。<strong>即实际为假,但是样本却抽取到了让我们判断结果为真的情况。</strong></p></li></ul><p><img src="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123.png" class="lazyload placeholder" data-srcset="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><h3 id="如何减少两类错误"><a href="#如何减少两类错误" class="headerlink" title="如何减少两类错误"></a>如何减少两类错误</h3><p><img src="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667132397600-2.png" class="lazyload placeholder" data-srcset="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667132397600-2.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><h3 id="显著性水平"><a href="#显著性水平" class="headerlink" title="显著性水平"></a>显著性水平</h3><p><img src="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667132490168-4.png" class="lazyload placeholder" data-srcset="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667132490168-4.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><h3 id="p值计算"><a href="#p值计算" class="headerlink" title="p值计算"></a>p值计算</h3><p><img src="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667133450982-6.png" class="lazyload placeholder" data-srcset="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667133450982-6.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><h3 id="置信水平和区间"><a href="#置信水平和区间" class="headerlink" title="置信水平和区间"></a>置信水平和区间</h3><p><img src="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667133632548-8.png" class="lazyload placeholder" data-srcset="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667133632548-8.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><p><img src="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667133747607-10.png" class="lazyload placeholder" data-srcset="/2022/10/28/gai-lu-lun-yu-shu-li-tong-ji/123-1667133747607-10.png" srcset="https://img2.baidu.com/it/u=2037979560,2772131037&fm=26&fmt=auto&gp=0.jpg"></p><p><strong>理解:置信区间可以理解成总体均值等量用样本表示时候,往往在什么样的范围内,在这个范围内的概率就是置信水平。</strong></p><ul><li>为什么要置信区间,因为<strong>误差不可避免</strong>,<strong>置信区间即为统计量的误差范围</strong>。比如用$[a,b]$表示样本估计总体均值的误差范围,那么这一结果具有的<strong>可信程度</strong>,就是置信度。</li></ul><h3 id="其他"><a href="#其他" class="headerlink" title="其他"></a>其他</h3><p>t检验和z检验都可以用来判断样本,几次样本之间的<strong>均值</strong>是否显著,卡方检验用于判断样本偏差是否合理,即用来判断检验抽样是否合理。</p><h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><ul><li>2021.01.18: <<概率论与数理统计>></li><li>2021.08.22: <a href="https://zhuanlan.zhihu.com/p/346602966">https://zhuanlan.zhihu.com/p/346602966</a>, <a href="https://www.zhihu.com/question/24801731">https://www.zhihu.com/question/24801731</a></li></ul>]]></content>
<categories>
<category> About Reading </category>
</categories>
<tags>
<tag> About Reading </tag>
<tag> About Math </tag>
</tags>
</entry>
<entry>
<title>Build a blog like this</title>
<link href="/2021/10/14/build-a-blog-like-this/"/>
<url>/2021/10/14/build-a-blog-like-this/</url>
<content type="html"><![CDATA[<h1 id="How-to-build-a-blog-like-this"><a href="#How-to-build-a-blog-like-this" class="headerlink" title="How to build a blog like this?"></a>How to build a blog like this?</h1><ol><li>How to build a blog like this? What you need is this <a href="https://fuhanshi.github.io/2018/10/03/Hexo-Github%E5%85%8D%E8%B4%B9%E6%90%AD%E5%BB%BA%E7%82%AB%E9%85%B7%E4%B8%AA%E4%BA%BA%E5%8D%9A%E5%AE%A2/">blog</a> .(There may exists some typos, but it’s not a big problem.)</li><li>Then how to replace a theme? What you need is this <a href="https://www.jianshu.com/p/ef7a29e3ee8e">blog</a>.</li><li>Then where to find many many wonderful themes? What you need is this <a href="https://hexo.io/themes/">website</a>.</li><li>About how to config this website? <a href="https://yuang01.github.io/">clicke here</a></li></ol>]]></content>
<categories>
<category> About Me </category>
</categories>
<tags>
<tag> About Blogs </tag>
</tags>
</entry>
<entry>
<title>About Me</title>
<link href="/2021/08/07/about-me/"/>
<url>/2021/08/07/about-me/</url>
<content type="html"><![CDATA[]]></content>
<categories>
<category> About Me </category>
</categories>
<tags>
<tag> About Me </tag>
</tags>
</entry>
</search>