framework laptop 16 hybrid gpu support #101

lamikr · 2024-07-05T23:40:31Z

I received a Framework 16 laptop for testing and development with AMD's cpus and gpus.

7840HS CPU
780M iGPU (gfx1103) with 12 CU's (rocm-smi device id 0x7480)
7700S GPU (gfx1102) with 32 CU's (rocm-smi device id 0x15bf
32GB so-ram (need to check if I can update it to 64 or 96 gb later)

So far tested:

gfx1102 7700S works with basic tests ok. Have not had time to do any benchmarks yet with it.
gfx1103 will need more work and will start debugging it now

This is the first time I am able to test with hyprid gpu's and would like to find ways to test all 3 scenarios:

Either 7700S or 780M alone (should be doable by masking another gpu away from rocm)
Some tasks where it would make sense to share the task between both GPU's

lamikr · 2024-07-05T23:45:34Z

Framework laptop has 2 M.2 SSD slots and I plan to install different Linux distros to one of them and use second slot as a storage for builds. Not sure whether I could also get some distros installed to usb keys and booted from there.

So far I have tested the 7700S functionality with Fedora 40.

jrl290 · 2024-07-11T00:30:15Z

I will be very interested to see what you find. My 7840U (780M - gfx1103) will operate properly with pytorch on any gfx11xx build, but it randomly halts. Right now I'm just running it by restarting the python script if it exits without an expected return value. Not ideal but it gets me by.

- initial support for gfx1036 and gfx1103 as a build target - updated also the gfx1010 configuration settings to be more similar in composable kernel and miopen fixes: #101 fixes: #103 Signed-off-by: Mika Laitio <[email protected]>

lamikr · 2024-07-17T04:05:58Z

Initial work is now done and both the integrated M780 (gfx1103) and external 7700S (gfx1102) are
selectable as a build target and can be used. Memory and GPU usage for both of them also show up on nvtop.

More testing with the distro and new Linux 6.10 kernel is however still needed.

jrl290 · 2024-07-17T21:47:15Z

That's great! I have downloaded and installed and am testing now. Seems I am unable to install the official Linux 6.10 kernel, but I am able to use the Linux 6.10 rc4 kernel. Important too, since the auto allocation of shared memory is supported

I am getting this warning upon loading the pytorch_lightning module, but it doesn't seem actually affect the processing:
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1103

I am still randomly coming across a fatal error:
HW Exception by GPU node-1 (Agent handle: 0x5d70d5ca5b90) reason :GPU Hang

Interestingly enough though, this is only occurring in one section of pytorch code and not in another. So I'll have to investigate to see where exactly the differences may be triggering the error

lamikr mentioned this issue Jul 13, 2024

rocm sdk 6.1.2 release #65

Open

lamikr mentioned this issue Jul 14, 2024

initial gfx1036 and gfx1103 support #111

Merged

lamikr closed this as completed in 750fe4c Jul 15, 2024

lamikr reopened this Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

framework laptop 16 hybrid gpu support #101

framework laptop 16 hybrid gpu support #101

lamikr commented Jul 5, 2024

lamikr commented Jul 5, 2024

jrl290 commented Jul 11, 2024

lamikr commented Jul 17, 2024

jrl290 commented Jul 17, 2024

framework laptop 16 hybrid gpu support #101

framework laptop 16 hybrid gpu support #101

Comments

lamikr commented Jul 5, 2024

lamikr commented Jul 5, 2024

jrl290 commented Jul 11, 2024

lamikr commented Jul 17, 2024

jrl290 commented Jul 17, 2024