Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port to loongarch64 #2183

Merged
merged 6 commits into from
Jul 19, 2023
Merged

Conversation

znley
Copy link
Contributor

@znley znley commented May 24, 2023

This patch will allows criu to support the loongarch64 platoform. Most featues are available, but there are still some known and unknow bugs.

The result of zdtm test run on my loongarch64 machine as follows:

################### 16 TEST(S) FAILED (TOTAL 390/SKIPPED 43) ###################
 * zdtm/transition/ipc(ns)
 * zdtm/transition/maps008(h)
 * zdtm/transition/fifo_loop(uns)
 * zdtm/transition/file_read(uns)
 * zdtm/transition/maps007(uns)
 * zdtm/transition/unix_sock(ns)
 * zdtm/transition/socket-tcp(h)
 * zdtm/static/del_standalone_un(h)
 * zdtm/static/shm-mp(uns)
 * zdtm/static/seccomp_filter_inheritance(unknown)
 * zdtm/static/fanotify00(ns)
 * zdtm/static/rlimits00(ns)
 * zdtm/static/shm-unaligned(uns)
 * zdtm/static/shm(ns)
 * zdtm/static/deleted_unix_sock(h)
 * zdtm/static/config_inotify_irmap(uns)
##################################### FAIL #####################################

As you can see this patch is not perfect, so I hope to get your advice on what else I need to do next.

@mihalicyn mihalicyn self-requested a review May 24, 2023 08:08
@znley znley force-pushed the criu-dev-loongarch64 branch 2 times, most recently from 3312460 to 03dd186 Compare June 7, 2023 00:58
@znley
Copy link
Contributor Author

znley commented Jun 8, 2023

@mihalicyn Is there more suggestions on this?

@mihalicyn
Copy link
Member

Hi @znley!

Sorry about delay with review.

Thanks for working on porting CRIU to the new architecture.

While I'm looking closely through the code, I want to ask you two things:

  • please split your changes into a few commits (at least, you can split compel changes and criu, images changes).
  • how we can validate this thing without having a loongarch64 machine? Can you prepare ready-to-go qemu image with appropriate instruction to run this on amd64 machine and run CRIU/zdtm?

I'll explain why I'm asking. We already have support for mips architecture, and it's not a perfect experience, because we are not sure that it's working and we have no way to test it on our own.
See, for instance:
#1619
Original author (@sunny868) of a MIPS port just disappeared and not replying to messages from us at all. (I've even tried to contact with him by email with no luck...)

So, IMHO, it would be great to have some way to test this without hardware. Can you also prepare such a qemu image/configuration and describe how to do that? You can put this instruction as a separate file in the Documentation directory in CRIU tree.

@codecov-commenter
Copy link

codecov-commenter commented Jun 8, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.05 ⚠️

Comparison is base (2d6f04c) 70.35% compared to head (ea746a0) 70.31%.

❗ Current head ea746a0 differs from pull request most recent head e4692d0. Consider uploading reports for the commit e4692d0 to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #2183      +/-   ##
============================================
- Coverage     70.35%   70.31%   -0.05%     
============================================
  Files           133      133              
  Lines         34001    33989      -12     
============================================
- Hits          23923    23900      -23     
- Misses        10078    10089      +11     

see 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@znley
Copy link
Contributor Author

znley commented Jun 8, 2023

@mihalicyn
Thanks for your reviews!

First, I plan to split it into image, compel, include, criu and Makefile changes. How do you think?

About how to test. There are two ways:

  • gcc farm provides loongarch64 machine, but I didn't try it and don't think it's a good way.
  • So I'm building a rootfs for loongarch64, all source code is available. Now it can works on x86 on docker but at least python and protobuf are required to run, so please wait a while.

@mihalicyn
Copy link
Member

mihalicyn commented Jun 8, 2023

First, I plan to split it into image, compel, include, criu and Makefile changes. How do you think?

The idea is not about to split changes to different directories into a different commits. What we want is to have each commit to be an atomic change that:

  • can be compiled (so if you do git checkout <any commit hash> it should compile and work as expected)
  • bring some new functionality/covert API/add some helpers

For instance, you can split your changes to be like:

  • add loongarch64 support to compel (this should be possible to compile and all examples in compel/test directory should be working and work)
  • add an image format for loongarch64 core
  • add loongarch64 support to parasite/restorer
  • add loongarch64 support to zdtm

may be later we will split it to even smaller pieces, but I think this looks optimal for now.

Speaking generally, we have the same policies as Linux kernel has. If you have a series like 60 commits to the Linux kernel then each commit should work and pass tests and should not bring any degradation (at least known). There is a reason to have such a policy - it allows to perform git bisect .... If your git history does not follows this rule then git rebase will lead to a back pain.

Hint: you can use git rebase -i --exec "make clean && make" <first_commit_sha_that_you_want_to_test>~ to automatically run make clean && make on each commit, or even run tests.

@mihalicyn
Copy link
Member

@avagin avagin force-pushed the criu-dev branch 2 times, most recently from f392ea1 to 4d137b8 Compare June 12, 2023 06:33
@znley znley force-pushed the criu-dev-loongarch64 branch 2 times, most recently from dd857fd to 16c6dc0 Compare June 12, 2023 09:41
@znley
Copy link
Contributor Author

znley commented Jun 15, 2023

Hi, @mihalicyn

One of the most important thing is I have finished a docker image where criu tests can be done. You can use it on x86 machine with docker run -it --rm merore/archlinux-loongarch64.

About this image:

  • Boot an Loongarch64-based Arch Linux with qemu-system-loongarch64
  • The default user/password is root/loongarch64
  • All dependcies required by criu build and test are pre-installed
  • The built-in criu code may not be up to date, re-fetch it if necessary
  • The qemu-system-loongarch64 may runs very slowly, perhaps you need to prepare a cup of coffee

Known issus with criu running on loongarch64:

  • I build criu with make -j4 WERROR=0, becase there are some warning, I haven't dealt with it yet.
  • Since the kernel gradually deprecate syscall NR_fstat and loongarch64 is not compatible with it. so It need fix in seccomp_filter_inheritance test unit of the zdtm.
[root@archlinux criu]# git diff --cached
diff --git a/test/zdtm/static/seccomp_filter_inheritance.c b/test/zdtm/static/seccomp_filter_inheritance.c
index 7a86cd8..6c02026 100644
--- a/test/zdtm/static/seccomp_filter_inheritance.c
+++ b/test/zdtm/static/seccomp_filter_inheritance.c
@@ -100,9 +100,6 @@ int main(int argc, char **argv)
                if (filter_syscall(__NR_ptrace) < 0)
                        _exit(1);
 
-               if (filter_syscall(__NR_fstat) < 0)
-                       _exit(1);
-
                zdtm_seccomp = 1;
                test_msg("SECCOMP_MODE_FILTER is enabled\n");

Finally, thank you for your patience and guidance, I split the commit and fixed the linter failure but it might not be standard yet.

@mihalicyn
Copy link
Member

Hi @znley!

Great job! I'll play with this thing then.

From the first look I have a question. Where I should take this files:

Can you try to integrate this into GitHub actions?
You can take one of our pipelines as an example:
https://github.com/checkpoint-restore/criu/blob/criu-dev/.github/workflows/x86-64-gcc-test.yml

My understanding is that you need:

  • cross compile this thing from amd64 runner machine for loongarch64
  • run Qemu loongarch64 VM (with ssh server inside and all dependencies provided with the image by default)
  • copy CRIU and tests binaries to the VM
  • run some basic ZDTM tests inside the VM (let's start from ./test/zdtm.py run -t zdtm/transition/maps007 -t zdtm/static/sigpending -f h for instance)

Thanks for work on that!
Kind regards,
Alex

@znley
Copy link
Contributor Author

znley commented Jun 15, 2023

From the first look I have a question. Where I should take this files

Sorry I forgot to explain this problem. I put all binaries under archives directory, but didn't commit these onto git because the file size is large. I don't have a better way yet, maybe I will package and upload the archives to github release later. It is better to use docker at present.

Archives consists of:

  • Firmware Download form here
  • Qemu compile from source, It's version is 8.0.0
  • Archlinux qcow2(unofficial)

As for how to integrate into CI, I may need to do some examples first.

@mihalicyn
Copy link
Member

Sorry I forgot to explain this problem. I put all binaries under archives directory, but didn't commit these onto git because the file size is large. I don't have a better way yet, maybe I will package and upload the archives to github release later. It is better to use docker at present.

thanks for clarification!

As for how to integrate into CI, I may need to do some examples first.

sure, feel free to ask if you have any questions. I can advice you to create a separate branch in your CRIU fork for this CI work, create a new separate pipeline that:

next step is to integrate CRIU.

Even before that you can just add a new cross-compilation target here:
https://github.com/checkpoint-restore/criu/blob/criu-dev/.github/workflows/cross-compile.yml

@znley
Copy link
Contributor Author

znley commented Jul 5, 2023

@mihalicyn
I'm sorry I was busy with other things some time ago.

With this work znley@927ae43, an unaem -a ci job can runs on a loongarch64 vm. https://github.com/znley/criu/actions/runs/5461334951/jobs/9939225089

I have adjusted the previous archlinux image without adopting the fedora you mentioned. Please see if it works.

@znley
Copy link
Contributor Author

znley commented Jul 6, 2023

@mihalicyn

Build and test are integrated into the CI.
https://github.com/znley/criu/actions/runs/5473006521/jobs/9965902071

There are two problems:

  1. LoongArch64 qemu vm runs very slow, It takes at least an hour.
  2. A zdtm test case map00 runs with serious error, can't complete the entire test.

@znley
Copy link
Contributor Author

znley commented Jul 7, 2023

@mihalicyn

The entire test run is complete. It took 2 hour and 28 minutes.
https://github.com/znley/criu/actions/runs/5482106553/jobs/9987111425

@znley znley force-pushed the criu-dev-loongarch64 branch 2 times, most recently from 43c98e8 to 7fe7798 Compare July 11, 2023 03:20
@znley
Copy link
Contributor Author

znley commented Jul 11, 2023

  • rebase
  • avoid build warnings, now without WERROR=0 build on loongarch64

@mihalicyn
Copy link
Member

Hi @znley

Excellent work, thanks!

Btw, it seems like zdtm/static/maps00 test problem is not related to CRIU itself. You can easily reproduce freeze without CRIU:
./test/zdtm.py run -t zdtm/static/maps00 --nocr (--nocr option means that test is run without checkpoint/restore). I've experimented with this test and loongarch64 VM and it makes VM unresponsive.

Minimal reproducer:

cd test/zdtm/static
rm -f maps00.pid
make maps00.cleanout
make maps00.pid
kill -SIGTERM $(cat maps00.pid)
# system becomes unresponsive OR freeze totally

Kernel logs:

[  286.144567] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  286.146708] rcu: 	3-...0: (2 ticks this GP) idle=bcd4/1/0x4000000000000000 softirq=3394/3394 fqs=2468
[  286.147023] rcu: 	(detected by 2, t=5252 jiffies, g=6937, q=2780 ncpus=4)
[  286.147274] Sending NMI from CPU 2 to CPUs 3:
[  296.166483] rcu: rcu_sched kthread starved for 2500 jiffies! g6937 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=1
[  296.166545] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  296.166606] rcu: RCU grace-period kthread stack dump:
[  296.166648] task:rcu_sched       state:R  running task     stack:0     pid:13    ppid:2      flags:0x00000800
[  296.167242] Stack : 0000000000000001 9000000000d79020 0000000000000004 0000000000000000
[  296.167394]         9000000000e11eb8 0000000000000000 0000000000000002 c5641674628b6411
[  296.167469]         90000000014ea940 90000000014ed000 90000000014eb000 90000000014ea800
[  296.167518]         90000000014ed418 0000000000000000 90000000014ea940 90000000013de000
[  296.167570]         90000000014ed940 0000000000000000 0000000000000000 9000000000d79020
[  296.167623]         0000000000000000 90000000002b6df4 0000000000000000 90000000014ea940
[  296.167669]         900000000810c600 9000000008173d98 9000000008173d98 c5641674628b6411
[  296.167722]         ffffffff80000000 9000000006007a40 ffffffff80000000 0000000000000007
[  296.167777]         0000000000000008 90000000013de000 0000000000000001 0000000000000002
[  296.167825]         90000000014ed418 90000000014ed940 9000000008147e60 90000000002ba4b0
[  296.167957]         ...
[  296.168058] Call Trace:
[  296.168177] [<9000000000d788e0>] __schedule+0x454/0xa34

[  296.168845] rcu: Stack dump where RCU GP kthread last ran:
[  296.168858] Sending NMI from CPU 2 to CPUs 1:
[  296.168984] NMI backtrace for cpu 1
[  296.169097] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.3.0-13 #1 b259c8dadceac2a1b1c9199f20a87397bd6cb926
[  296.169231] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 2/2/2022
[  296.169271] $ 0   : 0000000000000000 9000000000d72b30 900000000817c000 900000000817fe40
[  296.169322] $ 4   : 0000000000000005 0000000000000001 900000000817c000 9000000000d78f10
[  296.169350] $ 8   : 9000000008173d10 9000000000e13240 0000000000000000 0000000000000000
[  296.169377] $12   : 0000000000000000 000000000002f6dc 90000000013de000 0000000000000001
[  296.169404] $16   : 4000000000000000 0000000722a2f640 9000000008173958 0000000000000001
[  296.169432] $20   : 0000000000000064 900000000810ed48 0000000000000000 0000000000000004
[  296.169460] $24   : 90000000013f7270 90000000013f7168 0000000000000000 0000000000000004
[  296.169486] $28   : 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  296.169519] era   : 90000000002217a0 __arch_cpu_idle+0x20/0x24
[  296.169554] ra    : 9000000000d72b30 arch_cpu_idle+0x1c/0x34
[  296.169575] CSR crmd: 000000b0	
[  296.169582] CSR prmd: 00000004	
[  296.169592] CSR euen: 00000000	
[  296.169600] CSR ecfg: 00071c1c	
[  296.169608] CSR estat: 00001000	
[  296.169626] ExcCode : 0 (SubCode 0)
[  296.169649] PrId  : 0014c010 (Loongson-64bit)
[  296.169689] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.3.0-13 #1 b259c8dadceac2a1b1c9199f20a87397bd6cb926
[  296.169705] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 2/2/2022
[  296.169745] Stack : 00000000000001d7 0000000000000000 90000000002230bc 900000000817c000
[  296.169776]         900000000805fcb0 900000000805fcb8 0000000000000000 0000000000000000
[  296.169804]         900000000805fcb8 0000000000000040 900000000805fd98 900000000805fb10
[  296.169830]         fffffffffffffffe 900000000805fcb8 c5641674628b6411 9000000008113980
[  296.169857]         0000000000000001 00003fffffffffff 0000000000000000 0000000000000000
[  296.169883]         0000000000000000 90000000011e9477 0000000005a04000 0000000000000000
[  296.169910]         0000000000000000 0000000000000000 90000000012a2eb8 90000000013f1000
[  296.169937]         90000000013ff040 900000000817fd00 0000000000000000 0000000000000001
[  296.169965]         0000000000000000 0000000000000000 90000000002230d4 00007ffff2ac766c
[  296.169991]         00000000000000b0 0000000000000004 0000000000000000 0000000000071c1c
[  296.170017]         ...
[  296.170034] Call Trace:
[  296.170059] [<90000000002230d4>] show_stack+0x5c/0x180
[  296.170083] [<9000000000d71648>] dump_stack_lvl+0x60/0x88
[  296.170094] [<9000000000d5121c>] nmi_cpu_backtrace+0x15c/0x160
[  296.170106] [<9000000000223708>] handle_backtrace+0x10/0x40
[  296.170116] [<90000000002e24fc>] __flush_smp_call_function_queue+0xc4/0x250
[  296.170129] [<900000000022d4b4>] loongson_ipi_interrupt+0x80/0xc0
[  296.170141] [<900000000029efcc>] __handle_irq_event_percpu+0x50/0x12c
[  296.170154] [<900000000029f0c0>] handle_irq_event_percpu+0x18/0x6c
[  296.170166] [<90000000002a54f8>] handle_percpu_irq+0x54/0x88
[  296.170176] [<900000000029deac>] generic_handle_domain_irq+0x28/0x40
[  296.170188] [<9000000000734c9c>] handle_cpu_irq+0x6c/0xa8
[  296.170200] [<9000000000d724e8>] handle_loongarch_irq+0x30/0x48
[  296.170210] [<9000000000d72580>] do_vint+0x80/0xb4
[  296.170246] [<90000000002217a0>] __arch_cpu_idle+0x20/0x24
[  296.170275] [<9000000000d72b30>] arch_cpu_idle+0x1c/0x34
[  296.170288] [<9000000000d72bd4>] default_idle_call+0x1c/0x48
[  296.170299] [<90000000002875d8>] do_idle+0xb4/0x118
[  296.170310] [<90000000002877fc>] cpu_startup_entry+0x24/0x28
[  296.170320] [<900000000022db04>] start_secondary+0x9c/0xa4
[  296.170331] [<9000000000d7310c>] smpboot_entry+0x54/0x58

So, we have a problem with loongarch64 kernel (or) Qemu? Can you try to run this reproducer on real hardware and on VM just to confirm?

@mihalicyn
Copy link
Member

The entire test run is complete. It took 2 hour and 28 minutes.
https://github.com/znley/criu/actions/runs/5482106553/jobs/9987111425

yes, that's why I think that it's worth to run only a few tests, not a full ZDTM testsuite.
We can just run something like this:

./test/zdtm.py run -t zdtm/static/maps02 -t zdtm/static/maps05 -t zdtm/static/maps06 -t zdtm/static/maps10 -t zdtm/static/maps_file_prot -t zdtm/static/memfd00 -t zdtm/transition/fork -t zdtm/transition/fork2 -t zdtm/transition/shmem -f h

I think it will be sufficient for GitHub Actions environment. But I've a plan to resurrect our Jenkins and add CI Job for loongarch64 once we merge it, so we will be able to run full testsuite on our Jenkins CI.

@mihalicyn
Copy link
Member

mihalicyn commented Jul 16, 2023

@mihalicyn

Build and test are integrated into the CI. https://github.com/znley/criu/actions/runs/5473006521/jobs/9965902071

There are two problems:

1. LoongArch64 qemu vm runs very slow, It takes at least an hour.

2. A zdtm test case `map00` runs with serious error, can't complete the entire test.

perfect! I've played with this thing locally. Thanks a lot for doing this!

IMHO, let's add CI to this branch and run limited subset of tests like this (we can extend it later):

./test/zdtm.py run -t zdtm/static/maps02 -t zdtm/static/maps05 -t zdtm/static/maps06 -t zdtm/static/maps10 -t zdtm/static/maps_file_prot -t zdtm/static/memfd00 -t zdtm/transition/fork -t zdtm/transition/fork2 -t zdtm/transition/shmem -f h

I think that fixups like loongarch64 fix: avoid deprecated warning that $sp in a clobber list and loongarch64 fix: avoid unused compat variable warning should be squashed into an appropriate commits.

@znley
Copy link
Contributor Author

znley commented Jul 18, 2023

@mihalicyn
All fixup patches have been squashed into an previous commits. I think it looks not bad too.

IMHO, let's add CI to this branch and run limited subset of tests like this (we can extend it later):

Do you mean add CI to criud-dev branch? I'm not sure about this. If so, open another PR or just in this PR?

About the map00 test, I need to analyze again.

@mihalicyn
Copy link
Member

Do you mean add CI to criud-dev branch? I'm not sure about this. If so, open another PR or just in this PR?

yeah, I think you can add your CI job just in a scope of this PR.

About the map00 test, I need to analyze again.

thanks! That's very interesting if you'll be able to reproduce this behaviour on your environment in QEMU, and then if it's reproducible with QEMU it's worth trying to run this on a hardware. Anyway, this looks like a bug somewhere in Qemu (may be interrupt is not properly generated on a page fault) or Linux kernel. So testing on a hardware allows us to isolate and determine if this is a QEMU or Linux kernel.

@znley
Copy link
Contributor Author

znley commented Jul 19, 2023

@mihalicyn
It seems that ubuntu 20.04 runs on github runner occures some issues. I did some tests locally, problably due to issues with azure hosted ubuntu repositories. So I update runner to ubuntu 22.04.

Meanwhile CI runs a limited subset of tests now.

 ./test/zdtm.py run -t zdtm/static/maps02 -t zdtm/static/maps05 -t zdtm/static/maps06 -t zdtm/static/maps10 -t zdtm/static/maps_file_prot -t zdtm/static/memfd00 -t zdtm/transition/fork -t zdtm/transition/fork2 -t zdtm/transition/shmem -f h

@mihalicyn
Copy link
Member

It seems that ubuntu 20.04 runs on github runner occures some issues. I did some tests locally, problably due to issues with azure hosted ubuntu repositories. So I update runner to ubuntu 22.04.

yes, looks like it is.

Meanwhile CI runs a limited subset of tests now.

thanks a lot. As I can see tests are all green now.

I think this PR looks good to me and I've a plan to merge it on a weekend. May be I'll do some small changes in a commit descriptions or something like that, but generally it's in a perfect shape! Thanks for your work on that and hope that you'll take care of this port and help us to maintain it properly.

@mihalicyn
Copy link
Member

@znley I've made a small changes in your PR branch (fixed spelling and commit message format), also made linter happy.

I'll merge a PR right now.

It looks like there is some issues with GitHub Actions runners and we experiencing issues with SSL like this:

+ curl -fsSL https://download.docker.com/linux/ubuntu/gpg
+ sudo apt-key add -
Warning: apt-key is deprecated. Manage keyring files in trusted.gpg.d instead (see apt-key(8)).
curl: (60) SSL: no alternative certificate subject name matches target host name 'download.docker.com'

https://github.com/checkpoint-restore/criu/actions/runs/5601283935/jobs/10244922111?pr=2183
It looks like we are not alone with this problems:
https://community.codecov.com/t/github-actions-failing-due-to-tls-certificate-expiration/3751

Anyway, it's not related to your changes and I've seen this tests in "green" state. I've also made appropriate testing on my local environment to confirm that everything is in a perfect shape.

@mihalicyn mihalicyn merged commit f70c782 into checkpoint-restore:criu-dev Jul 19, 2023
28 of 37 checks passed
@znley
Copy link
Contributor Author

znley commented Jul 20, 2023

@mihalicyn

I think this PR looks good to me and I've a plan to merge it on a weekend. May be I'll do some small changes in a commit descriptions or something like that, but generally it's in a perfect shape! Thanks for your work on that and hope that you'll take care of this port and help us to maintain it properly.

I just did a little work. Thanks for your patient guidance and time again. it has helped me a lot.

Test issues including map00 I will fix one by one.

At the same time, I will make necessary improvements for CI with operating system update. Keep communicate with any related questions at any time.

Respect!

@znley znley deleted the criu-dev-loongarch64 branch July 21, 2023 03:37
gentoo-bot pushed a commit to gentoo/gentoo that referenced this pull request Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants