Our CI runs Teamcity Agents in Docker containers on a number of machines. For each run, I start 100 test-runner processes (Ruby/Cucumber) to run the tests via 100 instances of Appium.
A couple of weeks ago we moved starting the 100 test-runner processes and 100 Appium instances into a child docker container, but one particular machine started going wrong: every so often, all (or very nearly all) 100 Appium processes hang at start-up while reading in the 'require' files - sometimes hanging on fs.closeSync(), sometimes on toString buffer.js:634:45 called from fs.readFileSync fs.js:615:40
All Appium processes are definitely wedged (I took the stack trace two times, one second apart, with different debugger instances) and the stack traces fall into one of a handful of groups, all stuck while 'require'ing files, before any ports are open and before the server opens its log, though it's not always the same require file each time.
Once it's happened, it seems to keep on happening with all subsequent runs until I manually delete the node_modules directory, but after a while it starts locking up again. The node_modules directory is in a volume shared between the Teamcity Agent and the test container where the Appium processes run, so we don't have to do the git checkout afresh each time, nor the npm install.
This seems to implicate file locking, but when I run lslock as root in the container, both before I start Appium and just after I get its stack trace, I see nothing (aside from a lock that all docker containers show, which seems to relate to a lock on /dev at the top/host level):
$ lslocks
COMMAND PID TYPE SIZE MODE M START END PATH
file_lock 54314 POSIX 0B WRITE 0 0 0 /
(unknown) -1 OFDLCK 0B READ 0 0 0
file_lock is my test to make sure lslock is working; I also checked that mandatory locking isn't enabled for that volume.
I wasn't able to trigger the hang from the command-line, but I hadn't captured all the instances' environment variables, so I couldn't try to start all 100 up. I'll do that next time the hang happens.
Which said, I'm out of reasonable things to look into: I can upgrade node and appium, but without understanding why the processes are apparently deadlocking each other, I have no guarantee that'll be a permanent solution.
The processes go into disk wait, and scripting echo 3 > /proc/sys/vmdrop_caches
didn't clear it (though there were hints it might have unclogged things for a fraction of a second).
It seems to be related to using shared Docker volumes, and the hang even happened when only 4 Appiums were started, so it wasn't merely lots of processes hammering the overlay fs with read requests.
Changing node_modules into a softlink pointing to an unshared area seems to have resolved things over the past few days.
Host: SLES 12 SP 4 - 4.14.75-0.2-default #1 SMP PREEMPT Mon Nov 13 14:53:06 UTC 2017 (c152297) x86_64 x86_64 x86_64 GNU/Linux
/proc/version: Linux version 4.14.75-0.2-default (geeko@buildhost) (gcc version 4.8.5 (SUSE Linux)) #1 SMP PREEMPT Mon Nov 13 14:53:06 UTC 2017 (c152297)
Docker version 18.09.6, build 481bc7715621
Container: Debian GNU/Linux 9 - 4.14.75-0.2-default #1 SMP PREEMPT Mon Nov 13 14:53:06 UTC 2017 (c152297) x86_64 GNU/Linux
node: v9.11.2
node_modules
, and see if e.g. just processes using one particular volume get stuck? Sharing volumes using Linux kernel features is difficult enough I'd suspect deadlocking could also happen somewhere in the kernel.ps
in the container(s) where the Appium process(es) are stuck, which state are the stuck processes in?D
as in "blocked for disk I/O"? Is the kernel version on the four machines different?4110 ? Dl 0:03 node --inspect-port=0
I've added afind|xargs timeout cksum
to see if I can find the clogged file - I noticed that command wedged too in an earlier attempt to track this.echo 3 > /proc/sys/vm/drop_caches
in a container with processes stuck inD
, and see if it fixes this (at least for some time)? And please do look at the kernel versions.