0

Our CI runs Teamcity Agents in Docker containers on a number of machines. For each run, I start 100 test-runner processes (Ruby/Cucumber) to run the tests via 100 instances of Appium.

A couple of weeks ago we moved starting the 100 test-runner processes and 100 Appium instances into a child docker container, but one particular machine started going wrong: every so often, all (or very nearly all) 100 Appium processes hang at start-up while reading in the 'require' files - sometimes hanging on fs.closeSync(), sometimes on toString buffer.js:634:45 called from fs.readFileSync fs.js:615:40

All Appium processes are definitely wedged (I took the stack trace two times, one second apart, with different debugger instances) and the stack traces fall into one of a handful of groups, all stuck while 'require'ing files, before any ports are open and before the server opens its log, though it's not always the same require file each time.

Once it's happened, it seems to keep on happening with all subsequent runs until I manually delete the node_modules directory, but after a while it starts locking up again. The node_modules directory is in a volume shared between the Teamcity Agent and the test container where the Appium processes run, so we don't have to do the git checkout afresh each time, nor the npm install.

This seems to implicate file locking, but when I run lslock as root in the container, both before I start Appium and just after I get its stack trace, I see nothing (aside from a lock that all docker containers show, which seems to relate to a lock on /dev at the top/host level):

$ lslocks
COMMAND           PID   TYPE SIZE MODE  M START END PATH
file_lock       54314  POSIX   0B WRITE 0     0   0 /
(unknown)          -1 OFDLCK   0B READ  0     0   0 

file_lock is my test to make sure lslock is working; I also checked that mandatory locking isn't enabled for that volume.

I wasn't able to trigger the hang from the command-line, but I hadn't captured all the instances' environment variables, so I couldn't try to start all 100 up. I'll do that next time the hang happens.

Which said, I'm out of reasonable things to look into: I can upgrade node and appium, but without understanding why the processes are apparently deadlocking each other, I have no guarantee that'll be a permanent solution.


The processes go into disk wait, and scripting echo 3 > /proc/sys/vmdrop_caches didn't clear it (though there were hints it might have unclogged things for a fraction of a second).

It seems to be related to using shared Docker volumes, and the hang even happened when only 4 Appiums were started, so it wasn't merely lots of processes hammering the overlay fs with read requests.

Changing node_modules into a softlink pointing to an unshared area seems to have resolved things over the past few days.


Host: SLES 12 SP 4 - 4.14.75-0.2-default #1 SMP PREEMPT Mon Nov 13 14:53:06 UTC 2017 (c152297) x86_64 x86_64 x86_64 GNU/Linux

/proc/version: Linux version 4.14.75-0.2-default (geeko@buildhost) (gcc version 4.8.5 (SUSE Linux)) #1 SMP PREEMPT Mon Nov 13 14:53:06 UTC 2017 (c152297)

Docker version 18.09.6, build 481bc7715621

Container: Debian GNU/Linux 9 - 4.14.75-0.2-default #1 SMP PREEMPT Mon Nov 13 14:53:06 UTC 2017 (c152297) x86_64 GNU/Linux

node: v9.11.2

10
  • To clarify: You are starting everything in a single container? And it only goes wrong on a particular physical machine, and works fine on other machines? Can you vary details, like use two different volumes for node_modules, and see if e.g. just processes using one particular volume get stuck? Sharing volumes using Linux kernel features is difficult enough I'd suspect deadlocking could also happen somewhere in the kernel.
    – dirkt
    Commented Jul 4, 2019 at 12:27
  • @dirkt The TeamCity container does some stuff, then starts the secondary container, which then runs all 100 test scripts, each of which starts its own Appium server. The TeamCity container is just waiting until the secondary container is finished, so it seems unlikely to be involved. This is the only machine of four affected. Thanks for the suggestion - I will add rigging an unshared additional volume to my list of things to try. Commented Jul 4, 2019 at 13:39
  • If you do a ps in the container(s) where the Appium process(es) are stuck, which state are the stuck processes in? D as in "blocked for disk I/O"? Is the kernel version on the four machines different?
    – dirkt
    Commented Jul 4, 2019 at 13:53
  • @dirkt I can't believe I didn't log that before! In the most recent clog, the processes are indeed in disk wait: 4110 ? Dl 0:03 node --inspect-port=0 I've added a find|xargs timeout cksum to see if I can find the clogged file - I noticed that command wedged too in an earlier attempt to track this. Commented Jul 5, 2019 at 16:22
  • Can you try echo 3 > /proc/sys/vm/drop_caches in a container with processes stuck in D, and see if it fixes this (at least for some time)? And please do look at the kernel versions.
    – dirkt
    Commented Jul 5, 2019 at 17:29

1 Answer 1

0

A partial answer: what seemed to be causing it was trying to run appium from a node_modules directory in a Docker volume shared between two containers: moving node_modules to a non-volume area stopped the hanging.

Quite what was causing the actual problem - no idea, but it looks like a bug in the filesystem logic. Quite how NodeJS/Appium was triggering it when 100 processes reading at least as much Ruby code later didn't, I don't know.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .