deck-chores should trigger rediscovery of containers when task errors out with critical exception container not running #57

nramesh25 · 2020-06-07T23:28:07Z

We ran into an issue where the deck-chores instance was holding on to a container that was restarted, and it was not able to recover from the failure. I am not sure if docker emitted an event and the container didn't receive it or there was a race condition and the event was lost. Minimally if deck-chores already determines if the container is not running, can it remove the container from the tasks and re-discover them?

cron    2ssszq9rhbx3  2020-06-07 23:00:00,002|INFO    |vm119_api-gw.2.z1yfdmahe3ut5651ax3kg00pz: Executing 'submission-monitor'.
cron    2ssszq9rhbx3  Job "exec_job (trigger: cron[year='*', month='*', day='*', week='*', day_of_week='*', hour='*', minute='*/5', second='0'], next run at: 2020-06-07 23:05:00 UTC)" raised an exception
cron    2ssszq9rhbx3  Traceback (most recent call last):
cron    2ssszq9rhbx3    File "/usr/local/lib/python3.8/site-packages/apscheduler/executors/base.py", line 125, in run_job
cron    2ssszq9rhbx3      retval = job.func(*job.args, **job.kwargs)
cron    2ssszq9rhbx3    File "/usr/local/lib/python3.8/site-packages/deck_chores/jobs.py", line 100, in exec_job
cron    2ssszq9rhbx3      raise AssertionError('Container is not running.')
cron    2ssszq9rhbx3  AssertionError: Container is not running.
cron    2ssszq9rhbx3  2020-06-07 23:00:00,010|CRITICAL|An exception in deck-chores occurred while executing submission-monitor in container 5c33cd04b1aeae3e265c0512957a2fb7866164d9de4a489b23fb472e806dc4e2:
cron    2ssszq9rhbx3  2020-06-07 23:00:00,010|ERROR   |Container is not running.

cron    2ssszq9rhbx3  2020-06-07 23:03:06,954|INFO    |ID: c78a823e-a86e-59c2-b71a-c2016cb38d09   Next execution: 2020-06-07 23:05:00+00:00   Configuration:
cron    2ssszq9rhbx3  2020-06-07 23:03:06,955|INFO    |{'command': '/bin/bash -c "/opt/python3/bin/python3.7 /api-gw/scripts/monitoring/submission_monitor.py"', 'name': 'submission-monitor', 'user': '', 'environment': {}, 'jitter': None, 'max': 1, 'timezone': 'UTC', 'trigger': (<class 'apscheduler.triggers.cron.CronTrigger'>, ('*', '*', '*', '*', '*', '*', '*/5', '0')), 'service_id': ('com.docker.swarm.service.name=vm119_api-gw',), 'job_name': 'submission-monitor', 'job_id': 'c78a823e-a86e-59c2-b71a-c2016cb38d09', 'container_id': '5c33cd04b1aeae3e265c0512957a2fb7866164d9de4a489b23fb472e806dc4e2'}

The text was updated successfully, but these errors were encountered:

funkyfuture · 2020-06-08T14:27:11Z

am i understanding correctly that this was (probably?) the chain of events:

a single(?) container w/ a job defintion that represents a service was started
that job definition was parsed, the job was added to the scheduler
a restart of the container was invoked
the Docker daemon stops the container, deck-chores might have missed that event (that should appear in the logs) or was processing it while
the scheduler is trying to execute the job, but the container is stopped atm
the Docker daemon starts the container

please correct me when i got a false impression. also, these questions come to my mind:

was the job executed five minutes later?
are there any log entries around that event that show that the container's jobs were removed or (re-)added?
you seem to use the Swarm mode; was the container restarted on the same node that it ran before?

nramesh25 · 2020-06-08T15:41:38Z

Thanks for the questions, let me do my best to explain the setup and events

Environment

We are running docker swarm node with 1 node acting as both manager and worker.
deck-chores v 1.1.0 is run in a global mode with SERVICE_ID_LABELS=com.docker.swarm.service.name
we have a service api-gw running in replicated mode ( running with 2 replicas ) that has the cron definition added as a label in the job definition

Activities:

Performed a update of all images and restarted the services ( docker service update --force )
deck-chores restarted at 2020-06-06T12:32:16.044614673Z

2020-06-06 12:32:21,489|INFO    |Inspecting running containers.
2020-06-06 12:32:45,364|INFO    |vm119_api-gw.2.z1yfdmahe3ut5651ax3kg00pz: Added 'submission-monitor' (c78a823e-a86e-59c2-b71a-c2016cb38d09).
2020-06-06 12:32:47,766|INFO    |Listening to events.

2 instances of api-gw restarted at 2020-06-06T12:35:23.629945606Z and 2020-06-06T12:35:01.402977676Z

Errors appearing container is not running

2020-06-06 12:40:00,004|INFO    |vm119_api-gw.2.z1yfdmahe3ut5651ax3kg00pz: Executing 'submission-monitor'.
2020-06-06 12:40:00,009|CRITICAL|An exception in deck-chores occurred while executing submission-monitor in container 5c33cd04b1aeae3e265c0512957a2fb7866164d9de4a489b23fb472e806dc4e2:
2020-06-06 12:40:00,009|ERROR   |Container is not running.

deck-chores didn't recover from this. Even after 2 days the cron is failing every 5 min with the same error

2020-06-08 15:35:00,002|INFO    |vm119_api-gw.2.z1yfdmahe3ut5651ax3kg00pz: Executing 'submission-monitor'.
2020-06-08 15:35:00,009|CRITICAL|An exception in deck-chores occurred while executing submission-monitor in container 5c33cd04b1aeae3e265c0512957a2fb7866164d9de4a489b23fb472e806dc4e2:
2020-06-08 15:35:00,009|ERROR   |Container is not running.

was the job executed five minutes later? - No
are there any log entries around that event that show that the container's jobs were removed or (re-)added? - No other logs after the initial one
you seem to use the Swarm mode; was the container restarted on the same node that it ran before? - Yes running on the same node.

funkyfuture · 2020-06-21T11:18:52Z

it's clear that the stop of the containers that serve your api service went unnoticed to deck-chores, isn't it?

i have no clue whether the fact that you deploy in Swarm mode is relevant here. in any case, i will not try to recreate it, as i don't intend to support it.

the only thing i can suggest for now is, that you run deck-chores at debug logging level, try to recreate the failing behaviour and submit complete logs.

if we fail to figure out the root cause, an option might be to exit deck-chores when a container can't be found. so that the service manager can restart it if configured to do so.

nramesh25 · 2020-07-15T14:13:48Z

Sorry I missed this update. I tried figuring out the root cause but with no luck. i think an option to exit the container when a critical error is encountered may be a good idea.

Closes #57

funkyfuture added a commit that referenced this issue Jan 18, 2022

Adds an option to exit on encounters with unexpected container states

6a782b3

Closes #57

funkyfuture added a commit that referenced this issue Jan 18, 2022

Adds an option to exit on encounters with unexpected container states

6059699

Closes #57

funkyfuture mentioned this issue Jan 18, 2022

Adds an option to exit on encounters with unexpected container states #78

Open

funkyfuture added a commit that referenced this issue Jan 27, 2022

Adds an option to exit on encounters with unexpected container states

44e11e3

Closes #57

funkyfuture added enhancement feedback pending labels Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deck-chores should trigger rediscovery of containers when task errors out with critical exception container not running #57

deck-chores should trigger rediscovery of containers when task errors out with critical exception container not running #57

nramesh25 commented Jun 7, 2020

funkyfuture commented Jun 8, 2020

nramesh25 commented Jun 8, 2020

funkyfuture commented Jun 21, 2020 •

edited

Loading

nramesh25 commented Jul 15, 2020

deck-chores should trigger rediscovery of containers when task errors out with critical exception container not running #57

deck-chores should trigger rediscovery of containers when task errors out with critical exception container not running #57

Comments

nramesh25 commented Jun 7, 2020

funkyfuture commented Jun 8, 2020

nramesh25 commented Jun 8, 2020

funkyfuture commented Jun 21, 2020 • edited Loading

nramesh25 commented Jul 15, 2020

funkyfuture commented Jun 21, 2020 •

edited

Loading