Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deck-chores should trigger rediscovery of containers when task errors out with critical exception container not running #57

Open
nramesh25 opened this issue Jun 7, 2020 · 4 comments

Comments

@nramesh25
Copy link

We ran into an issue where the deck-chores instance was holding on to a container that was restarted, and it was not able to recover from the failure. I am not sure if docker emitted an event and the container didn't receive it or there was a race condition and the event was lost. Minimally if deck-chores already determines if the container is not running, can it remove the container from the tasks and re-discover them?

cron    2ssszq9rhbx3  2020-06-07 23:00:00,002|INFO    |vm119_api-gw.2.z1yfdmahe3ut5651ax3kg00pz: Executing 'submission-monitor'.
cron    2ssszq9rhbx3  Job "exec_job (trigger: cron[year='*', month='*', day='*', week='*', day_of_week='*', hour='*', minute='*/5', second='0'], next run at: 2020-06-07 23:05:00 UTC)" raised an exception
cron    2ssszq9rhbx3  Traceback (most recent call last):
cron    2ssszq9rhbx3    File "/usr/local/lib/python3.8/site-packages/apscheduler/executors/base.py", line 125, in run_job
cron    2ssszq9rhbx3      retval = job.func(*job.args, **job.kwargs)
cron    2ssszq9rhbx3    File "/usr/local/lib/python3.8/site-packages/deck_chores/jobs.py", line 100, in exec_job
cron    2ssszq9rhbx3      raise AssertionError('Container is not running.')
cron    2ssszq9rhbx3  AssertionError: Container is not running.
cron    2ssszq9rhbx3  2020-06-07 23:00:00,010|CRITICAL|An exception in deck-chores occurred while executing submission-monitor in container 5c33cd04b1aeae3e265c0512957a2fb7866164d9de4a489b23fb472e806dc4e2:
cron    2ssszq9rhbx3  2020-06-07 23:00:00,010|ERROR   |Container is not running.
cron    2ssszq9rhbx3  2020-06-07 23:03:06,954|INFO    |ID: c78a823e-a86e-59c2-b71a-c2016cb38d09   Next execution: 2020-06-07 23:05:00+00:00   Configuration:
cron    2ssszq9rhbx3  2020-06-07 23:03:06,955|INFO    |{'command': '/bin/bash -c "/opt/python3/bin/python3.7 /api-gw/scripts/monitoring/submission_monitor.py"', 'name': 'submission-monitor', 'user': '', 'environment': {}, 'jitter': None, 'max': 1, 'timezone': 'UTC', 'trigger': (<class 'apscheduler.triggers.cron.CronTrigger'>, ('*', '*', '*', '*', '*', '*', '*/5', '0')), 'service_id': ('com.docker.swarm.service.name=vm119_api-gw',), 'job_name': 'submission-monitor', 'job_id': 'c78a823e-a86e-59c2-b71a-c2016cb38d09', 'container_id': '5c33cd04b1aeae3e265c0512957a2fb7866164d9de4a489b23fb472e806dc4e2'}
@funkyfuture
Copy link
Owner

am i understanding correctly that this was (probably?) the chain of events:

  • a single(?) container w/ a job defintion that represents a service was started
  • that job definition was parsed, the job was added to the scheduler
  • a restart of the container was invoked
  • the Docker daemon stops the container, deck-chores might have missed that event (that should appear in the logs) or was processing it while
  • the scheduler is trying to execute the job, but the container is stopped atm
  • the Docker daemon starts the container

please correct me when i got a false impression. also, these questions come to my mind:

  • was the job executed five minutes later?
  • are there any log entries around that event that show that the container's jobs were removed or (re-)added?
  • you seem to use the Swarm mode; was the container restarted on the same node that it ran before?
@nramesh25
Copy link
Author

Thanks for the questions, let me do my best to explain the setup and events

Environment

  • We are running docker swarm node with 1 node acting as both manager and worker.

  • deck-chores v 1.1.0 is run in a global mode with SERVICE_ID_LABELS=com.docker.swarm.service.name

  • we have a service api-gw running in replicated mode ( running with 2 replicas ) that has the cron definition added as a label in the job definition

Activities:

  • Performed a update of all images and restarted the services ( docker service update --force )

  • deck-chores restarted at 2020-06-06T12:32:16.044614673Z

2020-06-06 12:32:21,489|INFO    |Inspecting running containers.
2020-06-06 12:32:45,364|INFO    |vm119_api-gw.2.z1yfdmahe3ut5651ax3kg00pz: Added 'submission-monitor' (c78a823e-a86e-59c2-b71a-c2016cb38d09).
2020-06-06 12:32:47,766|INFO    |Listening to events.
  • 2 instances of api-gw restarted at 2020-06-06T12:35:23.629945606Z and 2020-06-06T12:35:01.402977676Z

Errors appearing container is not running

2020-06-06 12:40:00,004|INFO    |vm119_api-gw.2.z1yfdmahe3ut5651ax3kg00pz: Executing 'submission-monitor'.
2020-06-06 12:40:00,009|CRITICAL|An exception in deck-chores occurred while executing submission-monitor in container 5c33cd04b1aeae3e265c0512957a2fb7866164d9de4a489b23fb472e806dc4e2:
2020-06-06 12:40:00,009|ERROR   |Container is not running.
  • deck-chores didn't recover from this. Even after 2 days the cron is failing every 5 min with the same error
2020-06-08 15:35:00,002|INFO    |vm119_api-gw.2.z1yfdmahe3ut5651ax3kg00pz: Executing 'submission-monitor'.
2020-06-08 15:35:00,009|CRITICAL|An exception in deck-chores occurred while executing submission-monitor in container 5c33cd04b1aeae3e265c0512957a2fb7866164d9de4a489b23fb472e806dc4e2:
2020-06-08 15:35:00,009|ERROR   |Container is not running.

was the job executed five minutes later? - No
are there any log entries around that event that show that the container's jobs were removed or (re-)added? - No other logs after the initial one
you seem to use the Swarm mode; was the container restarted on the same node that it ran before? - Yes running on the same node.

@funkyfuture
Copy link
Owner

funkyfuture commented Jun 21, 2020

it's clear that the stop of the containers that serve your api service went unnoticed to deck-chores, isn't it?

i have no clue whether the fact that you deploy in Swarm mode is relevant here. in any case, i will not try to recreate it, as i don't intend to support it.

the only thing i can suggest for now is, that you run deck-chores at debug logging level, try to recreate the failing behaviour and submit complete logs.

if we fail to figure out the root cause, an option might be to exit deck-chores when a container can't be found. so that the service manager can restart it if configured to do so.

@nramesh25
Copy link
Author

Sorry I missed this update. I tried figuring out the root cause but with no luck. i think an option to exit the container when a critical error is encountered may be a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment