Random exceptions with WorkerManager

ThirumalaiK · February 25, 2023, 9:09am

I’m using Sanic Server with Google App Engine to serve one of our production services. However, we often see some random exceptions while serving the app.

Sanic Version: 22.12.0
Runtime: Python 3.10

It will be great if someone can shed some light on this.

 Traceback (most recent call last):
  File "/layers/google.python.pip/pip/bin/sanic", line 8, in <module>
    sys.exit(main())
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/sanic/__main__.py", line 12, in main
    cli.run(args)
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/sanic/cli/app.py", line 119, in run
    serve(app)
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/sanic/mixins/startup.py", line 862, in serve
    manager.run()
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/sanic/worker/manager.py", line 95, in run
    self.monitor()
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/sanic/worker/manager.py", line 197, in monitor
    self._sync_states()
  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/sanic/worker/manager.py", line 288, in _sync_states
    state = self.worker_state[process.name].get("state")
  File "<string>", line 2, in __getitem__
  File "/layers/google.python.runtime/python/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod

sjsadowski · February 26, 2023, 1:41am

Is there any additional context around this? This traceback isn’t much to work with.

ahopkins · February 26, 2023, 9:34am

So, I guess the light that I can shed is that when running Sanic in its default worker-process mode, it will attempt to keeps its worker state in sync with all other processes. Not knowing any more about what your setup looks like it is hard to provide more detail. I am not sure what AppEngine provides if multiple processes are even an option. You might need to use single process mode.

ThirumalaiK · March 27, 2023, 3:28am

Thanks! It works fine with single process, but it doesn’t seem to be an issues with platform’s MultiProcessing support and can also be replicated with a simple helloworld app. GAE shutdowns the instances if it’s idle for 15mins, and this error mostly appears after SIGTERM.

I’ve upgraded runtime to Python 3.11 recently. From the traceback, it appears that it attempts to sync the state of worker process, which has already been shutdown/terminated.

 Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/sanic/mixins/startup.py", line 862, in serve
    manager.run()
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/sanic/worker/manager.py", line 95, in run
    self.monitor()
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/sanic/worker/manager.py", line 197, in monitor
    self._sync_states()
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/sanic/worker/manager.py", line 288, in _sync_states
    state = self.worker_state[process.name].get("state")
            ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "<string>", line 2, in __getitem__
  File "/layers/google.python.runtime/python/lib/python3.11/multiprocessing/managers.py", line 837, in _callmethod
    raise convert_to_error(kind, result)
KeyError: 'Sanic-Server-0-0'

ThirumalaiK · March 27, 2023, 3:34am

@ahopkins
It looks like this has already been addressed in one of your recent commits.

ahopkins · March 27, 2023, 5:10am

LMK please if you still experience problems.

ThirumalaiK · March 27, 2023, 9:04am

@ahopkins
WorkerProcess#set_state internally tries to get the worker state by process name. I can see KeyError there now.

We call this from WorkerManager#_sync_states in case of KeyError. Do we also need to handle this gracefully in WorkerProcess#set_state as well?

def set_state(self, state: ProcessState, force=False):
        if not force and state < self.state:
            raise Exception("...")
        self.state = state
        self.worker_state[self.name] = {
            **self.worker_state[self.name],
            "state": self.state.name,
        }

ahopkins · March 27, 2023, 9:12pm

This is DURING the lifetime of an app? Not shutdown? What OS? Anyway I can recreate this?

ThirumalaiK · April 1, 2023, 3:47pm

@ahopkins
I am not sure if I can help with a reproducible code as this looks like a race condition between monitor and shutdown operations.

Server Logs:

timestamp,severity,textPayload
2023-04-01T14:37:05.977581Z,INFO,[pid1-app] Exit 0 app (pid 11): /bin/sh -c exec sanic main:app --port=${PORT:-8080} --no-motd --no-access-logs --workers 4 [session:HKBFS3X]
2023-04-01T14:37:03.488038Z,,[INFO] [24] [Sanic-Server-2-0] sanic.server: Stopping worker [24]
2023-04-01T14:37:03.488028Z,,[INFO] [23] [Sanic-Server-1-0] sanic.server: Stopping worker [23]
2023-04-01T14:37:03.488017Z,,[INFO] [25] [Sanic-Server-3-0] sanic.server: Stopping worker [25]
2023-04-01T14:37:03.487999Z,,[INFO] [22] [Sanic-Server-0-0] sanic.server: Stopping worker [22]
2023-04-01T14:37:03.479585Z,,[INFO] [11] [MainProcess] sanic.root: Server Stopped
2023-04-01T14:37:03.479572Z,,KeyError: 'Sanic-Server-3-0'
2023-04-01T14:37:03.479337Z,,"    raise convert_to_error(kind, result)"
2023-04-01T14:37:03.479312Z,,"  File ""/layers/google.python.runtime/python/lib/python3.11/multiprocessing/managers.py"", line 837, in _callmethod"
2023-04-01T14:37:03.479266Z,,"  File ""<string>"", line 2, in __getitem__"
2023-04-01T14:37:03.479250Z,,"            ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^"
"Traceback (most recent call last):,ERROR
 in _sync_states
2023-04-01T14:37:03.479238Z
2023-04-01T14:37:03.479213Z,,[ERROR] [11] [MainProcess] sanic.root: KeyError occurred while synchronizing worker states. err='Sanic-Server-3-0'
2023-04-01T14:37:03.479198Z,,KeyError: 'Sanic-Server-2-0'
2023-04-01T14:37:03.478801Z,,"    raise convert_to_error(kind, result)"
2023-04-01T14:37:03.478790Z,,"  File ""/layers/google.python.runtime/python/lib/python3.11/multiprocessing/managers.py"", line 837, in _callmethod"
2023-04-01T14:37:03.478777Z,,"  File ""<string>"", line 2, in __getitem__"
2023-04-01T14:37:03.478764Z,,"            ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^"
"Traceback (most recent call last):,ERROR
 in _sync_states
2023-04-01T14:37:03.478754Z
2023-04-01T14:37:03.478730Z,,[ERROR] [11] [MainProcess] sanic.root: KeyError occurred while synchronizing worker states. err='Sanic-Server-2-0'
2023-04-01T14:37:03.478716Z,,KeyError: 'Sanic-Server-1-0'
2023-04-01T14:37:03.478074Z,,"    raise convert_to_error(kind, result)"
2023-04-01T14:37:03.478062Z,,"  File ""/layers/google.python.runtime/python/lib/python3.11/multiprocessing/managers.py"", line 837, in _callmethod"
2023-04-01T14:37:03.478049Z,,"  File ""<string>"", line 2, in __getitem__"
2023-04-01T14:37:03.478031Z,,"            ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^"
"Traceback (most recent call last):,ERROR
 in _sync_states
2023-04-01T14:37:03.478022Z
2023-04-01T14:37:03.477998Z,,[ERROR] [11] [MainProcess] sanic.root: KeyError occurred while synchronizing worker states. err='Sanic-Server-1-0'
2023-04-01T14:37:03.477982Z,,KeyError: 'Sanic-Server-0-0'
2023-04-01T14:37:03.477269Z,,"    raise convert_to_error(kind, result)"
2023-04-01T14:37:03.477256Z,,"  File ""/layers/google.python.runtime/python/lib/python3.11/multiprocessing/managers.py"", line 837, in _callmethod"
2023-04-01T14:37:03.477240Z,,"  File ""<string>"", line 2, in __getitem__"
2023-04-01T14:37:03.477218Z,,"            ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^"
"Traceback (most recent call last):,ERROR
 in _sync_states
2023-04-01T14:37:03.477207Z
2023-04-01T14:37:03.477174Z,,[ERROR] [11] [MainProcess] sanic.root: KeyError occurred while synchronizing worker states. err='Sanic-Server-0-0'
2023-04-01T14:37:03.468257Z,INFO,[pid1-nginx] Exit 0 nginx (pid 18): /usr/sbin/nginx -c /tmp/nginxconf-2913916072/nginx.conf [session:HKBFS3X]
2023-04-01T14:37:03.467932Z,,[INFO] [11] [MainProcess] sanic.root: Terminating a process: Sanic-Server-3-0 [25]
2023-04-01T14:37:03.466299Z,,[INFO] [11] [MainProcess] sanic.root: Terminating a process: Sanic-Server-2-0 [24]
2023-04-01T14:37:03.464831Z,,[INFO] [11] [MainProcess] sanic.root: Terminating a process: Sanic-Server-1-0 [23]
2023-04-01T14:37:03.461914Z,,[INFO] [11] [MainProcess] sanic.root: Terminating a process: Sanic-Server-0-0 [22]
2023-04-01T14:37:03.461724Z,,[INFO] [11] [MainProcess] sanic.root: Received signal SIGTERM. Shutting down.
2023-04-01T14:37:03.457982Z,INFO,[pid1-app] Propagating terminated signal to app (pid 11) [session:HKBFS3X]
2023-04-01T14:37:03.457973Z,INFO,[pid1-nginx] Propagating terminated signal to nginx (pid 18) [session:HKBFS3X]
2023-04-01T14:37:03.457963Z,INFO,[pid1] Waiting 5s to allow shutdown processing [session:HKBFS3X]
2023-04-01T14:37:03.457949Z,INFO,[pid1] Sending terminated signal to processes [session:HKBFS3X]
2023-04-01T14:37:03.457025Z,INFO,[pid1] Received terminated signal [session:HKBFS3X]
2023-04-01T14:36:04.183927Z,,[INFO] [24] [Sanic-Server-2-0] GAE-APP: initializing app configs
2023-04-01T14:35:58.441111Z,,[INFO] [23] [Sanic-Server-1-0] GAE-APP: initializing app configs
2023-04-01T14:35:57.999842Z,,[INFO] [22] [Sanic-Server-0-0] GAE-APP: initializing app configs
2023-04-01T14:35:56.962331Z,,[INFO] [25] [Sanic-Server-3-0] GAE-APP: initializing app configs
2023-04-01T14:35:56.931374Z,,[INFO] [22] [Sanic-Server-0-0] sanic.server: Starting worker [22]
2023-04-01T14:35:56.924851Z,,[INFO] [23] [Sanic-Server-1-0] sanic.server: Starting worker [23]
2023-04-01T14:35:56.923972Z,,[INFO] [25] [Sanic-Server-3-0] sanic.server: Starting worker [25]
2023-04-01T14:35:56.923941Z,,[INFO] [24] [Sanic-Server-2-0] sanic.server: Starting worker [24]
2023-04-01T14:35:55.382907Z,INFO,[pid1-nginx] Starting nginx (pid 18): /usr/sbin/nginx -c /tmp/nginxconf-2913916072/nginx.conf [session:HKBFS3X]
2023-04-01T14:35:55.375349Z,INFO,[pid1-nginx] Successfully connected to localhost:8081 after 1.268327928s [session:HKBFS3X]
2023-04-01T14:35:54.110717Z,INFO,[pid1-app] Starting app (pid 11): /bin/sh -c exec sanic main:app --port=${PORT:-8080} --no-motd --no-access-logs --workers 4 [session:HKBFS3X]
2023-04-01T14:35:54.106710Z,INFO,[pid1-nginx] nginx waiting for any of 3 prerequisite(s): [portbind:tcp:localhost:8080 portbind:tcp:localhost:8081 portbind:unix:/tmp/google-config/app.sock] [session:HKBFS3X]
2023-04-01T14:35:54.106298Z,INFO,"[pid1-app] app has no prerequisites, starting immediately [session:HKBFS3X]"
2023-04-01T14:35:54.106279Z,INFO,[pid1] Starting processes [app nginx] [session:HKBFS3X]
2023-04-01T14:35:54.105804Z,DEBUG,"[pid1-app] Using app start info from /srv/.googleconfig/app_start.json: &main.appStart{Entrypoint:struct { Type string ""json:\""type\""""; UnparsedValue string ""json:\""unparsed_value\""""; Command string ""json:\""command\""""; WorkDir string ""json:\""workdir\"""" }{Type:""User"", UnparsedValue:"""", Command:""sanic main:app --port=${PORT:-8080} --no-motd --no-access-logs --workers 4"", WorkDir:""""}, EntrypointFromAppYAML:"""", EntrypointContents:"""", Runtime:""python311""} [session:HKBFS3X]"
2023-04-01T14:35:54.100296Z,INFO,[pid1] started [session:HKBFS3X]

Looking at the implementation of WorkerManager and WorkerProcess implementation, it seems like they share the same worker_state object. So, it’s obvious to see KeyError if you are calling WorkerProcess#set_state right after seeing KeyError at WorkerManager#_sync_states. I think we should do something like this to handle this gracefully.

--- a/sanic/worker/process.py
+++ b/sanic/worker/process.py
@@ -37,10 +37,11 @@ class WorkerProcess:
         if not force and state < self.state:
             raise Exception("...")
         self.state = state
-        self.worker_state[self.name] = {
-            **self.worker_state[self.name],
-            "state": self.state.name,
-        }
+        if self.name in self.worker_state:
+            self.worker_state[self.name] = {
+                **self.worker_state[self.name],
+                "state": self.state.name,
+            }

Alternatively, the sync task can be skipped if the server is shutting down. I don’t know if it has any adverse effects.

--- a/sanic/worker/manager.py
+++ b/sanic/worker/manager.py
@@ -293,10 +293,10 @@ class WorkerManager:
         self.shutdown()
 
     def shutdown(self):
+        self._shutting_down = True
         for process in self.processes:
             if process.is_alive():
                 process.terminate()
-        self._shutting_down = True
 
     @property
     def pid(self):
@@ -311,11 +311,8 @@ class WorkerManager:
         return all(acked) and len(acked) == self.num_server
 
     def _sync_states(self):
-        for process in self.processes:
-            try:
+        if not self._shutting_down:
+            for process in self.processes:
                 state = self.worker_state[process.name].get("state")
-            except KeyError:
-                process.set_state(ProcessState.TERMINATED, True)
-                continue
-            if state and process.state.name != state:
-                process.set_state(ProcessState[state], True)
+                if state and process.state.name != state:
+                    process.set_state(ProcessState[state], True)

ahopkins · April 2, 2023, 6:29am

I think I need to revisit the safety of skipping, and why we are deleting it here:

github.com

sanic-org/sanic/blob/d62a92fac967097ada646f10cfa55ed2c6488cc4/sanic/worker/process.py#L80

    
      
                  logger.debug(
                      f"{Colors.BLUE}Terminating a process: "
                      f"{Colors.BOLD}{Colors.SANIC}"
                      f"%s {Colors.BLUE}[%s]{Colors.END}",
                      self.name,
                      self.pid,
                  )
                  self.set_state(ProcessState.TERMINATED, force=True)
                  try:
                      os.kill(self.pid, SIGINT)
                      del self.worker_state[self.name]
                  except (KeyError, AttributeError, ProcessLookupError):
                      ...
          
          
def restart(self, restart_order=RestartOrder.SHUTDOWN_FIRST, **kwargs):
              logger.debug(
                  f"{Colors.BLUE}Restarting a process: {Colors.BOLD}{Colors.SANIC}"
                  f"%s {Colors.BLUE}[%s]{Colors.END}",
                  self.name,
                  self.pid,
              )

You are probably right about it being a race condition.

Abu · May 15, 2023, 8:28am

Hello everyone! I am encountering exact same error as described here and https://github.com/sanic-org/sanic/pull/2725 didn’t really help, since it still tries to get the process by name to terminate, which results in the same error. Is there any updates on this?

ahopkins · May 15, 2023, 3:52pm

Can you post the error? When in the cycle do you see this? Startup? While running? Shutdown?

Abu · May 16, 2023, 11:49am

Thank you for responding so quick @ahopkins! Here I will try to provide as many details as possible, feel free to ask for more specifics if I miss anything

We also are using sanic server in a Google Cloud Run for one of our production services. the thing with cloud run is that it can auto scale up and down when needed, but we see these errors specifically when cloud run decides to kill the server by sending shutdown SIGTERM. Here is the example log for one of such cases:

As the result we are getting Experienced exception while trying to serve error log followed by this traceback:

Traceback (most recent call last):
  File "/interview/.venv/lib/python3.10/site-packages/sanic/worker/manager.py", line 316, in _sync_states
    state = self.worker_state[process.name].get("state")
  File "<string>", line 2, in __getitem__
  File "/usr/local/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod
    raise convert_to_error(kind, result)
KeyError: 'Sanic-Server-0-0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/interview/.venv/lib/python3.10/site-packages/sanic/mixins/startup.py", line 853, in serve
    manager.run()
  File "/interview/.venv/lib/python3.10/site-packages/sanic/worker/manager.py", line 122, in run
    self.monitor()
  File "/interview/.venv/lib/python3.10/site-packages/sanic/worker/manager.py", line 224, in monitor
    self._sync_states()
  File "/interview/.venv/lib/python3.10/site-packages/sanic/worker/manager.py", line 318, in _sync_states
    process.set_state(ProcessState.TERMINATED, True)
  File "/interview/.venv/lib/python3.10/site-packages/sanic/worker/process.py", line 41, in set_state
    **self.worker_state[self.name],
  File "<string>", line 2, in __getitem__
  File "/usr/local/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod
    raise convert_to_error(kind, result)
KeyError: 'Sanic-Server-0-0'

As you can see, the _sync_states method actually catches the KeyError exception, but it still get re-raised when calling process.set_state(ProcessState.TERMINATED, True).
There is not much more to it really, just these error logs happening here and there when cloud run is trying to stop the server.
We are using Sanic 23.3.0 and python 3.10.
Here is how we start the server:

loader = AppLoader(factory=partial(create_app))
app = loader.load()
host = os.environ.get('LOCAL_SERVER_HOST', '0.0.0.0')
app.prepare(
        host=host,
        port=int(app.config['PORT']),
        dev=_local,
        debug=_local,
        single_process=not _local,
)
Sanic.serve(primary=app, app_loader=loader)

as you can see, we use single process server, since we don’t really need a worker manager (cloud run already takes care about scaling up the app when needed). However, I was a bit confused that according to the error log above, we are still involving worker manager in the process, but maybe that is expected.

I hope it is clearer now, but let me know if otherwise!

ahopkins · May 22, 2023, 8:01pm

Thanks! This is very helpful.

javierdialpad · December 4, 2023, 2:40pm

Hey folks, getting the same error while running Sanic v23.6 on GAE. Does it have a fix?

ahopkins · December 7, 2023, 7:09pm

When the process is shutting down? Any other details to share?

javierdialpad · December 12, 2023, 2:24pm

Yes, it happens occasionally when closing the server after a SIGTERM is received (coming from App Engine’s auto-scaler I presume).

Some relevant details:

python 3.9
sanic v23.6
entrypoint: sanic main.app --host 0.0.0.0 --port 8080 --workers 2 --no-access-logs

It’s pretty similar to what @Abu posted. Looks like when the monitor process runs and tries to sync state, at least one of the workers was already removed from the dictionary by the terminate method.

After getting a KeyError here https://github.com/sanic-org/sanic/blob/78c44ed5e4947833df3f5a73fa966881bcdebd9f/sanic/worker/manager.py#L386, it tries to retrieve the process again from the same dict, which will inevitably raise an exception.

Relevant logs:

2023-12-12T12:16:07.348083Z INFO:sanic.root:Received signal SIGTERM. Shutting down.

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/sanic/mixins/startup.py", line 880, in serve
    manager.run()
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/sanic/worker/manager.py", line 122, in run
    self.monitor()
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/sanic/worker/manager.py", line 224, in monitor
    self._sync_states()
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/sanic/worker/manager.py", line 318, in _sync_states
    process.set_state(ProcessState.TERMINATED, True)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/sanic/worker/process.py", line 41, in set_state
    **self.worker_state[self.name],
  File "<string>", line 2, in __getitem__
  File "/layers/google.python.runtime/python/lib/python3.9/multiprocessing/managers.py", line 825, in _callmethod
KeyError: 'Sanic-Server-0-0'

Thank you!

ahopkins · December 12, 2023, 2:43pm

Okay, shutdown is getting another bit of an upgrade in the next release. Can you report back after it’s out?

javierdialpad · December 12, 2023, 2:51pm

Of course, will do. Thanks!

javierdialpad · February 26, 2024, 2:42pm

@ahopkins reporting back - trying out Sanic v23.12.1 and still occasionally getting these exceptions on shutdown

  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/sanic/mixins/startup.py", line 1146, in serve
    manager.run()
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/sanic/worker/manager.py", line 189, in run
    self.monitor()
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/sanic/worker/manager.py", line 294, in monitor
    self._sync_states()
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/sanic/worker/manager.py", line 427, in _sync_states
    state = self.worker_state[process.name].get("state")