Long runniing io blocking back ground tasks and sanic


i have a web api, using sanic, that basically serves files from a directory on disk.

The twists are:

  • The list of files changes over time. New files are added, old ones are deleted.
  • for some reasons the files system is very slow. (%$#$ M$…) and blocks the whole thing for about 1/2 min on each call.

The simple approach was to start a forever running task which modifies app context.

async def ten_second_job(app):
    while True:
        file_list = glob.glob(f"{app.config.INCOMING_DIR}/{FILE_GLOB}")
        app.ctx.file_list = sorted([Path(f).stem for f in file_list])
        await asyncio.sleep(11)


Bad idea as the glob blocks the whole app for really long because the file system is slow.

So the idea was to start a thread or process in the background so it does not interfere with the main web service.

First try

was something like this:

async def ten_second_job(app):
    while True:
        app.ctx.file_list = await loop.run_in_executor(None, read_files_thread, app)
        await asyncio.sleep(11)

and read_files_thread does now the glob thing from above
Works somewhat better, but if i have n workers i also have n background tasks running.
Which couldn’t be good for the file system performance, especially as they all run basically all at once.

This could be amended by adding a random number to the sleep() call.
Or adding a lock to the call (there is a redis based lock for another task already implemented). But then the workers would have different lists…

Second try

The next idea was starting one empty main process and one child for the read_files_thread() and an other for sanic (there is a question on how to run sanic in an sub process here.

But i can’t figure out how to update the app.ctx from the parent process…


So back to the drawing board:

What is the best way to do updates on `app.ctx’ with the results from a long running and blocking background sub task that should only run in a single instance?

You cannot. At least not without some sort of broker in between. Why do you need the list of files, is that part of an endpoint? Why not just do a lookup when it is requested?

The Files are kind of “Work Items”, incoming from an automatic process, needing some manual QA, that is this app, and then send off to somewhere else.
The list of files is shown to the users.
The list of files, in the backend, also contains their contents (and some meta info), so i need to read the files too, and that took to long for doing on request. That’s the 1/2min that the servers is blocked.

A third way i thought about would be creating a simple rest api server that only serves the file list.
But that server would have the same problem of being blocked during the reading of the files…

The main pain point is to avoid blocking the main web server during the file processing.

If reading is blocking (you cannot try with aiofiles?) then you definitely need a background process, I agree. A separate service might work. What comes to mind is a small project I presented at a conference last year. I have been meaning to turn it into a full package. It might be helpful.

It is using a file to send info back and forth between the processes. I would just replace that with a pubsub. There are other sync primitives in the multiprocessing lib you could also try, but I would not go that route.

i totally didn’t know about this…
Thought file access was always blocking, never looked into that… :((

First test seems really promising.

The saje project:

You can put an object in app.ctx?
And where did it got executed? on each worker or only once per sanic app?
i’ll look into this, seems interesting.

You can put anything you want in it. But, Sanic will not be able to sync it between processes. For that, you need to add something.

Not sure I understand the question.