How can i improve the performance in CPU-bound task?

hi, dear friends!
I try to test Llama with sanic, worker=cpuNum=2 ,
code as follows:

@app.listener("before_server_start")
async def load_model(app, loop):
    app.ctx.llam=Llama(model_path="./model")

@app.get("/get")
async def handler(request):
    question=request.args.get('question')
    # res=app.ctx.llam(question)  #there will take about 4 seconds and will block all.
    res = await sync_to_async(func=app.ctx.llam)(question)  #await no improvement here. 
    return {"answer":res["choices"]}

I send 100 jmeter request to the server,
answers will always return 2 by 2 ,block for about 4s at each round,
two cpu is always 100%,
as we know, Llama is CPU-bound task, async await function did no help.
but I still want to ask "how can i improve the performance "

What’s sync_to_async? I suppose it does the essentially the same as this code:

@app.before_server_start
async def main_start(app, loop):
    app.ctx.threadexec = ThreadPoolExecutor(
        max_workers=2, thread_name_prefix="llam-worker"
    )

@app.after_server_stop
async def server_stop(app, loop):
    app.ctx.threadexec.shutdown()

@app.get("/get")
async def handler(request):
    question=request.args.get('question')
    res = await asyncio.get_event_loop().run_in_executor(
        app.ctx.threadexec, app.ctx.llam, question
    )
    return json({"answer":res["choices"]})

This is explicit about the number of threads. It is hard to say why you get answers in 2-by-2 blocks, except for having two Sanic workers as separate processes but then you should get two at a time, not four. I suppose it is possible that Llama and/or Python GIL locks interfere such that two requests per worker are released at the same time, even if calculated sequentially, or maybe that sync_to_async actually runs two threads but no more at a time…

llam could take at least 4 seconds to answer a quetion, and llam is not a async function.
sync_to_async is imported from asgiref, by this, llam can be await.

So it could block for 4 seconds, preventing anything else from happening. Sounds like this is likely the issue.

Sounds like you are in the classic use case pattern where you need to push work off to a background worker. There are a lot of patterns you could follow starting with using a dedicated package like celery, or rolling your inside Sanic using the worker manager (like this: Pushing work to the background of your Sanic app).