hi, dear friends!
I try to test Llama with sanic, worker=cpuNum=2 ,
code as follows:
@app.listener("before_server_start")
async def load_model(app, loop):
app.ctx.llam=Llama(model_path="./model")
@app.get("/get")
async def handler(request):
question=request.args.get('question')
# res=app.ctx.llam(question) #there will take about 4 seconds and will block all.
res = await sync_to_async(func=app.ctx.llam)(question) #await no improvement here.
return {"answer":res["choices"]}
I send 100 jmeter request to the server,
answers will always return 2 by 2 ,block for about 4s at each round,
two cpu is always 100%,
as we know, Llama is CPU-bound task, async await function did no help.
but I still want to ask "how can i improve the performance "