How to improve requests per second

When I create a server using the sanic example in the wiki and just bombard it with requests.get() and a return empty() I can only get around 10k requests per second.

I can see on here people often get 70k+ requests per second on a simple requests. Im not sure how people are achieving this

I am on windows, I’m sure that is half the problem.

Sanic itself cannot make Python run faster. With that said, it makes use of patterns to make it run more efficient. Benchmarks are entirely dependent upon the architecture you are using, and almost always suspect.

Personally, I use them so that I can judge Sanic’s change in performance. We have a policy that new features cannot slow down its operation unless there is some overwhelming need.

With that said, my base benchmark on my machine is about 120-125k for 20.12 and earlier, and 130-135k for the current master. This is achieved by making use of multiple cores to handle responses. If you are running a single process the numbers will be much lower.

I have tested with about 20 python instances using requests.get() running in a while loop. Though the speed seems to not get any faster after ~4/5

when you say using multiple cores to handle responses do you mean multiple “workers” on the sanic side? if so what might be the best way for me to at least thread 2 different pages especially on windows

lets say /index runs on 1 thread and /page2 runs on another

Can you send me your script? I would highly suggest using this: GitHub - wg/wrk: Modern HTTP benchmarking tool

Sanic is not threaded and you cannot assign endpoints to run in threads.

Yes, I take advantage of multiple cores using multiple worker processes.

I tested Linux (I am assuming its using uvloop with linux) with 18k requests per second with a single worker, windows gets 11.2k and then with 32 workers linux gets 32k. Using WSL so I’m sure performance suffers a bit.

My app is a queue manager so its important that requests return in order, not sure if that’s possible with multiple workers. The other page of the app receives data about the machines using queue manager I might just make a second server and use a relay to deliver them both on the same port. Any suggestions inside of python? id prefer not to use nginx if I don’t have to.

I have a 3950x so its not like im running on some old laptop CPU or anything haha, But 18k per second is actually acceptable for what I want. 11k is just on the lower end of what id like to have

I run 20 of these for a total of 1 million requests

import urllib3, time

http = urllib3.PoolManager()
url = 'http://127.0.0.1/'
tasks = 50000

start = int(time.time()*1000)

for t in range(tasks):
    response = http.request('GET', url)
    http.clear

print(int(time.time()*1000) - start)
time.sleep(5)

On a server that looks like this

from sanic import Sanic
from sanic.response import empty

app = Sanic("hello_example")

@app.route("/")
async def test(request):
  return empty()

if __name__ == "__main__":
  app.run(host="0.0.0.0", port=80)

Hi @LordOdin,

I tried your code above to see what I can find.

I’m using Ubuntu 20.04, I’m doing these tests on my low-power laptop, it has 4-core Intel Core i5 mobile CPU @1.7Ghz, so the results quite a bit lower than what you’re seeing on your Ryzen.

I’m testing on Sanic 20.12.1, and master (21.03a). The quoted results are for 20.12, with 21.03a results in parentheses after it. I’m using just one Sanic worker.

Firstly, using your python (urllib3) based method, I ran 1 million requests spread across 20 processes using python multiprocessing. And it took 252.9 seconds (master 287.9), which equates to just 3954 requests per second (master 3484/sec) (We’ll need to check out why master is slower in this test)

Next, I tried exactly the same example using wrk for 60 seconds.

wrk -c 20 -d 60 -t 20 http://127.0.0.1

With this test I got 9434 requests/sec (master 9648/sec).
Almost 3 times faster. I think one of the differences is that each wrk worker holds its TCP connection open between requests.

Looks to me that you’ve deliberately written your Python tester code to close the TCP connection between requests (with http.close()) This is sensible because it simulates real world conditions where each connection might be coming from a different client, so Sanic cannot re-use any existing connections.

Putting that aside, you can get the Python-bechmark numbers a bit higher by reusing the connection, taking out the http.close() line. On my laptop now this completes 1 million requests in 168.1 (master 152.8) seconds, which comes out to 5948 requests/sec (master 6544/sec).

Unfortunately you can’t do the opposite test to force wrk to close connections between requests, because wrk doesn’t work that way.

All of this is a far cry from the approx 20k requests/sec I’d expect to be seeing on Sanic with a single worker, so I might look further into this testing methodology to see where I can find some more gains.

Update:
Testing the same code on my work computer, 8-core Intel Core i9 @3.4Ghz:
Python URLlib3 test1 (with connection close) - 1 million requests across 20 workers:
133.3 seconds total, 7.5k requests/sec (master 164.6sec, 6k/sec)
Python URLlib3 test2 (no close connection):
97.3 seconds, 10.2k requests/sec (master 97.2sec, 10.3k/sec)
wrk 20 threads for 60 seconds:
12.5k requests/sec (master 12.6k req/sec)
wrk 20 client threads and 8 server workers:
62k requests/sec

3 Likes

The reason I close the connection is because i use sanic to run a queue manager. Which can have over 2000 machines connected for hours. I’ve discovered a few times having so many machines reusing connections causes them to error out every so often. Not sure what causes that. So to be safe we don’t reuse connections atm.

When testing on wsl I was able to get 23k requests per second for a single thread. Windows was 11k.

Using wrk results in similar speeds for a single worker. On windows it gets socket errors. But not on wsl.

I can get 190k requests per second on wsl with 32 workers. But the goal is to make a single thread as fast as possible mostly because this app will run primarily on windows machines. I have a dictionary constantly being read and written to synchronized with all the pages, not sure I can do that with multiple workers?

I know people use a database for this kind of stuff but I want the fastest possible speed and low cpu usage. It’s important I can get a sub millisecond response time for the entire request.

One way of doing what you’re asking would be to use an external Redis cache. It can act like a dictionary outside of Python, and each of your Sanic workers can connect to that. That might solve the problem you’re seeing.

Thinking about it more, I wonder if Python is really the best tool for you to be using for your application.
Sanic is really pushing the limits of what a Web Server in Python is able to do.

If you absolutely require 18k requests/s and submillisecond response times on a single thread on windows machines (which may not necessarily be running beefy CPUs), then python is a poor choice.

The numbers we’re seeing in the tests above are for empty responses. When you start to layer on application code and proper responses with real content, your response time will creep up and requests/sec will go down. You’ll be constantly benchmarking your code and doing micro-optimizations everywhere, to keep your throughput up.

Without knowing any more details about your application and other requirements, personally in this case I’d reach for one of the super fast Rust http servers, something very low level like Hyper if you want no extras and super-fast responses, or a proper web server framework like Rocket for more ease-of-use.

The application is a queue manager, ask for a task and get the next one in queue. if tasks fail or machine doing the task turns off it gets requeued. The reason I am looking for raw requests per second is because there will be points we have 5000, 10000 machines connected asking for tasks, the requests just for those machines to report their status every 10 seconds is 1000 requests per second, thats without doing any actual tasks.

At the moment each request does about 10 - 20 dictionary operations, so being able to do that with an external DB seems highly unlikely, It looks like redis has some request limits as well. 70k without pipelining?

We have taken a slightly different approach to get better requests per second by needing to do less requests in the first place, If anyone has a better idea how to join multiple endpoints like this in to a single request feel free to share :slight_smile:

async def uber_api(request):
    args = request.json
    response_body = {}
    for endpoint in args.items():
        if endpoint[0] == 'request_task':
            response_body[endpoint[0]] = _request_task(endpoint[1])

        elif endpoint[0] == 'task_data':
            response_body[endpoint[0]] = _task_data(endpoint[1])

        elif endpoint[0] == 'heart_beat':
            response_body[endpoint[0]] = _heart_beat(endpoint[1])

        elif endpoint[0] == 'job_list':
            response_body[endpoint[0]] = _job_list()

        elif endpoint[0] == 'active_machines':
            response_body[endpoint[0]] = get_machines()

        elif endpoint[0] == 'summary':
            response_body[endpoint[0]] = _summary()

        elif endpoint[0] == 'job':
            response_body[endpoint[0]] = _job(endpoint[1])

        elif endpoint[0] == 'job_tasks':
            response_body[endpoint[0]] = _job_tasks(endpoint[1])

        elif endpoint[0] == 'job_details':
            response_body[endpoint[0]] = _job_details(endpoint[1])

        elif endpoint[0] == 'job_status':
            response_body[endpoint[0]] = _job_status(endpoint[1])

        elif endpoint[0] == 'node_job':
            response_body[endpoint[0]] = _node_job(endpoint[1])

        elif endpoint[0] == 'heartbeat_task':
            response_body[endpoint[0]] = _heartbeat_task(endpoint[1])

        elif endpoint[0] == 'delete_job':
            response_body[endpoint[0]] = _delete_job(endpoint[1])

    return json(response_body, escape_forward_slashes=False)

The endpoint routing is LRU cached. Since it looks like the paths are not dynamic, it should be only the first trip that is a different speed. Any subsequent requests hit the router cache. There is a max number of endpoints, so I guess it depends how many other requests you are handling.

Anyway, one thing to do would be to warm up the cache at start time. Have you tested against master?

Well the goal here is to eliminate extra latency imposed by the network by doing multiple api calls with 1 request. Not tested against master and I’m not entirely sure how to do so. What is this warming up you speak of?

We could technically send a single post request to the server and get a response from every endpoint on this list without having to do another request.

There are usually over 1000 machines sending requests to the server (goal is to be over 10k),
request_task and task_data are always submitted back to back. we request a task > compute task > send data back to queue manager > request another task. So now we can request a task and submit data in 1 post request

After this change we are getting amazing throughput but we get an error after a few tens of thousands of tasks are ran (I assume its something to do with using a new port every connection?)

using this code

        http = urllib3.PoolManager()
        r = http.request('POST', url, headers={'Content-Type': 'application/json'}, body=json.dumps(json_data))
        return r

we get this error

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=7777): Max retries exceeded with url: /uber_api/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001CE4C5127F0>: Failed to establish a new connection: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted'))

By warming up, I meant hitting the endpoints so that they are already cached. Something like this:

@app.listener("after_server_start")
async def warmup(app, loop):
    app.test_client.get("...")

Another thing worth checking into might be to pass in your own version of the router. You can subclass it, and the Sanic() instantiator will accept router as an argument. I am on a tablet now, but can send you code later if you want.

Where is that connection pool being instantiated? Perhaps you can squeeze some juice using a pattern to leverage HTTP/2 pipelining your requests.

https://www.python-httpx.org/http2/

At the moment I had about 20 of them running on my local machine (localhost), We tried keeping connections alive and my partner in Germany (140ms) gets this message on his connection. Might be able to check out httpx. its important everything is synchronous and completes in order otherwise we wouldn’t have big performance concerns

Connection pool is full, discarding connection: xx.xxx.xxx.x

We are only connecting to 1 host (my ip)

we are basically doing a for loop with new connections each time, and we want that for loop to run as fast as possible. idk if we can reuse connections on 10k+ machines connected to a single threaded sanic server without hurting the performance of the server