Proxying a video with range headers

pmdevita · December 27, 2019, 3:28am

Hi, I’m trying to build a web app that proxies Reddit hosted video for easier sharing/downloading. However, I’m running into some issues with browser requests that ask for a certain range of the video, which is done through the Range header.

Here is a snippet of code that displays the problem

@app.route("/test.mp4")
async def test_vid(request):
    url = "https://v.redd.it/nki3zh3yvbv31/DASH_480?source=fallback"
    headers = dict(request.headers)
    headers.pop("referer", None)
    headers.pop("Referer", None)
    headers.pop("cookie")
    headers["host"] = "v.redd.it"
    headers["origin"] = "https://www.reddit.com"
    headers['TE'] = "Trailers"
    headers['accept'] = "*/*"
    pprint(headers)

    async def proxy_vid(response):
        res: aiohttp.ClientResponse = await session.get(url, headers=headers)
        while True:
            data = await res.content.read(4096)
            if not data:
                res.close()
                break
            await response.write(data)

    async with session.get(url, headers=headers) as resp:
        resp_headers = dict(resp.headers)
        status = resp.status
        print(status)
        pprint(resp_headers)

    resp_headers.pop('Access-Control-Allow-Origin', False)
    resp_headers.pop('Access-Control-Expose-Headers', False)

    return stream(proxy_vid, status=status, headers=resp_headers, content_type="video/mp4")

Most of this is header formatting which isn’t as important. But as you can see, I’m just relaying data and headers, nothing too complicated.

If the host does not request a range, then the request works out fine. If it does however, a couple things can happen.

A request from Firefox will send Reddit the following headers

{'TE': 'Trailers',
 'accept': '*/*',
 'accept-language': 'en-US,en;q=0.5',
 'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'host': 'v.redd.it',
 'origin': 'https://www.reddit.com',
 'pragma': 'no-cache',
 'range': 'bytes=2239259-',
 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) '
               'Gecko/20100101 Firefox/72.0'}

and get back a 416 because it’s asking for out-of-range bytes in the Range header. I’m not sure how it pulls that number. The video also plays even though it only received 416 responses so I’m not sure where or how the data got to it or how to stop it from caching.

A request from Chrome will send these headers

{'TE': 'Trailers',
 'accept': '*/*',
 'accept-encoding': 'identity;q=1, *;q=0',
 'accept-language': 'en-US,en;q=0.9',
 'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'host': 'v.redd.it',
 'origin': 'https://www.reddit.com',
 'pragma': 'no-cache',
 'range': 'bytes=0-',
 'sec-fetch-mode': 'no-cors',
 'sec-fetch-site': 'same-origin',
 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/78.0.3886.0 Safari/537.36'}

which results in requests getting mysterious cancellation errors

socket.send() raised exception.
socket.send() raised exception.
[2019-12-26 20:45:37 -0500] [13784] [ERROR] Transport closed @ ('127.0.0.1', 60112) and exception experienced during error handling
[2019-12-26 20:45:37 -0500] [13784] [DEBUG] Exception:
Traceback (most recent call last):
  File "C:\Users\pmdevita\Documents\Developer\Python\Experiments\VredditProxy\venv\lib\site-packages\sanic\server.py", line 498, in stream_response
    self.request.version, keep_alive, self.keep_alive_timeout
  File "C:\Users\pmdevita\Documents\Developer\Python\Experiments\VredditProxy\venv\lib\site-packages\sanic\response.py", line 110, in stream
    await self.streaming_fn(self)
  File "C:/Users/pmdevita/Documents/Developer/Python/Experiments/VredditProxy/main.py", line 398, in proxy_vid
    data = await res.content.read(4096)
  File "C:\Users\pmdevita\Documents\Developer\Python\Experiments\VredditProxy\venv\lib\site-packages\aiohttp\streams.py", line 368, in read
    await self._wait('read')
  File "C:\Users\pmdevita\Documents\Developer\Python\Experiments\VredditProxy\venv\lib\site-packages\aiohttp\streams.py", line 296, in _wait
    await waiter
concurrent.futures._base.CancelledError
[2019-12-26 20:45:41 -0500] [13784] [DEBUG] KeepAlive Timeout. Closing connection.
[2019-12-26 20:45:42 -0500] [13784] [DEBUG] KeepAlive Timeout. Closing connection.

When embedding the mp4 in a <video> tag in a webpage, Firefox sends requests that also get this error.
I’ve been working on this for at least a week now and I’m absolutely stumped. I have no idea how or why this error is happening. I just need to send the data from Reddit to the browser just as it would normally appear to the browser.

sjsadowski · December 29, 2019, 3:38pm

What do you get back when you send the request with the headers using curl or httpie?

K0IN · January 1, 2020, 5:58pm

I don’t know if this helps u but i did something similar to this (my criteria was only one session for every connection) also i do only forward whitelisted headers

formated at https://pastebin.com/JU0ezX8d
use it like return await ForwardUrlStream(stream_link, dict(request.headers))

pmdevita · January 7, 2020, 6:03pm

Sorry for the late reply.

Here is a test case I wrote. It simulates the kind of Range requests a browser makes and checks the data against the original URL.

import requests
import unittest
import hashlib
from random import randrange

BASE_URL = "http://127.0.0.1:8000"


class VidTest(unittest.TestCase):
    def test_mute_range(self):
        # First, we simulate the first request to get more info about the video
        headers = {'Range': "bytes=0-1"}
        ORIGINAL_URL = "https://v.redd.it/nki3zh3yvbv31/DASH_480?source=fallback"
        TEST_URL = BASE_URL + '/test.mp4'

        o = requests.get(ORIGINAL_URL, headers=headers)
        t = requests.get(TEST_URL, headers=headers)

        # We hash them to verify they are the same
        original_hash = hashlib.md5()
        test_hash = hashlib.md5()

        original_hash.update(o.content)
        test_hash.update(t.content)

        self.assertEqual(original_hash.digest(), test_hash.digest())
        self.assertEqual(o.status_code, t.status_code)
        self.assertEqual(o.headers.get('Content-Range', False), t.headers.get('Content-Range', False))

        # The next section downloads the rest of the video in randomly sized chunks
        cursor = 2
        # Get the total size of the video
        total_size = int(o.headers['Content-Range'].split("/")[1])
        # Track how much is left to download
        size_left = total_size - 2
        # How many chunks will we split this up into?
        requests_left = randrange(3, 8)
        while requests_left:
            # Based on how much data is left to download and how many chunks we have left to download it with,
            # download a random amount of data that ballparks what splitting it into equal chunks would be
            median_size = round(size_left / requests_left)
            print(median_size)
            if requests_left > 1:
                request_size = min(randrange(median_size - 4096, median_size + 4096), size_left)
            else:
                request_size = size_left
            # Then make the request
            headers = {'Range': "bytes={}-{}".format(cursor, cursor + request_size - 1)}
            o = requests.get(ORIGINAL_URL, headers=headers)
            t = requests.get(TEST_URL, headers=headers)
            # and update our hashes
            original_hash.update(o.content)
            test_hash.update(t.content)
            # And do one small, possibly unnecessary test
            self.assertEqual(o.status_code, t.status_code)
            # Request complete, prepare for the next or to stop
            cursor += request_size
            size_left -= request_size
            requests_left -= 1
        self.assertEqual(original_hash.digest(), test_hash.digest())

Running this test does not generate any cancellation errors so browsers must be doing something different. I am not sure what that might be though.

syfluqs · January 14, 2020, 7:36am

chrome and firefox have their own different implementations for video streaming and their standards do not match. as far as i can remember chrome makes multiple requests for the file, first request is supposed to be with the range 0 as you got, which the server is supposed to return with a 206 http code that tells the browser that subsequent requests for partial video data can be made. the subsequent requests will be with appropriate range bytes as you seek the video in the browser. you get cancellation errors because chrome has a pipeline of multiple requests of the same file while it buffers the video and abruptly drops the connection to the server after each connection recieves some appropriate amount of data. firefox on the other hand uses only a single connection as far as i know.

when you are proxying the reddit video make sure you also proxy the http code of the response from reddit as well. i think the stream function gives 200 by default.

vltr · January 14, 2020, 4:02pm

I don’t know if this helps, a while ago I did a quick axel clone using aiohttp, which you basically deal with the accept-ranges header and then break them into parts using the range parameter. The browser implementation part is particular to each one, so it’s hard to understand sometimes why one browser does this and the other does that (but I think @syfluqs answer is quite complete, specially regarding Chrome).

Now, for the proxy part, one thing is for certain: you must respect almost every single header the browser send you, as well as a nice control flow for any errors that may occur (I think that’s the principle of a proxy ).

Anyway, if you need any help, let us know.