"Do I have to download the sound seperately?"
Most likely Yes. There are some servers that offer different video resolutions (quality select) and such servers usually keep separate files for audio and video. This allows for smaller files since they do not repeat the exact same audio data for each separate video file.
For example: Youtube does not keep audio in 1080p videos. You can "tried this with different videos" but still no 1080p file will have audio. Possibly your server/API is doing the same thing.
The simplest way you may have overseen is to use the Internet Browser for loading and playing the video and a screen recorder software capable of recording the screen and the audio not only from microphone, but also from any other sound source as this one used by the video.
In other words trying to write own program in whatever programming language is not necessary to achieve the effect you want and the advantage of this approach is that any video you are able to run in your Internet browser (which usually exceeds the capabilities of single libraries as it combines all of the available ones to provide you with the best Internet experience possible) becomes downloadable as file.
And if you are really in need of automating it ... why not use user input automation software to deliver the key-presses and mouse clicks necessary to make it happen in the way you have tested yourself as successful while doing in manually?
requests
does not have any logic that would allow it to strip video from sound. Use e.g. this link to verify that your code works: sample-videos.com/video321/mp4/720/big_buck_bunny_720p_2mb.mp4 If you can share the actual URL you have problems with, that could help us in resolving this. Or shout if the file from the URL I provided comes without sound.