A thought from me:
Modern CPUs and/or GPUs usually possess specialized hardware components for decoding (and encoding) video data, for speed reasons. "Decoding" basically means "decompressing". If a video is decoded by e.g. the GPU, then it might be directly displayed on screen, bypassing usual videodriver stuff such as, hmm, temporal dithering or framerate limiting. Those two are bad examples though - you'd think they'd make matters better, not worse.
Still, a way to test this theory would be to try watching animated GIFs. GIF is a "simple" format, in the sense that it doesn't need dedicated hardware for efficient decoding.
Btw i have the same problem as you (wrt watching videos)