We observed that enabling both use_gzip and dump_self_cuda_time_total in the vLLM torch profiler introduces significant overhead during profiling. For example, when profiling 10 randomly generated ...