NVIDIA AI Infrastructure Sample Questions:
1. An A1 server exhibits frequent kernel panics under heavy GPU load. 'dmesg' reveals the following error: 'NVRM: Xid (PCl:0000:3B:00): 79, pid=..., name=..., GPU has fallen off the bus.' Which of the following is the least likely cause of this issue?
A) Overclocking the GPU beyond its stable limits.
B) A driver bug in the NVIDIA drivers, leading to GPU instability.
C) Insufficient power supply to the GPIJ, causing it to become unstable under load.
D) A faulty CPU.
E) A loose or damaged PCle riser cable connecting the GPU to the motherboard.
2. You are deploying an NVIDIA GPU-accelerated application in a virtualized environment using vGPU. How does vGPU technology impact power and cooling considerations compared to a bare-metal deployment, and what specific monitoring metrics become crucial?
A) vGPU deployments require specialized cooling solutions that are not needed for bare metal setups.
B) vGPU deployments eliminate the need for GPU monitoring, as the virtualization layer handles all power and cooling management.
C) vGPU deployments have no significant impact on power and cooling requirements compared to bare-metal. Standard GPU temperature and power draw metrics are sufficient.
D) vGPU deployments can lead to higher overall power consumption and concentrated heat generation on the host server due to resource consolidation. Monitoring metrics like GPU utilization per VM, vGPU frame rate, and host server thermal headroom become crucial.
E) vGPU deployments typically require less power and cooling than bare-metal, as resources are shared. The host server's overall power consumption becomes the primary monitoring metric.
3. An AI infrastructure uses a combination of air-cooled and liquid-cooled NVIDIA GPUs. You want to optimize cooling performance based on the specific thermal characteristics of each GPU type and their location within the server rack. How can you achieve granular cooling control and monitoring to address these heterogeneous cooling requirements effectively? SELECT TWO.
A) Implement dynamic fan speed control based on individual GPU temperatures, leveraging tools like 'nvidia-smi' and custom scripts, for air-cooled GPUs.
B) Use a centralized monitoring system to track GPU temperatures and power consumption, but apply the same cooling profile to all GPUs regardless of type.
C) Employ liquid cooling only for the highest TDP GPUs and rely on ambient air cooling for all other components.
D) Implement rack-level airflow management solutions, such as blanking panels and cable management, to improve overall airflow uniformity.
E) Deploy per-server cooling solutions with independent fan control for each server node, allowing for tailored airflow adjustments.
4. You are configuring a server with multiple GPUs for CUDA-aware MPI. Which environment variable is critical for ensuring proper GPU affinity, so that each MPI process uses the correct GPU?
A) LD LIBRARY PATH
B) CUDA DEVICE ORDER
C) MPI GPU SUPPORT
D) CUDA LAUNCH BLOCKING-I
E) CUDA VISIBLE DEVICES
5. Which of the following are key benefits of using NVIDIA NVLink Switch in a multi-GPU server setup for AI and deep learning workloads?
A) Increased GPU-to-GPIJ communication bandwidth.
B) Simplified GPU resource management.
C) Support for larger GPU memory pools than a single server can physically accommodate.
D) Reduced latency in inter-GPU data transfers.
E) Enhanced security features compared to PCle based interconnections.
Solutions:
Question # 1 Answer: D | Question # 2 Answer: D | Question # 3 Answer: A,D | Question # 4 Answer: E | Question # 5 Answer: A,C,D |