As the shape and demands of large scale computing environments have evolved, so have the needs of those who are responsible for keeping them in tip top shape. Demanding AI workloads, such as training and inference, employ 10s of thousands to hundreds of thousands of high end GPU processors. Unlike their CPU counterparts in the world of traditional HPC, there is significantly more scrutiny on utilization and uptime, as the costs, both in terms of acquisition but also power and cooling, are an order of magnitude higher.
When a server full of GPUs costs nearly $500k, every second counts. When jobs aren’t running, or when they are running at low utilization rates, it can result in the loss of millions of seconds of GPU time. This metric is in fact one that has surfaced in large scale AI training environments over the past few years.
As such, observability is essential for understanding and optimizing system utilization. One critical aspect of observability is Job ID tracking, which bridges the gap between raw system metrics and actionable insights into user activities.
Familiar scheduling frameworks such as SLURM, LSF, and SGE are still commonly deployed in modern AI training datacenters. However, custom schedulers, including those that exclusively leverage Kubernetes (k8s) deployments are becoming more widespread. Therefore, a new technique is required to ensure that the metrics generated not only at the host level, but also within each pod or container, are captured and made available for real time and historical analysis.
The Challenge of User Support in Large Systems
Administrators of large-scale compute, storage, and networking infrastructures often face the challenge of supporting users with varying levels of technical knowledge. Some users may not even know the right questions to ask or what information is needed to diagnose issues.
Imagine a scenario where a user reports: “My job is running slow; can you help?” Without structured Job ID tracking and correlation of system metrics, troubleshooting becomes a time-consuming exercise in piecing together incomplete data.
Why Job ID Tracking Matters
Job IDs provide a unique identifier for each workload in an AI environment. By associating system metrics—such as I/O patterns, resource consumption, and user activities—with specific Job IDs, administrators can:
- Isolate Issues Quickly: Identify bottlenecks and inefficiencies
- Optimize Resource Allocation: Determine if jobs are over- or underutilizing resources
- Enhance User Support: Provide tailored guidance based on granular data
- Improve System Visibility: Gain a detailed understanding of job interactions with infrastructure
From Data Collection to Actionable Insights
At VAST Data, we’ve built an ecosystem of tools to make metrics collection and analysis seamless and actionable.
End-to-End Metrics Collection
One such tool is VAST’s vNFS collector, which tracks I/O metrics, providing a granular view of I/O requests for any NFS mount, including:
- Counters related to each and every NFSv3 and NFSv4 operation type, for example
- READ, WRITE
- LOOKUP, ACCESS, GETATTR
- CREATE, RENAME, DELETE
- And many more
- Mount points
- Process name and PID
- User Identifier (UID number)
- Export/Mountpoint location
- Admin & User defined Environment variables
- Such as SLURM_JOB_ID
Streamlined Data Forwarding
The vNFS Collector provides flexible data forwarding capabilities, facilitating seamless integration with a wide range of analytics and monitoring tools. In addition to logging locally to the host in JSON format, it efficiently supports the transmission of metrics to multiple endpoints, including:
- Prometheus: Seamlessly integrates with existing Grafana dashboards for real-time monitoring and visualization
- Kafka: Enables real-time, event-driven pipelines to support dynamic system automation and data streaming
- VAST DataBase: Enables historical tracking and in-depth analysis within the VAST UI, with seamless integration into advanced analytics tools like Trino, Spark, and Grafana for comprehensive data exploration and visualization
Job ID Dashboards for Quick Analysis
To make collected data actionable, we developed dedicated Job ID dashboards that allow administrators to:
- Identify users running multiple jobs simultaneously
- Compare resource usage across jobs
- Analyze I/O patterns and success rates at a granular level
- And more
The VAST Data Approach to Job ID Tracking
To provide a scalable and efficient solution, VAST developed an innovative method for capturing and analyzing job-related metrics using eBPF (Extended Berkeley Packet Filter).
eBPF is an extremely lightweight mechanism that captures low-level data when applications interact directly with the OS kernel in a safe, secure, and robust manner. In this case, we leverage eBPF to monitor all NFS operations occurring on the Linux client host. Its versatility allows it to operate seamlessly with any NFS client across nearly any OS kernel, ensuring broad compatibility in diverse environments. eBPF also provides complete fault isolation from running workloads, guaranteeing that the vNFS collector will never cause a kernel crash or disrupt application performance. Furthermore, it supports non-disruptive upgrades, allowing updates to be applied without any impact on active workloads, ensuring continuous system reliability and uptime.
Introducing vNFS Collector
The vNFS collector, developed by the VAST R&D team, is a highly efficient, easy-to-install eBPF filter program that instantly captures and distributes key system metrics. As part of our commitment to fostering community collaboration and innovation, we plan to make the vNFS collector an open-source project. This initiative will allow the broader technical community to contribute to its development, adapt to diverse environments, and accelerate observability advancements across the HPC and AI ecosystems.
Key Features
- Simple Installation: Available as an RPM and DEB package, as a docker container or as a source bundle for compilation against any kernel
- No Impact: Zero load on production workloads, does not slow I/O or require a reboot/unmount
- Customizable Tracking: Supports tagging I/O metrics with JOB_ID, PROJECT_NAME, and any other user-defined environment variable
- Kubernetes Deployment: Deployable as a DaemonSet for containerized applications
- Universal Compatibility: Works with any NFS server, not just VAST
Implementation and Visualization
Administrators can configure the system to send metrics to one or multiple endpoints, depending on their observability needs. Below are examples of Job ID tracking in action.
vNFS Collector configuration
This configuration file defines environment variables such as JOBID, SCHEDID, and POD_NAME for tracking, alongside configurations for logging, database storage, and integration with Prometheus and Kafka.
Also, as previously mentioned, an optional target for these metrics is the VAST DataBase, a highly scalable, realtime, transactional database that enables both real-time insight and historical analytics. When combined with VAST’s persistent data layer, the VAST DataBase and vNFS collector deliver a fully integrated solution that seamlessly serves both unstructured and structured data to demanding applications—while also providing comprehensive, built-in metrics for enhanced visibility and performance.
Example Metrics Visualization
JSON Output with Job ID and Process ID
This JSON output shows real-time job activity, including the Process Name (e.g., PyTorch), Process ID, Job ID, and various I/O statistics.
Grafana Dashboard Showing Writes by Process Name
This visualization highlights data writes categorized by processes—such as PyTorch, TensorFlow, Bash, and MATLAB—giving insights into how different applications interact with storage.
Grafana Dashboard Showing Writes by Job ID
This graph categorizes write operations by specific Job IDs, making it easy to track storage performance and resource usage across different computational tasks.
Unified Observability
To further streamline observability, we are integrating these dashboards into the VAST Management System (VMS), providing a unified interface that eliminates the need to switch between tools while maintaining compatibility with Grafana and other platforms.
Observability in Action on Cosmos
Job ID tracking elevates observability from a reactive troubleshooting tool to a proactive strategy for optimizing system performance and user experience. By adopting this approach, administrators can not only address user issues faster but also drive greater efficiency in large-scale systems.
For those managing large scale AI and HPC environments, Job ID tracking is not just a feature—it’s a necessity.
Ready to learn more? Join us on Cosmos, the community we’re building to transform how organizations simplify and accelerate AI. You can access and explore the features of a real VAST cluster in the VAST Data Labs on Cosmos, including the observability features and client metrics outlined above.