solutions

Feb 7, 2025

Tracking Job IDs: Enhancing Observability and Efficiency in Large-Scale Systems

Author

Andy Pernsteiner, Field CTO and Sagi Grimberg, VP of Architecture

As the shape and demands of large scale computing environments have evolved, so have the needs of those who are responsible for keeping them in tip top shape. Demanding AI workloads, such as training and inference, employ 10s of thousands to hundreds of thousands of high end GPU processors. Unlike their CPU counterparts in the world of traditional HPC, there is significantly more scrutiny on utilization and uptime, as the costs, both in terms of acquisition but also power and cooling, are an order of magnitude higher.

When a server full of GPUs costs nearly $500k, every second counts. When jobs aren’t running, or when they are running at low utilization rates, it can result in the loss of millions of seconds of GPU time. This metric is in fact one that has surfaced in large scale AI training environments over the past few years.

As such, observability is essential for understanding and optimizing system utilization. One critical aspect of observability is Job ID tracking, which bridges the gap between raw system metrics and actionable insights into user activities.

Familiar scheduling frameworks such as SLURM, LSF, and SGE are still commonly deployed in modern AI training datacenters. However, custom schedulers, including those that exclusively leverage Kubernetes (k8s) deployments are becoming more widespread. Therefore, a new technique is required to ensure that the metrics generated not only at the host level, but also within each pod or container, are captured and made available for real time and historical analysis.

The Challenge of User Support in Large Systems

Administrators of large-scale compute, storage, and networking infrastructures often face the challenge of supporting users with varying levels of technical knowledge. Some users may not even know the right questions to ask or what information is needed to diagnose issues.

Imagine a scenario where a user reports: “My job is running slow; can you help?” Without structured Job ID tracking and correlation of system metrics, troubleshooting becomes a time-consuming exercise in piecing together incomplete data.

Why Job ID Tracking Matters

Tracking Job IDs

Job IDs provide a unique identifier for each workload in an AI environment. By associating system metrics—such as I/O patterns, resource consumption, and user activities—with specific Job IDs, administrators can:

Isolate Issues Quickly: Identify bottlenecks and inefficiencies
Optimize Resource Allocation: Determine if jobs are over- or underutilizing resources
Enhance User Support: Provide tailored guidance based on granular data
Improve System Visibility: Gain a detailed understanding of job interactions with infrastructure

From Data Collection to Actionable Insights

At VAST Data, we’ve built an ecosystem of tools to make metrics collection and analysis seamless and actionable.

End-to-End Metrics Collection

One such tool is VAST’s vNFS collector, which tracks I/O metrics, providing a granular view of I/O requests for any NFS mount, including:

Counters related to each and every NFSv3 and NFSv4 operation type, for example
- READ, WRITE
- LOOKUP, ACCESS, GETATTR
- CREATE, RENAME, DELETE
- And many more
Mount points
Process name and PID
User Identifier (UID number)
Export/Mountpoint location
Admin & User defined Environment variables
- Such as SLURM_JOB_ID

Streamlined Data Forwarding

The vNFS Collector provides flexible data forwarding capabilities, facilitating seamless integration with a wide range of analytics and monitoring tools. In addition to logging locally to the host in JSON format, it efficiently supports the transmission of metrics to multiple endpoints, including:

Prometheus: Seamlessly integrates with existing Grafana dashboards for real-time monitoring and visualization
Kafka: Enables real-time, event-driven pipelines to support dynamic system automation and data streaming
VAST DataBase: Enables historical tracking and in-depth analysis within the VAST UI, with seamless integration into advanced analytics tools like Trino, Spark, and Grafana for comprehensive data exploration and visualization

Job ID Dashboards for Quick Analysis

Tracking Job IDs

To make collected data actionable, we developed dedicated Job ID dashboards that allow administrators to:

Identify users running multiple jobs simultaneously
Compare resource usage across jobs
Analyze I/O patterns and success rates at a granular level
And more

The VAST Data Approach to Job ID Tracking

To provide a scalable and efficient solution, VAST developed an innovative method for capturing and analyzing job-related metrics using eBPF (Extended Berkeley Packet Filter).

eBPF is an extremely lightweight mechanism that captures low-level data when applications interact directly with the OS kernel in a safe, secure, and robust manner. In this case, we leverage eBPF to monitor all NFS operations occurring on the Linux client host. Its versatility allows it to operate seamlessly with any NFS client across nearly any OS kernel, ensuring broad compatibility in diverse environments. eBPF also provides complete fault isolation from running workloads, guaranteeing that the vNFS collector will never cause a kernel crash or disrupt application performance. Furthermore, it supports non-disruptive upgrades, allowing updates to be applied without any impact on active workloads, ensuring continuous system reliability and uptime.

Introducing vNFS Collector

The vNFS collector, developed by the VAST R&D team, is a highly efficient, easy-to-install eBPF filter program that instantly captures and distributes key system metrics. As part of our commitment to fostering community collaboration and innovation, we plan to make the vNFS collector an open-source project. This initiative will allow the broader technical community to contribute to its development, adapt to diverse environments, and accelerate observability advancements across the HPC and AI ecosystems.

Key Features

Simple Installation: Available as an RPM and DEB package, as a docker container or as a source bundle for compilation against any kernel
No Impact: Zero load on production workloads, does not slow I/O or require a reboot/unmount
Customizable Tracking: Supports tagging I/O metrics with JOB_ID, PROJECT_NAME, and any other user-defined environment variable
Kubernetes Deployment: Deployable as a DaemonSet for containerized applications
Universal Compatibility: Works with any NFS server, not just VAST

Implementation and Visualization

Administrators can configure the system to send metrics to one or multiple endpoints, depending on their observability needs. Below are examples of Job ID tracking in action.

vNFS Collector configuration

Tracking Job IDs

This configuration file defines environment variables such as JOBID, SCHEDID, and POD_NAME for tracking, alongside configurations for logging, database storage, and integration with Prometheus and Kafka.

Also, as previously mentioned, an optional target for these metrics is the VAST DataBase, a highly scalable, realtime, transactional database that enables both real-time insight and historical analytics. When combined with VAST’s persistent data layer, the VAST DataBase and vNFS collector deliver a fully integrated solution that seamlessly serves both unstructured and structured data to demanding applications—while also providing comprehensive, built-in metrics for enhanced visibility and performance.

Example Metrics Visualization

JSON Output with Job ID and Process ID

Tracking Job IDs

This JSON output shows real-time job activity, including the Process Name (e.g., PyTorch), Process ID, Job ID, and various I/O statistics.

Grafana Dashboard Showing Writes by Process Name

Tracking Job IDs

This visualization highlights data writes categorized by processes—such as PyTorch, TensorFlow, Bash, and MATLAB—giving insights into how different applications interact with storage.

Grafana Dashboard Showing Writes by Job ID

Tracking Job IDs

This graph categorizes write operations by specific Job IDs, making it easy to track storage performance and resource usage across different computational tasks.

Unified Observability

To further streamline observability, we are integrating these dashboards into the VAST Management System (VMS), providing a unified interface that eliminates the need to switch between tools while maintaining compatibility with Grafana and other platforms.

Observability in Action on Cosmos

Job ID tracking elevates observability from a reactive troubleshooting tool to a proactive strategy for optimizing system performance and user experience. By adopting this approach, administrators can not only address user issues faster but also drive greater efficiency in large-scale systems.

For those managing large scale AI and HPC environments, Job ID tracking is not just a feature—it’s a necessity.

Ready to learn more? Join us on Cosmos, the community we’re building to transform how organizations simplify and accelerate AI. You can access and explore the features of a real VAST cluster in the VAST Data Labs on Cosmos, including the observability features and client metrics outlined above.

Tracking Job IDs: Enhancing Observability and Efficiency in Large-Scale Systems

The Challenge of User Support in Large Systems

Why Job ID Tracking Matters

From Data Collection to Actionable Insights

End-to-End Metrics Collection

Streamlined Data Forwarding

Job ID Dashboards for Quick Analysis

The VAST Data Approach to Job ID Tracking

Introducing vNFS Collector

Implementation and Visualization

vNFS Collector configuration

Example Metrics Visualization

JSON Output with Job ID and Process ID

Grafana Dashboard Showing Writes by Process Name

Grafana Dashboard Showing Writes by Job ID

Unified Observability

Observability in Action on Cosmos

More from this topic