In Part 1 of our VAST Catalog exploration we examined different ways the Catalog could be used to purge cluster space according to corporate data retention policies.
Now we’ll expand the Catalog’s applications and explore use cases across storage utilization, permissions reporting, attribute changes, and more.
Report 2: Utilization Report (by owner)
Here’s an easy one: the storage administrator would like a report on the space used by each user in their environment. The report should include data only in a specific area of the storage system.
You can create a report by replacing “owner_name” with any other column you might want to group by (e.g.: gid, extension, tags data, whatever).
See? That one was easy.
Report 3: Attribute Change Report
An administrator would like to report on recently changed files in the system, where attributes like permissions are altered rather than actual content. In this example we are looking for elements that have been modified since permission/ownership updates have been made. While fairly abstract, an interrogation like this could help locate security problems. File and directory updates are separately grouped:
A variation on this might be to report on objects that have been accessed since an attribute change:
Of course, you’ll probably want to list those files at some point:
Report 4: Data Growth
An administrator would like to locate directories where the most recently written data can be found and ordered by utilization. This query was recently used to locate runaway capacity consumption by a rogue process at a VAST customer deployment.
This query is a little more interesting than the others, here’s a breakdown:
The inner query separates the storage namespace down to N-depth of directories - in this case, two directory levels. The more levels you extend, the more lines there will be in the report.
The outer query sums the content size of those directores and reports the top 10 directories in terms of data written since the supplied timestamp.
You can use a similar query structure to show the number of recently used, changed, or modified files/objects in the directory hierarchy down to N-depth of resolution.
Report 5: Permissions Reporting
An administrator would like to list files for a given user that have executable permissions and exist outside that user’s home directory. The same logic can be applied to find readable files and writable directories or really anything that involves permissions. Since POSIX permissions are housed in the Catalog using 9-bit integers, we need to use bitwise operations to match the permissions we’re looking for - bitwise AND in this case. The “bitwise_and()” function in Trino does this for us. Here’s a cheat-sheet:
Bit position | Attr | Match entry with a bit in this position: |
---|---|---|
r - - - - - - - - | user/read | bitwise_and(nfs_mode_bits, 256) |
- w - - - - - - - | user/write | bitwise_and(nfs_mode_bits, 128) |
- - x - - - - - - | user/execute | bitwise_and(nfs_mode_bits, 64) |
- - - r - - - - - | group/read | bitwise_and(nfs_mode_bits, 32) |
- - - - w - - - - | group/write | bitwise_and(nfs_mode_bits, 16) |
- - - - - x - - - | group/execute | bitwise_and(nfs_mode_bits, 8) |
- - - - - - r - - | other/read | bitwise_and(nfs_mode_bits, 4) |
- - - - - - - w - | other/write | bitwise_and(nfs_mode_bits, 2) |
- - - - - - - - x | other/execute | bitwise_and(nfs_mode_bits, 1) |
Let’s find all of the files for user “harry” that Harry can execute that are not in his home directory and are not otherwise executable by everyone. We know harry has group memberships 4422,42 and 25343:
A variation on this might be to identify world-executable files (or world-writable directories with element_type = ‘DIR’) in the entire VAST namespace.
I’d love to do this forever but I told myself five cases - noting that the power of the Catalog exists in domains that are entirely disparate from the administration examples today.
Unexplored here are use cases for tagging data that can allow for virtualizing arrangements based on tag contents. Or that the database table housing the Catalog is a snapshot of file system metadata state at the most recent snapshot timestamp.
Older snapshots are addressable as well. Imagine the reporting you can do by comparing snapshot tables from different dates.
We’re going to need a Part Three.