This is a work in progress
The actual location of the content that the Storage Manager will managed will be referred here as $ROOT. $ROOT is installation-specific, so the Storage Manager should be configurable for the storage to be rooted at any path.
For the IU instances, $ROOT must not reside on the same filesystem as the system root. IU VMs are limited to 40G for a system volume, and they cannot be grown. In these cases, a new filesystem under /srv/amp would be the preferred path.
For efficiency, the content managed by the Storage Manager must reside on a single POSIX-style filesystem. This provides several advantages:
- The filesystem can be exported to other local machines with the same semantics
- Rename operations are constant-time
- Link operations are constant-time
- Ownership and Permissions are well-understood
- Avoids out-of-disk issues when moving large files around
The directory structure rooted at $ROOT and has this structure:
This breaks the storage into three areas:
The data directory is the storage for all of the files managed by the storage manager. This would include master files, intermediate files, etc.
The structure within the data directory is implementation-dependent, but for efficiency it would ideally include a directory hashing mechanism of some sort.
This is effectively a dropbox for injecting data into the Storage Manager at the user's request (either directly or via a mechanism outside of the scope of AMP). Using a mechanism that is TBD, the Storage Manager will be made aware of new files in this directory and move them into the $ROOT/data hierarchy for management.
The structure within this directory is TBD, but it could be something like the following (or completely different):
- A flat namespace where all files go, regardless of ownership, collection, etc. That data would need to come from another source.
- Per-collection directories where files placed into a collection directory will trigger this file's association with the collection
- Per-user directories working similarly to the above
The $WORKING hierarchy is where the MGM Adapters will store their work-in-progress files, state information or whatever:
Each directory at the top level of $WORKING corresponds to a job execution that is currently in progress. In the example above, two jobs (job-0001 and job-0002) are currently running. The naming is implementation-dependent.
Within each job directory, each node in the workflow will get a separate directory that a specific MGM Adapter instance is free to use in whatever manner necessary. Temporary files, downloads from S3 storage, output of local MGMs, or whatever. Like the job directories, the node directory naming is up to the implementation
Files appearing in the $ROOT/incoming tree will be renamed() into the $ROOT/data tree upon successful ingest.
Passing files for MGM Adapter Input
The Storage Manager will pass absolute path names of input files (residing within $ROOT/data) to MGM Adapters for processing. The MGM Adapters can read the files, send the data to an S3 bucket for processing, or whatever, directly from the stored location, without adding an additional transfer. Additionally, since the $ROOT/data space is available to all MGM Adapters only one copy of the data exists (unless it is copied by an MGM Adapter).
Capturing MGM Adapter Output
When an MGM Adapter has created new output, the path within the $WORKING tree is passed to the Storage Manager for ingest. The Storage Manager will ingest the output file in roughly the same manner as ingesting a new master: by moving the file into $ROOT/data when it is correct.
The Storage Manager may set up the basic $ROOT directory tree, if one doesn't already exist in the $ROOT location.
Starting / Restarting
It is implementation dependent whether or not the $WORKING tree is cleared on startup.
$WORKING tree maintenance
When a job is complete and the output data has been ingested into the Storage Manager, an AMP component should erase the job directory in $WORKING
Disk Space Management
It is assumed that the system administrator will monitor the overall disk usage and add disk as necessary for continued operation. However, it is the responsibility of AMP components to clean up transient data to maintain a small disk usage profile. Additionally, if there is an operation that AMP can reasonably foresee which will exhaust the allocated disk space, it is probably a good idea to inform the user (and admins) and abort the request. Specifically, if someone wants to upload a 500G file and only 400G is actually available on the disk, aborting the transaction early is far preferable to truncating the data.
There are three components need direct access to this data (each of which correspond to a top-level $ROOT directory):
- A source (outside the scope of AMP) will need to provide file data that will need to be managed by the Storage Manager
- The Storage Manager itself
- MGM Adapters (shims) will need access to the stored data and provide file data results
For the pilot, the same system user can be used for all components, but for a production system there should be more protection against malicious or accidental modification.
Using separate system users, all belonging a common group is one method that can be used to isolate file access. Specifically, the system user that created the file (via an ingest or an MGM Adapter output), the other system users cannot write to the file, but they can all read since they share a common group.