Child pages
  • "My data set is more than 150 MB in size."
Skip to end of metadata
Go to start of metadata

Simple steps for depositing a large dataset


Note to all depositors

In order to make "big data" accessible via IUScholarWorks repository, that data must be stored on the Scholarly Data Archive and linked to from IUScholarWorks. The process described below is one where we create a permanent URL (called a "PURL") that will be stable and accessible in perpetuity, which is then listed under the "External Files" section of an item record on the repository. (See this example to better understand how it works.) We recognize that it is an imperfect solution, but rest assured it is only a temporary fix to an acknowledged challenge of providing free, stable, long-term access to research data. We will update this documentation with less onerous instructions for preparing your data for deposit, once such a solution has been reached. - SRK, 12/11/2013

Detailed steps for depositing a large dataset

  1. Organize and describe the dataset

    1. Create a README file to accompany your data

The README file should also uploaded as a regular bitstream into the DSpace record, and it should always be named 'README.txt'. The file contents should look similar to the following:

README FILE FOR LEADII Vortex2 Archive Dataset
Created by:
URI [IUScholarWorks Handle, e.g.,]
DOI [in URL format, e.g.,]

[List any special considerations for software/websites used to create the files, file formats, etc. For example, "These tar files were produced by a ///////////////// and can be viewed with ////////////// software packages."]


[Variable names or abbreviations that aren't self-evident, units of measurement, definitions for codes or symbols, etc.]

[each time you take a new step that will eventually change your output files from the source text, document it here.]
0. Cleaned up data
1. Uploaded files to
2. Named stopwords list using Options menu
3. etc

Source data was created by [data source name & contact information here].

This data is licensed for reuse under a Creative Commons Attribution 4.0 license [or other relevant data license--remember that data cannot be copyrighted in most cases].

[note any limitations to your metholodology, problems encountered with the software, etc.]

A template for the README can be found here; recommendations for writing READMEs can be found on Dryad and StackOverflow). If your files are from the same research project, you should note the differences between each file. (Was the data collected on different dates? In different areas? From different subjects? Determine what makes each dataset distinct, and record it.) Also identify any files that may contain sensitive information. You'll use this information in later steps. When writing your README file, imagine that you are writing for a user 20 years in the future, who may not know much about your research (or even your discipline). This will help you to describe your data in the clearest possible way.

Your README file, once completed, should be uploaded both to the Scholarly Data Archive (see Step 3) and also separately to the DSpace item record (described in Step 2.)

b. Organize your files

Once you have identified which files you plan to preserve and store, think about how you can organize them for a) ease of upload and b) understandability by other researchers. For example, if your project is large and has many datasets collected over a period of years, you may wish to create several separate compressed (zipped) files for upload, within each is nested other folders that contain interrelated instrumentation data files, images, and data dictionaries. Alternatively, for smaller research projects, you may wish to simply create one zipped file for upload that includes a small amount of clearly labeled, interrelated documents, images, or spreadsheets.

c. Create a checksum and manifest file for your dataset(s).

1) For files on a workstation, you can use the MD5 program (command line option here) to generate a checksum. For files already on the Scholarly Data Archive, you can generate a checksum by following these instructions.

2) Create a manifest file that looks like this (download a template)


#manifest file for LEADII Vortex2 Archive Dataset (*) 

LEADII-Vortex2-dataset.tar.gz 30000000000 f2b425d71600959318e214450b7321f4

LEADII-Vortex2-dataset.tar.gz 30000000000 cc90092d9fdffa6466c6a8bfc3b8ce63 


The first line (preceded by "#") is a comment. It is important to ensure that the manifest file is eventually linked to the correct DSpace item by inserting a Handle link to the item record and the full title, identical to the title you will use in the item record to describe the dataset. The other lines correspond to each file ('bitstream') in SDA, with the filename, filesize (in raw bytes) and an MD5 checksum.

Save your manifest file as "uniquefilename_manifest.txt".

* This Handle should be the item record handle, which will be created in Step 2. (For example, this item record includes the Handle in the URI field). You will edit your manifest file to include the Handle once the item record has been approved and the Handle is confirmed by your collection administrator.

d. Rename your files for understandability.

If you do not already follow file-naming conventions, you should rename your dataset file using a unique title so that it is clearly associated with you (as the data creator) and your research project. (For example, "Konkiel_2012Census&".) If you plan to upload several files as part of the same project, be sure they are easily differentiated (i.e. Konkiel_2012Census&, Konkiel_2012Census&, etc or Plale_LEADII-Vortex2-dataset_09242010.tar, Plale_LEADII-Vortex2-dataset_09252010.tar, etc).

Once your dataset has been renamed, update your manifest file's name to match the dataset's. (For example, I would want my file "Konkiel_2012Census&" to be accompanied by "Konkiel_2012Census&MaritalStatusStudy_manifest.txt".) The idea is for these files to be easily associated by anyone who would download your dataset from the repository. It will also help you stay organized if you are creating multiple manifest files at one time.

2. Create an item record in the repository

Follow the steps outlined on the Self-submission workflow page, uploading your README file directly to the item record during Step 3 of that process.

Collection Adminstrators

You will need to approve the item record at this point.

Once the item record has been approved, send a follow up email to the submitter noting the Handle and reminding them to send you the manifest file for the item record with the Handle included.

Once your item record has been approved, you will be asked to send the Collection Administrator your manifest file, with the Handle for your item record included. That updated manifest file will be uploaded to your item record by the Collection Administrator. You will also deposit that manifest file to the SDA in Step 3.

Collection Aministrators

Upon receipt of the manifest file, login to IUScholarWorks repository and find the item record related to this dataset. On the left hand navigation bar, click "Edit this item" and then on the "Item bitstreams" tab. Click the "Upload a new bitstream" link to add the manifest file. Click "Return" to revisit the item record as it appears normally in the IR. Check to ensure that the manifest file has uploaded correctly and appears below the README file in the "Files in this item" section of the item record.

Email the submitter to let them know that they may now move on to uploading files to the SDA.

3. Upload the data files and accompanying manifest files to your Scholarly Data Archive account (if they are not there already).

You will need to upload the dataset(s), manifest file, and README file for the item record, using any of the following four options:

4. Copy your data to the IUScholarWorks SDA 'dropbox' & notify

 Karst-to-SDA Workflow

    • $ SSH
    • $ (enter Karst username)
    • $ (enter password)
    • $ (make sure the file permissions are correct for items you plan to move)
      • directories: $ chmod 775 dir_name (drwxrwxr-x)
      • files: $ chmod 664 file_name (rw-rw-r--)
      • (Note: file permissions need to be set correctly whether you're moving Karst > SDA or SDA > SDA
    • $ module load hpss (to access HSI)
    • $ hsi (get HSI shell prompt)
    • ? “Kerberos Principal": (enter SDA username)
    • ? “Password for": (enter SDA password)
    • umask 000 (this will ensure that file permissions are not changed by HSI)
    • ? put /gpfs/home/i/u/user/Karst/filename : /hpss/i/u/iuswdata/dropbox/filename  (use PUT command)
      (Anyone can deposit to the dropbox folder, but they can't see what's in the directory once it's put there.) 

Note: the 'dropbox' is write-only. You will not be able to see the file(s) once you've put or moved them to the 'dropbox.'

Once your data has been copied to the IUScholarWorks dropbox, send an email to (CC'ing your collection administrator) notifying the IUSW that you've made a transfer to the dropbox.

In your email, include the following:

  • Full filename (extension included) for each file copied to the IUScholarWorks dropbox
  • The Handle for the item record to which you'd like the dataset added

IUSW will create a PURL for your dataset and then notify you and the collection administrator of that PURL. The collection administrator will add this PURL to your item record under the "External Files" section so that data stored on the SDA can be downloaded. You will be notified when your item record has been updated and your data is accessible on the repository.

Collection Administrators

Once you receive the PURL from IUSW, login to IUScholarWorks repository and find the item record related to this dataset. On the left hand navigation bar, click "Edit this item" and then on the "Item metadata" tab.

On the drop-down menu under "Add new metadata > Name", choose "dc.relation.uri" and paste the PURL into the "Value" field. Click "Add new metadata," then "Return" to revisit the item record as it appears to the end user.

Once you've confirmed that the README file, manifest file, and dataset PURL appear correctly, email the submitter to notify them that their item record is finalized and accessible on the IR.

5. Your data is now deposited!

Now you can cite your dataset and accompanying documentation in journal articles using the Handle for your item record; add your dataset to systems like ImpactStory and VIVO to track citations of your work by others; and link to your dataset using the Handle, a persistent identifier that will never suffer from broken links.

Questions? Email for help.

  • No labels