Effective Data Management
Richard Brittain
Research Computing, Dartmouth College
This presentation can be found at:
http://www.dartmouth.edu/~rc/classes/data_management/
Course Handout: (last update Tuesday, 28-Feb-2012 11:00:07 EST)
Effective Data Management
With many thanks to MIT for allowing us to borrow content from their data management bootcamp.
http://libraries.mit.edu/data-management
Dartmouth Office of Sponsored Programs web site:
http://www.dartmouth.edu/~osp/resources/data_management_planning/
Thanks to Anne Graham, Amy Stout and Katherine McNeill of the MIT library staff
for the original version of this presentation (January 2010)
Permission is granted to download and use these notes, as long as
all copyright notices are kept intact.
Richard Brittain, Dartmouth College.
Why are you here ?
- You're managing research data (your own, or your lab's)
- You're not sure how to do that
- You're not sure if you should worry about it
- You want some clues and pointers
- NSF now (Jan 2011) requires that you think about this stuff
What We Will Cover
- Data planning checklist
- Security and backups
- Directory structures and naming conventions
- Good file formats and long-term access
- Documentation and metadata
- Data sharing and citation
- Best practices for data retention and archiving
- Online Resources
Data Planning Checklist (1)
- What type of data will be produced ?
- will it be reproducible; what would happen if it got lost or became unusable later
- How much of it, and at what growth rate ? (MB/GB/TB)
- Will it change frequently ?
- Who is it for ?
- Who controls it (PI, student, lab, Dartmouth, funder) ?
- How long should it be retained ?
- e.g. 3-5 years, 10-20 years, permanently
Data Planning Checklist (2)
- Are there tools or software needed to create/process/visualize the data?
- Any privacy requirements from the funders or lab?
- e.g. human subjects (HIPAA requirements), personal data, high security data
- Any sharing requirements from the funders or lab?
- Any other funder requirements?
- International collaboration/sharing issues
What is Data? (1)
- Observational data captured around the time of the event
- Examples: Sensor readings, telemetry, survey results, neuroimages
- Usually irreplaceable
- Experimental data from lab equipment
- Examples: gene sequences, chromatograms, toroid magnetic field readings
- Often reproducible, but can be lengthy and expensive
What is Data? (2)
- Simulation data generated from test models
- Examples: climate models, economic models
- Models and metadata (inputs) more important than output data.
- Reproducible, but possibly expensive
- Derived or compiled data
- Examples: text and data mining, compiled database, 3D models
- Reproducible, but possibly expensive
- Samples and other non-digital data forms
- Samples, physical collections, notebooks etc. may all be considered data
for the purposes of presenting a data management plan
Security and Backups
Security and Backups
- Assurance -- don't lose the data
- Integrity -- don't let the data be corrupted
- Privacy -- keep out unauthorized access
Data Storage Options
- Personal Computer
- internal or external drive
- CDs or DVDs - not built to last
- laptops/mobile devices vulnerable, but might be initial data acquisition tool
- Departmental or College server
- RStor (AFS) data volumes
(20GB default. 100GB-2TB volumes on request. 50TB currently in system. Off-campus access)
- ThayerFS
- MyFiles
- "Cloud" storage (e.g. Amazon S3)
- Occasional access, Read/Write
- Subject archive (e.g. Genbank)
RStor (AFS)
Campuswide storage for researchers
- 20GB free allocation to anyone in DND
- Data volumes up to 2TB may be purchased
- Small volumes backed up daily. Large volumes replicated daily
- Off-campus access; sponsored accounts; custom access controls for shared space
- Kerberos-based authentication
- Scaleable, relatively cheap, currently 5 file servers
MyFiles / OurFiles
Campus file server for CIFS (Windows) shares. MyFiles is personal/private
and OurFiles is departmental shared space (custom access control)
- Provisionally 10GB Students, 20GB Staff, 25GB Faculty (free)
- Most space can be purchased (price TBD)
- Available to anyone in the DND
- Accessible via Active Directory, or through WEBDAV
- Snapshots every 2 hr, all kept for 55 weeks
- Replicated to remote location (except snapshots)
- OurFiles in production, MyFiles in testing
Backup Your Data
- Complete system failure or loss
- "Disaster Recovery"; complete rebuild of a system
- Operating System, Applications; configuration and tuning; special purpose hardware; data
- Carbon Copy Cloner (Mac);
Clonezilla (Linux, Windows)
- Hardware failure in storage technology
- RAID, local or remote mirrors
- User error
- Accidental deletion or corruption
- Reversion to older versions
- Mac Time Machine; folder sync utilities
- Make 3 copies
- e.g. original + external/local + external/remote
- Geographically distributed
- Local vs. remote depends on recovery time needed
Data Backup Options (1)
Option #1: external hard drive or tape backup system (local)
e.g. Windows backup, Mac TimeMachine, UNIX dumps, rsync
- Hard drive (2TB for <$200)
- Tape backup system (~ $5k for 10TB)
- CDs or DVDs are not built to last
Data Backup Options (3)
Option #3: Cloud Storage
- Amazon S3
- S3-based Remote Hard Drive Services
- Mozy (EMC mozy.com/)
- free client software, 448-bit blowfish encryption or AES key
- ~$5/mo unlimited storage for home; server license ~$6.95 + $0.50/GB per month.
- Carbonite (www.carbonite.com/)
- free client software, 1024-bit blowfish encryption
- ~$55/yr, unlimited storage
Data Backup Options (4)
Option #4: "Duplicate Archive" (darch):
Proposed Readonly/Offline Data Archiving solution
- low cost, using multiple commodity drives (SATA, USB3)
- write-once, readonly, mostly offline
- long term storage (7+ years)
- multiple copies, stored separately, checksummed and verified periodically
- possible cost < $1000/TB, minimum 4 copies, for 7+ years.
- may become an offered service on campus if enough people are interested, but the code for
managing the mirroring, checksumming and verification will all be available open-source
Files may be imported to the system by various means: AFS, NFS; CIFS; sftp, or by directly attaching data drives.
One convenient mechanism may be to use an AFS volume for preparation of content, then freeze it and import
directly to darch.
File level deduplication within a drive will make it efficient for snapshots of dynamic data. Some subset of
the data can be published as read-only shares for subsequent access. Files may be compressed and/or encrypted
before importing.
Recovery, Compression, Encryption
- Unencrypted ideally, encrypted if sensitive (such as for remote copy)
- Uncompressed ideally, maybe use for remote copy
- may make a significant difference to data storage requirements
- makes files inherently less resilient to minor corruption
- compression may improve performance of data analysis software
- many image/video/audio formats are inherently compressed
- Test File Recovery
- At setup time and on a regular schedule
- To secure data
- protect your hardware (especially portable/mobile systems)
- if sensitive, use file encryption (e.g. PGP
(Pretty Good Privacy)
- passwords and keys on paper (2 copies, secure) and in a PGP-encrypted file
- don't rely on 3rd party encryption alone (e.g., provided by cloud storage)
Places your data may be (local)
- Intentional - local disk (workstation, laptop)
- Memory - running programs, shared memory
- "trashcans", deleted files
- Swap/paging file, disk buffers, disk controller caches
- Hibernate file
- Browser caches, thumbnail resources, network file cache
- Filenames and other metadata also in registry, "recently used" lists, application and folder properties, cookies, history lists
- Forgotten duplicate copies on local disk
Places your data may be (external)
- Server disk (remote share)
- External drive backup copies - USB/FW disks, TimeMachine
- Disaster Recovery backups (e.g. CCC, Clonezilla)
- Network backup server, e.g. Netbackup (# of copies ?)
- Explicit replicate (offsite/cloud)
- USB thumb drive, pocket drive (e.g. transfer medium)
- Mobile devices (i*, MP3 player, camera), SD-card
- Backup staging server, print server
Directory Structure / Naming
Directory Structure (Folder Hierarchy) and File Naming Conventions
Good Directory Structure
- Use them!
- Top level directory/folder should include the project title,
unique identifier, and date (e.g. year).
- Substructure should have clear, documented naming convention
- e.g. each run of an experiment, each version of a dataset, each person in the group
- Use tools to help browse and display complex hierarchies
- e.g. tree - display a directory hierarchy in tree form, with many
options; can create HTML output. (Unix/Linux)
e.g. Auroral Radio Noise
data archive at Dartmouth.
(tree -C -T "High Latitude Auroral Radio Noise Archive"
-o tree.html -d -H http://caligari.dartmouth.edu/~radio/ )
File Version Control
Strategies include
- file-naming conventions
- standard file headers (inside the file), listing creation date, version number, status
- log files
- version control software (e.g. RCS, SVN (subversion))
Always record every change to a file, no matter how small.
Discard obsolete versions
if no longer needed after making backups.
File Naming Conventions (1)
- Avoid non-portable character sets in file names (ASCII, Unicode)
- Assume the files may be copied to a different operating system
- Reserve the file "extension" for application-specific codes, e.g. formats like
WRL, MOV, TIFF
- Identify the activity or project in the file name, e.g. use the unique project name
or identifier.
Example:
Project_instrument_location_YYYYMMDD[hh][mm][ss][...extras].ext
- Use tools to index files and locate them quickly
File Naming Conventions (2)
Many academic disciplines have specific recommendations, e.g.
- DOE's Atmospheric Radiation Measurement (ARM) Program
- GIS datasets from Massachusetts State
Data file toolkit
- File indexing/searching
- Bulk renaming -- use free tools to help you
- Duplicate file finders
- File format verification
Bulk renaming -- use free tools to help you
- manpages.ubuntu.com/manpages/dapper/man1/prename.1.html -- rename command line utility (linux).
- thunar.xfce.org/pwiki/documentation/bulk_renamer -- Thunar bulk file rename GUI tool (linux)
- www.bulkrenameutility.co.uk/ -- Bulk Rename Utility (Windows)
- manytricks.com/namemangler/ -- NameMangler (Mac)
- renamer4mac.com/ - Renamer (Mac)
- www.powersurgepub.com/products/psrenamer.html -- psrenamer (Mac, Windows, Linux (Java))
Duplicate file finders
- www.stearns.org/freedups/README -- Freedups
- freedup.org/ -- Freedup
- duplicatefilessearcher.net/ -- Duplicate File Searcher
File format verification
- hul.harvard.edu/jhove/ -- jhove
File Formats for Long-Term Access
File Formats for Long-Term Access
File Formats for Long-Term Access (1)
Principles
- Unencrypted
- Uncompressed
- Non-proprietary
- Open, documented standard
- Common usage by research community
- Standard representation (ASCII, Unicode)
- Embedded metadata if possible
File Formats for Long-Term Access (2)
Examples
File Formats for Long-Term Access (3)
Discipline Standards
e.g.
Common File Formats
- Text
|
e.g. ASCII, Word, PDF |
- Numerical
|
e.g. ASCII, SAS, Stata, Excel, netCDF, HDF |
- Database
|
e.g. MySQL, MS Access, Oracle |
- Multimedia
|
e.g. JPEG, TIFF, Dicom, MPEG, Quicktime |
- Models
|
e.g. 3D VRML, X3D |
- Software
|
e.g. Java, C, Fortran |
- Domain-specific
|
e.g. FITS in Astronomy, CIF in Chemistry, ESRI in GIS |
- Vendor-specific
|
e.g. Varian NMR data format, LeCroy digital oscilloscope format. |
Documentation and Metadata
Documentation and Metadata
Project Documentation (1)
- Title
- name of the dataset or research project that produced it
- Creator
- names and addresses of the organization or people who created the data,
including all significant contributors
- Identifier
- the identification number used to identify the data, even if it is just an internal
project reference number
- Subject
- keywords or phrases describing the subject or content of the data
Project Documentation (2)
- Dates
- key dates associated with the data, including project start and end data;
release date; other dates associated with the data lifespan, e.g. maintenance cycle,
update schedule
- Funders
- organizations or agencies who funded the research
- Language
- language(s) of the intellectual content of the resource, when relevent
Project Documentation (3)
- Location
- where the data relates to a physical location, record information about
its spatial coverage
- Rights
- description of any known intellectual property rights held for the data
- List of file names and relationships
- list of all digital files in the archivem with their names and file extensions
(e.g. 'NWPalaceTR.WRL', 'stone.mov')
Other Metadata (1)
- Formats
- format(s) of the data, e.g. FITS, SPSS, HTML, JPEG
- Methodology
- how the data was generated, including equipment or software used.,
experimental protocol, other things you would include in your lab notebook.
Can reference a published article, if it covers everything.
- Workflows or analyses
- to be able to reproduce your work
- Sources
- references to source material for data derived from other sources, including details
of where the source data is held, how identified and accessed
Other Metadata (2)
- Versions
- date/time stamped, and use a separate ID (e.g. version number) for each version
- Checksums
- to test if your file has changed over time
- Explanation of codes used in file names
- brief explanation of any naming conventions or abbreviations used to label the files
- List of codes used in files
- list of any special values used in the data
(e.g codes for categorical survey responses, '999'
indicates a dummy value in the data, etc.)
- Store metadata in a text file (such as a README file or codebook) in the same directory as the data, if not using a format with integrated metadata.
Metadata tools
- Bag-it
- Library of Congress created
- Wrap up a directory, with file manifest, checksums, and optional additional metadata
- Verifiable
www.digitalpreservation.gov/partners/resources/tools/index.html#bagit
Bag-it Example (1)
bagit.txt
BagIt-Version: 0.96
Tag-File-Character-Encoding: UTF-8
manifest-md5.txt
4f4530ec94573e2d3f7dfe318ced1628 data/awst_0001_0001_0_00012.xml
c383d44f134032bb79c3a698f7f66b16 data/awst_0001_0001_0_00022.xml
ab02245e8e63758e080ef5f14c7804f8 data/awst_0001_0001_0_00025.xml
c0e11bac29f385f6757b76efeb48a67e data/awst_0001_0001_0_00010.xml
d0d408fd3142de7d3da69a8e7d93b29b data/awst_0001_0001_0_00029.xml
2b5ef9b643ea8961428b4fa0eaceb662 data/awst_0001_0001_0_00003.xml
3ede3352cac161ba23078162b3ed135f data/awst_0001_0001_0_00027.xml
edd78d7f2a2e5f588bb4b47038c64872 data/awst_0001_0001_0_00023.xml
....
Bag-it Example (2)
bag-info.txt
External-Identifier: library_escrow_ocn220883270
OCLC-Number: ocn220883270
ISBN: 9781599046501
External-Description: Agent and web service technologies in virtual....
Source-Organization: Gale Research Group
Organization-Address: IGI Global, 701 E. Chocolate Ave. Suite 200, Hershey, Pennsylvania
Contact-Name: A. N. Other
Contact-Email: another@igi-global.com
Payload-Oxum: 1479718.36
Bagging-Date: 2010-08-17
Bag-Size: 1.4 MB
Internal-Sender-Description: Millennium bibliographic record number b42328408.....
tagmanifest-md5.txt
5e9eb293148f6e98989a12398a512744 bagit.txt
be43f575526f3fc0806756fad5725b86 bag-info.txt
19fbfc7389058716d8ea770b2c944cb5 manifest-md5.txt
Embedded Metadata Example
START_HEADER
FILE_TYPE RAW
SPECTRA_COUNT 39600
LOCATION Churchill, Manitoba
CHANNEL 0 LF-B
FILENAME g:\data\ch\CH03231L.R00
VERSION 4.33
DATE Tue Aug 19 19:59:49 2003
BYTES_PER_SAMPLE 1
FREQUENCY_COUNT 498
YEAR 2003
SPECTRUM_INTERVAL 2000
BYTES_PER_SPECTRUM 502
BEGIN_FREQUENCY_LIST
30.00000
40.00000
...
END_FREQUENCY_LIST
END_HEADER
Sample Metadata Standards
Data Sharing and Citation
Data Sharing and Citation
Why Share Your Data
- Promote your research
- Replication
- Enable new discoveries
- Store your data in a reliable archive
- Comply with funding requirements
Example of a Data Sharing Policy
"The NIH expects and supports the
timely release and sharing of
final research data
from NIH-supported
studies
for use by other researchers
Starting with the October 2003 receipt date, investigators submitting and NIH application seeking $500,000
or more in direct costs in any single year are expected to include a plan for data sharing,
or state why data sharing is not possible"
Data Sharing
Minimal
- Share via email
- Post to a local web site
- Post to Google/Amazon/Microsoft shared cloud storage
- Place in AFS, world readable
Ideal
- Deposit to a data archive (domain or institution)
- GenBank, Protein Database
Data Identifiers
Must be
globally unique, persistent
Many different schemes (discipline specific)
Intellectual Property Issues (1)
Sharing Data which you Produced Yourself
- Data is not copyrightable
- An original expression of data is copyrightable (e.g. chart or table in a book)
- Data can be licensed
- Culture of data sharing: can make available your data under a CCO declaration to make this explicit
Note: Laws about data vary outside the U.S.
Intellectual Property Issues (2)
Sharing data that you have collected from other sources
- You may or may not have the rights to do so
- It depends upon whether that data was licensed and has terms of use
- Most databases to which the libraries subscribe are licensed and prohibit redistribution outside
of Dartmouth
- If you are uncertain as to your rights to disseminate date, Dartmouth researchers can consult
the Office of General Counsel
Note: Laws about data vary outside the U.S.
Citing Data (1)
- Cite a publication describing the data, e.g.
Wilson, M.D. (1988) The MRC Psycholinguistic Database: Machine Readable
Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers,
20(1), 6-11.
Citing Data (2)
- Cite the data itself, e.g.
- Subject archive entry
e.g. Genbank ACCESSION number (e.g. Genbank: NP_002070)
- Complete citation
Bachman, Jerald G., Lloyd D. Johnston, and Patrick M. O'Malley, 2008
"Monitoring the Future: A Continuing Study of American Youth (12th-Grade Survey)",
ICPSR, DOI:1902.1/ICPSR02751-v1, http://dx.doi.org/10.3886/ICPSR25382
Citing Data (3)
- ISO 690-2 bibliographic referencing standard
- Can include: author, title, size, edition, language, publisher, publication date, publication place
- Assumes a unique identifier for the dataset
Citing Data (4)
Include
- Contributing investigator/authors
- Title of the work
- If the work is part of a larger work, give the title of the part
- Year of publication, issue, release
- The date when the dataset was published, issued, or released, not the
date when the data were collected, created, or processed, nor the date of the phenomena
characterized by the data
Citing Data (5)
Include
- Publisher
- the data center /repository /institution
- Identifier (including Edition/Version)
- Availability and access
- URL or other site where data is located
Data Retention and Archiving
Data Retention and Archiving
Data Retention and Archiving (1)
From the checklist
Data Retention and Archiving (2)
- Keep all versions? Just final version? First and last?
- Depends on re-processing costs. If you can re-process the data,
probably better to do so,
but keep all software and protocol/methodology information to support that
Remember
- Documentation is the most important thing
- Don't lose the bits
- Be neat (formats, file names)
- Think about what you want to accomplish
Over Time
- Test data restore from backup
- Check documentation and metadata
- Are files still readable ?
- Still accessible at the published URL?
- Migrate files to newer formats
- Update software to read/write data
- Weed out obsolete date (and destroy where appropriate)
Additional Online Resources (1)
Guides to Data Management
Additional Online Resources (2)
Guides to Data Management
- www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf -- ICPSR Guide to Social Science Data Preparation and Archiving (pdf):
- daac.ornl.gov/PI/bestprac.html -- Oak Ridge National Laboratory:
- www.data-archive.ac.uk/create-manage/ -- UK Data Archive: Manage and Share Data:
- dmponline.hatii.arts.gla.ac.uk/ -- DMP Online: Data Management plan preparation tool
Additional Online Resources (3)
NSF discipline-specific data management plan guidelines
- http://nsf.gov/eng/general/ENG_DMP_Policy.pdf Engineering
- http://www.nsf.gov/geo/ear/2010EAR_data_policy_9_28_10.pdf Earth Sciences
- http://www.nsf.gov/pubs/2004/nsf04004/start.htm Ocean Sciences
- http://www.nsf.gov/bfa/dias/policy/dmpdocs/ast.pdf Astronomy
- http://www.nsf.gov/bfa/dias/policy/dmpdocs/che.pdf Chemistry
- http://www.nsf.gov/bfa/dias/policy/dmpdocs/dmr.pdf Materials Research
- http://www.nsf.gov/bfa/dias/policy/dmpdocs/dms.pdf Mathematical Sciences
- http://www.nsf.gov/bfa/dias/policy/dmpdocs/phy.pdf Physics
- http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf Social, Behavioral, and Economic Sciences
Additional Online Resources (4)
Metadata Standards and Digital Repositories
Where data management is concerned...
Where data management is concerned...
"Perfection is the Enemy of the Good"
just do the best you can