Effective Data Management

Richard Brittain

Research Computing, Dartmouth College

This presentation can be found at:

http://www.dartmouth.edu/~rc/classes/data_management/

Course Handout: (last update Tuesday, 28-Feb-2012 11:00:07 EST)

Effective Data Management

With many thanks to MIT for allowing us to borrow content from their data management bootcamp.

http://libraries.mit.edu/data-management

Dartmouth Office of Sponsored Programs web site: http://www.dartmouth.edu/~osp/resources/data_management_planning/

Thanks to Anne Graham, Amy Stout and Katherine McNeill of the MIT library staff for the original version of this presentation (January 2010)

Permission is granted to download and use these notes, as long as all copyright notices are kept intact.

Richard Brittain, Dartmouth College.

Why are you here ?


What We Will Cover

Data Planning Checklist (1)


Data Planning Checklist (2)


What is Data? (1)


What is Data? (2)

Security and Backups


Security and Backups

Data Storage Options

RStor (AFS)

Campuswide storage for researchers

MyFiles / OurFiles

Campus file server for CIFS (Windows) shares. MyFiles is personal/private and OurFiles is departmental shared space (custom access control)

Backup Your Data

Data Backup Options (1)


Option #1: external hard drive or tape backup system (local)
e.g. Windows backup, Mac TimeMachine, UNIX dumps, rsync

Data Backup Options (2)


Option #2: Dartmouth's NetBackup service
http://www.dartmouth.edu/comp/soft-comp/datastorage/backups/

Data Backup Options (3)

Option #3: Cloud Storage

Data Backup Options (4)

Option #4: "Duplicate Archive" (darch):
Proposed Readonly/Offline Data Archiving solution
Files may be imported to the system by various means: AFS, NFS; CIFS; sftp, or by directly attaching data drives. One convenient mechanism may be to use an AFS volume for preparation of content, then freeze it and import directly to darch.
File level deduplication within a drive will make it efficient for snapshots of dynamic data. Some subset of the data can be published as read-only shares for subsequent access. Files may be compressed and/or encrypted before importing.

Recovery, Compression, Encryption


Places your data may be (local)

Places your data may be (external)

Erasing Data

e.g. Example utilities for FAT and NTFS filesystems

Directory Structure / Naming






Directory Structure (Folder Hierarchy) and File Naming Conventions

Good Directory Structure


File Version Control


Strategies include
Always record every change to a file, no matter how small. Discard obsolete versions if no longer needed after making backups.

File Naming Conventions (1)

File Naming Conventions (2)


Many academic disciplines have specific recommendations, e.g.

Data file toolkit

  • Bulk renaming -- use free tools to help you
    • manpages.ubuntu.com/manpages/dapper/man1/prename.1.html -- rename command line utility (linux).
    • thunar.xfce.org/pwiki/documentation/bulk_renamer -- Thunar bulk file rename GUI tool (linux)
    • www.bulkrenameutility.co.uk/ -- Bulk Rename Utility (Windows)
    • manytricks.com/namemangler/ -- NameMangler (Mac)
    • renamer4mac.com/ - Renamer (Mac)
    • www.powersurgepub.com/products/psrenamer.html -- psrenamer (Mac, Windows, Linux (Java))
  • Duplicate file finders
    • www.stearns.org/freedups/README -- Freedups
    • freedup.org/ -- Freedup
    • duplicatefilessearcher.net/ -- Duplicate File Searcher
  • File format verification
    • hul.harvard.edu/jhove/ -- jhove
  • File Formats for Long-Term Access






    File Formats for Long-Term Access

    File Formats for Long-Term Access (1)


    Principles

    File Formats for Long-Term Access (2)


    Examples

    File Formats for Long-Term Access (3)


    Discipline Standards e.g.

    Common File Formats

    Data Conversion Firms


    Documentation and Metadata






    Documentation and Metadata

    Project Documentation (1)

    Project Documentation (2)

    Project Documentation (3)

    Other Metadata (1)

    Other Metadata (2)

    Metadata tools

    Bag-it Example (1)

    bagit.txt
    BagIt-Version: 0.96
    Tag-File-Character-Encoding: UTF-8
    
    manifest-md5.txt
    4f4530ec94573e2d3f7dfe318ced1628  data/awst_0001_0001_0_00012.xml
    c383d44f134032bb79c3a698f7f66b16  data/awst_0001_0001_0_00022.xml
    ab02245e8e63758e080ef5f14c7804f8  data/awst_0001_0001_0_00025.xml
    c0e11bac29f385f6757b76efeb48a67e  data/awst_0001_0001_0_00010.xml
    d0d408fd3142de7d3da69a8e7d93b29b  data/awst_0001_0001_0_00029.xml
    2b5ef9b643ea8961428b4fa0eaceb662  data/awst_0001_0001_0_00003.xml
    3ede3352cac161ba23078162b3ed135f  data/awst_0001_0001_0_00027.xml
    edd78d7f2a2e5f588bb4b47038c64872  data/awst_0001_0001_0_00023.xml
    ....
    

    Bag-it Example (2)

    bag-info.txt
    External-Identifier: library_escrow_ocn220883270
    OCLC-Number: ocn220883270
    ISBN: 9781599046501
    External-Description: Agent and web service technologies in virtual....
    Source-Organization: Gale Research Group
    Organization-Address: IGI Global, 701 E. Chocolate Ave. Suite 200, Hershey, Pennsylvania
    Contact-Name: A. N. Other
    Contact-Email: another@igi-global.com
    Payload-Oxum: 1479718.36
    Bagging-Date: 2010-08-17
    Bag-Size: 1.4 MB
    Internal-Sender-Description: Millennium bibliographic record number b42328408.....
    
    tagmanifest-md5.txt
    5e9eb293148f6e98989a12398a512744  bagit.txt
    be43f575526f3fc0806756fad5725b86  bag-info.txt
    19fbfc7389058716d8ea770b2c944cb5  manifest-md5.txt
    

    Embedded Metadata Example

    START_HEADER
    FILE_TYPE RAW
    SPECTRA_COUNT 39600
    LOCATION Churchill, Manitoba
    CHANNEL 0 LF-B
    FILENAME g:\data\ch\CH03231L.R00
    VERSION 4.33
    DATE Tue Aug 19 19:59:49 2003
    BYTES_PER_SAMPLE 1
    FREQUENCY_COUNT 498
    YEAR 2003
    SPECTRUM_INTERVAL 2000
    BYTES_PER_SPECTRUM 502
    BEGIN_FREQUENCY_LIST   
    30.00000   
    40.00000   
    ...
    END_FREQUENCY_LIST
    END_HEADER
    

    Sample Metadata Standards

    Data Sharing and Citation






    Data Sharing and Citation

    Why Share Your Data


    Example of a Data Sharing Policy



    "The NIH expects and supports the timely release and sharing of final research data from NIH-supported studies for use by other researchers

    Starting with the October 2003 receipt date, investigators submitting and NIH application seeking $500,000 or more in direct costs in any single year are expected to include a plan for data sharing, or state why data sharing is not possible"

    Data Sharing

    Minimal

    Ideal

    Data Identifiers


    Must be globally unique, persistent

    Many different schemes (discipline specific)

    Intellectual Property Issues (1)


    Sharing Data which you Produced Yourself

    Note: Laws about data vary outside the U.S.

    Intellectual Property Issues (2)


    Sharing data that you have collected from other sources

    Note: Laws about data vary outside the U.S.

    Citing Data (1)


    Citing Data (2)


    Citing Data (3)


    Citing Data (4)


    Include

    Citing Data (5)


    Include

    Data Retention and Archiving






    Data Retention and Archiving

    Data Retention and Archiving (1)


    From the checklist

    Data Retention and Archiving (2)


    Remember


    Over Time

    Additional Online Resources (1)

    Guides to Data Management

    Additional Online Resources (2)

    Guides to Data Management
    • www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf -- ICPSR Guide to Social Science Data Preparation and Archiving (pdf):
    • daac.ornl.gov/PI/bestprac.html -- Oak Ridge National Laboratory:
    • www.data-archive.ac.uk/create-manage/ -- UK Data Archive: Manage and Share Data:
    • dmponline.hatii.arts.gla.ac.uk/ -- DMP Online: Data Management plan preparation tool

    Additional Online Resources (3)

    NSF discipline-specific data management plan guidelines
    • http://nsf.gov/eng/general/ENG_DMP_Policy.pdf Engineering
    • http://www.nsf.gov/geo/ear/2010EAR_data_policy_9_28_10.pdf Earth Sciences
    • http://www.nsf.gov/pubs/2004/nsf04004/start.htm Ocean Sciences
    • http://www.nsf.gov/bfa/dias/policy/dmpdocs/ast.pdf Astronomy
    • http://www.nsf.gov/bfa/dias/policy/dmpdocs/che.pdf Chemistry
    • http://www.nsf.gov/bfa/dias/policy/dmpdocs/dmr.pdf Materials Research
    • http://www.nsf.gov/bfa/dias/policy/dmpdocs/dms.pdf Mathematical Sciences
    • http://www.nsf.gov/bfa/dias/policy/dmpdocs/phy.pdf Physics
    • http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf Social, Behavioral, and Economic Sciences

    Additional Online Resources (4)

    Metadata Standards and Digital Repositories

    Where data management is concerned...




    Where data management is concerned...

    "Perfection is the Enemy of the Good"

    just do the best you can


    These notes may be found at http://www.dartmouth.edu/~rc/classes/data_management (multi-frame version), or http://www.dartmouth.edu/~rc/classes/data_management/s5.shtml (S5 slide show).
    The online version has many links to additional information and may be more up to date than the printed notes.
    (last update   Tuesday, 28-Feb-2012 11:00:07 EST)   ©Dartmouth College