Category Archives: Data Curation

Meet Data and Visualization Services

Data and Visualization Services LogoThe fall of 2014 marks the completion of the first five years of the libraries’ Data and GIS Services Department. In 2009, when Mark Thomas and I formed the department, the name accurately reflected our staffing and services as Mark focused on GIS-related issues and I focused on data-related issues. As an increasing number of scholars have embraced data-driven research over the last five years , our services and staff have grown to support an increasingly diverse set of research needs at Duke.

In 2010-2011 academic year, the Libraries launched services around data management and sharing plans in anticipation of new funding rules surrounding research data. In 2012, the library expanded data services in collaboration with OIT’s Research Computing to offer one of the first data visualization consulting positions in the country. In 2013 and 2014, we expanded services and staff to include consultations on research computing and big data.

At this year’s Data and GIS Services annual retreat, we decided that the time has come to change the name of the department to reflect the broader range of staff and consulting services available. While we continue to support our traditional dimensions of data and GIS research, we intend to support a range of data needs across the following five themes:

Data and Visualization Services Themes

Data Sources
Get the data you need. Data and Visualization Services consultants can help you locate and license a diverse range of data sources.  We also provide long term storage for Duke data collections through Duke’s institutional repository.

Data Storage and Management
Need help on a data management plan, want advice on archiving, or struggling with “big data” analytics?  We are happy to consult!

Data Cleaning and Analysis
From Google Refine to the command line, we can help with data cleaning and analysis.

Mapping and GIS
Mapping and spatial analysis remain a core service for the data and visualization program.

Data Visualization
Our data visualization service can help with the most effective way to represent your data for both analysis and communication.

 

We appreciate the research community’s support as we’ve grown over the last five years.  We look forward to working with you on a larger range of data challenges in the future!

Top 10 List – Data and GIS Edition

As we begin our summer in Data and GIS Services, we spend this post reflecting back on some of the services, software, and tools that made data work this spring more productive and more visible.  We proudly present our top 10 list for the Spring 2014 semster:

10. DMPTool
While we enjoy working directly with researchers crafting data management plans, we realize that some data management needs arise outside of consultation hours.  Fortunately, the Data Management Planning Tool (DMPTool) is there 24/7 to provide targeted guidance on data management plans for a range of granting agencies.

9. Fusion Tables
A database in the cloud that allows you to query and visualize your data, Fusion Tables has proven a powerful tool for researchers who need database functionality but don’t have time for a full featured database.  We’ve worked with many groups to map their data in the cloud; see the Digital Projects blog for an example.  Fusion Tables is a regular workshop in Data and GIS.

8. Open Refine
You could learn the UNIX command line and a scripting language to clean your data, but Open Refine opens data cleaning to a wider audience that is more concerned with simplicity than syntax.  Open Refine is also a regular workshop in Data and GIS.

7. R and RStudio
A programming language that excels at statistics and data visualization, R offers a powerful, open source solution to running statistics and visualizing complex data.  RStudio provides a clean, full-featured development environment for R that greatly enhances the analysis process.

6. Tableau Public
Need a quick, interactive data visualization that you can share with a wide audience?  Tableau Public excels at producing dynamic data visualizations from a range of different datasets and provides intuitive controls for letting your audience explore the data.

5. ArcOnline
ArcGIS has long been a core piece of software for researchers working with digital maps.  ArcOnline extends the rich mapping features of ArcGIS into the cloud, allowing a wider audience to share and build mapping projects.

4. Pandas
A Python library that brings data analysis and modeling to the Python scripting language, Pandas brings the ease and power of Python to a range of data management and analysis challenges.

3. RAW
Paste in your spreadsheet data, choose a layout, drag and drop your variables… and your visualization is ready.  Raw makes it easy to go from data to visualization using an intuitive, minimal interface.

2. Stata 13
Another core piece of software in the Data and GIS Lab (and at Duke), Stata 13 brought new features and flexibility (automatic memory management — “hello big data”) that were greatly appreciated by Duke researchers.

1. R Markdown
While many librarians tell people to “document your work,” R Markdown makes it easy to document your research data, explain results, and embed your data visualizations using a minimal markup language that works in any text editor and ties nicely into the R programming language.   For pulling it all together, R Markdown is number one in our top ten list!

We hope you’ve enjoyed the list!  If you are interested in these or other data tools and techniques, please contact us at askdata@duke.edu!

Scaling Support: Designing Data for a Growing Statistics Program

r_stats101How do you support 57,860 online students learning R and statistics ?  Late last fall, Data and GIS Services shared this challenge with Professor Mine Çetinkaya-Rundel and the staff of CIT as we sought to translate Professor Çetinkaya-Rundel’s successful Statistics 101 course to a Coursera class on Data Analysis and Statistical Inference.  While Data and GIS Services has supported Statistics 101 students for several years identifying appropriate data and using the R statistical language for their assignments, the scale of the Coursera course introduced new challenges of trying to provide engaging data to a very large audience without having the opportunity to provide direct support to everyone in the class.

In our initial meetings with Professor Çetinkaya-Rundel, she requested that Data and GIS create data collections for the course that would provide easy access in R and would include a range of statistical measures that would appeal to the diverse audience in the class.  The first challenge — easy access to R — required some translation work.  While R excels in its flexibility, graphics, and statistical power, it lacks some of the built in data documentation features present in other statistical packages.  This project prompted Data and GIS to reconsider how to provide documentation and pre-formatted R data to an audience that would likely be unfamiliar with R and data documentation.

The second challenge — finding data that covered a wide range of interesting topics — proved much easier.  The General Social Survey with its diverse and engaging questions on a wide range of topics proved to be an easy choice for the class.  The American National Election Studies, also offered a diverse set of measures of public opinion that suited the course well.  With these challenges identified and addressed, we spent the end of 2013 selecting portions of the data for class (subsetting), abridging the data documentation for instructional use, and transforming the data to address its usage in an online setting (processing missing values for R, creating factor variables).

As Professor Çetinkaya-Rundel’s class launches on February 17th, this project has given us a new appreciation of providing data and statistical services in a MOOC while also building course materials that we are using in Statistics 101 at Duke.  While students begin the Coursera course on Data Analysis and Statistical Inference, students in Professor Kari Lock Morgan’s Statistics 101 class will use these data in their on-campus Duke course as well.  We hope that both collections will reduce some of the technological hurdles that often confront courses using R as well as improving statistical literacy at Duke and beyond.

Data and GIS Services Spring 2014 Workshop Series

DGSwkshpExplore network analysis, text mining, online mapping, data visualization, and statistics in our spring 2014 workshop series.  Our workshops provide a chance to explore new tools or refresh your memory on effective strategies for managing digital research.  Interested in keeping up to date with workshops and events in Data and GIS?  Subscribe to the dgs-announce listserv or follow us on Twitter (@duke_data).

Currently Scheduled Workshops

 Thu, Jan 9 2:00 PM – 3:30 PM  Data Management Plans – Grants, Strategies, and Considerations

 Mon, Jan 13 2:00 PM – 3:30 PM Webinar: Social Science Data Management and Curation

 Mon, Jan 13 3:00 PM – 4:00 PM Google Fusion Tables

 Tue, Jan 14 3:00 PM – 4:00 PM Open (aka Google) Refine 

 Wed, Jan 15 1:00 PM – 3:00 PM Stata for Research

 Thu, Jan 16 3:00 PM – 5:00 PM Analysis with R

 Tue, Jan 21 1:00 PM – 3:00 PM Introduction to ArcGIS

 Wed, Jan 22 1:00 PM – 3:00 PM ArcGIS Online

 Wed, Jan 22 3:00 PM – 4:00 PM Open (aka Google) Refine 

 Mon, Jan 27 2:00 PM – 3:30 PM Introduction to Text Analysis

 Wed, Jan 29 1:00 PM – 3:00 PM Analysis with R

 Thu, Jan 30 2:00 PM – 4:00 PM Stata for Research

 Mon, Feb 3 1:00 PM – 2:00 PM  Data Visualization on the Web

 Mon, Feb 3 2:00 PM – 3:00 PM  Data Visualization on the Web (Advanced)

 Tue, Feb 11 2:00 PM – 4:00 PM Using Gephi for Network Analysis and Visualization

 Wed, Feb 12 1:00 PM – 3:00 PM Introduction to ArcGIS

 Tue, Feb 18 2:00 PM – 3:30 PM Introduction to Tableau Public 8

 Tue, Feb 25 1:00 PM – 3:00 PM ArcGIS Online

 Thu, Feb 27 1:00 PM – 3:00 PM Historical GIS

 Mon, Mar 3 2:00 PM – 3:30 PM  Designing Academic Figures and Posters

 Tue, Mar 4 1:00 PM – 3:00 PM  Useful R Packages: Extensions for Data Analysis, Management, and Visualization

Access your Duke-Cloud from ANYWHERE

Say you’ve been making hella maps or data stories all day. Now you need to move to your comfy work spot and you need your data to come with you.  If you use Duke’s CIFS, moving around is easy, and all of your files are already backed-up.

In this example we follow the researcher, Ms. Stu Fac-Staff.  Stu is part student, part faculty, and part staff at Duke University.  She needs a portable place for her data and wants easy access from her home, lab, and devices.  Stu also needs to easily share data with colleagues.  No problem!  Stu uses CIFS.

Here’s the scenario.  Ms. Stu Fac-Staff walks into the Data & GIS Lab in the Duke University Libraries with a flash drive full of data tables.  She gathers more supporting data and some advice about crunching the numbers.  Stu finishes her day with a visualization and map. (Proudly, Stu imagines this is going to get the A.  “Is this grant worthy?” Stu asks herself.  “You bet your NSF Application it is!”)  Meanwhile, her flash dive is now full and all she wants is to SAVE THE DATA, CONVENIENTLY for later retrieval back home. So Stu stores the data on the Duke Cloud (CIFS.)

How do I get the free CIFS Space and how much can I use/access?

  • Duke University provides 5 GB (at least!) of easily accessible Cloud-storage space to all faculty, students, and staff
  • If you need more space, larger quantities are available upon request
  • The space is called CIFS and is an OIT supported personal home directory of portable file space; CIFS is a mappable drive on your device and the files are backed up
  • Students are provisioned CIFS space automatically.  Faculty & Staff must request the space through the OIT Service Desk

How do I access the data from my device?

  • In the Data & GIS Lab, after using your NetID to login, open the Windows File Explorer and your CIFS space will be mapped as drive Z.
  • After you leave our Data & GIS Lab, all you have to do is “map the drive” on your own machine
  • Web – For easy distribution to colleagues, you might want to access or distribute your files through the web.  To do this, store the files in your ‘public_html‘ directory inside of your CIFS space.  Now the files can be downloaded via a web browser.  This method is, by default, open to the world; you may want to take additional steps to secure this public_html directory  (see below.)

    http://people.duke.edu/~NetID

     

Can I Secure the Data?

  • Are you trying to access your mapped drive from off campus?
    • Use the VPN directions
    • The CIFS protocol encrypts NetID/password but it does not encrypt your data stream over the Internet.  If you’re connecting from an unencrypted or untrusted network (e.g. wireless in the coffee shop), the VPN allows for a secure connection.
  • Did you put files in your public_html folder?
    • Unlike the default CIFS space, placing files in the ‘public_html’ directory means they become accessible to the world
    • You can control and limit access by following OIT’s “htaccess” instructions

Data and GIS Fall 2013 Newsletter

Analyze, discover, manage, map, and visualize your data with Duke Libraries Data and GIS Services.  Our team of five consultants provides a broad range of support in areas ranging from data analysis, data visualization, geographic information systems, financial data, statistical software and data storage and management.  Our lab provides 12 workstations with the latest data software and three Bloomberg Professional workstations nearly 24/7 for the Duke community.

Data and GIS Workshop Series

All are welcome to the Data and GIS Workshop Series.  Analyze, communicate, clean, map, represent and visualize your data with a wide range of workshops on data based research methods and tools.  Details and registration for each class are available at the links that follow.  (Interested in keeping up to date with workshops and events in Data and GIS?  Just go to https://lists.duke.edu/sympa/info/dgs-announce and click on the “Subscribe” link at the bottom left.)

    Tue, Sep 3, 2013      1:00 PM - 3:00 PM    Introduction to ArcGIS    
    Wed, Sep 4, 2013     10:00 AM - 11:30 AM   Stata for Research    
    Wed, Sep 11, 2013    10:00 AM - 11:00 AM   Open (aka Google) Refine     
    Thu, Sep 12, 2013     1:00 PM - 3:00 PM    Analysis with R    
    Tue, Sep 17, 2013     1:00 PM - 2:30 PM    Introduction to Tableau Public 8    
    Thu, Sep 19, 2013    10:00 AM - 11:00 AM   Google Fusion Tables    
    Mon, Sep 23, 2013     1:00 PM - 2:30 PM    Introduction to Tableau Public 8    
    Tue, Sep 24, 2013     1:00 PM - 2:30 PM    Stata for Research    
    Mon, Sep 30, 2013    10:00 AM - 11:00 AM   Top 10 Dos and Don'ts for Charts and Graphs    
    Mon, Sep 30, 2013     1:00 PM - 3:00 PM    Introduction to ArcGIS    
    Tue, Oct 8, 2013      1:00 PM - 2:30 PM    Introduction to Text Analysis    
    Thu, Oct 10, 2013     1:00 PM - 3:00 PM    ArcGIS Special Topics: Geocoding & Proximity Analysis    
    Thu, Oct 17, 2013     1:00 PM - 3:00 PM    Historical GIS    
    Mon, Oct 28, 2013     1:00 PM - 2:00 PM    Designing Academic Figures and Posters    
    Tue, Oct 29, 2013     1:00 PM - 3:00 PM    Web GIS Applications

Data and GIS also offers instruction tailored to courses or research teams. Please contact askdata@duke.edu to schedule a session!

Data Management

Data Management Planning – DMPTool – Get 24/7 online help for your next data management plan, including information about Duke resources available for your data work.

Statistical Software Updates

Explore all of our Data and GIS Lab resources on our site at http://library.duke.edu/data/about/lab.html or come visit us on the second floor of Perkins Library.

Job Opportunities in Data and GIS Services

Data & GIS Services is hiring!  We have two open positions for student web programmers interested in working on data visualization projects.  See the Library Student employment page (http://library.duke.edu/jobs/students.html) for more information on how to apply.  (The job can be found by searching for requisition number “DUL14-AMZ02”.)

New Data and Map Collections

CPS on Web (CPS Utilities Online)
CPS on Web is a set of utilities enabling you to access CPS data and documentation from this website.   You may make tables and graphs from the CPS data, download data extractions, make estimations, get summaries and statistical measures, search the documentation, and make your own variables as functions of the existing ones.

Global Financial Data
Global Financial Data is a collection of financial and economic data provided in ASCII or Excel format. Data includes: long-term historical indices on stock markets; Total Return data on stocks, bonds, and bills; interest rates; exchange rates; inflation rates; bond indices; commodity indices and prices; consumer price indices; gross domestic product; individual stocks; sector indices; treasury bill yields; wholesale price indices; and unemployment rates covering over 200 countries.

LandScan Global
The LandScan Global Population Database provides global population distribution in a gridded GIS format at 30 arc-second resolution (approximately 1×1 km cells). Oak Ridge National Laboratory developed modeling techniques to disaggregate and interpolate census data within administrative boundaries to create a GIS layer showing population distribution as accurately and as timely as possible. EastView provides this data to use in GIS software as a WMS (Web Mapping Service) or as a WCS (Web Coverage Service) to allow a user to incorporate population distribution into GIS mapping and analysis.

Contact Us

email: askdata@duke.edu
twitter: duke_data or duke_vis

Data Management Planning Advice – DMPTool @ Duke

Data and GIS Services is happy to announce the launch of new service designed to provide detailed data management planning help online.  As an increasing number of granting agencies require a data management plan as part of the grant application process, the DMPTool provides “an open source, web application that assists researchers in producing data management plans and delivering them to funders.” For Duke researchers, the tool provides constantly updated advice about how to complete a data management plan while simultaneously highlighting Duke resources available from a variety of data support providers for the planning, maintenance, and sharing of research data.

We hope that the DMPTool will streamline the grant writing process and help researchers make the appropriate connections to resources available both at Duke and beyond for data management planning.  We welcome your comments and suggestions on this resource.

DMPTool

Catching up on computational biology resources

With the arrival of summer, now is great time to catch up on these resources in computation biology and bioinformatics:

BioStar: Have a question on bioinformatics, computational genomics and biological data analysis but not sure who to ask? Try BioStar, which is an online open community of biologists ready to answer questions, even from “newbies”. You are also welcome to answer and comment on the questions. The more you do, the more reputation points you can earn toward your BioStar badge.

OpenHelix: The site provides a searchable collection of tutorials,  training materials, and exercises on the most popular genomic resources. The folks at OpenHelix also contract with resource providers to offer onsite, hands-on workshops at institutions. While most of their tutorials and training materials require a subscription, they do provide a suite of free tutorials, including ones on the UCSC Genome Browser and the RCSB Protein Data Bank.

Database: The Journal of Biological Databases and Data Curation: While maybe not beach reading, Database is a nice complement to the Nucleic Acids Research annual database issue. This open-access journal, launched in 2009, aims to provide a “platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.”

Have a computation biology resource you would like to recommend? Please leave a comment.