Category Archives: Statistics

2020 RStudio Conference Livestream Coming to Duke Libraries

RStudio 2020 Conference LogoInterested in attending the 2020 RStudio Conference, but unable to travel to San Francisco? With the generous support of RStudio and the Department of Statistical Science, Duke Libraries will host a livestream of the annual RStudio conference starting on Wednesday, January 29th at 11AM. See the latest in machine learning, data science, data visualization, and R. Registration links and information about sessions follow. Registration is required for the first session and keynote presentations.  Please see the links in the agenda that follows.

Wednesday, January 29th

Location: Rubenstein Library 249 – Carpenter Conference Room

11:00 – 12:00 RStudio Welcome – Special Live Opening Interactive Event for Watch Party Groups
12:00 – 1:00 Welcome for Hadley Wickham and Opening Keynote – Open Source Software for Data Science (JJ Allaire)
1:00 – 2:00 Data, visualization, and designing with AI (Fernanda Viegas and Martin Wattenberg, Google)
2:30 – 4:00 Education Track (registration is not required)
Meet you where you R – Lauren Chadwick, R Studio.
Data Science Education in 2022 (Karl Howe and Greg Wilson, R Studio)
Data science education as an economic and public health intervention in East Baltimore (Jeff Leek, Johns Hopkins)
Of Teacups, Giraffes, & R Markdown (Desiree Deleon, Emory)

Location: Edge Workshop Room – Bostock 127

5:15 – 6:45 All About Shiny  (registration is not required)
Production-grade Shiny Apps with golem (Colin Fay, ThinkR)
Making the Shiny Contest (Duke’s own Mine Cetinkaya-Rundel)
Styling Shiny Apps with Sass and Bootstrap 4(Joe Cheng, RStudio)
Reproducible Shiny Apps with shinymeta (Carson Stewart, RStudio)
7:00 – 8:30 Learning and Using R (registration is not required)
Learning and using R: Flipbooks (Evangeline Reynolds, U Denver)
Learning R with Humorous Side Projects (Ryan Timpe, Lego Group)
Toward a grammar of psychological Experiments (Danielle, Navaro, University of New South Wales)
R for Graphical Clinical Trial Reporting(Frank Harrell, Vanderbilt)

Thursday, January 30th

Location: Edge Workshop Room – Bostock 127

12:00 – 1:00 Keynote: Object of type closure is not subsettable (Jenny Bryan, RStudio)
1:23 – 3:00 Data Visualization Track (registration is not required)
The Glamour of Graphics (William Chase, University of Pennsylvania)
3D ggplots with rayshader (Dr. Tyler Morgan-Wall, Institute for Defense Analyses)
Designing Effective Visualizations (Miriah Meyer, University of Utah)
Tidyverse 2019-2020 (Hadley Wickham, RStudio)
3:00 – 4:00 Livestream of Rstudio Conference Sessions (registration is not required)
4:00 – 5:30 Data Visualization Track 2 (registration is not required)
Spruce up your ggplot2 visualizations with formatted text (Claus Wilke, UT Austin)
The little package that could: taking visualizations to the next level with the scales package (Dana Seidel, Plenty Unlimited)
Extending your ability to extend ggplot2 (Thomas Lin Pedersen, RStudio)
5:45 – 6:30 Career Advice for Data Scientists Panel Discussion (registration is not required)
7:00 – 8:00 Keynote: NSSD Episode 100 (Hillary Parker, Stitchfix and Roger Peng, JHU)

Where can I find data (or statistics) on ___________?

Helping Duke students, staff and faculty to locate data is something that we in Data and Visualization Services often do.  In this blog post I will walk you through a sample search and share some tips that I use when I search for data and statistics.

“Hi there, I am looking for motorcycle registration numbers and sales volumes by age and sex for the United States.”

BREAKING DOWN THE QUESTION:

There are two types of data needed: motorcycle registration data and motorcycle sales data. There are two criteria that the data should be differentiated by: owner’s age and owner’s gender.
There is a geographic component: United States.

One criteria that is not given is time.  When a time frame isn’t provided, I assume that what is needed is the most current data available.  Something to consider is that “current” often will still be a year or more old. It takes time for data to be gathered, cleaned and published.

***Pro-tip: When you are looking for data consider who/what/when and where – adding in those components makes it easier to construct your search.***

WHERE AND HOW DO I SEARCH?

If I do not immediately have a source in mind (and sometimes even if I do, just to hit all the bases) I will use Google and structure my search as follows: motorcycle sales and registration by age and gender united states.

***Pro-tip: You can use Google (or search engine of your choice) to search across things we subscribe to and the open Web, but you will need to be connected via a Duke IP address***

EVALUATING RESULTS

One of the first results returned is from a database we subscribe to called Statistia. This source gives me the number of motorcycle owners by age in 2018, which answers part of the question, but does not include sales information or gender breakdown.

Another top result is a report on Motorcycle Trends in the United States from the Bureau of Transportation Statistics (BTS). Unfortunately, the report is from 2009 and the data cited in the article are from 2003-2007.  A search of the BTS site does not yield any thing more current. However, when I check the source list at the bottom of the report, there are several listed that I will check directly once I’ve finished looking through my search results.

***Pro-tip: Always look for sources of data in reports and figures, even if the data are old. Heading to the source can often yield more current information.***

A third result that looks promising is from a motorcycling magazine: Motorcycle Statistics in America: Demographics Change for 2018. The article reports on statistics from the 2018 owner surveys conducted by the Motorcycle Industry Council (which is one of the sources that the Bureau of Transportation report  listed). This article provides the percent of males and females that own motorcycles as well as the median age of motorcycle owners.  While this is pretty close to the data needed, it is worthwhile to look into the Motorcycle Industry Council. Experience has taught me, however, that industry data typically is neither open nor freely available.

CHECKING THE COMMON SOURCE

When I go to the Motorcycle Industry Council (MIC) Web site I find that they do, indeed, have a statistical report that comes out every year which gives a comprehensive overview of the motorcycle industry.  If you are not a member, you can buy a copy of the report, but it is expensive (nearly $500).

***Pro-tip: Always check the original source even if you anticipate that there may be a paywall – it’s a good idea to evaluate all sources to ensure that they are credible and authoritative.***

MAKING A DECISION

In this instance, I would ultimately advise the person to use the statistics reported in the article Motorcycle Statistics in America: Demographics Change for 2018. Secondary sources aren’t ideal, and can sometimes be complicated to cite, but when you can’t get access to the primary source and that primary source is the authority, it is your best bet.

***Pro-tip: If you are using a secondary source, you should name the original source in text. For example: Data from the 2018 Motorcycle Industry Council Owner Survey (as cited by Ultimate Motorcycling, 2019) but include a citation to the secondary source in your reference list according to the formatting of the style you are using. 

PARTING THOUGHTS

In closing, the data you want might not always be the data you use – either due to the data being proprietary, restricted, or perhaps just doesn’t exist or doesn’t exist in the form you need and/or are able to use.  When this happens, take a moment to think on your research question and determine if you have the time and the resources needed to continue pursuing your question as it stands (purchasing, requesting, applying for, or collecting your own data), or if you need to broaden or change your focus to incorporate the resources you do find in a meaningful way.

Data and Visualization Spring 2016 Workshops

Spring 2016 DVS WorkshopsSPRING 2016: Data and Visualization Workshops 

Interested in getting started in data driven research or exploring a new approach to working with research data?  Data and Visualization Services’ spring workshop series features a range of courses designed to showcase the latest data tools and methods.  Begin working with data in our Basic Data Cleaning/Analysis or the new Structuring Humanities Data  workshop.  Explore data visualization in the Making Data Visual class.  Our wide range of workshops offers a variety of approaches for the meeting the challenges of 21st century data driven research.   Please join us!

Workshop by Theme

DATA SOURCES

DATA CLEANING AND ANALYSIS

DATA ANALYSIS

MAPPING AND GIS

DATA VISUALIZATION

* – For these workshops, no prior experience with data projects is necessary!  These workshops are great introductions to basic data practices.

DVS Fall Workshops

GenericWorkshops-01Data and Visualization Services is happy to announce its Fall 2015 Workshop Series.  With a range of workshops covering basic data skills to data visualization, we have a wide range of courses for different interests and skill levels..  New (and redesigned) workshops include:

  • OpenRefine: Data Mining and Transformations, Text Normalization
  • Historical GIS
  • Advanced Excel for Data Projects
  • Analysis with R
  • Webscraping and Gathering Data from Websites

Workshop descriptions and registration information are available at:

library.duke.edu/data/news

 

Workshop
 

Date

OpenRefine: Data Mining and Transformations, Text Normalization
Sep 9
Basic Data Cleaning and Analysis for Data Tables
Sep 15
Introduction to ArcGIS
Sep 16
Easy Interactive Charts and Maps with Tableau
Sep 18
Introduction to Stata
Sep 22
Historical GIS
Sep 23
Advanced Excel for Data Projects
Sep 28
Easy Interactive Charts and Maps with Tableau
Sep 29
Analysis with R
Sep 30
ArcGIS Online
Oct 1
Web Scraping and Gathering Data from Websites
Oct 2
Advanced Excel for Data Projects
Oct 6
Basic Data Cleaning and Analysis for Data Tables
Oct 7
Introduction to Stata
Oct 14
Introduction to ArcGIS
Oct 15
OpenRefine: Data Mining and Transformations, Text Normalization
Oct 20
Analysis with R
Oct 20

 

Top 10 List – Data and GIS Edition

As we begin our summer in Data and GIS Services, we spend this post reflecting back on some of the services, software, and tools that made data work this spring more productive and more visible.  We proudly present our top 10 list for the Spring 2014 semster:

10. DMPTool
While we enjoy working directly with researchers crafting data management plans, we realize that some data management needs arise outside of consultation hours.  Fortunately, the Data Management Planning Tool (DMPTool) is there 24/7 to provide targeted guidance on data management plans for a range of granting agencies.

9. Fusion Tables
A database in the cloud that allows you to query and visualize your data, Fusion Tables has proven a powerful tool for researchers who need database functionality but don’t have time for a full featured database.  We’ve worked with many groups to map their data in the cloud; see the Digital Projects blog for an example.  Fusion Tables is a regular workshop in Data and GIS.

8. Open Refine
You could learn the UNIX command line and a scripting language to clean your data, but Open Refine opens data cleaning to a wider audience that is more concerned with simplicity than syntax.  Open Refine is also a regular workshop in Data and GIS.

7. R and RStudio
A programming language that excels at statistics and data visualization, R offers a powerful, open source solution to running statistics and visualizing complex data.  RStudio provides a clean, full-featured development environment for R that greatly enhances the analysis process.

6. Tableau Public
Need a quick, interactive data visualization that you can share with a wide audience?  Tableau Public excels at producing dynamic data visualizations from a range of different datasets and provides intuitive controls for letting your audience explore the data.

5. ArcOnline
ArcGIS has long been a core piece of software for researchers working with digital maps.  ArcOnline extends the rich mapping features of ArcGIS into the cloud, allowing a wider audience to share and build mapping projects.

4. Pandas
A Python library that brings data analysis and modeling to the Python scripting language, Pandas brings the ease and power of Python to a range of data management and analysis challenges.

3. RAW
Paste in your spreadsheet data, choose a layout, drag and drop your variables… and your visualization is ready.  Raw makes it easy to go from data to visualization using an intuitive, minimal interface.

2. Stata 13
Another core piece of software in the Data and GIS Lab (and at Duke), Stata 13 brought new features and flexibility (automatic memory management — “hello big data”) that were greatly appreciated by Duke researchers.

1. R Markdown
While many librarians tell people to “document your work,” R Markdown makes it easy to document your research data, explain results, and embed your data visualizations using a minimal markup language that works in any text editor and ties nicely into the R programming language.   For pulling it all together, R Markdown is number one in our top ten list!

We hope you’ve enjoyed the list!  If you are interested in these or other data tools and techniques, please contact us at askdata@duke.edu!

Scaling Support: Designing Data for a Growing Statistics Program

r_stats101How do you support 57,860 online students learning R and statistics ?  Late last fall, Data and GIS Services shared this challenge with Professor Mine Çetinkaya-Rundel and the staff of CIT as we sought to translate Professor Çetinkaya-Rundel’s successful Statistics 101 course to a Coursera class on Data Analysis and Statistical Inference.  While Data and GIS Services has supported Statistics 101 students for several years identifying appropriate data and using the R statistical language for their assignments, the scale of the Coursera course introduced new challenges of trying to provide engaging data to a very large audience without having the opportunity to provide direct support to everyone in the class.

In our initial meetings with Professor Çetinkaya-Rundel, she requested that Data and GIS create data collections for the course that would provide easy access in R and would include a range of statistical measures that would appeal to the diverse audience in the class.  The first challenge — easy access to R — required some translation work.  While R excels in its flexibility, graphics, and statistical power, it lacks some of the built in data documentation features present in other statistical packages.  This project prompted Data and GIS to reconsider how to provide documentation and pre-formatted R data to an audience that would likely be unfamiliar with R and data documentation.

The second challenge — finding data that covered a wide range of interesting topics — proved much easier.  The General Social Survey with its diverse and engaging questions on a wide range of topics proved to be an easy choice for the class.  The American National Election Studies, also offered a diverse set of measures of public opinion that suited the course well.  With these challenges identified and addressed, we spent the end of 2013 selecting portions of the data for class (subsetting), abridging the data documentation for instructional use, and transforming the data to address its usage in an online setting (processing missing values for R, creating factor variables).

As Professor Çetinkaya-Rundel’s class launches on February 17th, this project has given us a new appreciation of providing data and statistical services in a MOOC while also building course materials that we are using in Statistics 101 at Duke.  While students begin the Coursera course on Data Analysis and Statistical Inference, students in Professor Kari Lock Morgan’s Statistics 101 class will use these data in their on-campus Duke course as well.  We hope that both collections will reduce some of the technological hurdles that often confront courses using R as well as improving statistical literacy at Duke and beyond.

Data and GIS Services Spring 2014 Workshop Series

DGSwkshpExplore network analysis, text mining, online mapping, data visualization, and statistics in our spring 2014 workshop series.  Our workshops provide a chance to explore new tools or refresh your memory on effective strategies for managing digital research.  Interested in keeping up to date with workshops and events in Data and GIS?  Subscribe to the dgs-announce listserv or follow us on Twitter (@duke_data).

Currently Scheduled Workshops

 Thu, Jan 9 2:00 PM – 3:30 PM  Data Management Plans – Grants, Strategies, and Considerations

 Mon, Jan 13 2:00 PM – 3:30 PM Webinar: Social Science Data Management and Curation

 Mon, Jan 13 3:00 PM – 4:00 PM Google Fusion Tables

 Tue, Jan 14 3:00 PM – 4:00 PM Open (aka Google) Refine 

 Wed, Jan 15 1:00 PM – 3:00 PM Stata for Research

 Thu, Jan 16 3:00 PM – 5:00 PM Analysis with R

 Tue, Jan 21 1:00 PM – 3:00 PM Introduction to ArcGIS

 Wed, Jan 22 1:00 PM – 3:00 PM ArcGIS Online

 Wed, Jan 22 3:00 PM – 4:00 PM Open (aka Google) Refine 

 Mon, Jan 27 2:00 PM – 3:30 PM Introduction to Text Analysis

 Wed, Jan 29 1:00 PM – 3:00 PM Analysis with R

 Thu, Jan 30 2:00 PM – 4:00 PM Stata for Research

 Mon, Feb 3 1:00 PM – 2:00 PM  Data Visualization on the Web

 Mon, Feb 3 2:00 PM – 3:00 PM  Data Visualization on the Web (Advanced)

 Tue, Feb 11 2:00 PM – 4:00 PM Using Gephi for Network Analysis and Visualization

 Wed, Feb 12 1:00 PM – 3:00 PM Introduction to ArcGIS

 Tue, Feb 18 2:00 PM – 3:30 PM Introduction to Tableau Public 8

 Tue, Feb 25 1:00 PM – 3:00 PM ArcGIS Online

 Thu, Feb 27 1:00 PM – 3:00 PM Historical GIS

 Mon, Mar 3 2:00 PM – 3:30 PM  Designing Academic Figures and Posters

 Tue, Mar 4 1:00 PM – 3:00 PM  Useful R Packages: Extensions for Data Analysis, Management, and Visualization

Data and GIS Fall 2013 Newsletter

Analyze, discover, manage, map, and visualize your data with Duke Libraries Data and GIS Services.  Our team of five consultants provides a broad range of support in areas ranging from data analysis, data visualization, geographic information systems, financial data, statistical software and data storage and management.  Our lab provides 12 workstations with the latest data software and three Bloomberg Professional workstations nearly 24/7 for the Duke community.

Data and GIS Workshop Series

All are welcome to the Data and GIS Workshop Series.  Analyze, communicate, clean, map, represent and visualize your data with a wide range of workshops on data based research methods and tools.  Details and registration for each class are available at the links that follow.  (Interested in keeping up to date with workshops and events in Data and GIS?  Just go to https://lists.duke.edu/sympa/info/dgs-announce and click on the “Subscribe” link at the bottom left.)

    Tue, Sep 3, 2013      1:00 PM - 3:00 PM    Introduction to ArcGIS    
    Wed, Sep 4, 2013     10:00 AM - 11:30 AM   Stata for Research    
    Wed, Sep 11, 2013    10:00 AM - 11:00 AM   Open (aka Google) Refine     
    Thu, Sep 12, 2013     1:00 PM - 3:00 PM    Analysis with R    
    Tue, Sep 17, 2013     1:00 PM - 2:30 PM    Introduction to Tableau Public 8    
    Thu, Sep 19, 2013    10:00 AM - 11:00 AM   Google Fusion Tables    
    Mon, Sep 23, 2013     1:00 PM - 2:30 PM    Introduction to Tableau Public 8    
    Tue, Sep 24, 2013     1:00 PM - 2:30 PM    Stata for Research    
    Mon, Sep 30, 2013    10:00 AM - 11:00 AM   Top 10 Dos and Don'ts for Charts and Graphs    
    Mon, Sep 30, 2013     1:00 PM - 3:00 PM    Introduction to ArcGIS    
    Tue, Oct 8, 2013      1:00 PM - 2:30 PM    Introduction to Text Analysis    
    Thu, Oct 10, 2013     1:00 PM - 3:00 PM    ArcGIS Special Topics: Geocoding & Proximity Analysis    
    Thu, Oct 17, 2013     1:00 PM - 3:00 PM    Historical GIS    
    Mon, Oct 28, 2013     1:00 PM - 2:00 PM    Designing Academic Figures and Posters    
    Tue, Oct 29, 2013     1:00 PM - 3:00 PM    Web GIS Applications

Data and GIS also offers instruction tailored to courses or research teams. Please contact askdata@duke.edu to schedule a session!

Data Management

Data Management Planning – DMPTool – Get 24/7 online help for your next data management plan, including information about Duke resources available for your data work.

Statistical Software Updates

Explore all of our Data and GIS Lab resources on our site at http://library.duke.edu/data/about/lab.html or come visit us on the second floor of Perkins Library.

Job Opportunities in Data and GIS Services

Data & GIS Services is hiring!  We have two open positions for student web programmers interested in working on data visualization projects.  See the Library Student employment page (http://library.duke.edu/jobs/students.html) for more information on how to apply.  (The job can be found by searching for requisition number “DUL14-AMZ02”.)

New Data and Map Collections

CPS on Web (CPS Utilities Online)
CPS on Web is a set of utilities enabling you to access CPS data and documentation from this website.   You may make tables and graphs from the CPS data, download data extractions, make estimations, get summaries and statistical measures, search the documentation, and make your own variables as functions of the existing ones.

Global Financial Data
Global Financial Data is a collection of financial and economic data provided in ASCII or Excel format. Data includes: long-term historical indices on stock markets; Total Return data on stocks, bonds, and bills; interest rates; exchange rates; inflation rates; bond indices; commodity indices and prices; consumer price indices; gross domestic product; individual stocks; sector indices; treasury bill yields; wholesale price indices; and unemployment rates covering over 200 countries.

LandScan Global
The LandScan Global Population Database provides global population distribution in a gridded GIS format at 30 arc-second resolution (approximately 1×1 km cells). Oak Ridge National Laboratory developed modeling techniques to disaggregate and interpolate census data within administrative boundaries to create a GIS layer showing population distribution as accurately and as timely as possible. EastView provides this data to use in GIS software as a WMS (Web Mapping Service) or as a WCS (Web Coverage Service) to allow a user to incorporate population distribution into GIS mapping and analysis.

Contact Us

email: askdata@duke.edu
twitter: duke_data or duke_vis

Upcoming MATLAB Training at Duke

MATLAB is an integrated technical computing environment that combines numeric computation, advanced graphics and visualization, and a high-level programming language.  Duke’s license agreement offers MATLAB licenses to faculty and staff for work or personal computers, as well as students through on-campus use.  The Duke Office of Information Technology (OIT) maintains instructions on installing MATLAB at Duke.  MATLAB is used by many communities at Duke, including Engineering, Econometrics, Medical Sciences, Computational Biology, and Business.

On Tuesday, June 18, OIT in partnership with Duke University Libraries will host a one-day course on MATLAB that focuses on using this software for Data Processing and Visualization.  The course will cover importing data, organizing data, and visualizing data in a hands-on format (detailed outline).  Seats are limited to 20; please register soon to reserve your spot.

MATLAB for Data Processing and Visualization
(outline)
Laura Proctor, Academic Training Engineer at MathWorks
Tuesday, June 18
8:30 a.m. to 4:30 p.m. (lunch break from 12:00 p.m. to 1:00 p.m., lunch not provided)
Library Computer Classroom, Bostock 023
Registration (seats limited to 20)

The course assumes some existing familiarity with MATLAB.  New potential MATLAB users may want to attend an overview seminar on the software that will be held on Thursday, May 30.  This overview will not be hands on, but it will include live demonstrations and examples of both MATLAB and Simulink, an environment for multi-domain simulation and model-based design.

Introduction to Data Analysis and Visualization with MATLAB & Simulink
(details and registration)
Mehernaz Savai, Applications Engineer at MathWorks
Thursday, May 30
1:00 p.m. to 4:00 p.m.
FCIEMAS Building, Schiciano Auditorium – side A

If you would like to begin learning to use MATLAB, MathWorks offers a self-directed MATLAB Fundamentals course, and the Duke library collection also includes several introductory MATLAB texts, such as MATLAB Primer and MATLAB: A Practical Approach.