Got Data? Data Publishing Services at Duke Continue During COVID-19

While the library may be physically closed, the Duke Research Data Repository (RDR) is open and accepting data deposits. If you have a data sharing requirement you need to meet for a journal publisher or funding agency we’ve got you covered. If you have COVID-19 data that can be openly shared, we can help make these vital research materials available to the public and the research community today. Or if you have data that needs to be under access restrictions, we can connect you to partner disciplinary repositories that support clinical trials data, social science data, or qualitative data.

Speaking of the RDR, we just completed a refresh on the platform and added several features!

In-line with data sharing standards, we also assign a digital object identifier (DOI) to all datasets, provide structured metadata for discovery, curate data to further enhance datasets for reuse and reproducibility, provide safe archival storage, and a standardized citation for proper acknowledgement.

Openness supports the acceleration of science and the generation of knowledge. Within the libraries we look forward to partnering with Duke researchers to disseminate their research data! Visit https://research.repository.duke.edu/ to learn more or contact datamanagement@duke.edu with any questions.

Maps in Tableau

Making Maps with Tableau

Tableau LogoOne of the attractive features of Tableau for visualization is that it can produce maps in addition to standard charts and graphs. While Tableau is far from being a full-fledged GIS application, it continues to expand its mapping capabilities, making it a useful option to show where something is located or to show how indicators are spatially distributed.

Here, we’re going to go over a few of the Tableau’s mapping capabilities. We’ve recorded a workshop with examples relating to this blog post’s discussion:

For a more general introduction to Tableau (including some mapping examples), you should check out one of these other past CDVS workshops:

Concepts to Keep in Mind

Tableau is a visualization tool: Tableau can quickly and effectively visualize your data, but it will not do specialized statistical or spatial analysis.

Tableau makes it easy to import data:  A big advantage of Tableau is the simplicity of tasks such as changing variable definitions between numeric, string, and date, or filtering out unneeded columns. You can easily do this at the time you connect to the data (“connect” is Tableau’s term for importing data into the program).

Tableau is quite limited for displaying multiple data layers: Tableau wants to display one layer, so you need to use join techniques to connect multiple tables or layers together. You can join data tables based on common attribute values, but to overlay two geographic layers (stack them), you must spatially join one layer to one other layer based on their common location.

Tableau uses a concept that it calls a “dual-axis” map to allow two indicators to display on the same map or to overlay two spatial layers. If, however, you do need to overlay a lot of data on the same map, consider using proper GIS software.

Dual-Axis map
Overlay spatial files using dual-axis maps

Displaying paths on a map requires a special data structure:  In order for tabular data with coordinate values (latitude/longitude) to display as lines on a map, you need to include a field that indicates drawing order. Tableau constructs the lines like connect-the-dots, each row of data being a dot, and the drawing order indicating how the dots are connected.

Lines
Using drawing order to create lines from points

You might use this, for instance, with hurricane tracking data, each row representing measurements and location collected sequentially at different times. The illustration above shows Paris metro lines with the station symbol diameter indicating passenger volume. See how to do this in Tableau’s tutorial.

You can take advantage of Tableau’s built-in geographies: Tableau has many built-in geographies (e.g., counties, states, countries), making it easy to plot tabular data that has an attribute with values for these geographic locations, even if you don’t have latitude/longitude coordinates or geographic files — Tableau will look up the places for you!  (It won’t, however, look up addresses.)

Tableau also has several built-in base maps available for your background.

Tableau uses the “Web Mercator” projection: This is the same as Google Earth/Maps. Small-scale maps (i.e., large area of coverage) may look stretched out in an unattractive way since it greatly exaggerates the size of areas near the poles.

Useful Mapping Capabilities

Plot points: Tableau works really well for plotting coordinate data (Longitude (X) and Latitude (Y) values) as points.  The coordinates must have values in decimal degrees with negative longitudes being east of Greenwich and negative latitudes being south of the equator.

Points with time slider
Point data with time slider

Time slider: If you move a categorical “Dimension” variable onto Tableau’s Pages Card, you can get a value-based slider to filter your data by that variable’s values (date, for instance, as in Google Earth). This is shown in the image above.

Heatmap of point distribution: You can choose Tableau’s “Density” option on its Marks card to create a heatmap, which may display the concentration of your data locations in a smoother manner.

Filter a map’s features: Tableau’s Filter card is akin to ArcGIS’s Definition Query, to allow you to look at just a subset of the features in a data table.

Shade polygons to reflect attribute values: Choropleth maps (polygons shaded to represent values of a variable) are easy to make in Tableau. Generally, you’ll have a field with values that match a built-in geography, like countries of the world or US counties.  But you can also connect to spatial files (e.g., Esri shapefiles or GeoJSON files), which is especially helpful if the geography isn’t built into Tableau (US Census Tracts are an example).

Choropleth Map
Filled map using color to indicate values

Display multiple indicators: Visualizing two variables on the same map is always problematic because the data patterns often get hidden in the confusion, but it is possible in Tableau.  Use the “dual-axis” map concept mentioned above.  An example might be pies for one categorical variable (with slices representing the categories) on top of choropleth polygons that visualize a continuous numeric variable.

Multiple variables
Two variables using filled polygons and pies

Draw lines from tabular data: Tableau can display lines if your data is structured right, as discussed and illustrated previously, with a field for drawing order. You could also connect to a spatial line file, such as a shapefile or a GeoJSON file.

Help Resources

We’ve just given an overview of some of Tableau’s capabilities regarding spatial data. The developers are adding features in this area all the time, so stay tuned!

2020 RStudio Conference Livestream Coming to Duke Libraries

RStudio 2020 Conference LogoInterested in attending the 2020 RStudio Conference, but unable to travel to San Francisco? With the generous support of RStudio and the Department of Statistical Science, Duke Libraries will host a livestream of the annual RStudio conference starting on Wednesday, January 29th at 11AM. See the latest in machine learning, data science, data visualization, and R. Registration links and information about sessions follow. Registration is required for the first session and keynote presentations.  Please see the links in the agenda that follows.

Wednesday, January 29th

Location: Rubenstein Library 249 – Carpenter Conference Room

11:00 – 12:00 RStudio Welcome – Special Live Opening Interactive Event for Watch Party Groups
12:00 – 1:00 Welcome for Hadley Wickham and Opening Keynote – Open Source Software for Data Science (JJ Allaire)
1:00 – 2:00 Data, visualization, and designing with AI (Fernanda Viegas and Martin Wattenberg, Google)
2:30 – 4:00 Education Track (registration is not required)
Meet you where you R – Lauren Chadwick, R Studio.
Data Science Education in 2022 (Karl Howe and Greg Wilson, R Studio)
Data science education as an economic and public health intervention in East Baltimore (Jeff Leek, Johns Hopkins)
Of Teacups, Giraffes, & R Markdown (Desiree Deleon, Emory)

Location: Edge Workshop Room – Bostock 127

5:15 – 6:45 All About Shiny  (registration is not required)
Production-grade Shiny Apps with golem (Colin Fay, ThinkR)
Making the Shiny Contest (Duke’s own Mine Cetinkaya-Rundel)
Styling Shiny Apps with Sass and Bootstrap 4(Joe Cheng, RStudio)
Reproducible Shiny Apps with shinymeta (Carson Stewart, RStudio)
7:00 – 8:30 Learning and Using R (registration is not required)
Learning and using R: Flipbooks (Evangeline Reynolds, U Denver)
Learning R with Humorous Side Projects (Ryan Timpe, Lego Group)
Toward a grammar of psychological Experiments (Danielle, Navaro, University of New South Wales)
R for Graphical Clinical Trial Reporting(Frank Harrell, Vanderbilt)

Thursday, January 30th

Location: Edge Workshop Room – Bostock 127

12:00 – 1:00 Keynote: Object of type closure is not subsettable (Jenny Bryan, RStudio)
1:23 – 3:00 Data Visualization Track (registration is not required)
The Glamour of Graphics (William Chase, University of Pennsylvania)
3D ggplots with rayshader (Dr. Tyler Morgan-Wall, Institute for Defense Analyses)
Designing Effective Visualizations (Miriah Meyer, University of Utah)
Tidyverse 2019-2020 (Hadley Wickham, RStudio)
3:00 – 4:00 Livestream of Rstudio Conference Sessions (registration is not required)
4:00 – 5:30 Data Visualization Track 2 (registration is not required)
Spruce up your ggplot2 visualizations with formatted text (Claus Wilke, UT Austin)
The little package that could: taking visualizations to the next level with the scales package (Dana Seidel, Plenty Unlimited)
Extending your ability to extend ggplot2 (Thomas Lin Pedersen, RStudio)
5:45 – 6:30 Career Advice for Data Scientists Panel Discussion (registration is not required)
7:00 – 8:00 Keynote: NSSD Episode 100 (Hillary Parker, Stitchfix and Roger Peng, JHU)

Duke University Libraries Partners with the Qualitative Data Repository

Duke University Libraries has partnered with the Qualitative Data Repository (QDR) as an institutional member to provide qualitative data sharing, curation, and preservation services to the Duke community. QDR is located at Syracuse University and has staff and infrastructure in place to specifically address some of the unique needs of qualitative data including curating data for future reuse, providing mediated access, and assisting with Data Use Agreements.

Duke University Libraries has long been committed to helping our scholars make their research openly accessible and stewarding these materials for the future. Over the past few years, this has included launching a new data repository and curation program, which accepts data from any discipline as well as joining the Data Curation Network. Now through our partnership with QDR we can further enhance our support for sharing and archiving qualitative data.

Qualitative data come in a variety of forms including interviews, focus groups, archival materials, textual documents, observational data, and some surveys. QDR can help Duke researchers have a broader impact through making these unique data more widely accessible.

“Founded and directed by qualitative researchers, QDR is dedicated to helping researchers share their qualitative data,” says Sebastian Karcher, QDR’s associate director. “Informed by our deep understanding of qualitative research, we help researchers share their data in ways that reflect both their ethical commitments and do justice to the richness and diversity of qualitative research. We couldn’t be more excited to continue our already fruitful partnership with Duke University Libraries”

Through this partnership, Duke University Libraries will have representation on the governance board of QDR and be involved in the latest developments in managing and sharing qualitative data. The libraries will also be partnering with QDR to provide virtual workshops in the spring semester at Duke to enhance understanding around the sharing and management of qualitative research data.

If you are interested in learning more about this partnership, contact datamanagement@duke.edu.

Introducing Felipe Álvarez de Toledo, 2019-2020 Humanities Unbounded Digital Humanities Graduate Assistant

Felipe Álvarez de Toledo López-Herrera is a Ph.D. candidate at the Art, Art History, and Visual Studies Department at Duke University and a Digital Humanities Graduate Assistant for Humanities Unbounded, 2019-2020.  Contact him at askdata@duke.edu.

Over the 2019-2020 academic year, I am serving as a Humanities Unbounded graduate assistant in Duke Libraries’ Center for Data and Visualization Sciences. As one of the three Humanities Unbounded graduate assistants, I will partner on Humanities Unbounded projects and focus on developing skills that are broadly applicable to support humanities projects at Duke. In this blog post, I would like to introduce myself and give readers a sense of my skills and interests. If you think my profile could address some of the needs of your group, please reach out to me through the email above!

My own dissertation project began with a data dilemma. 400 years ago, paintings were shipped across the Atlantic by the thousands.  They were sent by painters and dealers in places like Antwerp or Seville, for sale in the Spanish colonies. But most of these paintings were not made to last. Cheap supports and shifting fashions guaranteed a constant renewal of demand, and thus more work for painters, in a sort of proto-industrial planned obsolescence.[1]As a consequence, the canvas, the traditional data point of art history, was not a viable starting point for my own research, rendering powerless many of the tools that art history has developed for studying painting. I was interested in examining the market for paintings as it developed in Seville, Spain from 1500-1700; it was a major productive center which held the idiosyncratic role of controlling all trade to the Spanish colonies for more than 200 years. But what could I do when most of the work produced within it no longer exists?

This problem drives my research here at Duke, where I apply an interdisciplinary, data-driven approach. My own background is the product of two fields: I obtained a bachelor’s degree in Economics in my hometown of Barcelona, Spain in 2015 from the Universitat Pompeu Fabra, and simultaneously attended art history classes in the University of Barcelona. This combination found a natural mid-way point in the study of art markets. I came to Duke to be a part of DALMI, the Duke, Art, Law and Markets Initiative, led by Professor Hans J. Van Miegroet, where I was introduced to the methodologies of data-driven art historical research.

Documents in Seville’s archives reveal a stunning diversity of production that encompasses the religious art for which the city is known, but also includes still lives, landscapes and genre scenes whose importance has been understated and of which few examples remain [Figures 1 & 2]. But analysis of individual documents, or small groups of them, yields limited information. Aggregation, with an awareness of the biases and limitations in the existing corpus of documents, seems to me a way to open up alternative avenues for research. I am creating a database of painters in the city of Seville from 1500-1699, where I pool known archival documentation relating to painters and painting in this city and extract biographical, spatial and productive data to analyze the industry. I explore issues such as the industry’s size and productive capacity, its organization within the city, reactions to historical change and, of course, its participation in transatlantic trade.

This approach has obliged me to become familiar with a wide range of digital tools. I use OpenRefine for cleaning data, R and Stata for statistical analysis, Tableau for creating visualizations and ArcGIS for visualizing and generating spatial data (see examples of my own work below [Figures 3-4]). I have also learned the theory behind relational databases and am learning to use MySQL for my own project; similarly, for the data-gathering process I am interested in learning data-mining techniques through machine learning. I have been using a user-friendly software called RapidMiner to simplify some of my own data gathering.

Thus, I am happy to help any groups that have a data set and want to learn how to visualize it graphically, whether through graphs, charts or maps. I am also happy to help groups think about their data gathering and storage. I like to consider data in the broadest terms: almost anything can be data, if we correctly conceptualize how to gather and utilize it realistically within the limits of a project. I would like to point out that this does not necessarily need to result in visualization; this is also applicable if a group has a corpus of documents that they want to store digitally. If any groups have an interest in text mining and relational databases, we can learn simultaneously—I am very interested in developing these skills myself because they apply to my own project.

I can:

  • Help you consider potential data sources and the best way to extract the information they contain
  • Help you make them usable: teach you to structure, store and clean your data
  • And of course, help you analyze and visualize them
    • With Tableau: for graphs and infographics that can be interactive and can easily be embedded into dashboards on websites.
    • With ArcGIS: for maps that can also be interactive and embedded onto websites or in their Stories function.
  • Help you plan your project through these steps, from gathering to visualization.

Once again, if you think any of these areas are useful to you and your project, please do not hesitate to contact me. I look forward to collaborating with you!

[1]Miegroet, Hans J. Van, and Marchi, ND. “Flemish Textile Trade and New Imagery in Colonial Mexico (1524-1646).” Painting for the Kingdoms. Ed. J Brown. Fomento Cultural BanaMex, Mexico City, 2010. 878-923.

 

Boost Your Energy

Energy at Duke

With the launch of the Duke University Energy Intiative (EI) several years ago, the Center for Data and Visualization Sciences (CDVS) has seen an increased demand for all sorts of data and information related to energy generation, distribution, and pricing.  The EI is a university-wide, interdisciplinary hub that advances an accessible, affordable, reliable, and clean energy system.  It involves researchers and students from the Pratt School of Engineering, the Nicholas School of the Environment, the Sanford School of Public Policy, the Duke School of Law, the Fuqua School of Business, and departments in the Trinity College of Arts & Sciences.

The creation of the EI included development of an Undergraduate Certificate in Energy and Environment and an Undergraduate Minor in Energy Engineering in the Pratt School.  An Energy Data Analytics PhD Student Fellows program is affiliated with the EI’s Energy Data Analytics Lab, and  Duke’s BassConnections program includes several Energy & Environment teams led by the Energy Initiative.

The EI website provides links to energy-related data sources, particularly datasets that have proven useful in Duke energy research projects. We will discuss below some more key sources for finding energy-related data.

Energy resources and potentials

The sources for locating energy data will vary depending on the type of energy and the spot on the source-to-consumption continuum that interests you.

The US Department of Energy’s (DoE’s) Energy Information Administration (EIA) has a nice outline of energy sources, with explanations of each, in their Energy Explained web pages. These include nonrenewable sources such as petroleum, gas, gas liquids, coal, and nuclear.  The EIA also discusses a number of renewable sources such as hydropower (e.g., dams, tidal, or wave action), biomass (e.g., waste or wood), biofuels (e.g., ethanol or biodiesel), wind, geothermal, and solar. Hydrogen is another fuel source discussed on these pages.

Besides renewability, a you might be interested in a source’s carbon footprint. Note that some of the sources the EIA lists as renewables may be carbon creating (such as biomass or biofuels), and some non-renewables may be carbon neutral (such as nuclear).  Any type of energy source clearly has environmental implications, and the Union of Concerned Scientists has a discussion of the Environmental Impacts of Renewable Energy Technologies.

The US Geological Survey’s Energy Resources Program measures resource potentials for all types of energy sources.  The Survey is a great place to find data relating to their traditional focus of fossil fuel reserves, but also for some renewables such as geothermal.  The EIA provides access to GIS layers relating to energy, not only reserves and renewable potentials, but also infrastructure layers.

The DOE’s Office of Scientific and Technical Information (OSTI) is well known as a repository of technical reports, but it also hosts the DOE Data Explorer. This includes hidden gems like the REPLICA database (Rooftop Energy Potential of Low Income Communities in America), which has geographic granularity down to the Census Tract level.

For more on renewables, check out the NREL (National Renewable Energy Laboratory), which disseminates GIS data relating to renewable energy in the US (e.g., wind speeds, wave energy, solar potential), along with some international data. The DoE’s Open Data Catalog is also particularly strong on datasets (tabular and GIS) relating to renewables.  The data ranges from very specific studies to US nationwide data.

REexplorer, showing wind speed in Kenya

For visualizing energy-related map layers from selected non-US countries, the Renewable Energy Data Explorer (REexplorer) provides an online mapping tool. Most layers can be downloaded as GIS files. The International Renewable Energy Agency (IRENA) also has statistics on renewables. Besides downloadable data, summary visualizations can be viewed online using Tableau Dashboards.

Price and production data

The US DOE “Energy Economy” web pages will introduce you to all things relating to the economics of energy, and their EIA (mentioned above) is the main US source for fossil fuel pricing, from both the production and the retail standpoint.

Internationally, the OECD’s International Energy Agency (IEA) collects supply, demand, trade, production and consumption data, including price and tax data, relating to oil, gas, and coal, as well as renewables.  In the OECD iLibrary go to Statistics tab to find many detailed IEA databases as well as PDF book series such as World Energy Balances, World Energy Outlook, and World Energy Statistics. For more international data (particularly in the developing world), you might want to try Energydata.info.  This includes geospatial data and a lot on renewables, especially solar potential.

Finally, a good place to locate tabular data of all sorts is the database ProQuest Statistical Insight. It indexes publications from government agencies at all levels, IGOs and NGOs, and trade associations, usually providing the data tables or links to the data.

Infrastructure (Generation, Transportation/Distribution, and Storage)

ArcGIS Pro using EPA’s eGRID data

Besides the EIA’s GIS layers relating to energy, mentioned above, another excellent source for US energy infrastructure data is the Homeland Infrastructure Foundation-Level Data (HIFLD), which includes datasets on energy infrastructure from many government agencies. These include geospatial data layers (GIS data) for pipelines, power plants, electrical transmission and more. For US power generation, the Environmental Protection Agency has their Emissions & Generation Resource Integrated Database (eGRID).  eGRID data includes the locations of all types of US electrical power generating facilities, including fuel used, generation capacity, and detailed effluent data. For international power plant data, the World Resources Institute’s (WRI’s) Global Power Plant Database includes data on around 30,000 plants, and some of WRI’s other datasets also relate to energy topics.

Energy storage can include the obvious battery technologies, but also pumped hydroelectric systems and even more novel schemes.  The US DoE has a Global Energy Storage Database with information on “grid-connected energy storage projects and relevant state and federal policies.”

Businesses

For data or information relating to individual companies in the energy sector, as well as for more qualitative assessments of industry segments, you can begin with the library’s Company and Industry Research Guide. This leads to some of the key business sources that the Duke Libraries provide access to.

Trade Associations

Trade associations that promote the interests of companies in particular industries can provide effective leads to data, particularly when you’re having trouble locating it from government agencies and IGOs/NGOs. If they don’t provide data or much other information on their websites, be sure to contact them to see what they might be willing to share with academic researchers. Most of the associations below focus on the United States, but some are global in scope.

These are just a few of the sources and strategies for locating data on energy.  For more assistance, please contact the Center for Data and Visualization Sciences: askdata@duke.edu

R Open Labs – open hours to learn more R

New this fall…

R fun: An R Learning Series
An R workshop series by the Center for Data and Visualization Sciences.

You are invited to stop by the Edge Workshop Room on Mondays for a new Rfun program, the R Open Labs,  6-7pm, Sept. 16 through Oct. 28. No need to register although you are encouraged to double-check the R Open Labs schedule/hoursBring your laptop!

This is your chance to polish R skills in a comfortable and supportive setting.  If you’re a bit more advanced, come and help by demonstrating the supportive learning community that R is known for.

No Prerequisites, but please bring your laptop with R/RStudio installed. No skill level expected. Beginners, intermediate, and advanced are all welcome. One of the great characteristics of the R community is the supportive culture. While we hope you have attended our Intro to R workshop (or watched the video, or equivalent). This is an opportunity to learn more about R and to demystify some part of R that your find confusing.

FAQ

What are Open Labs

Open labs are semi-structured workshops designed to help you learn R. Each week brief instruction will be provided, followed by time to practice, work together, ask questions and get help. Participants can join the lab any time during the session, and are welcome to work on unrelated projects.

The Open Labs model was established by our colleagues at Columbia and adopted by UNC Chapel Hill. We’re giving this a try as well. Come help us define our direction and structure. Our goal is to connect researchers and foster a community for R users on campus.

How do I Get Started?

Attend an R Open Lab. Labs occur on Mondays, 6pm-7pm in the Edge Workshop Room in the Bostock Library. In our first meeting we will decide, as a group, which resource will guide us. We will pick one of the following resources…

  1. R for Data Science by Hadley Wickham & Garrett Grolemund (select chapters, workbook problems, and solutions)
  2. The RStudio interactive R Primers
  3. Advanced R by Hadley Wickham (select chapters and workbook problems)
  4. Or, the interactive dataquest.io learning series on R

Check our upcoming Monday schedule and feel free to RSVP.  We will meet for 6 nearly consecutive Mondays during the fall semester.

Please bring a laptop with R and R Studio installed.  If you have problems installing the software, we can assist you with installation as time allows. Since we’re just beginning with R Open Labs, we think there will be time for one-on-one attention as well through learning and community building.

How to install R and R Studio

If you are getting started with R and haven’t already installed anything, consider using using these installation instructions.  Or simply skip the installation and use one of these free cloud environments:

Begin Working in R

We’ll start at the beginning, however, R Open Labs recommends that you attend our Intro to R workshop or watch the recorded video. Being a beginner makes you part of our target audience so come ready to learn and ask questions. We also suggest working through materials from our other workshops, or any of the resource materials listed in the Attend an R Open Lab section (above).  But don’t let lack of experience stop you from attending.  The resources mentioned above will be the target of our learning and exploration.

Is R help available outside of Open Labs?

If you require one-on-one help with R outside of the Open Labs, in-person assistance is available from the Library’s Center for Data & Visualization Sciences, our Center’s Rfun workshops, or our walk-in consulting in the Brandaleone Data and Visualization Lab (floormap. 1st Floor Bostock Library).

 

Introducing Duke Libraries Center for Data and Visualization Sciences

As data driven research has grown at Duke, Data and Visualization Services receives an increasing number of requests for partnerships, instruction, and consultations. These requests have deepened our relationships with researchers across campus such that we now regularly interact with researchers in all of Duke’s schools, disciplines, and interdepartmental initiatives.

In order to expand the Libraries commitment to partnering with researchers on data driven research at Duke, Duke University Libraries is elevating the Data and Visualization Services department to the Center for Data and Visualization Sciences (CDVS). The change is designed to enable the new Center to:

  • Expand partnerships for research and teaching
  • Augment the ability of the department to partner on grant, development, and funding opportunities
  • Develop new opportunities for research, teaching, and collections – especially in the areas of data science, data visualization, and GIS/mapping research
  • Recognize the breadth and demand for the Libraries expertise in data driven research support
  • Enhance the role of CDVS activities within Bostock Libraries’ Edge Research Commons

We believe that the new Center for Data and Visualization Sciences will enable us to partner with an increasingly large and diverse range of data research interests at Duke and beyond through funded projects and co-curricular initiatives at Duke. We look forward to working with you on your next data driven project!

Minding Your Business: Locating Company and Industry Data

The Data and Visualization Services (DVS) Department can help you locate and extract many types of data, including data about companies and industries.  These may include data on firm location, aggregated data on the general business climate and conditions, or specific company financials.  In addition to some freely available resources, Duke subscribes to a host of databases providing business data.

Directories of Business Locations

You may need to identify local outlets and single-location companies that sell a particular product or provide a particular service.  You may also need information on small businesses (e.g., sole proprietorships) and private companies, not just publicly traded corporations or contact information for a company’s headquarters.  A couple of good sources for such local data are the ReferenceUSA Businesses Database and SimplyAnalytics.

From these databases, you can extract lists of locations with geographic coordinates for plotting in GIS software, and SimplyAnalytics also lets you download data already formatted as GIS layers. Researchers often use this data when needing to associate business locations with the demographics and socio-economic characteristics of neighborhoods (e.g., is there a lack of full-service grocery stores in poor neighborhoods?).

SimplyAnalytics
SimplyAnalytics

When searching these resources (or any business data source), it often helps to use an industry classification code to focus your search. Examples are the North American Industry Classification System (NAICS) and the Standard Industrial Classification (SIC) (no longer revised, but still commonly used). You can determine a code using a keyword search or drilling down through a hierarchy.

Aggregated Business and Marketing Data

Government surveys ask questions of businesses or samples of businesses. The data is aggregated by industry, location, size of company, and other criteria and typically include information on the characteristics of each industry, such as employment, wages, and productivity.

Sample Government Resources

Macroeconomic indicators relate to the overall business climate, and a good source for macro data is Global Financial Data. Its data series includes many stock exchange and bond indexes from around the world.

Private firms also collect market research data through sample surveys. These are often from a consumer perspective, for instance to help gauge demand for specific products and services. Be aware that the numbers for small geographies (e.g., Census Tracts or Block Groups) are typically imputed from small nationwide samples, based on correlations with demographic and socioeconomic indicators. Examples of resources with such data are SimplyAnalytics (with data from EASI and Simmons) and Statista (mostly national-level data).

Firm-Level Data

You may be interested in comparing numbers between companies, ranking them based on certain indicators, or gathering time-series data on a company to follow changes over time.  Always be aware of whether the company is a publicly traded corporation or is privately held, as the data sources and availability of information may vary.

For firm-level financial detail, public corporations traded in the US are required to submit data to the U.S. Securities and Exchange Commission (SEC).

EDGAR
SEC’s EDGAR Service

Their EDGAR service is the source of the corporate financials repackaged by commercial data providers, and you might find additional context and narrative analysis with products such as Mergent Online, Thomson One, or S&P Global NetAdvantage.  The Bloomberg Professional Service in the DVS computer lab contains a vast amount of data, news, and analysis on firms and economic conditions worldwide. You can find many more sources for firm- and industry-specific data from the library’s guide on Company and Industry Research, and of course at the Ford Library at the Fuqua School of Business.

All of these sources provide tabular download options.

For help finding any sort of business or industry data, don’t hesitate to contact us at askdata@duke.edu.

Where can I find data (or statistics) on ___________?

Helping Duke students, staff and faculty to locate data is something that we in Data and Visualization Services often do.  In this blog post I will walk you through a sample search and share some tips that I use when I search for data and statistics.

“Hi there, I am looking for motorcycle registration numbers and sales volumes by age and sex for the United States.”

BREAKING DOWN THE QUESTION:

There are two types of data needed: motorcycle registration data and motorcycle sales data. There are two criteria that the data should be differentiated by: owner’s age and owner’s gender.
There is a geographic component: United States.

One criteria that is not given is time.  When a time frame isn’t provided, I assume that what is needed is the most current data available.  Something to consider is that “current” often will still be a year or more old. It takes time for data to be gathered, cleaned and published.

***Pro-tip: When you are looking for data consider who/what/when and where – adding in those components makes it easier to construct your search.***

WHERE AND HOW DO I SEARCH?

If I do not immediately have a source in mind (and sometimes even if I do, just to hit all the bases) I will use Google and structure my search as follows: motorcycle sales and registration by age and gender united states.

***Pro-tip: You can use Google (or search engine of your choice) to search across things we subscribe to and the open Web, but you will need to be connected via a Duke IP address***

EVALUATING RESULTS

One of the first results returned is from a database we subscribe to called Statistia. This source gives me the number of motorcycle owners by age in 2018, which answers part of the question, but does not include sales information or gender breakdown.

Another top result is a report on Motorcycle Trends in the United States from the Bureau of Transportation Statistics (BTS). Unfortunately, the report is from 2009 and the data cited in the article are from 2003-2007.  A search of the BTS site does not yield any thing more current. However, when I check the source list at the bottom of the report, there are several listed that I will check directly once I’ve finished looking through my search results.

***Pro-tip: Always look for sources of data in reports and figures, even if the data are old. Heading to the source can often yield more current information.***

A third result that looks promising is from a motorcycling magazine: Motorcycle Statistics in America: Demographics Change for 2018. The article reports on statistics from the 2018 owner surveys conducted by the Motorcycle Industry Council (which is one of the sources that the Bureau of Transportation report  listed). This article provides the percent of males and females that own motorcycles as well as the median age of motorcycle owners.  While this is pretty close to the data needed, it is worthwhile to look into the Motorcycle Industry Council. Experience has taught me, however, that industry data typically is neither open nor freely available.

CHECKING THE COMMON SOURCE

When I go to the Motorcycle Industry Council (MIC) Web site I find that they do, indeed, have a statistical report that comes out every year which gives a comprehensive overview of the motorcycle industry.  If you are not a member, you can buy a copy of the report, but it is expensive (nearly $500).

***Pro-tip: Always check the original source even if you anticipate that there may be a paywall – it’s a good idea to evaluate all sources to ensure that they are credible and authoritative.***

MAKING A DECISION

In this instance, I would ultimately advise the person to use the statistics reported in the article Motorcycle Statistics in America: Demographics Change for 2018. Secondary sources aren’t ideal, and can sometimes be complicated to cite, but when you can’t get access to the primary source and that primary source is the authority, it is your best bet.

***Pro-tip: If you are using a secondary source, you should name the original source in text. For example: Data from the 2018 Motorcycle Industry Council Owner Survey (as cited by Ultimate Motorcycling, 2019) but include a citation to the secondary source in your reference list according to the formatting of the style you are using. 

PARTING THOUGHTS

In closing, the data you want might not always be the data you use – either due to the data being proprietary, restricted, or perhaps just doesn’t exist or doesn’t exist in the form you need and/or are able to use.  When this happens, take a moment to think on your research question and determine if you have the time and the resources needed to continue pursuing your question as it stands (purchasing, requesting, applying for, or collecting your own data), or if you need to broaden or change your focus to incorporate the resources you do find in a meaningful way.