Where There’s Smoke …

A team of Duke undergraduates participating in the Global Health Capstone course was awarded the “Outstanding Capstone Research Project” for their examination of state and congressional district characteristics that might influence the outcome of legislative efforts to raise cigarette excise taxes in North Carolina, South Carolina, and Mississippi.  Sarah Chapin and Gregory Morrison used GIS mapping tools in the Library’s Data & GIS Services Department to illuminate the relationships between county demographics and state legislators’ votes for or against cigarette tax hikes. Brian Clement, Alexa Monroy, and Katherine Roemer were other members of the research group.  Congratulations!

Regional Focus
The recent cigarette excise tax increases Mississippi (2009), North Carolina (2009), and South Carolina (2010) served as case studies from which to draw components of successful strategies to develop a regional legislative toolkit for those wishing to increase cigarette excise taxes in the Southeast.  In all of these states, the tax increase was controversial. The Southeast in general is tax averse, which presents a systemic challenge to those who advocate raising taxes on cigarettes.

Senate Votes & Poverty by CountyThe researchers examined state characteristics which might influence the outcome of efforts to raise excise taxes, such as coalitions for and against proposed increases, the facts each side brought to bear and the nature of the discourse mobilized by different groups, the economic impact in each state of both smoking and the proposed excise taxes, and local political realities. The students restricted the area of interest to the Southeast because this region has a shared history and, consequently, similar challenges when it comes to race, poverty, and rural populations. They are also, broadly speaking, politically similar and have had a similar experience with both tobacco use and government regulation.

This multi-disciplinary analysis provides a reference point for state legislators or interest groups wishing to pass cigarette tax increases.  The deliverable provided a model of past voting trends, suggestions for framing political dimensions of the issue, and strategies to overcome opposition in state legislatures.

Comparing Legislative Districts and County Data
Senate Votes & Party AffiliationThe bulk of the research involved mapping the political landscape surrounding cigarette tax legislation.  In doing so, researchers looked at voting records, interest group politics, campaigns, and state ideology. Broadly, the research entailed charting the electoral geography by overlaying state house and senate districts with county-level data.  Districts were coded based on voting history, party affiliation, smoking rates, and constituent demographics.  State legislature websites were used to find representatives’ voting histories, allowing the researchers to match legislators by county when constructing a GIS dataset.  County party affiliations are available through the state board of elections.  Finally, county demographics came from the 2010 Census data.

Senate Votes & Percent Black by County

Overcoming Ideology
Besides using GIS mapping to illustrate these relationships, the researchers analyzed the involvement of major interest groups, specifically, lobbying expenditures and campaign contributions to map the involvement of both pro- and anti-tobacco interest groups.  Additionally, they examined the impact of state ideology on the framing of political dimensions, looking at editorials, opinion pieces, newspapers, and committee markups, as well as interviews (both previous interviews and ones they conducted) with state legislators and interest groups.  Overcoming state ideology, both political and social, is a major factor in passing cigarette excise tax legislation, especially in a region with such dominant tobacco influence.

Again, the purpose of the research is not merely to understand the political landscapes surrounding the passage of cigarette tax bills, but to apply these findings to the creation of a legislative toolbox for representatives or interests groups concerned with pushing similar legislation.

Swimming in a Sea of Data

This post comes from Erika Kociolek, a second year Master in Environmental Management student at the Nicholas School.  The Data and GIS staff want to congratulate Erika on successfully defending her project!

For about 4 months, I’ve been swimming in a proverbial sea of data related to hypoxia (low dissolved oxygen concentrations) and landings in the Gulf of Mexico brown shrimp fishery.  I’m a second year master of environmental management (MEM) student at the Nicholas School, focusing on Environmental Economics and Policy.  I’ve been working with my advisor, Dr. Lori Bennear, to complete my master’s project (MP), an analysis attempting to estimate the effect of hypoxia  on landings and other economic outcomes of interest.

To do this, we are using data from the Southeast Monitoring and Assessment Program (SEAMAP), NOAA/NMFS, and a database of laws and policies related to brown shrimp that I compiled in Fall 2010.  By running regressions that difference out all variation in catch except for that attributable to hypoxia, we can isolate its effect on economic outcomes of interest.  I’ve found that catch, revenue, catch per unit effort, and revenue per unit effort are all larger in the presence of summer hypoxia.  However, if we look at catch for different sizes of shrimp, we see that in the presence of summer hypoxia, catch of larger shrimp decreases and catch of smaller shrimp increases significantly.

Getting to the point of discussing results has required a bunch of data analysis, cleaning, management, and visualization.  I used R, STATA, ArcGIS, and have even used video editing software to make dynamic graphics representing my results that have improved my own understanding of the raw data.  As an example, the video below, showing the change in hypoxia over time (1997-2004), was created using ArcGIS 10.

Note: The maps in the video above use data from the Southeast Monitoring and Assessment Program (SEAMAP).

Hypoxia is a dynamic and complex phenomenon, varying in severity, over time, and in space; hypoxia in Gulf waters is more severe and widespread in summer.  The model I’m using actually takes advantage of this variation to obtain an estimate of the effect of hypoxia on catch and other economic outcomes.  To show people the source of variation I’m exploiting, I created this video.  These maps are drawing on data of dissolved oxygen concentrations and displaying it spatially.

We have dissolved oxygen measurements for most of the Gulf in the summer (June) and fall (December).  Each subarea-depth zone (see related map) that changes from salmon shading (not hypoxic) to red (hypoxic), or vice-versa, is variation in hypoxia that the models I’m running use to get an estimate of the hypothesized effect.

Many thanks are due to my advisor, Dr. Bennear, as well as to the helpful folks at the Data/GIS lab, who have provided invaluable assistance with the data management and data visualization components of this project!

This research was funded by NOAA’s National Center for Coastal Ocean Science, Award #NA09NOS4780235.

Surveying Our Researchers

Understanding library users’ research goals remains a key element of the Perkins Library’s Strategic Plan.  As part of the Library’s User Studies Initiative, Teddy Gray surveyed the Biology Department in the Fall of 2010 to discover what tools and resources departmental members use in their research, researchers’ data management needs, and the impact of the BES Library closing in 2009.

From the 18 interviews of faculty, graduate students, postdocs, and lab managers, we learned–not surprisingly–that nearly all the interviewees use data in their research, most of which they generate themselves. Half incorporate data from others into their work with nearly a third using sequence data from GenBank. Out of the 12 interviewees who generate data in their labs, two-thirds archive their data in existing repositories.

In addition to the interviews, this survey also examined research articles produced by Duke Biologists from 2009 in which we paid special attention to their methods sections and citation patterns. From analyzing departmental research articles, we found out the nearly 40% of the authors deposited their research data into either GenBank or a journal archive. Only one author deposited data into another existing scientific repository. Again nearly 40% of the authors used a general statistical package in their work (SAS and R being the most popular), while nearly half used a biology-specific statistical tool.

Almost everyone interviewed uses statistical tools in their research with over half now using R. Many also use biology-specific statistical programs.

All but one of the interviewees prefer the online versions of library material over the print. A third use image databases–primarily Google Images–in their teaching and presentations; however, only one interviewee knew of subject specific image databases such as the Biology Image Library. And while some interviewees missed the convenience of easy shelf browsing with the BES Library so close by, all are happy with the daily document delivery to the building.

We are grateful to the Biology Department for their support (and time) in conducting this survey and plan to use the results as the basis for library services.  Data and GIS Services is always interested in hearing more from Duke researchers about the nature of your research! Please let us know if you would like to discuss your research interest and/or library needs.

Wrangle, Refine, and Represent

Data visualization and data management represented the core themes of the 2011 Computer Assisted Reporting (CAR) Conference that met in Raleigh from February 24-27.  Bringing together journalists, computer scientists, and faculty, the conference united a number of communities that share a common interest in gathering and representing empirical evidence online (and in print).

While the conference featured luminaries in data visualization (Amanda Cox, David Huynh , Michal Migurski, Martin Wattenberg) who gave sage advice on how to best represent data online, web based data visualization tools provided a central focus for the conference.

Notable tools that may be of interest to the Duke research (and teaching) community include:

DataWrangler – An interactive data cleaning tool much like Google Refine (see below)

Google Fusion Tables – “manage large collections of tabular data in the cloud” – Fusion tables provides convenient access to google’s data visualization and mapping services.  The service also allows groups to annotate data online.

Google Refine – Refine is primarily a data cleaning tool that simplifies the process of cleaning data for further processing or analysis.  While users of existing data management tools may not be convinced to leave their current data management tool, Refine provides a rich suite of tools that will likely attract many new converts.

Many Eyes – One of the premier online visualization tools hosted by IBM.  Visualizations range from pie charts to digital maps to text analysis.  Many Eye’s versatility is one of its key strengths.

Polymaps – Billed as a “javascript library for image- and vector-tiled maps” – Polymaps allows the creating of custom lightweight map services on the web.

SIMILE Project (Semantic Interoperability of Metadata and Information in unLike Environments) – The SIMILE Project is a collection of different research projects designed to “enhance inter-operability” among digital assets.  At the conference, the Exhibit Project received particular attention for its ability to produce data rich visualization with very little coding required.

Timeflow –  Presented by Sarah Cohen and designed by Martin Wattenberg- Timeflow provides a convenient application for visualizing temporal data.

What’s hot in molecular biology databases

The journal Nucleic Acids Research has just published its 18th annual database issue. The current issue summarizes 96 new and 83 previously reviewed molecular biology databases, including GenBank, ENA, DDBJ, and GEO. Also included in the issue is an editorial advocating the creation of a “community-defined, uniform, generic description of the core attributes of biological databases,” which would be known as the BioDBCore checklist. Such a checklist would benefit both database users and provides: users would have a much easier time finding the appropriate resource and providers would be able to highlight specialized resources and the lesser known functionality of established databases.

Besides the databases reviewed in the current issue, Nucleic Acids Research maintains a select list of 1330 molecular biology databases that have been profiled in various database issues over the past 18 years.

SimplyMap! – Census and business data made easier

Online mapping and data access has become even easier with the launch of SimplyMap 2.0.  A long time favorite of Economics and Public Policy courses (and faculty) at Duke, this program provides a straight forward interface for web-based mapping and data extraction application that lets users create thematic maps and reports using US census, business, and marketing data.

SimplyMap 2.0 map interface

Version 2.0 includes improvements designed to make it easier to find and analyze data and create professional looking GIS-style thematic maps.

Significant changes include:

  • A new multi-tab interface to allow you to easily switch between your projects
  • Interactive wizards to guide you through making maps and reports
  • Can choose to automatically select the geographic unit displayed on a map based on the zoom level
  • Easier searching and browsing to choose data variables
  • Assign keyword tags to organize your maps and reports
  • Share your work with other users of SimplyMap (send a URL that lets them open a copy of your map or report)
  • Data filters (greater than, less than, etc.) can now be applied to both maps and reports
  • More export options: Data: Excel, DBF, CSV;  Maps: GIF, PDF, Shapefiles (boundaries only, no attributes)
  • Faster performance

Give SimplyMap 2.0 a try and let us know what you think.  Support is always available in Perkins Data and GIS.

Policy Paradox: Mapping Residential Restrictions

Do residential restrictions placed on convicted sex offenders serve to protect the public?  Duke Economics Ph.D. candidate Songman Kang, has been using the analytical capabilities of geographic information software to help determine the extent to which the restrictions affect residential locations of sex offenders: computing the area covered by a restriction and determining which offenders had to relocate due to a restriction.

According to Kang, the residential restrictions are designed to reduce recidivism among sex offenders and prevent their presence near places where children regularly congregate.  Neither of these claims has been found consistent with empirical evidence though, and it is unclear whether the restrictions have been successful in reducing the rates of repeat sex offenses.  On the other hand, the restrictions severely limit residential location choices, and may force offenders to relocate away from employment opportunities and supportive networks of family and friends.  As a result of the deteriorated economic conditions, the offenders who had to relocate may become more likely to commit non-sex offenses.

The following maps illustrate some of the restricted zones in Miami and in the Triangle area of North Carolina studied by Mr. Kang.

Figure 1: Residential Restricted Zones in Miami

Figure 2: Triangle Restricted Residences

Rolling with R in 2011

Interest in the open source statistical package R has grown over the last few years as researchers discover its powerful graphic capabilities, a suite of packages that extend its functionality, and its data import capabilities.  While several courses use R to teach introductory statistics, most researchers arrive at R with some statistical experience.  The following selected resources represent a growing number of books and websites designed to help orient users to the capabilities of R.

Quick-R Homepagequick_r
This website tries to provide a quick overview of basic data management and statistical capabilities of R for current SAS, SPSS, Stata, and Systat users.  The stress is on providing a brief overview of R commands for common data analysis needs.

R for Stata UsersR for Stata
A comprehensive guide for getting started in R using Stata as point of reference.

R for SAS and SPSS Users (not pictured)
Similar concept as R for Stata users.

Using R for Introductory Statistics
John Verzani’s R for Introductory Statistics is one of several introductions to using R for basic statistics. Examples are available as an R package.

Do you have other sources that you like for R?  Let us know in the comments.

Making Data Flow

As water quality and questions of water supply have grown more salient in the Triangle, Duke researchers have tried to contribute to the growing debate over water quality using the latest digital mapping (GIS) tools.  In the fall of 2009, Data and GIS Services in Perkins Library provided GIS analysis support for a stream and watershed assessment project that developed strategies to reverse the impact of poor urban stormwater management, degraded water quality, and the loss of natural habitats on the Duke campus.

Data/GIS helped the researchers access critical spatial data for the characterization of the contributing watershed’s current land use patterns.  This data enabled the students to analyze the watershed’s area of impervious surface and hydrologic flow paths, and helped inform the understanding of the water quality issues faced at the stream site.

The GIS map below illustrates how digital mapping tools can be used to summarize a large amount of complex data into a compelling presentation.

Special thanks to the interdisciplinary team of environmental and civil engineers, biology and environmental science majors, and a Nicholas MEM student who shared their project results: Alicia Burtner, Matt Ball, Nari Sohn, Avni Patel, Will Bierbower, Adam Nathan, Mike Schallmo, Justine Jackson-Ricketts, and Jai Singh.


Welcome to the Perkins Data and GIS blog!  Our goal is to highlight Duke research, collections, policies, and tools surrounding empirical
data and digital maps of interest to the research community.  We hope that this blog will serve as a catalyst to link researchers and resources across the Duke community and beyond!