Search Analysis: What We’ve Learned

Stop searching for records- just flip a lever; Ad*Access. Item R0712.

We’re taking a user-centered approach in planning the new Digital Collections web interface to ensure that our new design meets the needs and expectations of the people who use it.  One way to discover those needs is to analyze our web traffic in an attempt to decipher user intent when searching and browsing materials in our site.  Valuable patterns exist in this data that can help us optimize the site’s utility and performance by supporting actual user information-seeking behaviors.  Lou Rosenfeld recently wrote a terrific blog post about this “bottom-up analysis” on A List Apart.

Using aggregated data from Google Analytics, we studied searches performed in our site from the period between May 1st and November 1st this year.  We found that Duke Digital Collections was searched approximately 131,000 times during this six month period; that’s an average of 717 searches per day.  The average user spent about three minutes on the site after entering his or her search query and viewed nearly four pages.  Visitors also adjusted their searches with keyword refinement 26% of the time.

Only three percent of these unique searches were entered from the homepage.  Eight percent— a whopping 11,121 unique searches—were entered directly from the Ad*Access portal page, while other popular start pages included the Historic American Sheet Music collection (5%), the Duke Libraries homepage (5%), and the Emergence of Advertising in America collection (3%).  Search engines and referring sites are responsible for the majority (81.8%) of DDC’s traffic, helping to explain this phenomenon.  Links from search engine results and links from social media services like StumbleUpon (19.2% of all referrals), Digg (1.9%),  Facebook (1.3%), and Twitter (0.9%) often lead users directly to item pages, bypassing our portals or homepages entirely.

Over 62,000 distinct searches were conducted, though we’ll focus on the top 500.  The most frequent search, “beauty,” was entered 643 times; by comparison, the #500 search, “clean,” was entered 24 times.  The bulk of keyword searches in our system (421 of the top 500, 84.2%) were entered in the form of single term queries.  These queries were largely topical and exploratory in nature, allowing the user to browse through various results.  In other cases, entire phrases or names of persons were entered into the system when a user had a more specific subject in mind (this was especially true for searches conducted in Historic American Sheet Music, where users often looked for specific titles of scores or names of composers).

Many searches (37 of the top 500), were for years, whether for a specific year or an entire decade.  Top examples include: 1920 (312), 1950 (147), 1920s (138), 1911 (99), 1850 (88), 1930 (88), 1920’s (87).

Other users search for items of a particular format (18 of the top 500 queries).  Examples:  music (187), advertising (127), book (76), poster (74), ad (75), cookbooks (68), sheet music (65), advertisements (58).

Some users search for entire collections by name–e.g. gamble (68), Gamble (50), adviews (44)–though the system currently doesn’t adequately support that function, since it only indexes and returns matching items.

And finally, some users appear to want a way to see everything that can possibly come back in search results, using queries like “a” (71), “all” (41), and the “*” wild card (32); our system does not currently perform this kind of retrieval.

Using the Many Eyes toolkit we have created three data visualizations of the most frequent search queries in two of our most popular collections, Ad*Access and Historic American Sheet Music, and one for the website at large.  Various search terms can easily be parsed at a glance, allowing one to see their frequency and trends.

A couple of caveats about the data that will help to qualify and clarify:

  • Searches performed within the AdViews collection—launched July 21st—are mostly not reflected in this data.  Up until Oct 21st, the AdViews site used a different method of searching (relying on MIT’s SIMILE EXHIBIT code that searched on-the-fly after each keystroke without generating a new URL).
  • Links to search results (canned searches) count as searches.  Going to page 2 of search results counts as doing a new search, too, as does toggling list/grid view.

Top 500 Duke Digital Collections Search Terms

Fa8429d4-ce29-11de-be43-000255111976 Blog_this_caption

Top 100 Historic American Sheet Music Search Terms

3d998f32-ce2c-11de-9684-000255111976 Blog_this_caption

Top 50 Ad*Access Search Terms

D423d7a4-ce2d-11de-94e9-000255111976 Blog_this_caption

3 thoughts on “Search Analysis: What We’ve Learned”

  1. An impressive collection of search data! Although for your “many eyes” aggregations I’m curious why you didn’t make the terms case-insensitive so that “beauty” and “Beauty” wouldn’t be treated separately. Ideally they’d be combined as they are essentially the same search.

    Analyzing the terms to find “alike” searches (e.g. combine “television” and “tv”) could also be interesting. So interesting, in fact that I’ll do it real quick for those top 50 Ad*Access terms. It turns out the ordering becomes significantly different:

    beauty | 302
    1920 | 199
    television | 196
    coca cola | 143
    radio | 132
    beauty and hygiene | 109
    women | 104
    car | 94
    soap | 78
    food | 69
    cigarettes | 64
    1950 | 55
    transportation | 52
    cosmetics | 49
    beer | 44
    fashion | 43
    war | 35
    1911 | 33
    ivory | 32
    woman | 31
    world war II | 28
    propaganda | 28
    perfume | 28
    kotex | 28
    tobacco | 26
    toothpaste | 24
    nike | 24
    music | 24
    hygiene | 24
    coffee | 24
    makeup | 23
    smoking | 22
    shampoo | 22
    health | 22
    men | 21
    keep your beauty on duty | 20
    1940 | 20
    pepsi | 19
    hair | 19
    ford | 19
    alcohol | 18

  2. Very astute observation, Stephen. Thanks for compiling this view of the data. Accounting for variations on the year (“1920,” “1920s,” “1920’s”) jumps out as a significant difference. I wonder what is a more typical intention for a query of “1920”: is it finding items from that particular year or that entire decade? Given the round numbers on most of the year queries entered, I’d infer the latter.

  3. Thank you for this blog. I love browsing digital collections and am planning on (eventually) working in this field, so getting to peek behind the scenes of the collections is very helpful.
    Thanks again

Comments are closed.