Visualizations graphically represent information. As I related in connection to my image gallery, "See Connections with Visualization," "Building on the foundation of basic charting, data graphics and dashboard displays, the growing palate of visualization makes analyses more accessible and understandable to a general business audience -- as well as to seasoned BI professionals." Yet visualizations can be no better than the information and analyses that feed them. Let's use Wordle, an excellent "toy," to explore the trade-off between easy-to-use, attractive visualization for the masses and true, deep sense-making. Factoring in semantic considerations discussed over 50 years ago, it is unfortunately clear that Wordle images miss meaning.Let's relook an example I used in my recent InformationWeek Analytics report, "Seeing Connections: Visualization Makes Sense of Data." I used a Wordle visualization of President Obama's inaugural address as an illustration in my report. Per wordle.net, "Wordle is a toy for generating 'word clouds' from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text." Wordle counts the occurrence of individual words in the text you supply, excluding common "stop words." It creates appealing graphical renderings of the text -- well, really of a "bag of words" reduction of the text -- from this frequency count.
I applied Wordle to the text of a September 8, 1957 New York Times article, "Chemistry Is Employed in a Search for New Methods to Conquer Mental Illness." Why that article? Because it's featured in a 1958 IBM Journal article by Hans Peter Luhn, "The Automatic Creation of Literature Abstracts," that is among several 1950s Luhn papers that lays the foundations for automated text analysis and business intelligence. Click on the word cloud to see a larger version.
(Luhn's and other historical IBM Journal articles, formerly freely accessible, are unfortunately again behind a paywall.)
Wordle images are pretty but they may give a deceptive picture of word significance. The occurrence-counting approach is simply too simplistic. Author Jonathan Feinberg does writes in the Wordle FAQ that Wordle doesn't stem words, that doing so would be beyond his goals, and there are other linguistic deficiencies. Wordle is nice effort despite the short-comings, but as it is, it pretty much implements a very shallow approach as presented by Luhn in his 1958 paper:
"The justification of measuring word significance by use frequency is based on the fact that a writer normally repeats certain words as he advances or varies his arguments and as he elaborates on an aspect of a subject. This means of emphasis is taken as an indicator of significance...
"This rather unsophisticated argument on "significance" avoids such linguistic implications as grammar and syntax. In general, the method does not even propose to differentiate between word forms. Thus the variants differ, differentiate, different, differently, difference and differential could ordinarily be considered identical notions and regarded as the same word. No attention is paid to the logical and semantic relationships the author has established. In other words, an inventory is taken and a word list compiled in descending order of frequency."
What of those "logical and semantic relationships"?
Luhn's paper includes an image of the 1957 New York Times article, "Chemistry Is Employed in a Search for New Methods to Conquer Mental Illness." Most glaringly, the Wordle reduction does not include the names of the four authors whose paper the Times reported on. Three of those authors, brothers, shared the same surname, Sackler, but the three-fold repetition of that name (as a word) didn't boost the frequency high enough on the list for Wordle to have judged it significant enough to render. A robust tool would have made the connection. (For the third time: I like Wordle, but even the author doesn't pretend it's more than a "toy.") How's that?
Here is an article image with annotations that I have added. Five pink ovals are annotated "anaphora / coreference" to indicate that 1) "four New York City physicians," 2) those physicians by name, and 3) at least six occurrences of the the pronoun "they" (some unannotated by me) all refer to the same persons. Resolution of "the logical and semantic relationships the author has established," per Luhn, would have detected the in-article significance of the four physicians, who would then have been prominently featured in the word cloud.
What of those physicians? One of the three was Arthur M. Sackler. When I read the article, that name struck me as familiar. I realized why: I have visited the Arthur M. Sackler Gallery, part of Washington DC's Smithsonian Institution. I was on the 'L' in Chicago, heading to give a talk on text analytics to the Chicago chapter of the American Statistical Association, when I made the connection. And it so happens that I walked by the central Chicago Public Library branch on my way to the meeting.
The CPL has 10,745,608 volumes in its collection, so I did what any sensible researcher would do: I took advantage of the library's free, public WiFi and pulled up the Wikipedia page on Sackler, which states Dr. Sackler earned his fortune from the rights to import and sell Valium in the United States. It was one and the same Arthur M. Sackler. Now you know, and now you see a connection, just a small, chance relationship, but one that neatly illustrates the sometimes-wide chasm between pretty graphics and knowledge.As I related in connection to my image gallery, "See Connections with Visualization," "Building on the foundation of basic charting, data graphics and dashboard displays, the growing palate of visualization makes analyses more accessible and understandable to a general business audience." Yet visualizations can be no better than the information and analyses that feed them. Let's use Wordle, an excellent "toy," to see why.