Lexalytics Analyzes Wikipedia To Understand How Humans Think

Concepts extracted from the community created encyclopedia can be used to improve analysis of documents and sentiment in social media.

David F Carr, Editor, InformationWeek Government/Healthcare

April 14, 2011

4 Min Read

Top 20 Apps For Managing Social Media

(click image for larger view)
Top 20 Apps For Managing Social Media

Academics may frown on citations from Wikipedia because of its social media origins, but for the text mining and sentiment analysis firm Lexalytics the sprawling community created encyclopedia was the perfect reference for teaching software how to understand the world.

At its user conference this week in New York, Lexalytics announced that the Salience 5.0 release of its software, due out this summer, will be better able to understand concepts and relationships between concepts, thanks to a close reading of the entire content of Wikipedia. Because of the open source nature of the Web encyclopedia, Lexalytics was able to index it freely. A footnote to the press release cautions that no endorsement by the Wikimedia Foundation is implied.

"Wikipedia represents a very, very large corpus information, and, importantly, it's human edited--which means it shows the way humans think about information," CEO Jeff Catlin said. "We used it as a source for how people think about the organization of information and for perspective on how bits of information are related to each other."

Lexalytics is best known for technology that produces automated summaries of documents, as well as sentiment analysis capabilities that can be used for social media monitoring. Catlin said his firm's technology is used "behind the scenes" by companies like Radian6 (recently acquired by Salesforce.com) and also licensed directly by some websites, such as TripAdvisor. But the core Lexalytics technology is general purpose--like a search engine that can be adapted to search specialized types of content.

The "concept matrix" Lexalytics created on the basis of its Wikipedia analysis may factor into improved sentiment analysis, but it's broader than that, Caitlin said. In some ways, this was more similar to the work that went into creating IBM's computerized Jeopardy champ, Watson, which also had to be fed large volumes of news articles and reference sources. One thing the Watson team had in its favor was that answering trivia questions is a very specific task, focused on the kind of "myopic detail" that computers are good at handling. "So if they can figure out the question, there is a good chance they are going to have the right answer," he said.

Just as the process of building Watson's knowledge base started long before Alex Trebek stepped on stage, the compilation of the Lexalytics concept matrix was a distributed computing analytics job run across many servers--many of them procured through Amazon's cloud services. "We basically did boil the ocean, so this required a lot of hardware behind the scenes and a lot of Amazon computing time," he said. But by the end of the process his team had boiled it down to a summary of concepts that fits on a laptop or a modest sized server.

The result is a piece of computer software that "understands that a rose and a daisy are both flowers, which up until now has been a really tough model," Catlin said. "If someone writes that a device runs for three days without a recharge, the system can figure out that 'runs for three days without a recharge' is a battery event," even though the word "battery" was never mentioned. Using this technology, a marketing application could read through hundreds of news articles about a company to see how many of the key messages from its latest press release made their way into that coverage--even though each of the news writers used different words and phrases to tell the story.

Sentiment analysis is a relatively mature branch of text analytics, but automated systems still get confused by things like sarcasm and double meanings. One improvement Lexalytics is making in this upcoming release of its software is a filter for subjective versus objective understanding, or direct versus second-hand knowledge. For example, "I heard that movie was great"--a comment from someone who hasn't actually seen the movie--could be scored differently from "That movie was great!" even though both are positive sentiments.

The Wikipedia concept matrix "is just one more piece we're using to try to crank up the accuracy of these things, and it's wonderful because it's so good for general knowledge and gives us a broad and relatively deep look at the world," Catlin said.

About the Author(s)

David F Carr

Editor, InformationWeek Government/Healthcare

David F. Carr oversees InformationWeek's coverage of government and healthcare IT. He previously led coverage of social business and education technologies and continues to contribute in those areas. He is the editor of Social Collaboration for Dummies (Wiley, Oct. 2013) and was the social business track chair for UBM's E2 conference in 2012 and 2013. He is a frequent speaker and panel moderator at industry events. David is a former Technology Editor of Baseline Magazine and Internet World magazine and has freelanced for publications including CIO Magazine, CIO Insight, and Defense Systems. He has also worked as a web consultant and is the author of several WordPress plugins, including Facebook Tab Manager and RSVPMaker. David works from a home office in Coral Springs, Florida. Contact him at [email protected]and follow him at @davidfcarr.

See more from David F Carr

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

About the Author(s)

Editor's Choice