Big Data Terminology Mess Needs Cleanup - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Hardware/Architectures
09:06 AM
Mark E. Johnson
Mark E. Johnson

Big Data Terminology Mess Needs Cleanup

Big data needs a coherent and unified vocabulary of terms -- or we can't share solutions to problems across disciplines.

As the big data trend continues to grow due to positive media buzz, the ongoing proliferation of data generation and collection, and financial success stories, a nasty pest nibbles like a sand gnat at the practitioners. This pest is big data terminology -- or rather, the lack of a coherent and unified vocabulary of terms used in the big data arena.

The essential problem: Many concepts are floating around, and different folks understand these concepts according to their own technical backgrounds, disciplines, and work environments. Although one can presumably operate within one's own isolated framework, the terminology issue inhibits sharing of methodologies or recognizing that solutions may exist elsewhere. The terminology mess affects both business and academia.

As one example, consider the two technical areas of neural networks and non-linear regression. Both statisticians and computer scientists/engineers are heavy users of these tools. The following pairs of terms are equivalent:

                        Statistical Term                   Neural Network Term

                        coefficient                             weight

                        observation                           exemplar

                        parameter estimation            training

                        steepest descent                  back-propagation

                        intercept                               bias term

                        derived predictor                  hidden node

                        penalty function                    weight decay

The mathematical formulations in nonlinear regression and neural networks are essentially equivalent, but the terminology is entirely different (the above correspondence is found in Applied Linear Statistical Models (2005). The translation table shown above provides some hope for navigating the statistical and computer science literatures related to this methodology, but it can be uncomfortable or unproductive being away from the home discipline.

These issues do not reside only with these two academic departments. Specialty areas such as machine learning, artificial intelligence, Bayesian reasoning, graphical models, probabilistic networks, and pattern recognition all have their own flavors of terms, with some overlap and some contradictions.

Collecting the relevant terms, figuring out the underlying concepts, diagramming their relationships (subordinate and superordinate), and ultimately arriving at coherent and technically correct definitions is a monumental task.

This terminology situation in big data is precisely like the one that occurs in international standards, so there is the possibility of learning from the standards community. The International Standards Organization (ISO) produces standards that provide requirements, specifications, guidelines or characteristics that can be used consistently to ensure that materials, products, processes, and services are fit for their purpose. Each technical committee of ISO in turn has a subcommittee (SC1) that deals with terminology and definitions.

Thus, there are mechanisms in place to deal with the big data terminology mess. ISO TC69 (Applications of Statistical Methods), under the lead of its terminology subcommittee (I am Chair of TC69/SC1), has for the past 15 years produced core terminology documents on general statistical terms and terms used in probability (ISO 3534-1), applied statistics (ISO 3534-2), design of experiments (ISO 3534-3), and survey sampling (ISO 3534-4).

The process used to develop these documents could be used to develop a coherent vocabulary system for predictive analytics. (I'm using this more scientific phrase rather than big data terminology since predictive analytics has a superior je ne sais quoi -- we are dealing with international standards here!)

The gloomy state of affairs noted at the outset of this piece could be addressed under the aegis of TC69/SC1. The group is currently preparing for a new work item on predictive analytics. The subcommittee has experts from numerous countries but could benefit from additional expert participants from the US. I encourage those interested to contact the American Society of Quality for possible inclusion in the process. The technical work requires the consent and support of the expert's employer. As a volunteer effort (at least from the US participation side), the process is a bit glacial, but it could eventually reach an international consensus on a terminology document.

Ultimately, the efforts could also lead to technical standards on select methodologies in big data analysis -- but first and foremost, the pesky terminology problem needs to be tackled.

Which frequently misused big data terms bug you? Tell us in the comments section.

Emerging software tools now make analytics feasible -- and cost-effective -- for most companies. Also in the Brave The Big Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury's Call Me Maybe project. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Alex Kane Rudansky
Alex Kane Rudansky,
User Rank: Author
11/18/2013 | 10:24:56 AM
I've seen the same issue arise in healthcare. Doctors have different terms for the same illnesses and medications, making big data analytics a headache. For example: Hypertension and high blood pressure. Same illness, different name. Until a standardized nomenclature is put in place, it will be hard to accurately and effectively mine electronic health record data.
User Rank: Author
11/18/2013 | 9:46:28 AM
Big Data Terminology Mess
Vocabulary will be a big deal as data scientists try to communicate with people on the business side and even with people in other IT disciplines. As we just went through a large dev project here, I heard myself say several times, "I think we're just not speaking the same language." We each knew what we wanted, but the lexicon was completely different.You have an opportunity at this time, big data gurus, to set the tone in vocabulary. Which terms should we ban early?
11 Ways DevOps Is Evolving
Lisa Morgan, Freelance Writer,  2/18/2021
Graph-Based AI Enters the Enterprise Mainstream
James Kobielus, Tech Analyst, Consultant and Author,  2/16/2021
What Comes Next for AWS with Jassy to Become Amazon CEO
Joao-Pierre S. Ruth, Senior Writer,  2/4/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll