Big Data Terminology Mess Needs Cleanup

Big data needs a coherent and unified vocabulary of terms -- or we can't share solutions to problems across disciplines.

Mark E. Johnson, Contributor

November 18, 2013

4 Min Read

As the big data trend continues to grow due to positive media buzz, the ongoing proliferation of data generation and collection, and financial success stories, a nasty pest nibbles like a sand gnat at the practitioners. This pest is big data terminology -- or rather, the lack of a coherent and unified vocabulary of terms used in the big data arena.

The essential problem: Many concepts are floating around, and different folks understand these concepts according to their own technical backgrounds, disciplines, and work environments. Although one can presumably operate within one's own isolated framework, the terminology issue inhibits sharing of methodologies or recognizing that solutions may exist elsewhere. The terminology mess affects both business and academia.

As one example, consider the two technical areas of neural networks and non-linear regression. Both statisticians and computer scientists/engineers are heavy users of these tools. The following pairs of terms are equivalent:

Statistical Term Neural Network Term

coefficient weight

observation exemplar

parameter estimation training

steepest descent back-propagation

intercept bias term

derived predictor hidden node

penalty function weight decay

The mathematical formulations in nonlinear regression and neural networks are essentially equivalent, but the terminology is entirely different (the above correspondence is found in Applied Linear Statistical Models (2005). The translation table shown above provides some hope for navigating the statistical and computer science literatures related to this methodology, but it can be uncomfortable or unproductive being away from the home discipline.

These issues do not reside only with these two academic departments. Specialty areas such as machine learning, artificial intelligence, Bayesian reasoning, graphical models, probabilistic networks, and pattern recognition all have their own flavors of terms, with some overlap and some contradictions.

Collecting the relevant terms, figuring out the underlying concepts, diagramming their relationships (subordinate and superordinate), and ultimately arriving at coherent and technically correct definitions is a monumental task.

This terminology situation in big data is precisely like the one that occurs in international standards, so there is the possibility of learning from the standards community. The International Standards Organization (ISO) produces standards that provide requirements, specifications, guidelines or characteristics that can be used consistently to ensure that materials, products, processes, and services are fit for their purpose. Each technical committee of ISO in turn has a subcommittee (SC1) that deals with terminology and definitions.

Thus, there are mechanisms in place to deal with the big data terminology mess. ISO TC69 (Applications of Statistical Methods), under the lead of its terminology subcommittee (I am Chair of TC69/SC1), has for the past 15 years produced core terminology documents on general statistical terms and terms used in probability (ISO 3534-1), applied statistics (ISO 3534-2), design of experiments (ISO 3534-3), and survey sampling (ISO 3534-4).

The process used to develop these documents could be used to develop a coherent vocabulary system for predictive analytics. (I'm using this more scientific phrase rather than big data terminology since predictive analytics has a superior je ne sais quoi -- we are dealing with international standards here!)

The gloomy state of affairs noted at the outset of this piece could be addressed under the aegis of TC69/SC1. The group is currently preparing for a new work item on predictive analytics. The subcommittee has experts from numerous countries but could benefit from additional expert participants from the US. I encourage those interested to contact the American Society of Quality for possible inclusion in the process. The technical work requires the consent and support of the expert's employer. As a volunteer effort (at least from the US participation side), the process is a bit glacial, but it could eventually reach an international consensus on a terminology document.

Ultimately, the efforts could also lead to technical standards on select methodologies in big data analysis -- but first and foremost, the pesky terminology problem needs to be tackled.

Which frequently misused big data terms bug you? Tell us in the comments section.

Emerging software tools now make analytics feasible -- and cost-effective -- for most companies. Also in the Brave The Big Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury's Call Me Maybe project. (Free registration required.)

About the Author(s)

Mark E. Johnson

Contributor

Dr. Mark E. Johnson is Professor of Statistics at the University of Central Florida in Orlando. He is a Fellow of the American Statistical Association, an elected member of the International Statistical Institute, and a Chartered Statistician with the Royal Statistical Society. He is the author of Multivariate Statistical Simulation (Wiley Applied Probability and Statistics Series) and has published in such journals as Bulletin of the American Meteorological Society, Technometrics, Biometrics, JASA, The American Statistician, J. of Statistical Planning and Inference, and Risk Analysis. Mark does extensive consulting in the area of catastrophic risks (especially hurricanes) and regularly is retained as an expert witness in legal cases.

See more from Mark E. Johnson

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

About the Author(s)

Editor's Choice