Add Derived Data To Your DBMS Strategy - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software // Information Management
Commentary
12/14/2010
09:22 AM
Curt Monash
Curt Monash
Commentary
50%
50%

Add Derived Data To Your DBMS Strategy

Do you have a plan for managing more than just raw data? These five kinds of data can change the demands on your database management system.

Text analytics requires a lot of processing per document. You need to tokenize (among other things, identify the boundaries of) the words, sentences and paragraphs; identify the words' meaning; map out the grammar; resolve references such as pronouns; and often do more besides (e.g. sentiment analysis).

There are a double-digit number of steps to all that, many of them expensive. No way are you going to redo the whole process each time you do a query. (Not coincidentally, MarkLogic -- which does a huge fraction of its business in text-oriented uses -- thinks heavily in terms of the enhancement and augmentation of data.)

If you look through a list of actual Hadoop or other MapReduce use cases, you'll see that a lot of them boil down to "crunch data in a big batch job to get it ready for further processing." Most famously this gets done to weblogs, documents, images, or other nontabular data, but it can also happen to time series or traditional relational tables as well. (See, for example, the use cases in two recent Aster Data slide decks.) Generally, those are not processes that you want to try to run real time.

Scientists have a massive need to adjust or "cook" data, a point that emerged into the public consciousness in connection with Climategate. The LSST project expects to store 4.5 petabytes of derived data per year, for a decade. Types of scientific data cooking include:

Log processing, not unlike that done in various commercial sectors.

Assigning data to different kinds or densities of coordinate grids -- "regridding" -- often through a process of interpolation/approximation/estimation.

Adjusting/normalizing data for all kinds of effects (such as weather cycles).

Examples where data adjustment is needed can be found all over physical and social science and engineering. In some cases you might be able to get by with recalculating all that on the fly, but in many instances storing derived data is the only realistic option.

Similar issues arise in marketing applications, even beyond the kind of straightforward, predictive-analytics-based scoring and psychographic/clustering results one might expect.

For example, suppose you enter bogus information into some kind of online registration form, claiming to be a 90-year-old woman when, in fact, you're a 32-year-old male with 400 Facebook friends who are mostly in your age range. Let's say you tend to look at Web sites about cars, poker, and video games and have a propensity to click on ads featuring scantily-clad females.

Increasingly, analytic systems presented with this scenario would be smart enough to treat you as somebody other than your grandmother. But those too are complex analyses, run in advance, with the results stored in the database to fuel sub-second ad serving response times.

Curt Monash runs Monash Research, which provides strategic advice to users and vendors of advanced information technology. He also writes the blogs DBMS 2, Text Technologies, and Strategic Messaging. Write him at [email protected]

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
2 of 2
Next
Comment  | 
Print  | 
More Insights
Slideshows
Data Science: How the Pandemic Has Affected 10 Popular Jobs
Cynthia Harvey, Freelance Journalist, InformationWeek,  9/9/2020
Commentary
The Growing Security Priority for DevOps and Cloud Migration
Joao-Pierre S. Ruth, Senior Writer,  9/3/2020
Commentary
Dark Side of AI: How to Make Artificial Intelligence Trustworthy
Guest Commentary, Guest Commentary,  9/15/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
IT Automation Transforms Network Management
In this special report we will examine the layers of automation and orchestration in IT operations, and how they can provide high availability and greater scale for modern applications and business demands.
Slideshows
Flash Poll