Add Derived Data To Your DBMS Strategy

Do you have a plan for managing more than just raw data? These five kinds of data can change the demands on your database management system.

Curt Monash, Contributor

December 14, 2010

3 Min Read

Text analytics requires a lot of processing per document. You need to tokenize (among other things, identify the boundaries of) the words, sentences and paragraphs; identify the words' meaning; map out the grammar; resolve references such as pronouns; and often do more besides (e.g. sentiment analysis).

There are a double-digit number of steps to all that, many of them expensive. No way are you going to redo the whole process each time you do a query. (Not coincidentally, MarkLogic -- which does a huge fraction of its business in text-oriented uses -- thinks heavily in terms of the enhancement and augmentation of data.)

If you look through a list of actual Hadoop or other MapReduce use cases, you'll see that a lot of them boil down to "crunch data in a big batch job to get it ready for further processing." Most famously this gets done to weblogs, documents, images, or other nontabular data, but it can also happen to time series or traditional relational tables as well. (See, for example, the use cases in two recent Aster Data slide decks.) Generally, those are not processes that you want to try to run real time.

Scientists have a massive need to adjust or "cook" data, a point that emerged into the public consciousness in connection with Climategate. The LSST project expects to store 4.5 petabytes of derived data per year, for a decade. Types of scientific data cooking include:

Log processing, not unlike that done in various commercial sectors.

Assigning data to different kinds or densities of coordinate grids -- "regridding" -- often through a process of interpolation/approximation/estimation.

Adjusting/normalizing data for all kinds of effects (such as weather cycles).

Examples where data adjustment is needed can be found all over physical and social science and engineering. In some cases you might be able to get by with recalculating all that on the fly, but in many instances storing derived data is the only realistic option.

Similar issues arise in marketing applications, even beyond the kind of straightforward, predictive-analytics-based scoring and psychographic/clustering results one might expect.

For example, suppose you enter bogus information into some kind of online registration form, claiming to be a 90-year-old woman when, in fact, you're a 32-year-old male with 400 Facebook friends who are mostly in your age range. Let's say you tend to look at Web sites about cars, poker, and video games and have a propensity to click on ads featuring scantily-clad females.

Increasingly, analytic systems presented with this scenario would be smart enough to treat you as somebody other than your grandmother. But those too are complex analyses, run in advance, with the results stored in the database to fuel sub-second ad serving response times.

Curt Monash runs Monash Research, which provides strategic advice to users and vendors of advanced information technology. He also writes the blogs DBMS 2, Text Technologies, and Strategic Messaging. Write him at [email protected]

Read more about:

20102010

About the Author(s)

Curt Monash

Contributor

Curt Monash has been an industry, product, and/or stock analyst since 1981, specializing in the areas of database management, application development tools, online services, and analytic technologies

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights