There is a bottleneck big data must eliminate on its journey from buzzword to mainstream acceptance, someone called the data scientist.
This person has a PhD in math or statistics and is trained to fish for insights in the data lake. This person crafts algorithms like fishing flies, casting queries out like a line, luring insights to the surface where they can be hooked like trout.
To be a data scientist, one must combine domain experience, a deep background in statistics and math, and programming skills, noted Leon Kutsnelson, director and CTO for IBM Analytics emerging technologies. "We call them 'unicorns' because they don't exist," he said. If industry had to depend on PhDs to do big data, "we [would] continue to sit on mountains of data," he added.
[Looking for more data skills? Read Dell's Statistica Revamped, Free To College Students.]
Businesses don't have the five years or more it takes for a fresh crop of data scientists to earn their PhDs. That means hiring people with some training may be the best option for companies that want to hire data scientists.
"Several analyst firms have predicted six-figure talent gaps in terms of open jobs compared to qualified individuals. This has translated to our customers not being able to find qualified individuals to add to their teams." said Mark Morrissey, senior director of education services at Cloudera.
That is why companies selling big data solutions also provide an array of free online tutorials and videos to expand that user base. The "graduates" may not be as skilled as actual data scientists, but they'll come away knowing enough about big data to get some of the job done.
Which brings up the next question: How do you teach people to be "big data practitioners"?
When it comes to big data training, traditional on-site classroom instruction has strengths and drawbacks. One problem is geography. "There are not a lot of seats distributed across the globe," said Suzanne Ferry, vice president for global education and eLearning at MapR.
MapR can offer three-day classes in London or Tokyo, she said. But that doesn't do much good if the prospective trainee is in Milan or Manila. There is an opportunity cost in pulling someone out of an office and sending them to another location to take a class, Ferry noted. MapR instructors also must travel to teach in-person classes, and that costs money and time.
To break the tyranny of geography, MapR offers private classes at the client's site. Or it can turn to the Internet to deliver online classes, which is what many vendors do.
MapR, Cloudera, Hortonworks, and IBM offer free online courses to help aspiring big data professionals become acquainted with skill sets required for Hadoop and big data. Such courses are generally broken down by role, aimed at people seeking proficiency in development, administration, and analysis.
"Competency is determined by your persona," Ferry said. "We teach you how to use the skills you already have."
Courses delivered by video allow users to fit in lesson time at their convenience, and video can be revisited and reviewed as often as needed to grasp a concept. It might take three or four viewings before a concept makes sense, whereas in class, if you don't "get it" the first time, or you miss that particular class session, you have a problem, Ferry explained.
"Our classroom-based offerings are intensive three- to four-day sessions, in which students are able to interact directly with professional instructors. Our courses are developed in a pedagogical style which builds on concepts and delivers an intense, thorough learning experience," said Cloudera's Morrissey, in an e-mail. "Conversely, online (or on-demand) training allows individuals to learn over a longer time-horizon, revisiting topics which are of particular interest."
"The biggest impediment is emotional, not intellectual," observed IBM's Kutsnelson. "People get discouraged quickly … Distraction and lack of discipline are your biggest enemies"
IBM's way around this problem relies on psychology. "We require as little discipline as possible," Kutsnelson said. This seems counterintuitive, but it makes sense once this is coupled with another quirk of human nature -- the "student" has to choose a goal, which IBM then helps him or her achieve.
Kutsnelson likened it to being a high school student. The student learns at home on a computer because of a desire to do "cool" videogames or hack into other computers. This may seem a tad juvenile, but that user has a goal. When you make the coursework narrow and specialized, the trainee can imagine what the end-state will be, much like the high school student, Kutsnelson explained.
At Hortonworks, online coursework is broken into several tiers, the most accessible being Sandbox. "The tutorials themselves are organized by role," said Shaun Connolly, vice president of strategy. A trainee may get the basics of Spark or Scala, get a hands-on tour, and then drill down to the next level, he explained. Students also get to download the app for free to get practical hands-on experience alongside the video lessons.
Another approach is to provide links between introductory Sandbox lessons and practical applications of big data featured in corporate partner sites, Connolly added, where the "student" can see practical, real-life applications of big data. Sandbox and the linked partner sites attract 100,000 to 200,000 users per year, he said, and the structure can act as a feeder to more rigorous online courses or one- to three-day onsite instructor-led classes.
IBM's approach is "5-5-5"— structuring courses so that you get five lessons, each illustrated with its own five-minute video. Each lesson is followed by a hands-on exercise to reinforce the lesson, Kutsnelson said.
It's all about improving the absorption of concepts and skills, and that's something that all the vendors are always looking to improve.
"We do a lot of research on what people can absorb." said MapR's Ferry. "We do chunk it." A person can pay sharp attention for about eight minutes of class time. Online, video lessons can be "chunked" into logical segments that can be absorbed and reviewed," she explained. Coursework is solution-based. But certifying someone to be a data scientist is "too tough a nut to crack," Ferry said.
"No one becomes a data scientist overnight. It requires a mix of hacking skills, a deep knowledge of statistics, and substantial experience." Cloudera's Morrissey said in an e-mail. "Those who take our Data Science class should already have some of these skills, be a Python developer, and have interfaced with Hadoop in the past. Our goal for the data science track is to take folks with backgrounds as an analyst or statistician, and teach them how to utilize their skills with massive data sets."
So is there a diploma at the end of the rainbow? That depends.
Most IT certificates that are well-regarded aren't graded on a continuum from A to F or some other system, according to MapR's Ferry. Classes are offered on a pass/fail basis.
Offering a certificate means hitting a "sweet spot." Some courses take participants a few tries to pass, Connolly noted. But if you make the course too challenging and the washout rate shoots up, people will become reluctant to commit the time and energy. On the other hand, it can't be too easy, either. "If it is too slam-dunk, then the certification gets a bad brand. People talk," Connolly said. "Enterprises are looking for certification. They need a quality brand promise."
IBM offers a certificate of completion, with a digital badge that can be attached to a LinkedIn profile to vouch for the user's experience, Kutsnelson said. That ties in with the nature of the typical 20-hour online course, which is to turn out practitioners who are "good enough," he said.
"That is the reality of online learning," Kutsnelson said.
**New deadline of Dec. 18, 2015** Be a part of the prestigious InformationWeek Elite 100! Time is running out to submit your company's application by Dec. 18, 2015. Go to our 2016 registration page: InformationWeek's Elite 100 list for 2016.William Terdoslavich is an experienced writer with a working understanding of business, information technology, airlines, politics, government, and history, having worked at Mobile Computing & Communications, Computer Reseller News, Tour and Travel News, and Computer Systems ... View Full Bio