Recently (November 2016), I was given the opportunity to take a in-person training course from Cloudera. The course covered different aspects of the Hadoop ecosystem for about half of the time, and then focused on Spark for the other half. In this post, I'm going to give the positives and negatives that I found from taking the course (TL;DR at the bottom).
To give some context, I went in the course with one or two months of Spark development experience, along with a graduate level college course in Hadoop.
The course was in person and held in small room that sat about 12 people. There was only one other student in the room with me, along with about four or five more students in a different city connected via a video conference. Luckily for me, the instructor was also in my room. I'm not sure how I would have felt about have the instructor in the other office and only talking to him via video.
When we started up, we all signed up through Cloudera's website and got access to all of the course material. We used a VM for the course which gave us access to a "full" Hadoop cluster. From there, we just started going through the slides and doing the exercises.
The main topics covered in the course were:
- Spark RDD's
- Spark DataFrames
For me, I really enjoyed the section on Flume as it was the one area I had zero exposure to. I thought the examples in the course were pretty good and I felt like I learned a lot during it. I haven't actually used Flume since the course, however, it's still nice to have in my toolbelt.
The Hive and Impala sections I thought were just okay. We went over all the basics of creating tables and partitions, writing queries, etc. This was one of the sections that I felt lacked any real depth. We didn't go much past the very basics and it was a lot more of "here's what Hive can do", rather than really understanding how or why, which is what I was more after.
The Spark section I thought was decent as well. It was mostly focused on operations around RDD's as opposed to DataFrames. I think this might have changed since I took the course, but I was hoping for a bit more on DataFrames since they are used a bit more now. However, I did think the RDD work was interesting. Some of the examples got pretty technical, but unfortunately we kind of glossed over those sections, as the instructor and other students didn't really care much about it (more on that in a bit).
I was also disappointed that we didn't cover anything on Spark Streaming. This is also an area that I think is now covered, but wasn't when I took it.
My main disappointment with all of this course content is that it all seemed too high-level. I had a bit of experience in Hadoop/Hive/Spark coming in, and I was hoping to solidify that knowledge, as well as learn a whole lot more. It doesn't help that all but one of the other students had more of "manager" type roles and were looking to get their first exposure into Hadoop. So, when we got into the really technical pieces, like I mentioned before, no one really cared about them. The instructor would mention them very briefly, mutter something like "you guys won't need this", and then move on. With the course title of "Developer Training", I thought everyone would have prior programming experience and that it would be a very technical training, however that just wasn't the case.
One really nice thing about the content, however, were the slides and VM. Not only were they great, but the fact that you still have access to them after the course made it even better. The slides had some good reference content, and the VM we used was really well structured. A lot of the activities in the course build off of each other. So for example, I couldn't go do exercise 7 without doing exercise 1 through 9 first. However the nice part about the VM is that it had a "catch-up" script on it, where you could basically automatically finish all of the previous activities up to the one you wanted to do, which was a great feature.
Overall, I thought the instructor did a good job at teaching the course content, however, I found he failed to answer any in-depth technical questions at all. He contradicted himself quite a bit, and I just got the feeling that he was making up answers off the top of his head instead of admitting he didn't know the answer and getting back to us. This, plus the fact that he was skipping over the hardest technical pieces really rubbed me the wrong way. As a course sponsored by Cloudera, I expected someone who was truly an expert, and that's not what we got.
Overall, I thought it was a decent course with a few big flaws. A lack of depth and technical content is what was really missing from the course.
- "Take home" content is great. Access to slides and VM after the course
- Content was too high level to get any expertise
- Content was not very technical
- Questionable instructor