Tuesday, January 28, 2014

Big Data course at CVUT

Thanks to IBM support we will open new course in Big Data at CVUT FEL. The course will offer hands on experience in using the standard Big Data methods such as Hadoop. To exercise the hand on experience we have prepared several text processing tasks. 

Why is important to teach Big Data? The size of processed data is constantly growing. Internet portals, insurance companies, banks, GSM providers, health industry, automotive etc. are accumulating enormous data on their servers. The data contains lot of various information. The new methods for processing the data, understanding the data are being developed at similar pace. It is clear that companies without a large analytical departments and access to the Big Data will not be able to make good decisions. They will not be competitive. This is leading the companies to look for a new experienced people capable of processing and interpreting the data. The role of the university is to be ahead and react on these demand.

The motivation for this course is clear to IBM and the university too and this was the reason to join forces.

The objective of our course is to teach the students the Big Data basics and offer some hands-on experience. The course will focus on methods for extraction, analysis as well as selection of hardware infrastructure for managing persistent data. In the second half of the course we will show how to process streamed data, such as data from social networks. As exercise we will introduce standard analytical methods for text processing.

The course is split in to 13 weeks. We want to cover five main topics:

  1. Hadoop overview - all components and how they work together. Install Hadoop, HW requirements, SW requirements, how to administer, introduce to the basic setup of our cluster.
  2. MapReduce, how to use pre-installed data. The bag of words notion, TF-IDF,  SVD, LDA. 
  3. HDFS, NoSQL databases, HBase, SQL access, Hive,  How to upscale-downscale HDFS. 
  4. What is Mahout, what are the basic algorithms. Run random forest classification task using the Mahout algorithms.
  5. Streamed data – Storm or InfoSphere, real time processing using the Twitter data, simple sentiment algorithm

We will put all the presentations on the web with public access, they all will be in English. You can follow us on the course web pages. Keep the fingers crossed for us, it will be a lot of work but we all are looking forward to play with the latest technologies.