Wednesday, August 28, 2013

Summer projects

Summer is great, I can finally focus on real research and development. We have pursued several interesting projects in machine learning and we designed one Android application in the Cloud Computing Center. 

Tomas has been mostly focusing on documents categorization. His task is: for a given document find the best category. In practice this kind of algorithm allows automatic tagging of documents, it is good for grouping similar documents or it can be used for pairing ads and content in web advertising etc. To categorize the documents we need to build models for all categories. We have generated large number of models with the Latent Dirichlet Allocation (LDA) algorithm. The final classification to the best category was done with a Random forest (RF) classifier trained on a set of manually categorized documents. The training set was kindly provided by Seznam.cz. The LDA is requiring a lot of computational power. To accommodate these requirements we selected the Cloudera virtual images with preinstalled Hadoop. The LDA is written in Python and was run in Mapreduce. The work is in progress and the results are promising.

Ondrej with other students is looking in to another interesting problem. They try to model on-line gamers. We have received data from Pool Live Tour game made by Geewa. The data shows in detail the user behaviour. The task is simple. Model users good enough to predict their readiness for buying a new cue. Players are gaining skills, proceeding to higher levels, spending more time in the game etc. All this has to be captured to estimate the correct moment for cue update. The biggest problem in such projects is to get through the data and extract the right features. Mirek from Geewa helped us with decoding how it works. The data includes gamers who bought as well as non-buying users. This is good for supervised training. Again we decided to use our popular RF algorithm for classification. Currently we are planning tests. The models we have developed are also useful for clustering users to groups, for uncovering cheating etc. Similar models can be used for many other games or activities. These algorithms may uncover necessary insight and help us making the game or activity more challenging and engaging. This field is really rich for many further improvements. The greatest advantage: there are really many gamers producing a lot of data and more data results in bigger power.

In June we have started to work on Android application for the CTU students. The app allows them to search for faculties and students from the mobile device. Students can find their classes with detailed description. The app also offers detailed information about the university. The built-in RSS feed reader aggregates the most important university information sources. Finally the application allows users to check what is todays menu in the canteens. We are almost finished. We are starting testing and final debugging. Our plan is to give this away to students next month. It will be downloadable through Google Play for Android owners. We plan to introduce the iPhone version later, since iPhone is less frequent between students. Thanks go to the group of programmers at the FIT faculty of CTU who prepared the required KOS API.

There are some more projects we are cooking in our lab, but I will report about them next time. If you are interested in our projects or if you want to join our group let me know. We have plenty of interesting tasks.

2 comments:

  1. Why did the student develop its own LDA in python when you can use lightning fast off-the-shelf LDAs such as:

    - http://radimrehurek.com/gensim/ (czech)
    - Vowpal Wabbit - see http://www.machinedlearnings.com/2010/12/lightning-fast-lda.html

    ReplyDelete