Recently I was asked to review the latest development in our group and I realised how much work we have done. I have also noticed I am forgetting about my blog. Let’s fix it!
First the best news, our group has grown during the last year to five PhD and around 10 MSc students working in machine learning.
Today I would like to start with part one and mention some of our progress in machine learning. In the second part I will describe our IoT effort. The main machine learning topics can be broken in the following categories:
- Natural Language Processing
- Question answering YodaQA
- Intelligent assistants
- Sentence pair similarity
- Multinomial classification
- Information retrieval
- Learning to rank
- Information extraction
- Focused crawling
- Convolutional Neural Networks
- Combining text and images
- Image labeling
I’ll start with the major achievement, the YodaQA answering machine. It is an open source question answering system. It implements state-of-art methods of information extraction and natural language understanding — to answer human-phrased questions! You can try the live demo.
Along with YodaQA we have worked also on simpler Intelligent Assistants acting on a smaller number of commands. They take advantage from simpler algorithms finding the most similar answer for a given query. The sentence pair similarity is another topic of interest. The algorithms can help solving not only Answer Sentence Selection, but also other interesting problems, such as Next Utterance Ranking, Semantic Textual Similarity, Paraphrase Identification, Recognizing Textual Entailment etc. We have tested and developed a series of algorithms based on word embeddings and different architecture of Neural Networks.
To the NLP category belongs also the multinomial classification algorithm. The use case we are testing is the products categorization to a hierarchical directory structure. Typically the e-shops are categorizing products, such as a “14 inch screen notebook” under notebooks, computers, electronics etc. This process is handled by human beings and they do mistakes, our algorithm can find problematically categorized entries or suggest correct category.
An exclusive position in our group has the adaptive ranking research. The web content, its information relevancy, authority and the users interests are changing constantly. The goal of any search engine is to provide exactly what the users look for. The newly developed algorithm relies on users to find currently the best ranking. It constantly observes on what links are users clicking and it adapts based on this feedback. This is very relevant for information search and recommendation services.
Information extraction is the next topic completing our portfolio. Initially we have looked at basics of focused crawling, a strategy how to crawl internet and extract for example all mentions about AT&T and Linux. This leads to a design of a crawler with programmable search policy. Currently we work on even more sophisticated algorithm for extracting content from e-shop pages, segmentation, price, product name etc. extraction. These are known problems typically solved semi-manually constructing scripts and then running the extraction. Our goal is a high accuracy, general algorithms without any customization or training working for all e-shops.
In the part two of this blog I will review our efforts in the Internet of Things.