published: 2014-06-08 14:34:50, updated: 2017-11-12 15:53:33
Different approaches to implement a topic-based news aggregator
In computer science, an Implementation is a realization of a technical specification or algorithm as a program, software component, or other computer system through computer programming anddeployment.Wikipedia
The main aspect of the aggregator is its content curation. Which essentially means we have to sort all the feeds into topics/categories based on what they are about. We have two fields of interest in an RSS Field – one is title and the other is description. Using these, it is possible to analyze the feed and take a good guess what it is about.
Another approach we can take is to analyze the consecutive words in a feed title, and compare them with other feed titles. Common words such as verbs, adjectives etc. will need to be filtered out. An example of this could be the recent Google I/O where many feed titles contained the word thereby allowing the algorithm to correctly categorize them. This would be the work of a tokenizer and an iterative loop comparing the result with every other title.
Another way would to be create a Set of recognized categories as we go along, and search the title for these words. If the title contains words occurring in existing categories, the the feed is categorized under it. If not, then the feed words are used to identify a new category. The rank of a category can be formed as we move through the feed. After the feeds have been categorized, a single loop to re-evaluate the category is necessary to weed out any errors in categorizing. If a feed has been categorized as something else inspire of it containing words belonging to a better ranked category, this will correct it.
A further scalable approach is to store the feeds in a database on the server, and then to query the tables to identify the relationships. The words extracted from the feeds would be stored in tables, and each row would be joined with the corresponding rows containing those words. The results of which would then be entered into a new category. Doing such operations at the server scale allows providing these curated feeds to all subscribing users. However, the preference and storage for each user will still have to be on device. Only the performance intensive operations will be performed at the server.