Configuration system for the Apache Nutch spider: practical application in the Orion search engine

Published in: Proceedings of the 13th Latin American and Caribbean Conference for Engineering and Technology: Engineering Education Facing the Grand Challenges, What Are We Doing?
Date of Conference: July 29 - 31, 2015
Location of Conference: Santo Domingo, Dominican Republic
Authors: Yulio Aleman Jimenez
Yoniel Jorge Thomas Sosa
Aylin Estrada Velazco
Eyeris Rodríguez Rueda
Refereed Paper: #27

Abstract:

The steady increase in the amount of information in digital format public on computer networks around the world, has caused the difficulty of users to find what they really need at any given time. To locate the required information, the Information Retrieval Systems were designed; whose functionalities, have a large number of configuration options and difficult to administer. Apache Nutch is a free spiders with big advantages for collection and finding information on the web; however lacks a system that enables visually configuration without using console commands and conducive working with multiple instances simultaneously. At the University of Informatics Sciences of The Havana, Cuba, Orion search engine was developed, but it has many disadvantages that prevent optimal performance of the process of setting up its tracking mechanism based on Nutch. In this paper are shown the essential elements taken into account in the implementation of a system that improves the usability and makes easy the work of administrators in the configuration tasks. The system implemented, has a set of features and functionalities that contribute, through the availability of web interfaces, increased control of configuration changes and streamlining the process; also providing information on the settings, that previously impossible or difficult to obtain.

Keywords-- Apache Nutch, configuration, Information Retrieval System, Orión, web interface.