![]() Monday, May 26, 2003 |
| Business | ||
|
News:
Front Page |
National |
Southern States |
Other States |
International |
Opinion |
Business |
Sport |
Miscellaneous |
Advts: Classifieds | Employment | Obituary | Business
THIS WEEK NetSpeak focuses on the concepts and issues behind the web scraping technique, a process that allows one to read and extract specific data from different web pages. Though most netizens use a browser to access a web page, as mentioned in this column many times, this is not the only way to access web content. A browser is a program that talks to the web server (another program running on a remote or local machine) in a mutually agreed upon language or protocol and fetches the relevant content from the web site. This protocol used to download web pages from a web server, is called HyperText Transfer Protocol or HTTP. Anyone who has some knowledge of HTTP can create a tool that can read the web content. When we view a web page with a browser, we are forced to see all the content that includes text, graphics and banner advertisements. Another constraint of this browser-based web travel is the need to repeat several tasks. Suppose you are used to visiting a few specific sites for reading articles on a particular subject. To do this, everyday you need to access each of these sites, check out for content and read the appropriate pages. But, if you have a tool that can automatically visit these web sites, collect the relevant data, combine them into one file and store it on your local disk, you can read them at your convenience. Again, if you want only the text content to be downloaded, instead of displaying whatever is entered into a web page, as is done by the web browser, you need a tool that can independently read web pages and weed out unwanted elements. In fact there are several instances in personal/business life where you need to read web pages' data and further process it to obtain the required output. This process of reading and extracting information from a web page is called web scraping. Most web-programming languages have the necessary features that help a programmer easily implement a web-scraping tool. Such scripting languages as PERL, Python and REBOL (discussed in the last issue of this column) are good examples. And, for those of you who do not want to write programs, many useful packages are also available. For example, sitescooper, a program written in PERL, is a product that demonstrates the power and flexibility of web scraping.
SiteScooper
The program SiteScooper (http://sitescooper.org/) enables you to automatically read several web pages, extract information from them in different formats and store them on the local disk so that it can be read off-line later. Using the SiteScooper script, you can retrieve web content from several news web sites and convert the extracted data into formats such as plain text, HTML and so on. That is, if you need to read some web pages regularly without visiting the sites on which they are hosted, just run this script with appropriate options. To use SiteScooper you need to download/install PERL. If you are a Windows user, download PERL from the link: http://www.activestate.com/. Now, download the SiteScooper archive and extract it to a directory. At this point, move to the SiteScooper directory and start the script `SiteScooper.pl'. When you run the script initially, it will present you a list of sites from which you need to pick the ones you like. You can even create your own site list. Once the site list-file is properly updated, run the sitescooper.pl from the command prompt. If you are a Windows user and want to extract only the HTML content from all the selected pages, enter the command as: Perl sitescooper.pl - html. The script will visit all the specified sites, retrieve the content, store them in different directories and create an index file with appropriate links to the web data stored locally. You can read all the pages anytime by just accessing this index file. Again, if you want to get the only text content of a web page, just type the command: perl sitescooper.pl - text http://your-site.com/page.htm
Constraints
Though web scraping facilitates filtering out the relevant data stored on various web sites and organise them in a suitable format, this is not a technology without troubles. There are many web sites that restrict users from accessing their content through programs of this kind. So, while invoking the service of a web scraping software, make sure that the target web site has no such restrictions that forbid visitors from using tools that directly talk to its web server.
Copernic agent: A Net
search tool
Search engines are the primary vehicles through which a netizen reaches the Net resources. You will get a better search output if you can employ the services of multiple search engines instead of using just one. A desktop tool that enables us to search multiple search engines simultaneously in response to a search request is called a meta-search engine. Again, though search engines can be used to locate the Net resources, you should not ignore the fact that a large majority of Net resources are beyond the accessibility of search engines and this part of the web is known as invisible Web. Another issue that is related to the Web search process is the inclusion of several links with irrelevant and obsolete information. Therefore, to spot the appropriate Net locations, a netizen needs a tool with features having the ability to scan multiple search engines, the power to go deep into the invisible web and the capacity to analyse the results to weed out irrelevant information before presenting it to the user. The latest version of the search software, copernic, mentioned earlier, is a good alternative that can be tried out in this connection. To experiment with the program, download the Copernic Agent Basic version, which is available for free at: http:/www.copernic.com When you enter a search string, Copernic passes it on to several search engines, collects the results from all of them, strikes out duplicates and removes broken links. Next, it scans the invisible web and unearths the appropriate URLs. After collecting all the possible links, Copernic analyses and displays them according to relevance. The output generated in this manner gets automatically saved and can be read off-line anytime. Another notable feature of the software is the `summarise' tool (not available in the free version), which can be used to "obtain a list of the main key concepts contained in a web page or document text". J. Murali
Email the author at:
Printer friendly
page
News:
Front Page |
National |
Southern States |
Other States |
International |
Opinion |
Business |
Sport |
Miscellaneous |
|
|
|
The Hindu Group: Home | About Us | Copyright | Archives | Contacts | Subscription Group Sites: The Hindu | Business Line | The Sportstar | Frontline | The Hindu eBooks | Home |
Copyright © 2003, The
Hindu. Republication or redissemination of the contents of
this screen are expressly prohibited without the written consent of
The Hindu
|