In: Computer Science
Explain types of web crawlers and different Algorithms they use ex:-bfs,dfs and implement sequential and concurrent crawler using scrapy or any other tool
please explain your code with comments
Web Crawlers
Types of web crawlers.
Explanation
COMMERCIAL web crawlers.
It is developed to overcome the limitations of smaller personal use tools, often requiring huge amounts of development time, testing and real-world use.
Website crawlers like this are more robust, come complete with a wide range of features; and are often able to meet many different needs rather than a singular specific purpose.
Types of commercial web crawlers.
SEARCH ENGINE web crawlers
This type of website crawler is run by vast server farms that span countries and continents; the data they scrape is also stored in epic server farms that look more like warehouses. In order to scrape and store the enormous amounts of data that exists on the internet, you need enormous amounts of servers and hard drives.
Web crawlers form the foundations of search engines as they are used to crawl and scrape the internet. This process of crawling and scraping precipitates the indexation of web content, which itself facilitates the search results you find when “Googling” something.
One of the most common and long-standing implementations of website crawlers are those used by search engines such as Google.
PERSONAL web crawlers.
Some other web crawlers.
ALGORITHMS USE IN WEB CRAWLERS
BFS algorithm:
DFS goes off into one branch until it reaches a leaf node; this is a problem if one of the goal nodes is on another branch.
Example using scrapy.
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider): # Your first spider definition
class MySpider2(scrapy.Spider):# Your second spider definition
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished