Question

In: Computer Science

Explain types of web crawlers and different Algorithms they use ex:-bfs,dfs and implement sequential and concurrent...

Explain types of web crawlers and different Algorithms they use ex:-bfs,dfs and implement sequential and concurrent crawler using scrapy or any other tool

please explain your code with comments

Solutions

Expert Solution

Web Crawlers

  • Web crawlers is used in many places mainly used in digital marketing.
  • It is a piece of software that crawls and scrapes data from web pages, websites and the files that websites are built from.A crawler is a computer program that automatically searches documents on the Web.

Types of web crawlers.

  1. COMMERCIAL  web crawlers.
  2. SEARCH ENGINE web crawlers.
  3. PERSONAL web crawlers.

Explanation

COMMERCIAL  web crawlers.

It is developed to overcome the limitations of smaller personal use tools, often requiring huge amounts of development time, testing and real-world use.

Website crawlers like this are more robust, come complete with a wide range of features; and are often able to meet many different needs rather than a singular specific purpose.

Types of commercial web crawlers.

  • DESKTOP web crawlers.
  • CLOUD-BASED web crawlers

SEARCH ENGINE web crawlers

This type of website crawler is run by vast server farms that span countries and continents; the data they scrape is also stored in epic server farms that look more like warehouses. In order to scrape and store the enormous amounts of data that exists on the internet, you need enormous amounts of servers and hard drives.

Web crawlers form the foundations of search engines as they are used to crawl and scrape the internet. This process of crawling and scraping precipitates the indexation of web content, which itself facilitates the search results you find when “Googling” something.

One of the most common and long-standing implementations of website crawlers are those used by search engines such as Google.

PERSONAL web crawlers.

  • It is used for personal uses  or businesses might make to use in house .
  • They are often built to perform one or two specific jobs like scraping data from the search results or monitoring whether specific webpages are down.
  • These are typically good at performing a small and specific job but are likely to fail when deployed to work at scale.
  • These can work well on a small scale such as scraping paid ads once or twice per day for a small set of keywords.
  • These types of website crawlers are limited in their scalability and functionality, but unlike those used by search engines, you do have control over them.

Some other web crawlers.

  1. General Purpose Web Crawler
  2. Focussed Web Crawler
  3. Incremental Web Crawler
  4. Deep Web Crawler

ALGORITHMS USE IN WEB CRAWLERS

BFS algorithm:

  • Initialize queue (Q) with initial set of known URL's.
  • Until Q is empty (or page or time limit exhausted)
  • Steps:
    1. Pop URL from front of Q.
    2. If URL is not to be visited
      e.g. it is a .gif, .jpeg, .ps, .pdf etc.,
      or is already visited
      go to top of loop.
      continue loop (get next url).
    3. Else Download PAGE
    4. [Index PAGE and/or store cached copy]
    5. Parse P to obtain list of new links
    6. Append these links to the end of Q.
  1. DFS algorithm:
  • Steps
    1. Get the first link not visited from the start page
    2. Visit link and get first non-visited link from this page
    3. Repeat previous step till no non-visited links
    4. Go to next non-visited link in the previous level and repeat 2nd step

DFS goes off into one branch until it reaches a leaf node; this is a problem if one of the goal nodes is on another branch.

Example using scrapy.

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider): # Your first spider definition

class MySpider2(scrapy.Spider):# Your second spider definition

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished


Related Solutions

This question is in reference to BFS and DFS for data structures and algorithms Consider a...
This question is in reference to BFS and DFS for data structures and algorithms Consider a graph algorithm with a growth function on V and E: f(V, E). How would you convert f(V,E) to f'(V) such that f(V,E)=O(g(n))=f(V)? (That is, convert a growth function of two variables to be of one variable in such a way that the Big-Oh bound for the one variable function will hold for the two variable function.) Explain the steps in creating f', and explain...
Provide and implement three completely different algorithms of different running time that will check if two...
Provide and implement three completely different algorithms of different running time that will check if two strings are anagrams.
In this project you will implement and experiment with three different sorting algorithms: Insertion Sort, Selection...
In this project you will implement and experiment with three different sorting algorithms: Insertion Sort, Selection sort and Bubble Sort. The objective of this project is to understand the relation between the theoretical asymptotic complexities (Big-O) computed in class and the actual running times for different kinds of input. The enclosed source file main.cpp contains a program that reads two inputs: a letter specifying which sorting algorithm to use (I for InsertionSort , S for SelectionSort and B for BubbleSort),...
Use quickselect algorithm to implement the two algorithms for finding the kth smallest integer in a...
Use quickselect algorithm to implement the two algorithms for finding the kth smallest integer in a set of integers. MUST USE JAVA AND QUICKSELECT. Algorithm 1: Procedure SELECT( k,S) { if ISI =1 then return the single element in S    else { choose an element randomly from S;               let S1, S2, and S3 be the sequences of elements in S                 less than, equal to, and greater than m, respectively;              if IS1I >=k then return SELECT(k,S1)              ...
Computer Networking Proxy Server Related Question: Please explain the caching algorithms of a web proxy server...
Computer Networking Proxy Server Related Question: Please explain the caching algorithms of a web proxy server and include following concepts: Greedy Dual Size, least recently used, and least frequently used.
Explain different types of welding?
Explain different types of welding?
Briefly explain four different types of Use-Cases in terms of the purpose and the amount of...
Briefly explain four different types of Use-Cases in terms of the purpose and the amount of information
implement the following logic in Python, use appropriate data types. Data types are represented as either...
implement the following logic in Python, use appropriate data types. Data types are represented as either numeric (num) or string. string name string address num item num quantity num price num SIZE = 6 num VALID_ITEM [SIZE] = 106, 108, 307, 405, 457, 688 num VALID_ITEM_PRICE [SIZE] = 0.59, 0.99, 4.50, 15.99, 17.50, 39.00 num sub string foundIt = “N” string MSG_YES = “Item available” string MSG_NO = “Item not found” get name, address, item, quantity sub = 0 while...
What are the different algorithms we may use to solve the 0/1 Knapsack problem? Compare and...
What are the different algorithms we may use to solve the 0/1 Knapsack problem? Compare and contrast the different methods.
What is a beam expander? Elaborate on the different types. Why do we use it? Explain...
What is a beam expander? Elaborate on the different types. Why do we use it? Explain with the help of an example
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT