Call for Paper

CAE solicits original research papers for the November 2021 Edition. Last date of manuscript submission is October 30, 2021.

Read More

An Improved Generic Crawler using Poisson Fit Distribution

Thangaraj M., Sivagaminathan P. G.. Published in Information Sciences.

Communications on Applied Electronics
Year of Publication: 2016
Publisher: Foundation of Computer Science (FCS), NY, USA
Authors: Thangaraj M., Sivagaminathan P. G.
10.5120/cae2016652375

Thangaraj M. and Sivagaminathan P G.. An Improved Generic Crawler using Poisson Fit Distribution. Communications on Applied Electronics 6(1):7-13, October 2016. BibTeX

@article{10.5120/cae2016652375,
	author = {Thangaraj M. and Sivagaminathan P. G.},
	title = {An Improved Generic Crawler using Poisson Fit Distribution},
	journal = {Communications on Applied Electronics},
	issue_date = {October 2016},
	volume = {6},
	number = {1},
	month = {Oct},
	year = {2016},
	issn = {2394-4714},
	pages = {7-13},
	numpages = {7},
	url = {http://www.caeaccess.org/archives/volume6/number1/664-2016652375},
	doi = {10.5120/cae2016652375},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}
}

Abstract

The remarkable growth of Internet populates the World Wide Web to contain huge web data which is unexplored to whom it is intended for worth extraction and assimilation into knowledge. Retrieving potential information from web data needs a broad-spectrum crawler to collect relevant documents and metadata. Breadth first crawler algorithm is presented to fetch related web documents essential to create a web archive for alias extraction. In this paper, it is proved that the upgraded crawler generates better random depth rather than predetermined depth crawling. Contributing different mean values to this function enabled crawler it is possible to generate dynamic random depth.

References

  1. Beaza-Yates R and Castillo C, “Crawling the Infinite Web,” Journal of Web Engineering, vol. 6, no. 1, pp 49–72, 2007
  2. Ben Coppin, "Artificial Intelligence Illuminated", Jones and Barlett Publishers, 2004, Pg 77
  3. Brin S and Page L, ‘’The Anatomy of a Large-Scale Hyper textual Web Search Engine”, In Proceedings of 7th International World Wide Web Conference, April 14-18, 1998, Brisbane, Australia
  4. Broder A Kumar R, Maghoul F, Raghavan P,Rajagopalan R, Stata A, Tomkins and Wiener J, "Graph Structure in the Web: Experiments and Models", In Proceedings of the Ninth Conference on World Wide Web, pages 309-320,Amsterdam, Netherlands, May 2000
  5. Burner M, “ Crawling Towards Eternity - Building an Archive of the World Wide Web”, Web Techniques, 2(5), May 1997
  6. Chakrabarti S,”Mining the Web”, Morgan Kaufmann Publishers, 2003
  7. Cho J, Garc H,”Efficient Crawling through URL ordering”, In Proceedings of the seventh conference on World Wide Web”, Brisbane, Australia, April 1998
  8. Cho J, Shivakumar N, and Garcia-Molina H, ” Finding Replicated Web Collections”, In ACM SIGMOD, pages 355-366,1999
  9. Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, ”Automatic Discovery of Personal Name Aliases from the Web”, IEEE Transactions On Knowledge and Data Engineering, 2011, pp 831-844
  10. Edward R W, Coffman Z Liu,” Optimal Robot Scheduling for Web Search Engines”, Journal of Scheduling, 1998
  11. Guha R and Garg, "Disambiguating people in search", Technical Report, Stanford University, 2004
  12. Junghoo Cho and Hector Garcia-Molina, "Effective Page Refresh Policies for Web Crawlers", ACM Transactions On Database Systems, 2003
  13. Lawrence S, Giles C L,” Searching the World Wide Web Science”, 1998
  14. Menczer, Filippo Gautam Pant and Padmini Srinivasan, ”Topical Web Crawlers: Evaluating Adaptive Algorithms”, ACM Transactions on Internet Technology (TOIT), vol. 4, no. 4, pp 378-419, 2004
  15. Miller G A,"WordNet: A Lexical Database for English," Communications of the ACM (Vol. 38, No. 11), 1995, pp 39-41
  16. Najork M and Wiener J L,” Breadth-First Crawling Yields High-Quality Pages”, In Proceedings of the Tenth Conference on World Wide Web, pp 114-118, Hong Kong, May 2001, Elsevier
  17. Narasingh Deo,”Graph theory with Applications to Engineering and Computer Science”, PHI, 2004 Pg 301
  18. Pavalam S M, Jawahar M, Felix K Akorli, Kashmir raja S V,” Web Crawler in Mobile Systems”, In the proceedings of International Conference on Machine Learning (ICMLC 2011)
  19. Sandeep Sharma and Ravinder Kumar,”Web-Crawlers and Recent Crawling Approaches”, In Proceedings of the International Conference on Challenges and Development (IT-ICCDIT-2008), PCTE, Ludhiana (Punjab), May 30th, 2008
  20. Steven S. Skiena, "The Algorithm Design Manual", Second Edition, Springer Verlag London Limited,2008 pg 162
  21. Tan P N and Kumar V,” Discovery of Web Robots Session based on their Navigational Patterns”, Data Mining and Knowledge discovery, 2002
  22. Thangaraj M and Sivagaminathan P G, ”A Web Robot for Extracting Personal Name Aliases”, International Journal of Applied Engineering Research, ISSN 0973-4562 Volume 10, Number 14 pp , 2015, pp 34954-34961
  23. Yang sun, Isaac G, Councill C, Lee Giles, ” The Ethicality of Web Crawlers”, 2010
  24. Breadth First Search, Accessed June 1, 2014, en.wikipedia.org/wiki/Breadth-First_Search
  25. https://en.wikipedia.org/wiki/Web_crawler
  26. http://www.slideshare.net/sanchitsaini/working-with-websphinx-web-crawler-9506067
  27. https://en.wikipedia.org/wiki/Poisson_distribution
  28. https://en.wikipedia.org/wiki/Probability_mass_function

Keywords

Breadth First Search, Parsing, Multi-threading, *Probability Mass Function, Frontier, Virtual web