Call for Paper
CAE solicits original research papers for the April 2023 Edition. Last date of manuscript submission is March 31, 2023.
An Improved Generic Crawler using Poisson Fit Distribution
Thangaraj M. and Sivagaminathan P G.. An Improved Generic Crawler using Poisson Fit Distribution. Communications on Applied Electronics 6(1):7-13, October 2016. BibTeX
@article{10.5120/cae2016652375, author = {Thangaraj M. and Sivagaminathan P. G.}, title = {An Improved Generic Crawler using Poisson Fit Distribution}, journal = {Communications on Applied Electronics}, issue_date = {October 2016}, volume = {6}, number = {1}, month = {Oct}, year = {2016}, issn = {2394-4714}, pages = {7-13}, numpages = {7}, url = {http://www.caeaccess.org/archives/volume6/number1/664-2016652375}, doi = {10.5120/cae2016652375}, publisher = {Foundation of Computer Science (FCS), NY, USA}, address = {New York, USA} }
Abstract
The remarkable growth of Internet populates the World Wide Web to contain huge web data which is unexplored to whom it is intended for worth extraction and assimilation into knowledge. Retrieving potential information from web data needs a broad-spectrum crawler to collect relevant documents and metadata. Breadth first crawler algorithm is presented to fetch related web documents essential to create a web archive for alias extraction. In this paper, it is proved that the upgraded crawler generates better random depth rather than predetermined depth crawling. Contributing different mean values to this function enabled crawler it is possible to generate dynamic random depth.
References
- Beaza-Yates R and Castillo C, “Crawling the Infinite Web,” Journal of Web Engineering, vol. 6, no. 1, pp 49–72, 2007
- Ben Coppin, "Artificial Intelligence Illuminated", Jones and Barlett Publishers, 2004, Pg 77
- Brin S and Page L, ‘’The Anatomy of a Large-Scale Hyper textual Web Search Engine”, In Proceedings of 7th International World Wide Web Conference, April 14-18, 1998, Brisbane, Australia
- Broder A Kumar R, Maghoul F, Raghavan P,Rajagopalan R, Stata A, Tomkins and Wiener J, "Graph Structure in the Web: Experiments and Models", In Proceedings of the Ninth Conference on World Wide Web, pages 309-320,Amsterdam, Netherlands, May 2000
- Burner M, “ Crawling Towards Eternity - Building an Archive of the World Wide Web”, Web Techniques, 2(5), May 1997
- Chakrabarti S,”Mining the Web”, Morgan Kaufmann Publishers, 2003
- Cho J, Garc H,”Efficient Crawling through URL ordering”, In Proceedings of the seventh conference on World Wide Web”, Brisbane, Australia, April 1998
- Cho J, Shivakumar N, and Garcia-Molina H, ” Finding Replicated Web Collections”, In ACM SIGMOD, pages 355-366,1999
- Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, ”Automatic Discovery of Personal Name Aliases from the Web”, IEEE Transactions On Knowledge and Data Engineering, 2011, pp 831-844
- Edward R W, Coffman Z Liu,” Optimal Robot Scheduling for Web Search Engines”, Journal of Scheduling, 1998
- Guha R and Garg, "Disambiguating people in search", Technical Report, Stanford University, 2004
- Junghoo Cho and Hector Garcia-Molina, "Effective Page Refresh Policies for Web Crawlers", ACM Transactions On Database Systems, 2003
- Lawrence S, Giles C L,” Searching the World Wide Web Science”, 1998
- Menczer, Filippo Gautam Pant and Padmini Srinivasan, ”Topical Web Crawlers: Evaluating Adaptive Algorithms”, ACM Transactions on Internet Technology (TOIT), vol. 4, no. 4, pp 378-419, 2004
- Miller G A,"WordNet: A Lexical Database for English," Communications of the ACM (Vol. 38, No. 11), 1995, pp 39-41
- Najork M and Wiener J L,” Breadth-First Crawling Yields High-Quality Pages”, In Proceedings of the Tenth Conference on World Wide Web, pp 114-118, Hong Kong, May 2001, Elsevier
- Narasingh Deo,”Graph theory with Applications to Engineering and Computer Science”, PHI, 2004 Pg 301
- Pavalam S M, Jawahar M, Felix K Akorli, Kashmir raja S V,” Web Crawler in Mobile Systems”, In the proceedings of International Conference on Machine Learning (ICMLC 2011)
- Sandeep Sharma and Ravinder Kumar,”Web-Crawlers and Recent Crawling Approaches”, In Proceedings of the International Conference on Challenges and Development (IT-ICCDIT-2008), PCTE, Ludhiana (Punjab), May 30th, 2008
- Steven S. Skiena, "The Algorithm Design Manual", Second Edition, Springer Verlag London Limited,2008 pg 162
- Tan P N and Kumar V,” Discovery of Web Robots Session based on their Navigational Patterns”, Data Mining and Knowledge discovery, 2002
- Thangaraj M and Sivagaminathan P G, ”A Web Robot for Extracting Personal Name Aliases”, International Journal of Applied Engineering Research, ISSN 0973-4562 Volume 10, Number 14 pp , 2015, pp 34954-34961
- Yang sun, Isaac G, Councill C, Lee Giles, ” The Ethicality of Web Crawlers”, 2010
- Breadth First Search, Accessed June 1, 2014, en.wikipedia.org/wiki/Breadth-First_Search
- https://en.wikipedia.org/wiki/Web_crawler
- http://www.slideshare.net/sanchitsaini/working-with-websphinx-web-crawler-9506067
- https://en.wikipedia.org/wiki/Poisson_distribution
- https://en.wikipedia.org/wiki/Probability_mass_function
Keywords
Breadth First Search, Parsing, Multi-threading, *Probability Mass Function, Frontier, Virtual web