An Improved Generic Crawler using Poisson Fit Distribution

Thangaraj M.; Sivagaminathan P. G.

Call for Paper

July Edition

CAE solicits high quality original research papers for the upcoming July edition of the journal. The last date of research paper submission is 29 June 2026

Submit your paper

Know more

The week's pick

Machine Learning Models, Data Preprocessing Techniques and Suite of Metrics for Assessing Solar Power Forecasting: A Comprehensive Review

Asma A.M. Nagaraja

Random Articles

An Advance IoT based Road Traffic Manipulation System

Dec

2018

On the Design and Performance Analysis of Printed Triangular Starred Array Strip Antenna

February

2015

Quantum Mechanichs Theory in Time Independent

Oct

2017

The Design, Optimization and Characterization of 7GHz Ultra Low Noise Figure Amplifier using Hybrid MIC Technique for Satellite Mobile Applications

Dec

2016

Reseach Article

An Improved Generic Crawler using Poisson Fit Distribution

by Thangaraj M., Sivagaminathan P. G.

Communications on Applied Electronics

Foundation of Computer Science (FCS), NY, USA

Volume 6 - Number 1

Year of Publication: 2016

Authors: Thangaraj M., Sivagaminathan P. G.

10.5120/cae2016652375

Thangaraj M., Sivagaminathan P. G. . An Improved Generic Crawler using Poisson Fit Distribution. Communications on Applied Electronics. 6, 1 ( Oct 2016), 7-13. DOI=10.5120/cae2016652375

@article{ 10.5120/cae2016652375,

author = { Thangaraj M., Sivagaminathan P. G. },

title = { An Improved Generic Crawler using Poisson Fit Distribution },

journal = { Communications on Applied Electronics },

issue_date = { Oct 2016 },

volume = { 6 },

number = { 1 },

month = { Oct },

year = { 2016 },

issn = { 2394-4714 },

pages = { 7-13 },

numpages = {9},

url = { https://www.caeaccess.org/archives/volume6/number1/664-2016652375/ },

doi = { 10.5120/cae2016652375 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2023-09-04T19:56:34.923632+05:30

%A Thangaraj M.

%A Sivagaminathan P. G.

%T An Improved Generic Crawler using Poisson Fit Distribution

%J Communications on Applied Electronics

%@ 2394-4714

%V 6

%N 1

%P 7-13

%D 2016

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The remarkable growth of Internet populates the World Wide Web to contain huge web data which is unexplored to whom it is intended for worth extraction and assimilation into knowledge. Retrieving potential information from web data needs a broad-spectrum crawler to collect relevant documents and metadata. Breadth first crawler algorithm is presented to fetch related web documents essential to create a web archive for alias extraction. In this paper, it is proved that the upgraded crawler generates better random depth rather than predetermined depth crawling. Contributing different mean values to this function enabled crawler it is possible to generate dynamic random depth.

References

Beaza-Yates R and Castillo C, “Crawling the Infinite Web,” Journal of Web Engineering, vol. 6, no. 1, pp 49–72, 2007
Ben Coppin, "Artificial Intelligence Illuminated", Jones and Barlett Publishers, 2004, Pg 77
Brin S and Page L, ‘’The Anatomy of a Large-Scale Hyper textual Web Search Engine”, In Proceedings of 7th International World Wide Web Conference, April 14-18, 1998, Brisbane, Australia
Broder A Kumar R, Maghoul F, Raghavan P,Rajagopalan R, Stata A, Tomkins and Wiener J, "Graph Structure in the Web: Experiments and Models", In Proceedings of the Ninth Conference on World Wide Web, pages 309-320,Amsterdam, Netherlands, May 2000
Burner M, “ Crawling Towards Eternity - Building an Archive of the World Wide Web”, Web Techniques, 2(5), May 1997
Chakrabarti S,”Mining the Web”, Morgan Kaufmann Publishers, 2003
Cho J, Garc H,”Efficient Crawling through URL ordering”, In Proceedings of the seventh conference on World Wide Web”, Brisbane, Australia, April 1998
Cho J, Shivakumar N, and Garcia-Molina H, ” Finding Replicated Web Collections”, In ACM SIGMOD, pages 355-366,1999
Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, ”Automatic Discovery of Personal Name Aliases from the Web”, IEEE Transactions On Knowledge and Data Engineering, 2011, pp 831-844
Edward R W, Coffman Z Liu,” Optimal Robot Scheduling for Web Search Engines”, Journal of Scheduling, 1998
Guha R and Garg, "Disambiguating people in search", Technical Report, Stanford University, 2004
Junghoo Cho and Hector Garcia-Molina, "Effective Page Refresh Policies for Web Crawlers", ACM Transactions On Database Systems, 2003
Lawrence S, Giles C L,” Searching the World Wide Web Science”, 1998
Menczer, Filippo Gautam Pant and Padmini Srinivasan, ”Topical Web Crawlers: Evaluating Adaptive Algorithms”, ACM Transactions on Internet Technology (TOIT), vol. 4, no. 4, pp 378-419, 2004
Miller G A,"WordNet: A Lexical Database for English," Communications of the ACM (Vol. 38, No. 11), 1995, pp 39-41
Najork M and Wiener J L,” Breadth-First Crawling Yields High-Quality Pages”, In Proceedings of the Tenth Conference on World Wide Web, pp 114-118, Hong Kong, May 2001, Elsevier
Narasingh Deo,”Graph theory with Applications to Engineering and Computer Science”, PHI, 2004 Pg 301
Pavalam S M, Jawahar M, Felix K Akorli, Kashmir raja S V,” Web Crawler in Mobile Systems”, In the proceedings of International Conference on Machine Learning (ICMLC 2011)
Sandeep Sharma and Ravinder Kumar,”Web-Crawlers and Recent Crawling Approaches”, In Proceedings of the International Conference on Challenges and Development (IT-ICCDIT-2008), PCTE, Ludhiana (Punjab), May 30th, 2008
Steven S. Skiena, "The Algorithm Design Manual", Second Edition, Springer Verlag London Limited,2008 pg 162
Tan P N and Kumar V,” Discovery of Web Robots Session based on their Navigational Patterns”, Data Mining and Knowledge discovery, 2002
Thangaraj M and Sivagaminathan P G, ”A Web Robot for Extracting Personal Name Aliases”, International Journal of Applied Engineering Research, ISSN 0973-4562 Volume 10, Number 14 pp , 2015, pp 34954-34961
Yang sun, Isaac G, Councill C, Lee Giles, ” The Ethicality of Web Crawlers”, 2010
Breadth First Search, Accessed June 1, 2014, en.wikipedia.org/wiki/Breadth-First_Search
https://en.wikipedia.org/wiki/Web_crawler
http://www.slideshare.net/sanchitsaini/working-with-websphinx-web-crawler-9506067
https://en.wikipedia.org/wiki/Poisson_distribution
https://en.wikipedia.org/wiki/Probability_mass_function

Index Terms

Computer Science

Information Sciences

Keywords

Breadth First Search Parsing Multi-threading *Probability Mass Function Frontier Virtual web