- 
                Notifications
    
You must be signed in to change notification settings  - Fork 774
 
Background Reading
- Haydon, A; Najork, M. Mercator: A Scalable, Extensible Web Crawler (wayback (http://web.archive.org/web/\*/http://research.compaq.com/SRC/mercator/papers/www/paper.html)), 1999
 - Haydon, A; Najork, M. High-performance web crawling, 2001
 - Kimpton, Stata, Mohr. Internet Archive Crawler Requirements Analysis for library consortium, 2003
 - Lee, H; Leonard, D; Wang, X; Loguinov, D. IRLbot: Scaling to 6 Billion Pages and Beyond (new from WWW2008)
 
- Najork, M.; Wiener, J. Breadth-First Search Crawling Yields High-Quality Pages, 2001
 - Cho, J.; Garcia-Molina, H.; Page, L. Efficient Crawling Through URL Ordering, 1998
 - Abiteboul, S.; Preda, M.; Cobena, G. Computing web page importance without storing the graph of the web (extended abstract), 2001
 - Olsten, C.; Pandey, S. Recrawl Scheduling Based on Information Longevity (new from WWW2008)
 
- Haydon, A; Najork, M. Performance Limitations of the Java Core Libraries (may not reflect latest Java issues, Heritrix uses a high performance DNS package)
 
Find these (also may be outdated with respect to current Java and our implementation choices) at the archive-crawler Yahoo Group files page:
- G. B. Reddy Study of synch vs. asynch IO in Java
 - G. B. Reddy Study of multi-threaded DNS performance in Java
 
- Archive-crawler group files
 - Cho, J.; Garcia-Molina, H. The Evolution of the Web and Implications for an Incremental Crawler, Conf. on Very Large Data Bases, 2000
 - Focused Crawling The Quest for Topic-specific Portals
 - Focused Crawling: : A New Approach to Topic-Specific Web Resource Discovery, 1999, WWW8
 - Intelligent Crawling on the World Wide Web with Arbitrary Predicates, 2001, WWW10
 - Web Crawling High-Quality Metadata using RDF and Dublin Core, 2002, WWW11
 - Stanford WebBase Project
 - An Introduction to Heritrix - Mohr et al, 4th International Web Archiving Workshop 2004
 
- 
RFC 2616: Hypertext Transfer Protocol -
HTTP/1.1
- Clarifying the fundamentals of HTTP By Jeffery Mogul, an author of RFC-2616.
 
 - RFC 3986: Uniform Resource Identifiers (URI): Generic Syntax.
 - HTML 4.01 specification (from W3C).
 - Although robots.txt is important for crawling, it's never been officially ratified as an RFC. The defacto minimal spec live at robotstxt.org. Search engines have made a number of ad hoc extensions; Google recently shared some info about how GoogleBot implements the Robots Exclusion Protocol.
 - RFC 1034: Domain Names - Concepts and Facilities
 - RFC 1035: Domain Names - Implementation and Specification
 
Download All{.download-all-link}
crawler-requirements-2003-03.htm
(text/html)
Mohr-et-al-2004.pdf (application/pdf)
1998-Cho-efficient.pdf
(application/pdf)
1999-Heydon-javalimits.pdf
(application/pdf)
1999-Hirai-webbase.pdf
(application/pdf)
1999-Mercator.pdf (application/pdf)
2000-Broder-webgraph.pdf
(application/pdf)
2000-Cho-incremental.pdf
(application/pdf)
2001-Abiteboul-crawlorder.pdf
(application/pdf)
2001-Arasu-search.pdf (application/pdf)
2001-Najork-breadthfirst.pdf
(application/pdf)
2001-Najork-highperf.pdf
(application/pdf)
2002-Guillaume-webgraph.pdf
(application/pdf)
2008-IRLBot.pdf (application/pdf)
2008-Olston-recrawl.pdf
(application/pdf)
2002-Shkapenyuk-polybot.pdf
(application/pdf)
Structured Guides:
User Guide
- Introduction
 - New Features in 3.0 and 3.1
 - Your First Crawl
 - Checkpointing
 - Main Console Page
 - Profiles
 - Heritrix Output
 - Common Heritrix Use Cases
 - Jobs
 - Configuring Jobs and Profiles
 - Processing Chains
 - Credentials
 - Creating Jobs and Profiles
 - Outside the User Interface
 - A Quick Guide to Creating a Profile
 - Job Page
 - Frontier
 - Spring Framework
 - Multiple Machine Crawling
 - Heritrix3 on Mac OS X
 - Heritrix3 on Windows
 
- Responsible Crawling
 - Politeness parameters
 - BeanShell Script For Downloading Video
 - crawl manifest
 - JVM Options
 - Frontier queue budgets
 - BeanShell User Notes
 - Facebook and Twitter Scroll-down
 - Deduping (Duplication Reduction)
 - Force speculative embed URIs into single queue.
 - Heritrix3 Useful Scripts
 - How-To Feed URLs in bulk to a crawler
 - MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
 - WARC (Web ARChive)
 - When taking a snapshot Heritrix renames crawl.log
 - YouTube
 
- H3 Dev Notes for Crawl Operators
 - Development Notes
 - Spring Crawl Configuration
 - Potential Cleanup-Refactorings
 - Future Directions Brainstorming
 - Documentation Wishlist
 - Web Spam Detection for Heritrix
 - Style Guide
 - HOWTO Ship a Heritrix Release
 - Heritrix in Eclipse
 
