-
-
Notifications
You must be signed in to change notification settings - Fork 292
Description
See this discussion thread for more detail.
ISSUE: When the rootUrl does not match the urlFilter criteria, recursive scraping terminates at the root page.
RECOMMENDED SOLUTION: Do not apply the urlFilter to the rootUrl; or make it an option to ignore the rootUrl from scraping.
DESIRED: I'm specifying a rootUrl and would like the scraper to recurse through all hyperlinks. The rootUrl will not be downloaded in this scenario. When the scraper finds a hyperlink ending in .abc it should download the file.
ACTUAL: The rootUrl (see code below) does not meet the urlFilter criteria and the scraper stops with no recursion. The scraper should find a hyperlink to http://trillian.mit.edu/~jc/music/book/SCD/Book45.abc in the rootUrl among other .abc urls, but it does not. Note that when I set the rootUrl equal to an .abc url, e.g. the example above, the file downloads as expected.
const rootUrl = "http://trillian.mit.edu/~jc/music/book/SCD"
scrape({
urls: rootUrl,
recursive: true,
maxRecursiveDepth: 5,
urlFilter: function(url) {
let match = url.match(/\.abc$/);
return (match && match[0]);
},
directory: savePath
});