Skip to content

urlFilter terminates recursive scraping #460

@jkanel

Description

@jkanel

See this discussion thread for more detail.

ISSUE: When the rootUrl does not match the urlFilter criteria, recursive scraping terminates at the root page.

RECOMMENDED SOLUTION: Do not apply the urlFilter to the rootUrl; or make it an option to ignore the rootUrl from scraping.

DESIRED: I'm specifying a rootUrl and would like the scraper to recurse through all hyperlinks. The rootUrl will not be downloaded in this scenario. When the scraper finds a hyperlink ending in .abc it should download the file.

ACTUAL: The rootUrl (see code below) does not meet the urlFilter criteria and the scraper stops with no recursion. The scraper should find a hyperlink to http://trillian.mit.edu/~jc/music/book/SCD/Book45.abc in the rootUrl among other .abc urls, but it does not. Note that when I set the rootUrl equal to an .abc url, e.g. the example above, the file downloads as expected.

const rootUrl = "http://trillian.mit.edu/~jc/music/book/SCD"

scrape({
  urls: rootUrl,
  recursive: true,
  maxRecursiveDepth: 5,
  urlFilter: function(url) {
    let match = url.match(/\.abc$/);
    return (match && match[0]);
  },
  directory: savePath
});

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions