I personally use it to download documentations for offline access, but it can be used with most websites out there.
cargo install --git https://github.com/EL-MASTOR/deep
git clone https://github.com/EL-MASTOR/deep && cd https://github.com/EL-MASTOR/deep && cargo build --release
The order is important!
deep URL DIR BASE [FREQ] [-i IGNORED]
or deep -a [FREQ] for retrying failed URLs^3.
You can try a quick example with:
deep https://doc.rust-lang.org/nightly/clippy/ clippy 2
You can see the files downloaded with eza --icons --tree or any tree listing program you have.
You can view a local version of the website by either:
- Going to
file:///path/to/clippy/nightly/clippy/index.htmlin a chromium based browser. - Or by
cd dist-clippy && live-server .this will serve the files at port 8080, then you can go tohttp://localhost:8080/nightly/clippy/
The program uses an asynchronous mpsc channel that receives URLs and does some work to each URL. This works like a queue.
The program first send the URL that you provided to the channel, which takes that URL, and downloads its webpage at path^2, and then all the new a tag links in that page that met a certain criteria^1 get sent to the same channel.
This is repeated to each new link until no new links are found.
That's why it is called deep, because it deeply dives into the website's tree to find new links to download.
Once all the web pages have been downloaded, it proceeds to download the js, css, and image links found in those pages.
You can view it in the browser as explained in the Quick try section above.
- The program is asynchronous and concurrent for most of its work.
- ^1. The criteria for picking URL links:
- The URL link should start with a base. All the URL links found in those pages that don't start with this base are ignored.
The base is determined by takingBASEargument you provided (which is a number) and picking up until those number of pathnames in theURLto be the base.
So if theURLyou provided ishttps://example.com/a/b/c/d, (The pathname here is/a/b/c/d), and you specifiedBASEto be 2, the base URL will behttps://example.com/a/b. And ifBASEis 0, the base path ishttps://example.com.
The js, css and image links are exceptional, as they only get checked if the URL starts with the host, the pathname isn't included. So they get checked if they start withhttps://example.com, as inhttps://example.com/script.jseven ifBASEis 2.
This also means that external scripts, styles and images that are not related to the website aren't included.
You can't pickBASEto be more than the number of components in the pathname of theURLyou provided. In the previous example,/a/b/c/dhas only 4 components, so you can't pickBASEto be 5. - The URLs are new. Each URL, after they passed the base check, they get checked if they are new, to avoid repeating the work.
The program stores each new URL in a concurrent hashset with O(1) search time, so when ever a new URL is found, it checks if it's already in the hashset (processed) or not (not yet processed). If not already present in the hashset, it gets sent to the channel to be processed and downloaded.
Only the origin of the URLs are checked, meaning if a URL ishttps://example.com/a/b/x/y?query=string#hash, the?query=string#hashpart is removed so only the origin is remained, which contains the host and the pathnamehttps://example.com/a/b/x/y. This makes the program more efficient, so it does not include URLs that point to the same website but look differently.
- The URL link should start with a base. All the URL links found in those pages that don't start with this base are ignored.
- The optional argument
[FREQ](frequency) represents the amount of time in milliseconds between each request send.
So, if you set[FREQ]to 10, it will only allow for sending a 100 requests per second.
By default, the program sends as many requests as possible, depending on your connection speed.
This regulation is helpful to usedeepon the websites that will block you if you send requests fast. - An optional argument
[-i IGNORED]can be specified to ignore certain links.
Here!s an example to illustrate:
deep https://example.com/a/b dest 1 -i c/ d/e/
The ignored URLs are formed like: base_url + ignored_argument
Here, the base URL ishttps://example.com/a/.
From this the ignored URLs will be:https://example.com/a/c/,https://example.com/a/d/e/.
Any link that starts with any of these ignored URLs, will be ignored, and won't be downloaded.
There are many reasons where you might want to ignore some links.
[!IMPORTANT]
The slash at the end ends the pathname. Whether it's present or not, might affect the amount of pages downloaded. Take a look at this example to better understand:
deep https://example.com/x/y dest 1 -i a/
The ignored link will behttps://example.com/x/a/
deep https://example.com/x/y dest 1 -i a
The ignored link will behttps://example.com/x/a
Did you notice the difference? The 2nd example without an ending/ignoreshttps://example.com/x/apiandhttps://example.com/x/administration/x, whereas the 1st one doesn't.
It is a good practice to always end your ignored pathnames with/to avoid ignoring links that you didn't intend to ignore. Unless, you want to ignore them too.
You might specify as many ignored pathnames as you like. - ^2. The path and filename of the downloaded file is determined from the pathname of the URL. Very straight forward. The page at
https://example.com/a/b/z.htmlis downloaded toDIR/a/b/z.html. And the page athttps://example.com/a/b/yis downloaded toDIR/a/b/y/index.htmlif it's an html file that doesn't end with ".html", otherwise it is just downloaded as is.DIRhere is the directory name you provided. All messing directories are created. - ^3. Did some pages failed to download? Worry not! You don't have to restart. All you have to do is cd into your
DIRwhere you have downloaded the pages, then rundeep -a. This will retry downloading the failed URLs.
This uses the information stored in_deep-logs/failsafe.logand_deep-logs/visited.log.
These files are updated afterdeep -afinishes, so if other URLs failed to download during that, you can just repeat the command.[!IMPORTANT]
_deep-logs/visited.logcontains both failed and ignored URLs if any.
if you wish to get the URLs downloaded in your computer, you can use either of these methods:- From file paths: File names and paths get determined from their URLs.
The root ofDIRis the root of the websites, you just need to prefix it the domain name of the website.
It is important to note that URLs that don't end with ".html" get downloaded to "index.html".
"DIR/a/b/c.html" ->https://example.com/a/b/c.html
"DIR/x/y/z/index.html" ->https://example.com/x/y/z - From log files: You need to remove the failed and ignored URLs from
_deep-logs/visited.log. You can get needed information needed from_deep-logs/failsafe.log. The latter is sectioned by----.
The 1st line is the base URL. From the 2nd line which starts with----until the next one, there are ignored URLs if any. You need to ignore any line in_deep-logs/visited.logthat starts with any of these.
The other sections are failed URLs.
- From file paths: File names and paths get determined from their URLs.
-
Be careful of how you specify
URL! A trailing/can make a huge difference if it's present or not.
Make sure it is present in URLs that don't end with a filename, and absent on the ones that do.
Examples of URLs that ends with a filename:
https://example/a/b.html,
https://example/a/c.json.
Examples of URLs that don't ends with a filename:
https://example/a/x/,
https://example/a/images/.
If you're not sure about whether to add a trailing/or not, just load the URL in the browser and copy the link from the browser's address bar. The browser will do the job for you figuring out whether to add/or not. -
Be careful picking
BASEvalue. The lowerBASEis, the more websites are downloaded. So choose it to be as high as you need it to be.
Let's say you want to download the python documentation athttps://courseswebsite.com/python/default.asp.
Here you should make theBASEto be as high as you want it to be, in this case it will be 1 to download all the sub-urls ofhttps://courseswebsite.com/python.
If you lowerBASEby 1, you will download all the sub-urls ofhttps://courseswebsite.com/, which includes other courses and other things, when we only want the python course.
So you should choose it as high as you want it to be.
Do not set it to 3, since "default.asp" is just a single page and does not have any sub-urls, andhttps://courseswebsite.com/pythonis the root of the python course web pages. -
JavaScript does not get executed. Therefore content that loads dynamically won't be loaded.
Keep that in mind, if something is wrong about the downloaded pages, check if the content you want to download is statically loaded withcurl -o test.html URL, then open "test.html" to see if the content you want gets loaded or not. -
If
[FREQ]is not specified, the program doesn't put any restrictions upon sending requests.
The number of requests you send is only affected by your connection speed.
So, you might get IP-banned if some servers noticed that you send too many requests.
Though, it is not very common. I have run into this situation only once with one website. Later, I set[FREQ]to 10, and it worked fine with the website.
deep https://doc.rust-lang.org/std/ dist-std 1deep https://docs.rs/scraper/latest/scraper/ dist-scraper2 3deep https://rust-lang-nursery.github.io/rust-cookbook/web/clients.html dist-rust-cookbook 2deep https://developer.mozilla.org/en-US/docs/Web dist-mozilla 2deep https://www.w3schools.com/js/default.asp dist-w3-js 1 10 # I had to send a request each 10ms, because the websites blocks IPs that flood it with requests. This is the only website I have encountered that does that. I didn't test it with a value lower than 10 though, so it might still work faster with lower values.deep https://shopify.dev/docs dist-shopify 1 -i api/admin-graphql api/storefront # I had to ignore certain huge links because the website is huge