Introduction
In this article, you will examine how to use wget commands, to retrieve or transfer data, with a proxy server. Proxy servers are often referenced as the gateway between you and the world wide web and can make accessing data more secure. Feel free to learn more about proxies here , but let's get started!
Prerequisites & Installation
This article is for a wide range of developers, ✨including you juniors!✨ But to get the most of the material, it is advised to:
✅ Be familiar with Linux and unix commands and arguments.
✅ Have wget installed.
Check if wget is installed by opening the terminal and typing:
$ wget -V
If it is present, it will return the version. If not, follow the following steps to download wget on Mac or Windows.
Download wget on Mac
The recommended method to install wget on Mac using a package manager such as Homebrew.
You can install wget with Homebrew by running:
$ brew install wget
You can check for successful installation by rerunning the previous command to view its current version.
Download wget on Windows
To install and configure wget for Windows:
- Download wget for Windows and install the package.
- Copy the wget.exe file into your C:\Windows\System32 folder.
- Open the command prompt (cmd.exe), and run wget to check if it was successfully installed
Still having trouble? Here is an additional video that shows how to install wget on Windows 10.
What is wget?
wget is a GNU command-line utility tool primarily used to download content from the internet. It supports HTTP, HTTPS, and FTP protocols.
wget is designed to be effectively reliable over slow or unstable network connections. So if a download stops before completion due to a network error, wget automatically continues the same download from where it left off. And repeats this process until the whole file is successfully retrieved.
The tool also works as a web crawler by scraping linked resources from HTML pages and downloading them in sequence. It repeats this process until all content downloads or a specified recursion depth by the user is obtained. Afterward, the retrieved data is saved in a directory structure mirroring the remote server, thus creating a clone of webpages via HTTP.
wget is versatile, which is another reason it became so popular, it can work for scripts, terminals, and cron jobs. Not to mention the tool is non-interactive and runs independently in the background, meaning it does not matter if you are actively logged on while downloads occur.
Speaking of downloads, the wget tool even supports downloads through HTTP proxies. A proxy server is any machine that translates traffic between networks or protocols. Proxies are an intermediary server separating end-user clients from the destinations that they browse. Proxy servers may exist in schools, workplaces, and other institutions, where users need authentication to access the internet, and in some cases restricts the user from accessing certain websites.
When you use a proxy server, traffic flows through the proxy server to the requested address. Then the request returns through that same server, although this may not always be the case, and that server forwards the received data from the requested webpage back to you.
Thanks to proxies, you can download content more securely from the world wide web. In this post, you will examine how to, by using wget behind a proxy server.
wget Commands
If you are not familiar with wget, the tool uses a pretty repetitive syntax. It has two arguments: [OPTION] and [URL].
wget [OPTION]... [URL]...
OPTION : Decides what to do with the argument given. To view all the wget commands run
wget -h
.URL: of the file or directory you want to download/synchronize. You can call multiple OPTIONS or URLs at once.
Now that you learned wget's long and tedious syntax, it's time to learn some commands! 📣
Download a single file 📃
To download a regular file, run:
$ wget https://example.com/scraping-bee.txt
You can even set wget to retrieve the data if the latest version in the server is newer than the local copy. Instead of running the previous command, you would first extract the file using -S
to keep a timestamp of the initial extraction, and any made after that.
$ wget -S https://example.com/scraping-bee.txt
Next, to check if the file altered and download it if it has, you can run:
$ wget -N https://example.com/scraping-bee.txt
If you would like to download content and save it as the title on the HTML page, run this:
wget '...' > tmp &&
name=$(gawk '...' tmp) &&
mv tmp "$name"
Use / in name as needed.
Download a File to a Specific Directory 📁
Just replace PATH with the output directory location that you want to save the file.
$ wget ‐P <PATH> https://example.com/sitemap.xml
Rename a Downloaded File 📝
To rename a file, replace FILENAME with your desired name and run:
$ wget -O <FILENAME.html> https://example.com/file.html
Define Yourself as User-Agent 🧑💻
$ wget --user-agent=Chrome https://example.com/file.html
Limit Speed ⏩
Part of scraping etiquettecy is not crawling too fast. And thankfully wget can help with that by implementing the --wait
and --limit-rate
commands.
--wait=1 \\ Wait 1 second between extractions.
--limit-rate=10K \\ Limit the download speed (bytes per second)
Extract as Google bot 🤖
$ wget --user-agent="Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" https://example.com/path
Convert Links on a Page 🖇️
Convert the links in the HTML to still work in your local version. (e.g. example.com/path -> localhost:8000/path)
$ wget --convert-links https://example.com/path
Mirroring Single Webpages 📑
You can run this command to mirror a single web page to view it on your local device.
$ wget -E -H -k -K -p --convert-links https://example.com/path
Extract Multiple URLs 🗂️
First create and add all desired URLs to a urls.txt
file.
https://example.com/1
https://example.com/2
https://example.com/3
Next, run the following command to extract all the URLs.
$ wget -i urls.txt
That covers the most commonly used wget commands, but feel free to check out more !
How to Configure a Proxy with wget
First locate the wget initialization file inside /usr/local/etc/wgetrc (global, for all users) or $HOME/.wgetrc (for a single user). You can also view the documentation here to see a sample wget initialization file .wgetrc.
- Inside the initialization file add the lines:
https_proxy = http://[Proxy_Server]:[port]
http_proxy = http://[Proxy_Server]:[port]
ftp_proxy = http://[Proxy_Server]:[port]
- Set Proxy Variables .
wget recognizes the following environment variables to specify proxy location:
http_proxy/https_proxy: should contain the URLs of the proxies for HTTP and HTTPS connections respectively.
ftp_proxy: should contain the URL of the proxy for FTP connections. It is common that http_proxy and ftp_proxy are set to the same URL.
no_proxy: should contain a comma-separated list of domain extensions proxy should not be used for.
In addition to the environment variables, proxy location and settings may be specified from within wget itself using the commands: ‘--no proxy
and ‘proxy = on/off’
). Note that this may suppress the use of proxy, even if the correct environment variables are in place.
In the shell you can set the variables by running:
$ export http_proxy=http://[Proxy_Server]:[port]
$ export https_proxy=$http_proxy
$ export ftp_proxy=$http_proxy
- Lastly, add the following line(s) in either your ~/.bash_profile or /etc/profile:
export http_proxy=http://[Proxy_Server]:[port]
export https_proxy=http://[Proxy_Server]:[port]
export ftp_proxy=http://[Proxy_Server]:[port]
Some proxy servers require authorization to enable use, usually consisting of a username and password, which are sent using wget. Similar to HTTP authorization, while several authentication schemes exist, only the Basic authentication scheme is actively implemented.
You may enter your username and password through the proxy URL or the command-line options. In a not uncommon case that a company’s proxy is located at proxy.company.com
at port 8001, a proxy URL containing authorization content might look like so:
http://hniksic:mypassword@proxy.company.com:8001/
Alternatively, you can utilize the ‘proxy-user’ and ‘proxy-password’ options, and the equivalent .wgetrc settings proxy_user and proxy_password, to set the proxy username and password.
- You did it! Now wget your data using your proxy. 🎉
Conclusion
Now that you are a wget proxy pro, you now have free reign to extract almost whatever you want from websites. wget is a free and user-friendly tool that does not look like it is leaving anytime soon, so go ahead and get familiar with it. Hopefully, this article was able to help you start on your journey, and wget all the data needed!
As always, there are alternatives to wget, such as aria2 and cURL , but each come with their benefits. cURL also supports proxy use, and you can see how to do that in the article, How to set up a proxy with cURL? .
If you have enjoyed this article on setting up a proxy with wget, give ScrapingBee a try, and get the first 1000 requests free. Check out the getting started guide here !🐝
Scraping the web is challenging, given that anti-scraping mechanisms are growing by the day, so getting it done right can be quite a tedious task. ScrapingBee allows you to skip the noise and focus only on what matters the most: data.