Web scraping is a technique to retrieve a large amount of data and store in your system. Some websites discourage web scraping. Such websites can still be scraped, but responsibly so that it does not have a detrimental effect on the website being scraped. Web crawlers can retrieve data quickly and in-depth, therefore it is important to ‘take care’ of the target website.
Most websites may not have anti-scraping techniques because that would hamper a normal user to access the website. Though, there are some sites that still block scraping because they do not want their data to be openly accessed.
This article talks about how websites know that its a spider and not human interaction on websites and ways to overcome those barriers.
How can websites detect web scraping?
There are mechanisms that allow websites to detect search engine bots in action. Some mechanisms are:
- Unusual traffic or a high download rate from a single client or IP address in a limited time.
- Detecting repetitive tasks on a website not done by a human. A human won’t perform the same tasks all the time.
- Using honeypots for detection which are usually invisible to a normal user.
List of 9 Proven Ways to Bypass Anti Scraping Techniques In 2019
To overcome detection, and bypass anti-scraping techniques, follow these practices:
1) Adhere to robots.txt files
Owners of a website have the right to decide whether their website will be allowed to get crawled/scraped or not. Some websites disallow bots to scrape and allow other websites to scrape their websites. Web spiders should stick to robot.txt file for a website while scraping. This file has a set of rules that you should respect; as to how frequently you can scrape and which pages can you scrape.
Robots.txt file can be found on the website URL.
If it contains lines like the ones shown below, it means the site doesn’t like and wants to be scraped.
As most websites want to be listed on Google, the largest scraper of websites, owners do allow crawlers to access the websites.
2) Slow scrawling helps
If you are using bots, they fetch and scrape data very fast, as fast as placing requests within 30 seconds; which is unusual for a human. So it is easy for a website to detect that a scraper is at work. Fast scraping means a website receiving too many requests and making the website unresponsive.
To make your spider look real, try to mimic human behavior.
For example, add some random programmatic sleep calls between requests or put some delays after crawling a particular number of pages. Basically, the website you are scraping should be treated nice and not put much load on it.
There are autothrottle mechanisms that can be used to automatically throttle crawling speed if you are loading the website too much. The website environment also changes over time. So adjust the bot to an optimum crawling speed after a few trails run.
3) Change scrawling pattern
Humans tend to bring a variation in the task and not do it repetitively. They show random actions while scraping. Bots, on the other hand, have the same crawling pattern because they are programmed to do so. They do not change the pattern unless programmed to do so.
Websites have anti crawling mechanisms that can detect easily that a bot is involved in scraping or a human is doing so. Therefore, incorporate some random clicks in the program or mouse movements that makes the spider look like a human. Making changes in the crawling pattern is an efficient anti-scraping technique.
4) Rotate IPs and Proxies
Using the same IP address to send multiple requests will block your IP address. Your IP address can be seen while scraping. A website will easily know what you are doing. To prevent this, use multiple IP addresses. A request from a proxy server is difficult to be detected. Use random IP addresses for each request from a pool of IPs.
There are a lot of ways to change your outgoing IP. VPNs, shared proxies or TOR are the best possible ways. Also, there are commercial providers providing services for automatic IP rotation. This technique also distributes the load across various exit points.
As this is a well-known technique to the websites also, they have blocked some massively used IP ranges such as AWS.
5) User Agent rotation
A user agent is a tool that tells a server which web browser is being used. If you have not set up a user agent, websites won’t allow you to view their content. To know your user agent, you can simply type “what is my user agent in the Google search bar”.
You can also check your user-string at WhatsMyUserAgent.
Every request that comes from a browser contains a user agent header that leads to the detection of a bot. So to make the user agent appear real and escape the detection is to fake the user agent.
To spoof a user agent:
Create a list of user agents and for each request, pick up a random user-agent so that you don’t get blocked. Set your user agent to a common web browser instead of the default user agent.
Set your user-agent to a common web browser instead of using the default user-agent (such as wget/version or urllib/version). You could even pretend to be the Google Bot: Googlebot/2.1 if you want to have some fun!
6) Watch out for websites changing layouts
Some websites have a dynamic layout and keep changing it, making it tricky or scrapers. For instance, the first 20 pages will have a particular format and the rest of them might have a change in the layout.
To scrape data from such websites, use XPaths or CSS selectors for data mining. If you are not using these, check the difference in layout and add a condition in your code that scrapes those pages differently.
7) Use a headless browser
The problem with this is that when doing any kind of data mining, the content is rendered by the JS code and not the raw HTML response the server delivers.
In such a case, blocking can be prevented using a headless browser. The Headless browser means they aren’t visual on a desktop. So there is no graphical interface. This means there is no graphical interface. Instead of interacting with an element, you can automate everything with a command-line interface. This can help you to stay undetected while web scraping.
8) Protect yourself from honeypot traps
Websites take the utmost care to prevent hacking. They set up honeypots to lure hacks and detect if there are any hacking attempts on the website. It is usually an application that imitates the behaviour of a real system. For example, some websites install honeypot links that are invisible to normal users but can be accessed by web scrapers only.
To avoid getting into this trap, make sure the link that you are opening has proper visibility and a nofollow tag. When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will have the CSS style display:none or will be colour disguised to blend in with the page’s background colour.
This detection is obviously not easy and requires a significant amount of programming work to accomplish properly, as a result, this technique is not widely used on either side – the server-side or the bot or scraper side.
9) Scrape Behind Login
There are a few websites that do not allow login permission. For example Facebook and Indeed.
Login protected pages require some more information or cookies with each request to access the page. This gives a chance to a target website to see requests coming from the proxy servers and hence block your account.
Therefore, it is advised to avoid scraping websites that have a login as you will be blocked easily. To scrape such websites, you can imitate human browsers when authentication is required so that you can get the targeted data.
How to address Web Scraping detection?
When building a spider, spend some time to investigate what anti-scraping mechanism is the website user and then program your spider accordingly. This will lead to a better outcome of data and build a robust spider in the long run.
How do you find out if a website has blocked you?
Look for the following alarms on a website while crawling. If you see any of them, they are indicators of you being banned or blocked.
– CAPTCHA pages
– Unusual content delivery delays
– Frequent response with HTTP 404, 301 or 503 errors
Also, if these HTTP codes appear, consider yourself as blocked.
– 301 Moved Temporarily
– 401 Unauthorized
– 403 Forbidden
– 404 Not Found
– 408 Request Timeout
– 429 Too Many Requests
– 503 Service Unavailable
Read Best blogs @ HOW TO OPEN BLOCKED WEBSITES BY PROXY?
Conclusion: Proven Ways To Bypass Anti-scraping Techniques
To bypass anti-scraping techniques, the basic rule remains constant, i.e be nice to the target website and use a proxy server. Don’t overload it with requests that its server cannot handle. Build a stable and robust mechanism/ spider to crawl and collect data efficiently rather than getting blocked. These points will help you build your own solution towards anti-scraping.
Are you a data scientist, marketer or a publisher, who uses a lot of techniques to bypass anti-scraping sites for getting the relevant data? Tell us about your experience with the bots?