HubSpot Links Crawler 2.0 user agent hitting protected pages

contacts

#1

We are getting hit by a crawler with the user agent of "HubSpot Links Crawler 2.0 http://www.hubspot.com/" on a gated page. This is happening on the first page after a user signs into the application - thus if an unauthenticated user visits that page, it will trigger a 401. The crawler is most likely hitting that page because we are making an identify() call to associate an unknown user with an email address. Hubspot hitting us is flooding our error tracking software. How can we turn this off or prevent it from happening?


#2

Hi @terrance,

The HubSpot links crawlers respect robots.txt files; are you able to implement a robots.txt file for the page(s) you're referring to?


#3

@Derek_Gervais Yes, they have been implemented since day 1 as the page path is accessing an authenticated account. Here is what is looks like:

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-agent: *
Disallow: /
User-agent: AdsBot-Google
Disallow: /

#4

@Derek_Gervais Is this something better asked to support rather than on the forums?


#5

Hi @terrance,

This might be appropriate for Support, but I'm happy to help out here as well. Can you give me the page URL for the restricted page that the crawler is hitting? I'd like to dig in with the team to see if we're perhaps having trouble accessing the robots.txt.


#6

@Derek_Gervais I PM'd you. That data is kind of sensitive.