Episode 79: Google HTTP2, cheap hosting and rankings and ITP in iOS14

In this episode, you will hear Mark Williams-Cook talking about Googlebot moving to HTTP/2 and what this means for SEOs and webmasters, cheap hosting and SEO and ITP in iOS14.

Play this episode

To view this video please update your privacy consent to include 'Experience cookies' Open consent preferences

Listen on:

What's in this episode?

In this episode, you will hear Mark Williams-Cook talking about:

Googlebot moving to HTTP/2: What does this mean for SEOs and webmasters?

Cheap hosting and SEO: A recent study suggests that cheap, shared hosting could be bad for your rankings.

ITP in iOS14: The latest headline notes on Intelligent Tracking Prevention and iOS14

Show notes

https://webmasters.googleblog.com/2020/09/googlebot-will-soon-speak-http2.html

https://www.rebootonline.com/blog/long-term-shared-hosting-experiment/

https://www.simoahava.com/privacy/intelligent-tracking-prevention-ios-14-ipados-14-safari-14/

Transcription

MC: Welcome to episode 79 of the Search with Candour podcast, recorded on Friday the 18th of September 2020. My name is Mark Williams-Cook and today we'll be covering Googlebot speaking in http2 and what this means for you, a brilliant study that we'll be covering by Reboot Online about does cheap hosting make your SEO worse? And we'll be ending the show talking about intelligent tracking prevention - ITP - in iOS14, so all of the apple devices and what that means.

This episode is sponsored by Sitebulb. Sitebulb is a desktop-based crawler for Windows and Mac machines, which will allow you to do SEO audits for your websites. Sitebulb is a tool I've used for quite a long time now, we use it throughout Candour agency. Every episode I just talk about a new feature that I like in Sitebulb, or perhaps something new they've brought out and one of the things that Sitebulb, I think is famous for, is their site visualisation. So this is a really, really cool part of Sitebulb - so like many tools you'll pop in your URL, their spider will crawl your site and it will start giving you things they're finding that maybe you need to look at, and we've talked before about how they'll prioritise and explain those issues; the site visualisations are really helpful and I'll tell you why. So there's several different types of crawled tree directory, map directory tree, that they can generate for you but what you're going to get is a visual representation of how your pages are linked together and a really quick way to overview your site architecture. I literally used this today in a call with a client, because it immediately highlights where they have sets of pages that are maybe three, four clicks away and there's only one route to get to them, so it immediately warns me that they've perhaps got areas at their site they need to look at, in terms of internal linking or in some extreme cases, we've done crawls and when you look at these crawl maps and it looks like an exploding sun, then you've definitely got an information architecture, internal linking issue and just being able to show those, share screen now, especially we do so many zoom calls, being able to share those with clients immediately is really powerful. Sitebulb have got a special offer for Search with Candour listeners, so if you go to sitebulb.com/swc, you will be able to get a 60 day - it's a very generous, 60 day trial of Sitebulb, no credit card required, you can just sign up at sitebulb.com/swc, so go and check it out.

Just yesterday, on Thursday the 17th of September on the Google webmaster central blog, we had a post called Googlebot will soon speak http 2. So from the beginning of November, Googlebot will actually start crawling some sites over HTTP2. This is really interesting from an SEO point of view. So there've been various ways, especially with performance optimisation, different hoops that we've jumped through over the years. So when we were running HTTP1, one of the main things here is that when we were delivering say, image assets, it was quite common to use a technique called Sprite Sheets, which was essentially where you would have your site images delivered to the browser as one big image, and you'd cut out a sheet, the images you wanted to use on the page and the reason for this was over HTTP1, we had to basically send these things in a linear, one at a time fashion. So having 30, 40 images one would have to be sent and then received and then the next one and the next one and this caused latency. So one of the ways around that was people would just put all of the images into one big image, deliver it and then cut it up from there. This changed with HTTP2 because it actually allows multiple streams to come over in parallel, so we've gone the other way now. We don't want to deliver big sprite sheets where maybe there's images we don't need on that page, or don't need yet. So actually, we're quite comfortable now delivering these assets in small chunks because that can be done in parallel, so it's really helpful, there's lots of other advantages with HTTP2.

So Google says, why are we making this change to Googlebot? And they said, in general we expect this change to make crawling more efficient in terms of server resource usage with - and I'll call it h2, instead of http2 to save myself - so with h2, Googlebot is able to open a single TCP connection to the server and efficiently transfer multiple files over in parallel, instead of requiring multiple connections. The fewer connections open, the fewer resources the server and Googlebot have to spend on crawling. I think that's really interesting from a crawl budget perspective. So I was just talking to a client today, with a particularly large site, about crawl budget and how Google does have finite resources and talking about how Google decides how many pages, on a particular site they're going to crawl, and you know if we're not getting the coverage we need to on particularly large sites, we can use tools like robots.txt or actually changing other sites structured a bit to try and herd these bots to where we need them to; so they're looking at important pages regularly and they're discovering new important content.

So it seems, to me, maybe there's a hint there that if they're spending less resources on crawling, that potentially they can do more crawling. So the crawl budget may become, as we'd expect over time, actually less of an issue. We'll put a link in the show notes at search.withcandour.co.uk to this post. It gives you some other information, for instance, you can opt out of this if there's a reason you'd like to do that.

They've answered some other questions here, which I think are useful. So they said, why are you upgrading Googlebot now? And Google says, the software we use to enable Googlebot to crawl over h2 has matured enough that it can be used in production. Do I need to upgrade my server asap? Google answered, it is really up to you, however, we will only switch to crawling over h2 sites that support it and will clearly benefit from it. If there's no clear benefit for crawling over h2, Googlebot will still continue to crawl over http1. So I'm guessing maybe they will try and do crawls, using both h1 and h2, and see if there's any particular gain for Googlebot in terms of resources for crawling over h2. I'm guessing as well that means it’s not replacing; there's still separate infrastructure there and it's about load balancing and you know, if the site's simple and it's delivered pretty much the same, takes the same amount of time over h1 or h2, they'll keep it on the legacy h1 stuff. If you have got sites doing some new funky things with h2, and that's making it easier for Google, it sounds like that's a thing they're going to do.

How do I test if my site supports h2? And they've asked as well, how do I upgrade my site to h2? So Google's recommending there's a cloudflow blog post with a whole different range of methods to test whether your site supports h2, and the answer to how to upgrade is basically go and ask your devs, or your site admin, or your server hosting provider. The other detail they've given here is, how will different features in h2 help with crawling and Google's answers - some of the many, but most prominent benefits of h2 include; multiplexing and concurrency, so fewer TCP connections open, means fewer resources spent. Header compression as well; so they'll drastically reduce the http header sizes and that will save resources and server push, they've said this feature is not enabled yet, it's still in the evaluation phase, it may be beneficial for rendering but we don't have anything specific to say at this point. I do like the last question that they've answered here on h2 which is, is there any ranking benefit for a site in being crawled over h2? No, is the answer. So they're saying that's not obviously going to be a factor. It's possible they, with a reduction in cost of crawling, that they may crawl more. Obviously how much Google decides to crawl anyway isn't just down to their finite resources, it's also down to your website, how many links you have, how big the site is, if your site's returning errors things like that.

Here is something that I found really interesting. So there is a blog post on the Reboot Online blog, which I will link to in the show notes at search.withcandour.co.uk, and the title of this blog post is “Long term shared hosting experiment” - I saw some people tweeting about this and I think it's worth going into. So essentially, and I'll give you some details of how they've done it, they have set up an experiment to try and determine if putting your website on shared hosting, that's shared with lots of other websites, especially websites that are in “ill repute” we’ll say with Google, if this could potentially impact your rankings. It's been something I've seen talked about over many years, especially when people talk about things like private blog networks. One of the most basic mistakes you see real amateurs make when they're trying to manipulate Google rankings is, they'll make their network of websites and blogs and they'll stick them all, in the worst case, on the same server; so they've got the same ip, same information, all that kind of stuff, so those are very obvious footprints. You see people now that sell private blog network infrastructure advertising, how you can have every website on a separate class cip address block, so that you're really separating these things, and they look natural. And it was taken as the understanding that that's a requirement because you don't want to be associated with - even if one of the sites got rumbled and banned - you don't want the rest of the network exposed and there's lots and lots of things you need to do to ensure that.

But anyway, this is a similar experiment, I think it's fair to say. So the background to this is that, their hypothesis is that, especially around the AI side to Google; so Google's algorithms are pretty good now at picking up on a variety of signals, both on page and off page, to determine how quality, trustworthy, and authoritative a site is. It's not always known exactly which factors Google are looking at because they do have this, what they call, this black box approach almost, and the post here in the hypothesis is talking about those algorithms, trying to look for patterns which allow them to identify lower quality websites. So with the thinking that with really cheap shared hosting websites, you'll attract lots of lower quality websites like spam and PBNs, could this be actually a factor that Google is taking in, with all other things being equal, if your site is on cheap hosting that maybe is not as good, and therefore doesn't deserve to rank.

So, this experiment, their method was quite interesting. So they came up with an entirely new keyword, which they were going to use as their baseline, if you like, which was head headgenestio and they checked that this didn't return any results and then they bought up 20 domain names and again, they're pretty thorough - using search operators to check they didn't get any results that had been previously indexed when they searched in Google. They used Majestic and ahrefs to ensure none of domains had any links already, so nothing that points to them having been previously registered or having things that could impact how they would rank. So now that they could take it that those 20 websites were all starting on an even footing, there's no historic ranking signals there, they went about to find hosting and the method was, they were hosting 10 websites on dedicated ip addresses, on amazon web services - AWS - and the other 10 were hosted on shared ip addresses that we knew also hosted bad neighbourhood type websites. This was quite interesting so, without going into mega depth here, they scoured through a lot of data and they tried to find hosts that had at least 200 other websites hosted on a single ip address and they then shortlisted that using string matching on keywords, in the domain name, try and find domains that included things like sex pharma casino poker escort cams slots all the kind of things that maybe would suggest these weren't reputable websites, or they're more likely to be doing dodgy link building and things like that. And that's actually another step they took, which is they went through these sites, ran them again through Majestic and ahrefs, and they were trying to find these servers that had sites on that were using link building strategies that did go against Google's guidelines. so they're doing things like buying links.

So with all this kind of ground work done, they basically built a very similar static HTML design on each website, made sure they loaded quickly, had the same functionality, again as close as possible. And I was quite impressed with this, they went quite far here, so they set up a job to run an hourly page speed insight test on every website, every day of the experiment because they wanted to ensure that speed was not influencing the rankings, because you would expect these shared hosts to be slower, so they didn't want to spoil their experiment by having really slow sites and that could have been argued that's what was causing this difference.

They also monitored the uptime with StatusCake, again to make sure that this wasn't an issue and the short version is the performance was pretty much the same or identical scores on page speed insights. status cake showed that both the aws and the super cheap hosting both had a hundred percent up time during the experiment. they did get some other people involved in this as independent observers, but they were very careful as well not to share this experiment with anyone that they were doing it. Because obviously there's again, you could argue if people knew about the experiment they were Googling things they were clicking on sites that could possibly influence the results.

So they have very similar content, but not identical on each website for the target key phrase, each website had similar meta information and then they kicked off and this was actually at the end of May. Now, this is where it gets really interesting so they set up tracking with semrush to track the position of the rankings they were tracking, and essentially left that going. Some precautions they took included not searching, as we mentioned the target keyword or visiting the experiment sites over the course of the experiment, they registered all domains using different tags, creating search console accounts from different ip addresses, at different dates and times using different gmail accounts, and names to do so. The domains were fetched in Google search console gradually over two weeks at different times and dates and from different ipa addresses. They alternated between shared hosting and aws websites and started by fetching a shared hosting website first. the keyword was unknown to Google at the start of the experiment. Domain names were unknown to Google it started the experiment. There was no historical link data, as you said, on any of the websites. Content on all the websites was the same length and the keyword density and positioning was kept the same across all the experiment websites. Similar but not identical source code and styling. multiple daily checks to ensure page speed and uptime was equal, and no external visitors to the website. So that's a pretty thorough job there.

Now the results, so the results are really really interesting. which basically showed that websites hosted on the shared ip address did rank less strongly than those hosted on a dedicated one. so looking at the ranking, so positions 1, 2, 3, 4, 5, 6, 7, 8 and 9 were all taken by the AWS websites, the 10th AWS site was in position 13. So that left the shared hosting sites in 10th, 11th, 12th, 14th, 15th, 16th, 17th, 18th, 19th and 20th - so apart from one, all of the AWS sites outranked the shared sites. So they did some data analysis on this and looked at the statistical significance which I'll let you read through, if that interests you. Their conclusion was the results of this experiment suggests that cheap, shared hosting options can in fact have a detrimental effect on organic performance and rankings of the websites hosted there. If your website ends up being hosted alongside lower quality and potentially spammy ones, providing all websites being observed are otherwise on a level playing field. Low-cost shared hosting solutions often attract those looking to publish lower quality, spammy, and toxic websites in a churn and burn fashion.

Google and their AI approach to ranking the search results often relies on finding patterns that lower quality websites or those looking to manipulate the search results share. By hosting your website alongside such domains, you risk positioning yourself in a bad neighbourhood that could be seen as part of a pattern that low quality websites share. It’s important to note that these results don't show what affect the type of hosting you use when setting up a website would have on an actual SERP for a keyword with real competitors. In order to test a single variable, the hosting type, we had to minimize the effect of any outside ranking factors. it makes sense. This does, however, mean that we cannot know how much weight is given to the type of hosting as a ranking factor, especially when other factors on page and off page are being taken into account alongside it. Hosting a website on a dedicated server and an IP address has many benefits and now according to experiment data, ranking higher in the SERPs could very well be one of them. If your budget allows for it, I would invest in a hosting solution that matches the quality of your products, services and information, you intend to offer. So super interesting results there.

Now, the discussion online was quite interesting. So Cyrus Shepard was involved as an independent observer, and retweeted the results saying, this is important it needs to be shared, look, cheap hosting can impact rankings. This is where John Mueller from Google came back and replied and John said, artificial websites like this are pretty much never indicative of any particular effect in normal Google search. It's a cool experiment and a good write-up analysis and I love it when folks experiment like this, but it's not useful data. Host where it makes sense for you. So what John's saying there is that because, as the analysis said and as they had to do, because they created this experiment, in a vacuum with zero competition, that Google's ranking algorithm isn't really gonna work in the same way as it should be I guess. So, it's not necessarily meaning that's replicable outside of this experiment. So Cyrus came back and said basically, could you present us with more useful data then to back up your claims, and John said, are you saying that what works for made-up keywords on artificial websites will work for a website in an active niche? That seems like quite a stretch, even aside from the technical aspects of what is shared hosting anyway, as technically AWS is shared hosting.

So John went on to answer some more questions and just said, he's not aware of any ranking algorithm that would take IPs like that into account. Look at bloggers, there are great sites hosted there that do well, ignoring on-page limitations etc, there are terrible sites hosted on there. It's all the same infrastructure, the same ip address. So I think this is one of those cases again where John's trying to just provide his insight that we don't have as SEOs not working at Google. He says, personally he's not aware of any ranking algorithm, he's not saying there isn't one, he's saying he's not aware of one, and I think it's just a very interesting point.

So, the last thing he ended up with, which is where I think is a good place to end it is John said, there are many reasons to pick good hosting over cheap hosting, and having good, fast, stable site with happy users, does reflect in ranking but the ranking shouldn't be a primary reason. I think that it's a fair conclusion that regardless of SEO, obviously you want a fast website, you want it to be reliable, so there isn't a reason really you should ever really be going for this super cheap hosting, but that's a very, very interesting study.

We'll finish up this episode talking a little bit about ITP, so intelligent tracking prevention, and if i cast my mind back, this is actually the topic that we spoke about 78 episodes ago. On the very first episode of Search with Candou, we talked about ITP, I think it was version 1.2 maybe, at the time. So ITP is this intelligent tracking prevention which is something that's been around a while now, and it's about protecting people's privacy by default, especially in terms of things like cookies, and at Apple's annual worldwide developer conference in June, this year, there were a couple of big announcements. One of these was around tracking prevention in iOS14, iPad OS14, and Safari 14. So this is essentially all of Apple's operating systems. Now, I will give you a link again in the show notes at search.withcandour.co.uk, if you want to read in depth about this, because it is a really in-depth topic. What I wanted to do with you is share a really great bullet point list that I received from the TLDR marketing newsletter - so it's absolutely brilliant, a weekly newsletter about marketing. TLDR, if you don't know it is - too long didn't read - it's a great email because I'm quite impatient with email, and it just very quickly summarises marketing news. Absolutely brilliant newsletter, would highly recommend, you subscribe if you haven't already.

So the summary of the changes in intelligent tracking prevention are as such, so the privacy report is available in the Safari 14 browser, across all of Apple's operating systems; that's Mac, OS, iOS, and iPad OS. So this is the operating systems that are going to be affected. It uses duckduckgo's tracker radar list to enumerate, which known tracking capable domains have been receiving http requests from sites the user has visited. So duckduckgo obviously is the very privacy focused search engine, and this is looking at some of these worst offender domains in terms of tracking. So the report does highlight how some of the most prominent tracking domains like Facebook, and doubleclick.net have been prevented from accessing the user's browser storage, among other things. And since Webkit blocks all access to cookies in a third-party context, the full list of prevented domains comprises all the cross-site requests done from the sites the user visits, not just those listed in the privacy report.

Interestingly, all of the tracking prevention in Webkit is on by default, in all browsers in iOS14, and iPad14. There's full third-party cookie blocking, so all cookie access in third-party context is blocked, there are no exceptions here. All cross-site referrers have been downgraded to just origin. so this means where sometimes you'd see a full URL for where someone's come to your site in your analytics, now you'll just see the domain. All cookies written in JavaScript will have their expiration capped at a maximum of seven days, from the time the cookie is written or rewritten. And Safari, lastly, does not block requests, it strips them of the capability to access cookies or pass referrer headers etc.

So a couple of questions that came out of this as well that are useful to answer. So does Google Tag Manager actually work, isn't that relying on third-party cookies? And the answer there is that GTM, Google Tag Manager works fine. The preview mode requires third-party cookies, so that might suffer but they've added, who in their right mind uses preview mode on mobile? And secondly, so all remarketing tags from Facebook, LinkedIn insights and Google are no longer working because of the third-party cookie “thing”? and the answer is, as far as I know Facebook doesn't use third-party cookies anymore, but rather sets up first party tracking with the fbclids, like the Google ads tracking, so fbclid identifier in urls.

So this is a super interesting topic, it's something we're going through internally with our developers this week just to understand the impact, because there have been a lot of these privacy changes. Again, I was looking at one today with a new website that we had rebuilt that had that has now, as it should do, analytics cookie opt-ins and how that's affecting the amount and fidelity of the data we're getting. So it's really important that developers, marketers, start to understand, if they don't already, all of these new privacy concerns, especially in terms of how we can still be effective as marketers.

And that's everything we've got time for in this episode. So we'll be back again, last one for September which will be Monday the 28th of September. As usual, if you're enjoying the podcast, do give us a subscribe or share it with someone, we'd really appreciate that and we'll be back in a week's time. Have a brilliant week.