In this article, I want to explore some things we’ve seen about JS indexing behavior in the wild and in controlled tests and share some tentative conclusions I’ve drawn about how it must be working.
A brief introduction to JS indexing
There are some complexities even in this basic definition (answers in brackets as I understand them):
For more on the technical details, I recommend my ex-colleague Justin’s writing on the subject.
These days, if you need or want JS-enhanced functionality, more of the top frameworks have the ability to work the way Rob described in 2012, which is now called isomorphic (roughly meaning “the same”).
I was fascinated by this piece of research published recently — you should go and read the whole study. In particular, you should watch this video (recommended in the post) in which the speaker — who is an Angular developer and evangelist — emphasizes the need for an isomorphic approach:
If you work in SEO, you will increasingly find yourself called upon to figure out whether a particular implementation is correct (hopefully on a staging/development server before it’s deployed live, but who are we kidding? You’ll be doing this live, too).
To do that, here are some resources I’ve found useful:
Some surprising/interesting results
It may be more complicated than that, however. This segment of a thread is interesting. It’s from a Hacker News user who goes by the username KMag and who claims to have worked at Google on the JS execution part of the indexing pipeline from 2006–2010. It’s in relation to another user speculating that Google would not care about content loaded “async” (i.e. asynchronously — in other words, loaded as part of new HTTP requests that are triggered in the background while assets continue to download):
“Actually, we did care about this content. I’m not at liberty to explain the details, but we did execute setTimeouts up to some time limit.
If they’re smart, they actually make the exact timeout a function of a HMAC of the loaded source, to make it very difficult to experiment around, find the exact limits, and fool the indexing system. Back in 2010, it was still a fixed time limit.”
It matters how your JS is executed
I referenced this recent study earlier. In it, the author found:
It’s definitely worth reading the whole thing and reviewing the performance of the different frameworks. There’s more evidence of Google saving computing resources in some areas, as well as surprising results between different frameworks.
CRO tests are getting indexed
- For users:
- CRO platforms typically take a visitor to a page, check for the existence of a cookie, and if there isn’t one, randomly assign the visitor to group A or group B
- A cookie is then set to make sure that the user sees the same version if they revisit that page later
- For Googlebot:
I might have expected the platforms to block their JS with robots.txt, but at least the main platforms I’ve looked at don’t do that. With Google being sympathetic towards testing, however, this shouldn’t be a major problem — just something to be aware of as you build out your user-facing CRO tests. All the more reason for your UX and SEO teams to work closely together and communicate well.
Split tests show SEO improvements from removing a reliance on JS
- Googlebot crawls and caches HTML and core resources regularly
- Some pages are indexed with no JS execution. There are many pages that can probably be easily identified as not needing rendering, and others which are such a low priority that it isn’t worth the computing resources.
- Some pages get immediate rendering – or possibly immediate basic/regular indexing, along with high-priority rendering. This would enable the immediate indexation of pages in news results or other QDF results, but also allow pages that rely heavily on JS to get updated indexation when the rendering completes.
- Many pages are rendered async in a separate process/queue from both crawling and regular indexing, thereby adding the page to the index for new words and phrases found only in the JS-rendered version when rendering completes, in addition to the words and phrases found in the unrendered version indexed initially.
- The JS rendering also, in addition to adding pages to the index:
- May make modifications to the link graph
- May add new URLs to the discovery/crawling queue for Googlebot
Towards the end of my time there, there was someone in Mountain View working on a heavier, higher-fidelity system that sandboxed much more of a browser, and they were trying to improve performance so they could use it on a higher percentage of the index.”
Run a test, get publicity