Web crawler sounds like the discarded name for a Spider-Man sidekick, but it’s actually a program used by search engines to look at almost every single thing on the internet. Our vote on the name would have been “All Seeing Eye” or “The Bug Which Knows” or something cool like that, but we didn’t create the first web crawler; Google did. Well, sort of. Google (and other search engines like Bing) popularized the use of crawlers to determine what search results to give you, though. In fact, Google recently declared that the “spider bot hybrid can jump great distances and sees best when surrounded by green light” though, it still doesn’t have a formal name.
How do web crawlers work?
Web crawlers, also called spiders, crawl bots, or search engine crawlers, work by following the links between sites and indexing them. Indexing is a fancy word for “remembering” and the crawling process involves a bot arriving on your site, remembering all the copy (i.e. written words) and the links between the pages, as well as things like meta tags, photo description tags and alt tags (which are used to help screen readers, used by the visually impaired). Then, once the bot has completed the crawl and index process, it pops back to whatever search engine it came from and relays the information, then it goes out and crawls another site.
Web Crawlers Collect Data & Impact SEO
Web crawlers are one of the tools used by Google (and other search engines like Bing, Baidu, or Yandex) to determine the SEO rank of a given web page. To get that coveted #1 spot in the search rankings, the creepy crawlies need to be able to easily parse your site and identify what information is most important (like the links coming to and from the page, the H1 and H2 headers, and the meta tags).
We’ll briefly explain how those various SEO elements work here, but if you want a more in-depth answer to your SEO questions, read our article about on-page SEO.
Crawl bots use the way the words are arranged on a web page to determine what is most important. In most cases, that refers to the main heading, the H1, and the most prominent subheadings, the H2s, and occasionally H3s. If it helps, think about H1s like the headline of a news article, while the H2 is the most important sentence and an H3 is the caption of the photo.
Metadata is the other way the bots determine what is important on a given page. You can use the meta tags to identify items and pages of significance. Metadata is the description of a page or a certain section of content, and there are a few things you can do to make your meta-tags the best they can be. Be brief, descriptive, informative, and accurate with your meta tags, and the bots (and your SEO rankings) will be happy.
Why Links Are So Important For SEO & Web Crawlers
The entire internet is a massive; unknowable thing with a nearly unlimited number of websites and pages on it. A crawler doesn’t have time to access every page, and in fact many web pages (between 20-30 percent depending on who gives the statistic) are so-called “Dark Web” pages that are not accessible or connected to the regular internet that most people know and use. Besides those exceptions, links between sites are how the crawlers find your web page.
Crawl bots start with what is called a seed; a known list of accessible websites (like BBC News, Google.com, Wikipedia, etc.). From these seed sites, the bot crawls along the links to the other sites, indexes those, and repeats the process. This is how the known web works, and why links are so important: if your page isn’t linked to by many reputable “seed” sites, the crawl bots will lean towards ranking you less highly. Being linked to frequently by authoritative sites, or being an authority yourself, is a great way to scream to the top of the rankings. To help you get there, check out our list of SEO maintenance tasks you can implement right now.
If you’re worried about your site not being connected to anything at all, you can submit a sitemap.xml page to Google, or a crawl request, so that they know you exist and the bots will start including you in the regular crawls they do. This is a good habit to be in, because doing this will also let Google know that you want your site to be crawled, which means they’ll tell you if there’s an error with your robots.txt or something else.
A Common Web Crawler Mistake: Misusing Robots.txt
Before a spider crawls your web page, it politely asks permission. It does this by querying the robots.txt page and seeing if you have given it permission to crawl the site or not. A good bot (like one of the ones used by the major search engines) will listen to the instructions on the robots.txt page. A naughty one will do whatever it wants, scrape and steal content, and be a nuisance; and we’ll talk about how to deal with that in another article.
The problem comes when you have a page you want to be crawled which is telling the spiders to buzz off. This can result in your search ranking declining and your sales dropping. Google has a convenient, free tool you can use to check if your robots.txt page is working as intended.
Wrapping things up
Spiders are our friends, and just like our human visitors, we have to be conscious and empathetic towards them when working on web properties. The best websites optimize for humans and robots, allowing for a great user and crawlability experience end to end. Now that you’re equipped with this knowledge, go out there and make something great!