Did you know that Google Sheets can be turned into a web scraper? Albeit a simple one, it’s a web scraper none the less!

You can pull out information from a website by using the =IMPORTXML() command. We’ll dive into the specifics below, but if you want to jump in and start scraping, here’s a quick example.

=IMPORTXML(url, xpath_query)

=IMPORTXML(“https://www.somewebsite.com”, “//span[@class=’my-class-name’]”)

In this example, we call the function importxml and then pass in the cell and target element to the function. Let’s take a deeper dive.

A Real World Example

Let’s say you want to crawl techcrunch.com (sorry TechCrunch) for all headings under the latest news. You first setup the =IMPORTXML() function in your Google Sheets.

Then, insert the URL as the first parameter.

And now for the tricky part, you’ll need to target the html element. The rules for the xpath is as follows.

// – this will grab all elements of the type

//h2 – this will grab all h2 elements on the page

[@class=’name’] – will grab all elements on the page with the class name

//h2[@class=’name’] – then grabs all h2 elements with the class of name

As of this writing, the HTML for the headings under The Latest are:

Based on this, we know that we can target the heading by using h2[@class=’post-block__title’]. And we can finish up our import call with the following:

=IMPORTXML(“https ://techcrunch.com/”,”//h2[@class=’post-block__title’]”)

When you’re done, you should have 20 or so rows that have the titles of the articles from The Latest.

You can expand on this simple scraper by pulling out the links and description for your dev needs. Thanks for reading!

If you want to see the sheet I used, click here.

Post not found !

Leave a Reply

Your email address will not be published. Required fields are marked *
You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>