Opportunist
How to scrape a website

How to scrape a website

Discover how to scrape data from a website.

Published
June 11, 2018
Reading time
5 min

We often hear about how much data is on the web and how it’s growing exponentially from year to year.

That leads to discussions of Big Data, and Machine Learning, and so on. But in the end, what do YOU do with web data?

The answer is probably nothing, because 99% of websites don't let you access their data easily. You need access to that information, in a scalable way.

Luckily, there's web scraping to the rescue!

Web scraping allows you to automatically extract any content from any website. You can virtually scrape anything, from e-commerce shops to GitHub repositories.

How it works

First, you have to understand how a web page is created, and particularly how HTML works.

A web browser renders HTML documents. These documents describe the structure of the page semantically.

Think of it as a tree with branches. In reality, to render a web page, web browsers organize the HTML document in a tree structure called the DOM (Document Object Model).

<!DOCTYPE html>
<html>
    <head>
        <title>This is a title</title>
    </head>
    <body>
        <h1>Heading</h1>
        <p>Hello world!</p>
    </body>
</html>

What you need to keep in mind is that everything is nested.

...
<body>
    <h1>Heading</h1>
    <p>
                <span>I'm nested <b>I'm nested and bold!</b>
            <span>Wow, too much nesting for me, I'm getting lost!
                <span>Wait... can you actually do that?</span>
            </span>
        </span>
    </p>
</body>
...

There are some rules to respect, but that's not the topic of this article.

The elements "h1" and "p" are tags. They can be described by attributes:

<h1 class="nice-heading" id="main-heading">Heading</h1>

Attributes further describe the tags (nodes). They are very, very useful, mostly because they let you describe a path to the data.

Indeed, when you say "I want to extract data" from the single line of code above, what you're referring to is the "Heading" value, which is a text value.

But how do you access this data?

Accessing Data

Okay, so let's say we have the following code:

...
<body>
 <div class="container">  
   ...
   <div class="card">
     <h3 class="use-case">Repositories</h3>
     <p>Enrich your business database or find new leads to feed your CRM.</p>
   </div>
   ...
 </div>
</body>
...

The previous (simplified) code outputs the following:

Scraping Example

In this case, how do you access the text of the first card described by our HTML code?

Easy! You need to use the XPath language.

The XPath language is based on a tree representation of the XML document, and provides the ability to navigate around the tree, selecting nodes by a variety of criteria.

Remember how I said that an HTML document is like a tree with branches? Well, it's the same for XML. Both of these languages are what we call a markup language.

Xpath gives you the ability to navigate the DOM (remember, the fact HTML is organized into a tree structure with branches!).

In the end, it's very simple to access data, because what you get is the following structure:

div.container
 -- div.card
   -- h3.use-case
     #text
   -- p
    #text

Now, if you want to access the text inside the <p> tag by using XPATH:

document.xpath("//div[@class="container"]//div[class="card"]//p/text()")

Basically, what this code says is: "Take the div container then go to the div card and extract the text inside the p tag"

This way, you're able to extract the text "Enrich your business database or find new leads to feed your CRM".

Amazing, isn't it?

The Next Level

Now that you understand the basics, you need to dive a bit deeper into programming.

For most scraping use cases, I generally recommend to use Python.

There's an amazing community and tons of packages and libraries that you can use to scrape web data.

Here is an example to Scrape Websites with Python and BeautifulSoup.

Among others:

We've only been talking about basic HTML pages, but you probably know that websites nowadays use more and more JavaScript to build very cool stuff.

Unfortunately, JS does not simplify web scraping. But there's a solution to every problem :)

Some examples of useful libraries:

To help you a bit, here's a great XPath Cheatsheet to use whenever you want to access complicated nested data.

If you need help with web scraping, be sure to get in touch.

Be sure to check out our blog to get a sense of what you can do with web scraping.

{{tech-component}}

Guillaume Odier
Co-Founder & CEO
Table of contents
Get a demo
Business decisions should be backed by fresh and accurate insights. Power your growth with data-driven actions that adapt to your needs.
Crafted for leaders, designed for growth

Channel the full potential of revenue automation to save time and drive growth.

Get a demo
The best decision is an informed one

Easily extract, enrich and integrate the data you need to scale your operations and supercharge your growth.

Get a demo
Markets evolve, and leaders adapt.

Fully automate your Inbound and Outbound lead gen using Captain Data.  

Get a demo
Turn data points into vantage points

Channel the full potential of revenue automation to transform raw data into actionable insights

Get a demo
Evolving markets demand evolving strategies

Leverage the power of automation to eliminate unnecessary data entry, save time, and drive growth.

Get a demo
Make sense of your market one byte at a time

Easily extract, enrich and integrate the data you need to scale your operations and drive your growth.

Get a demo
Captain Data in 5 minutes

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Get a demo