What do you do when you can't download a website's information? You do it by hand? Wow, you're brave!
I'm a web developer, so I'm way too lazy to do things manually :)
If you're about to scrape data for the first time, go ahead and read How To Scrape A Website.
Today, let's say that you need to enrich your CRM with company data.
To make it interesting for you, we will scrape Angel List.
More specifically, we'll scrape Uber's company profile.
Please scrape responsibly!
Getting started
Before starting to code, be sure to have Python 3 installed, as we won't cover it here. Chances are you already have it installed.
You also need pip, a package management tool for Python.
easy_install pip
The full code and dependencies are available here.
We'll be using BeautifulSoup, a standard Python scraping library.
pip install BeautifulSoup4
You could also create a virtual environment and install all the dependencies inside the requirements.txt file:
pip install -r requirements.txt
Inspecting Content
Open https://angel.co/uber in your web browser (I recommend using Chrome).
Right-click and open your browser's inspector (sorry, it's in French!).
Hover your cursor on the description.
This example is pretty straightforward: you want the <h2> tag with the js-startup_high_concept class.
This would be the unique location of our data thanks to the class tags.
Extracting Data
Let's dive right in with a bit of code:
# we'll get back to this
headers = {}
# the Uber company page you're about to scrape!
company_page = '<https://angel.co/uber>'
# open the page
page_request = request.Request(company_page, headers=headers)
page = request.urlopen(page_request)
# parse the html using beautifulsoup
html_content = BeautifulSoup(page, 'html.parser')
description = html_content.find('h2', attrs={'class': 'js-startup_high_concept'})
print(description)
Let's get into the details:
- We create a variable headers (more on this very soon)
- The company_page variable is the page we're targeting
- Then we build our request. We inject the company_page and headers variable inside the Request object. Then we open the url with the parameterized request.
- We parse the HTML response with BeautifulSoup
- We look for our text content with the find() method
- We print our result!
Save this as script.py and run it in your shell, like this python script.py.
You should get the following:
urllib.error.HTTPError: HTTP Error 403: Forbidden
Oh :( What happened?
Well, it seems that AngelList has detected that we are a bot. Clever people!
Okay, so change the headers variable for this one:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
Run the code with python script.py. Now it should be good:
<h2 class="js-startup_high_concept u-fontSize15 u-fontWeight400 u-colorGray3">
The better way to get there
</h2>
Yeah! Our first piece of data :D
Want to find the website? Easy:
# we extract the website
website = html_content.find('a', attrs={'class': 'company_url'})
print(website)
And you get:
<a class="u-uncoloredLink company_url" href="http://www.uber.com/" rel= nofollow noopener noreferrer" target="_blank">uber.com</a>
Ok, but how do I get the value of the website?
Easy. Tell the program to extract the href:
print(website['href'])
Make sure to use the strip() method, otherwise you'll have big spaces:
description = description.text.strip()
I won't cover in detail all the elements you could extract. If you're having issues, you can always check this amazing XPath cheatsheet.
Save results to CSV
Pretty useless to print data, right? We should definitely save it!
The Comma-Separated Values format is really a standard for this purpose.
You can import it very easily in Excel or Google Sheets.
import csv
# open a csv with the append (a) parameter
with open('angel.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([description, website])
What you get is a single line of data. Since we told the program to append every result, new lines won't erase previous results.
Check out the whole script
from urllib import request
from datetime import datetime
from bs4 import BeautifulSoup
import csv
# add the correct User-Agent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
# the company page you're about to scrape
company_page = 'https://angel.co/uber'
# open the page
page_request = request.Request(company_page, headers=headers)
page = request.urlopen(page_request)
# parse the html using beautiful soup
html_content = BeautifulSoup(page, 'html.parser')
# we parse the title
title = html_content.find('h1')
title = title.text.strip()
print(title)
# we parse the description
description = html_content.find('h2', attrs={'class': 'js-startup_high_concept'})
description = description.text.strip()
print(description)
# we extract the website
website = html_content.find('a', attrs={'class': 'company_url'})
website = website['href'].strip()
print(website)
# open a csv with the append (a) parameter. We also save the date which is always a good indicator.
with open('angel.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([title, description, website, datetime.now()])
Conclusion
It wasn't that hard, right?
We covered a very basic example. You could also add multiple pages and parse them inside a for loop.
Remember how we got blocked by the website's security and resolved this by adding a custom User-Agent?
We wrote a small paper about anti-scraping techniques. It'll help you understand how websites try to block bots.
{{tech-component}}