Extracting data from the web using Python

Whether you are working on data science, machine learning projects, you are probably going to need to extract data from the web in your line of work. So how do we do to actually pull stuff out from the web?

In this article, we’re going to see the basics that you’re going to need to access and to extract automatically data from the web using Python.

Prerequisites:

Before we begin, please set up the Python environment on your machine. Head over to their official page here to install if you have not done so.

We will also be installing Beautiful Soup and Urllib modules from Python in our virtual environment later.

The Urllib Request library allows us to easily make HTTP requests, while BeautifulSoup will make scraping much easier for us.

Regular expressions using Python:

https://xkcd.com/208/

Regular expressions are a very specialized language that allow us to succinctly search strings and extract data from strings. Regular expressions are a language unto themselves. It is not essential to know how to use regular expressions, but they can be quite useful and powerful.

Basic patterns :

Here are the most basic patterns which match single chars:

  • a, X, 9, < — ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
  • . (a period) — matches any single character except newline ‘\n’
  • \w — (lowercase w) matches a “word” character: a letter or digit or underbar [a-zA-Z0–9_]. Note that although “word” is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
  • \b — boundary between word and non-word
  • \s — (lowercase s) matches a single whitespace character — space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
  • \t, \n, \r — tab, newline, return
  • \d — decimal digit [0–9] (some older regex utilities do not support but \d, but they all support \w and \s)
  • ^ = start, $ = end — match the start or end of the string
  • \ — inhibit the “specialness” of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as ‘@’, you can put a slash in front of it, \@, to make sure it is treated just as a character.

re.search() vs find() vs startswith() vs re.findall()

re.search() : This method takes a regular expression pattern and a string and searches for that pattern within the string.

find(): This method determines if string str occurs in string, or in a substring of string if starting index beg and ending index end are given.

startswith(): This method returns true if found matching string otherwise false.

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

Greedy vs Non-Greedy extraction :

There is an extension to regular expression where you add a ? at the end, such as .*? or .+?, changing them to be non-greedy. Now they stop as soon as they can.

For more information about using regular expressions in Python see the official python documentation here and you can download a RegEx “cheat sheet” right here .

Networks and Sockets :

Instead of just talking to a disk drive, we’re going to go right outside of the computer and talk across the Internet, to talk on the web.

Python’s awesome. Let’s say you want to use millions of lines of code in the link and network and transport layer on your computer, as well as, the entire Internet, and some server on the other side and you want to talk to it?

All that complexity can be translated into three lines of Python, that is it.

A pair (host, port) is used for the AF_INET address family, where host is a string representing either a hostname in Internet domain notation like 'daring.cwi.nl' or an IPv4 address like '100.50.200.5', and port is an integer.

SOCK_STREAM represent the socket types. More constants may be available depending on the system. (Only SOCK_STREAM and SOCK_DGRAM appear to be generally useful.)

What we’re doing here is we’re simulating what is going to happen in a web browser and the cool thing about the HTTP protocol is that we can do this by hand. This is going to go to data.pr4e.org and retrieve a document.

It will make the connection and if the port 80 is there, then we’re going to actually send the HTTP command.So the only other thing that’s kind of weird here is we have to add this encode and that’s because there are strings inside of Python that are in unicode, and we have to send them out in what’s called UTF-8 and the encode method converts from unicode internally to UTF-8.

Output :

All the lines before the extracted text is what we called the headers which contain the metadata of the request, you'll get these exact same headers either in Python or with Telnet or with the developer console.

And now we successfully opened a socket, sent a command, and then retrieved the data from the web.

Surfing the web :

https://xkcd.com/353/

So it might be amazing that we can use sockets to write a ten-line program that retrieves a web page, but, hey, this is Python, we’re trying to make this as easy as possible.

So there’s another library that wraps that socket stuff and does it for us automatically, called urllib.

Urllib :

In the code below, we import some library and then we execute the request.urlopen() method .

Here’s just a plain old URL suffice. It’s going to parse automatically the URL, it’s going to figure out what server to talk to, what document to retrieve, what HTTP version, that GET request ..., and all that’s stuff. And returns us back a file handle.

The code above lets us open a file and read through all the words in the file and counting them up.

Now we don’t have to just read text files, we can read HTML files.

Just tell it, go give me that HTML file, and then write a loop.

So now we have basically built a web browser in four lines of files and this will print out that content of the web page.

Parsing webpages :

What scraping really is, is when we write a Python program that fakes or pretends to be a browser as we just did. It retrieves the web pages, it looks at them, and maybe extracts some information.

Now, why might you scrape? Well, you might scrape data that you can’t get any other way. Now, you’ve got to be careful, because it might not be legal to do this.

Beautiful Soup :

BeautifulSoup is a python library that let us go through the html code with ease because it turns out that HTML is so ugly and so inconsistent that things like regular expressions don’t always work very well with HTML.

I think the naming of this is all sorts of this tongue-in-cheek of what a mess HTML is. And so instead of calling it HTML super parser, they just call it something silly because it’s kind of a silly problem because HTML on the web is just so bad.

The code snippet below is a simple use of the BeautifulSoup library to retrieve and parse any HTML page and pull out anchor tags, which is really sort of the beginning of a web crawler.

SSL certificate :

SSL stands for Secure Sockets Layer and is designed to create secure connection between client and server.

Client need to connect to server over SSL, fetch its certificate, check that the certificate is valid (signed properly) and belongs to this server (server name).

To keep things simple we will disable the certificate verification when connecting via HTTPs.

In a surprisingly few lines of code you can do the hard part of a web crawler. BeautifulSoup really simplifies it.

And that’s one of the things that people really like about Python.You’ve got a basic web crawler that works in less than 30 minutes.

And so the support for HTML, HTTP, and sockets in Python is one of the very, very charming things that people really like about Python.

Conclusion :

In this article, we learned how to use Python to effectively access web pages read them and extract specific data using Urllib and Beautifulsoup.

I hope this was a useful to those who just started learning about webscraping like me.

Do share your thoughts, questions. I welcome feedback and constructive criticism regarding this article. Happy learning!

References and further reading :

“Python for Everybody: Exploring Data In Python 3”

https://www.py4e.com/

https://www.coursera.org/learn/python-network-data

Committed lifelong learner. I am passionate about machine learning, data engineering and currently working as a datascientist.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store