Web Scraping with Beautifulsoup

Posted by:

Web Scraping with Beautifulsoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

A Gentle Introduction to Overall Steps

The HTML tags contained in the angled brackets provide structural information (and sometimes formatting), which we probably don’t care about in and of itself but is useful for selecting only the content relevant to our needs.


  1. Parse a html or xml file or website to generate a nested tree data structure.

  2. Loop through the structure and filter needed information.
  3. One common task is to extract all the URLs found within a page’s <a> tags, as in the university list example below:

  4. If needed, use firebug or other tools to located the desired information from the html source code.


  5. Enjoy your Beautiful Soup.

Installation and Setting up

I had some problems with the installation. Here is what worked for me.

If you got this error: ImportError: bad magic number in 'urllib': b'\x03\xf3\r\n' , to to this link here for solution:

This is caused by the linker pyc, have a look on the directory of your scripts, you may have used two differents versions of python exp 2… 3… etc just to rm -f the pyc file that causes the problem and you will fix it.

Make some Beautiful Soup -Enjoy!

How Many Univsersity in USA ?

Following script is adapted from somewhere which I could not seem to recall the origin.

so, there are 2163 Universties in USA as of this moment 11/18/2015 12:07:18 AM.

Local Climate for the last 100 Years ?

Weather underground is a wonderful site with climate-rich information.

  1. Followig script can be used to look back into historic climate fluctuation (temperature only here).

  2. Some coding with python matplot to plot it.

  3. and here is the plot with quite some bad data points (I think).


  Related Posts

Add a Comment