Ian London bio photo

Ian London

Data scientist at Metis, NYC.

Email Twitter Facebook LinkedIn Github

Requirements:

You need to have to have selenium installed first: run pip install selenium in the terminal or with iPython run !pip install selenium.

Getting an HTML selection with selenium

First, set up a Firefox webdriver and point it to our URL of interest.

from selenium import webdriver

driver = webdriver.Firefox()

driver.get('https://en.wikipedia.org/wiki/International_Space_Station')

Let’s select the first <table> element from Wikipedia’s ISS article as an example.

iss_table = driver.find_element_by_xpath('//table')

Now we want to see if our XPath selector got us what we were looking for.

We can look at the raw HTML of that first <table> and see it it’s what we wanted. To get the raw HTML of a selected element, we can get its outerHTML attribute:

iss_table_html = iss_table.get_attribute('outerHTML')

print iss_table_html[:200]
print '\n. . .\n'
print iss_table_html[-200:]
<table class="infobox" style="font-size:88%; width:22em; text-align:left">
<caption>International Space Station</caption>
<tbody><tr>
<td colspan="3" style="text-align:center;"><a href="/wiki/File:Int

. . .

ex.php?title=International_Space_Station&amp;action=edit">[update]</a></sup><br>
(<a href="/wiki/Exploded_view" title="Exploded view" class="mw-redirect">exploded view</a>)</td>
</tr>
</tbody></table>

Rendering the selected HTML in the notebook

Reading raw HTML isn’t very nice.

Let’s take advantage of some iPython Notebook magic: since we’re viewing the notebook in a web browser, we can also render HTML content directly in the notebook.

We lose whatever CSS styling was in the scraped website, as well as anything loaded from relative links, but we can see the general structure which is often all we want anyway.

This can make it much easier to see what our XPath selectors are actually pulling from the site. Is it what we intended? Scraping HTML is a messy business and selectors often surprise you, so it’s nice to be able to get visual feedback.

Here is the same table as above, rendered in HTML in the iPython notebook. Relative links won’t work, but in the example below the image of the ISS shows up correctly because its src is an absolute link.

# for ipython notebook display
from IPython.core.display import display, HTML

display(HTML(iss_table_html))
International Space Station
A rearward view of the International Space Station backdropped by the limb of the Earth. In view are the station's four large, gold-coloured solar array wings, two on either side of the station, mounted to a central truss structure. Further along the truss are six large, white radiators, three next to each pair of arrays. In between the solar arrays and radiators is a cluster of pressurised modules arranged in an elongated T shape, also attached to the truss. A set of blue solar arrays are mounted to the module at the aft end of the cluster.
The International Space Station on 23 May 2010 as seen from the departing Space Shuttle Atlantis during STS-132.
Station statistics
COSPAR ID 1998-067A
Call sign Alpha, Station
Crew Fully crewed: 6
Currently aboard: 6
(Expedition 47)
Launch 20 November 1998
Launch pad Baikonur 1/5 and 81/23
Kennedy LC-39
Mass Appx. 419,455 kg (924,740 lb)[1]
Length 72.8 m (239 ft)
Width 108.5 m (356 ft)
Height c. 20 m (c. 66 ft)
nadir–zenith, arrays forward–aft
(27 November 2009)[dated info]
Pressurised volume 916 m3 (32,300 cu ft)
(3 November 2014)
Atmospheric pressure 101.3 kPa (29.91 inHg, 1 atm)
Perigee 409 km (254 mi) AMSL[2]
Apogee 416 km (258 mi) AMSL[2]
Orbital inclination 51.65 degrees[2]
Average speed 7.66 kilometres per second (27,600 km/h; 17,100 mph)[2]
Orbital period 92.69 minutes[2]
Orbit epoch 25 January 2015[2]
Days in orbit 6353
(12 April)
Days occupied 5640
(12 April)
Number of orbits 95912[2]
Orbital decay 2 km/month
Statistics as of 9 March 2011
(unless noted otherwise)
References: [1][2][3][4][5][6]
Configuration
The components of the ISS in an exploded diagram, with modules on-orbit highlighted in orange, and those still awaiting launch in blue or pink
Station elements as of May 2015
(exploded view)

Much nicer!