Sunday, November 29, 2015

Box office plot (the other kind)

Tables can obfuscate understanding

And we have a follow up question on the Pandas read_html wikipedia James Bond article. Namely, how do we quickly visualize the data.

It is true that the table we had in the previous article doesn't help us with easily seeing the trend (it wasn't the purpose). Even trimming this down to just the Bond movie title and box office normalized to 2005 million $ doesn't help us that much, even with the minimum and maximum highlighted:


TitleBox office.1
1Dr. No448.8
2From Russia with Love543.8
3Goldfinger820.4
4Thunderball848.1
5You Only Live Twice514.2
6On Her Majesty's Secret Service291.5
7Diamonds Are Forever442.5
8Live and Let Die460.3
9man with !The Man with the Golden Gun334.0
10spy who !The Spy Who Loved Me533.0
11Moonraker535.0
12For Your Eyes Only449.4
13Octopussy373.8
14view !A View to a Kill275.2
15living !The Living Daylights313.5
16Licence to Kill250.9
17GoldenEye518.5
18Tomorrow Never Dies463.2
19world !The World Is Not Enough439.5
20Die Another Day465.4
21Casino Royale581.5
22Quantum of Solace514.2
23Skyfall879.8


Same data, only better

The absolute minimum to get pandas to plot a graph in a Jupyter notebook (I'm assuming you have enabled inline graphics by using %matplotlib inline) is to use the dataframe method plot, ie. df.plot():


The result of df.plot() with no other options

It usually does the right thing, including a readable legend, and picking only the columns that have numbers or dates. We could also have done a bar graph, but in this case would not have been as readable.

Now, let's add a few elements to this. For one thing, I'd like to not only plot a line graph (default) but to add a table directly under the graph. Since we'll have this table under the X axis, let's remove the tick values on X by using xticks=[].

I think we all can agree that a graph should have a title, so we will add that too. And finally, let's make it a bit larger.

ax = df.plot(table=True, xticks=[], title="Bond movies in 2005 dollars (million)", figsize=(17,11))

And how about adding the average value as a dotted horizontal line for the box office?

ax.hlines(y=df.mean()[0], xmin=0, xmax=23, color='b', alpha=0.5, linestyle='dashed', label='Box office average')

Might as well do it for the budget too. So in all we have:

ax = df.plot(table=True, xticks=[], title="Bond movies in 2005 dollars (million)", figsize=(17,11))
ax.hlines(y=df.mean()[0], xmin=0, xmax=23, color='b', alpha=0.5, linestyle='dashed', label='Box office average')
ax.hlines(y=df.mean()[1], xmin=0, xmax=23, color='g', alpha=0.5, linestyle='dashed', label='Budget average')

So, how does it look?
Final result - click to see full size image


The jupyter notebook can be found on github: pandas_bond.ipynb

Francois Dion
@f_dion

Thursday, November 26, 2015

Bridging the digital divide, $5 at a time

Going in the right direction


About 3 years ago, I wrote a piece titled "Going in the wrong direction" (well worth your time, go ahead and read it). In it, I highlighted the issue of the high cost of computing for experimentation and innovation, particularly when it comes to students. This obviously has impact on STEAM and school budgets too. I suggested that we'd see $20 and even sub $20 computers very soon.

The Raspberry Pi had established itself as a great option for exactly this. In mid 2012, during a Python conference in North Carolina (USA), I demoed a simple project using a Raspberry Pi controlling a laser. Everybody in attendance was sold on the concept of a $35 computer.

The $35, $25, I mean $20 computer

What I imagined for the price trend

I had to revisit the original story at the beginning of 2015, because the price of each iteration of the Raspberry Pi entry level model kept going down. It looked like sub $20 was close, at least as I was picturing it in my mind. At the same time, the higher spec model kept getting better (see my article on 3D Future Tech as to why that is possible)

I'll CHIP in $9


Earlier this year, a kickstarter campaign introduced the CHIP, a $9 computer. According to http://getchip.com they will sell it this coming Monday for $8!

How low can you go?

Meet the $5 #pizero


The Raspberry Pi foundation is now selling a $5 version of the Raspberry Pi. It is half the size of the Model A+ and a quarter of the price...

And yet another price model that totally disrupts the field. Just look at that:


So now, we've reached the price level where distribution and shipping cost impact more than the cost of the computer itself. This is the next problem to solve in bridging the digital divide.

We live in interesting times...

Francois Dion
@f_dion

Wednesday, November 25, 2015

Bond. James "import pandas" Bond

It all started when...

    [friend] I'm trying to get this table on wikipedia from python...

[me] Sure. What module are you using?

    [friend] BeautifulSoup, but man, this is hard. It's this url...

[me] Wait, this is not a Coursera assignment you are asking me to do, is it?

    [friend] No, no. I saw this thing using a different programming language and I want to do it in Python.

[me] Ok, sounds reasonable.

The URL

The basic URL that documents James Bond movies on wikipedia is at: https://en.wikipedia.org/wiki/List_of_James_Bond_films but the URL he sent me was: https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363 and hence why it looked like a assignment.


Let me pause for a brief second on this subject. I'm a big fan of reproducible research, and selecting a specific revision of a document is an excellent idea. This page will never change, whereas any given normal URL on wikipedia changes all the time.

I'll have some of that BeautifulSoup

My friend mentioned he was trying to use BeautifulSoup but facing some challenges. BeautifulSoup and lxml are the usual suspects when it comes to doing web scraping (and using requests itself to pull the data in). But I have to admit, most of the time I don't use any of these. You see, I'm lazy, and typically these solutions require too much work. If you want to see what I'm talking about, you can check using-python-beautifulsoup-to-scrape-a-wikipedia-table

I don't like to type more code than I need to. At any rate, the goal was to get the web page, parse two tables and then load the data in a pandas data frame to do further analysis, plots etc.

Enter the Pandas

And it's not even the Kung Fu Panda, just good old Pandas, the data wrangling tool par excellence (https://pypi.python.org/pypi/pandas/0.17.1).

Everybody knows, I hope, that it has a superb support for loading excel and CSV files. It's why Python is the number 1 data wrangling programming language.

But what about loading tables from wikipedia web pages, surely there is nothing that can simplify this, is there? If you've attended all PYPTUG meetings, you already know the answer.

 import pandas as pd  
   
 wiki_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363", header=0)  

read_html returns a list of dataframes, with each table found on the web page being a dataframe. So to access the box office table on this page, we have to look at the second dataframe, the first being the warning table at the top of the page. Since it is 0 indexed we refer to it with wiki_df[1]. We don't want line 0 because that's sub headers, and we don't want the last two lines because one is a movie that's just been released and the numbers are not in yet, and the other one because it's a total column. How do we do this? Good old Python slices:

 df = wiki_df[1][1:24]  

And that's it, seriously. One line to ingest, one line to cleanup.

The result

Title Year Bond actor Director Box office Budget Salary of Bond actor Box office.1 Budget.1 Salary of Bond actor.1
1 Dr. No 1962 Connery, SeanSean Connery Young, TerenceTerence Young 59.5 1.1 0.1 448.8 7.0 0.6
2 From Russia with Love 1963 Connery, SeanSean Connery Young, TerenceTerence Young 78.9 2.0 0.3 543.8 12.6 1.6
3 Goldfinger 1964 Connery, SeanSean Connery Hamilton, GuyGuy Hamilton 124.9 3.0 0.5 820.4 18.6 3.2
4 Thunderball 1965 Connery, SeanSean Connery Young, TerenceTerence Young 141.2 6.8 0.8 848.1 41.9 4.7
5 You Only Live Twice 1967 Connery, SeanSean Connery Gilbert, LewisLewis Gilbert 101.0 10.3 0.8 + 25% net merch royalty 514.2 59.9 4.4 excluding profit participation
6 On Her Majesty's Secret Service 1969 Lazenby, GeorgeGeorge Lazenby Hunt, Peter R.Peter R. Hunt 64.6 7.0 0.1 291.5 37.3 0.6
7 Diamonds Are Forever 1971 Connery, SeanSean Connery Hamilton, GuyGuy Hamilton 116.0 7.2 1.2 + 12.5% of gross (14.5) 442.5 34.7 5.8 excluding profit participation
8 Live and Let Die 1973 Moore, RogerRoger Moore Hamilton, GuyGuy Hamilton 126.4 7.0 n/a 460.3 30.8 n/a
9 man with !The Man with the Golden Gun 1974 Moore, RogerRoger Moore Hamilton, GuyGuy Hamilton 98.5 7.0 n/a 334.0 27.7 n/a
10 spy who !The Spy Who Loved Me 1977 Moore, RogerRoger Moore Gilbert, LewisLewis Gilbert 185.4 14.0 n/a 533.0 45.1 n/a
11 Moonraker 1979 Moore, RogerRoger Moore Gilbert, LewisLewis Gilbert 210.3 34.0 n/a 535.0 91.5 n/a
12 For Your Eyes Only 1981 Moore, RogerRoger Moore Glen, JohnJohn Glen 194.9 28.0 n/a 449.4 60.2 n/a
13 Octopussy 1983 Moore, RogerRoger Moore Glen, JohnJohn Glen 183.7 27.5 4.0 373.8 53.9 7.8
14 view !A View to a Kill 1985 Moore, RogerRoger Moore Glen, JohnJohn Glen 152.4 30.0 5.0 275.2 54.5 9.1
15 living !The Living Daylights 1987 Dalton, TimothyTimothy Dalton Glen, JohnJohn Glen 191.2 40.0 3.0 313.5 68.8 5.2
16 Licence to Kill 1989 Dalton, TimothyTimothy Dalton Glen, JohnJohn Glen 156.2 36.0 5.0 250.9 56.7 7.9
17 GoldenEye 1995 Brosnan, PiercePierce Brosnan Campbell, MartinMartin Campbell 351.9 60.0 4.0 518.5 76.9 5.1
18 Tomorrow Never Dies 1997 Brosnan, PiercePierce Brosnan Spottiswoode, RogerRoger Spottiswoode 338.9 110.0 8.2 463.2 133.9 10.0
19 world !The World Is Not Enough 1999 Brosnan, PiercePierce Brosnan Apted, MichaelMichael Apted 361.8 135.0 12.4 439.5 158.3 13.5
20 Die Another Day 2002 Brosnan, PiercePierce Brosnan Tamahori, LeeLee Tamahori 431.9 142.0 16.5 465.4 154.2 17.9
21 Casino Royale 2006 Craig, DanielDaniel Craig Campbell, MartinMartin Campbell 594.2 150.0 3.4 581.5 145.3 3.3
22 Quantum of Solace 2008 Craig, DanielDaniel Craig Forster, MarcMarc Forster 576.0 200.0 8.9 514.2 181.4 8.1
23 Skyfall 2012 Craig, DanielDaniel Craig Mendes, SamSam Mendes 1108.6[20] 150.0[21][22]—200.0[20] 17.0[23] 879.8 158.1 13.5


Francois Dion
@f_dion