Lab 3: Tables
Welcome to lab 3! This week, we'll learn about tables, which let us work with multiple arrays of data about the
same things. Tables are described in Chapter 6 (http://www.inferentialthinking.com/chapters/06/tables.html) of
the text.
First, set up the tests and imports by running the cell below.
In [2]: import numpy as np
from datascience import *
# These lines load the tests.
from client.api.notebook import Notebook
ok = Notebook('lab03.ok')
_ = ok.auth(inline=True)
1. Introduction
For a collection of things in the world, an array is useful for describing a single attribute of each thing. For
example, among the collection of US States, an array could describe the land area of each. Tables extend this
idea by describing multiple attributes for each element of a collection.
In most data science applications, we have data about many entities, but we also have several kinds of data
about each entity.
For example, in the cell below we have two arrays. The first one contains the world population in each year (as
estimated (http://www.census.gov/population/international/data/worldpop/table_population.php) by the US
Census Bureau), and the second contains the years themselves (in order, so the first elements in the population
and the years arrays correspond).
, In [3]: population_amounts = Table.read_table("world_population.csv").column("Po
pulation")
years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", years)
Population column: [2557628654 2594939877 2636772306 2682053389 2730228
104 2782098943
2835299673 2891349717 2948137248 3000716593 3043001508 3083966929
3140093217 3209827882 3281201306 3350425793 3420677923 3490333715
3562313822 3637159050 3712697742 3790326948 3866568653 3942096442
4016608813 4089083233 4160185010 4232084578 4304105753 4379013942
4451362735 4534410125 4614566561 4695736743 4774569391 4856462699
4940571232 5027200492 5114557167 5201440110 5288955934 5371585922
5456136278 5538268316 5618682132 5699202985 5779440593 5857972543
5935213248 6012074922 6088571383 6165219247 6242016348 6318590956
6395699509 6473044732 6551263534 6629913759 6709049780 6788214394
6866332358 6944055583 7022349283 7101027895 7178722893 7256490011]
Years column: [1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1
961 1962 1963 1964
1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
1994
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
2009
2010 2011 2012 2013 2014 2015]
Suppose we want to answer this question:
When did world population cross 6 billion?
You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you
would have to count the position where the population first crossed 6 billion, then find the corresponding
element in the years array. In cases like these, it might be easier to put the data into a Table , a 2-dimensional
type of dataset.
The expression below:
creates an empty table using the expression Table() ,
adds two columns by calling with_columns with four arguments,
assignes the result to the name population , and finally
evaluates population so that we can see the table.
The strings "Year" and "Population" are column labels that we have chosen. Ther names
population_amounts and years were assigned above to two arrays of the same length. The function
with_columns (you can find the documentation here (http://data8.org/datascience/tables.html)) takes in
alternating strings (to represent column labels) and arrays (representing the data in those columns), which are all
separated by commas.
, In [4]: population = Table().with_columns(
"Population", population_amounts,
"Year", years
)
population
Out[4]: Population Year
2557628654 1950
2594939877 1951
2636772306 1952
2682053389 1953
2730228104 1954
2782098943 1955
2835299673 1956
2891349717 1957
2948137248 1958
3000716593 1959
... (56 rows omitted)
Now the data are all together in a single table! It's much easier to parse this data--if you need to know what
the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.
2. Creating Tables
Question 2.1. In the cell below, we've created 2 arrays. Using the steps above, assign top_10_movies to a
table that has two columns called "Rating" and "Name", which hold top_10_movie_ratings and
top_10_movie_names respectively.
Welcome to lab 3! This week, we'll learn about tables, which let us work with multiple arrays of data about the
same things. Tables are described in Chapter 6 (http://www.inferentialthinking.com/chapters/06/tables.html) of
the text.
First, set up the tests and imports by running the cell below.
In [2]: import numpy as np
from datascience import *
# These lines load the tests.
from client.api.notebook import Notebook
ok = Notebook('lab03.ok')
_ = ok.auth(inline=True)
1. Introduction
For a collection of things in the world, an array is useful for describing a single attribute of each thing. For
example, among the collection of US States, an array could describe the land area of each. Tables extend this
idea by describing multiple attributes for each element of a collection.
In most data science applications, we have data about many entities, but we also have several kinds of data
about each entity.
For example, in the cell below we have two arrays. The first one contains the world population in each year (as
estimated (http://www.census.gov/population/international/data/worldpop/table_population.php) by the US
Census Bureau), and the second contains the years themselves (in order, so the first elements in the population
and the years arrays correspond).
, In [3]: population_amounts = Table.read_table("world_population.csv").column("Po
pulation")
years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", years)
Population column: [2557628654 2594939877 2636772306 2682053389 2730228
104 2782098943
2835299673 2891349717 2948137248 3000716593 3043001508 3083966929
3140093217 3209827882 3281201306 3350425793 3420677923 3490333715
3562313822 3637159050 3712697742 3790326948 3866568653 3942096442
4016608813 4089083233 4160185010 4232084578 4304105753 4379013942
4451362735 4534410125 4614566561 4695736743 4774569391 4856462699
4940571232 5027200492 5114557167 5201440110 5288955934 5371585922
5456136278 5538268316 5618682132 5699202985 5779440593 5857972543
5935213248 6012074922 6088571383 6165219247 6242016348 6318590956
6395699509 6473044732 6551263534 6629913759 6709049780 6788214394
6866332358 6944055583 7022349283 7101027895 7178722893 7256490011]
Years column: [1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1
961 1962 1963 1964
1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
1994
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
2009
2010 2011 2012 2013 2014 2015]
Suppose we want to answer this question:
When did world population cross 6 billion?
You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you
would have to count the position where the population first crossed 6 billion, then find the corresponding
element in the years array. In cases like these, it might be easier to put the data into a Table , a 2-dimensional
type of dataset.
The expression below:
creates an empty table using the expression Table() ,
adds two columns by calling with_columns with four arguments,
assignes the result to the name population , and finally
evaluates population so that we can see the table.
The strings "Year" and "Population" are column labels that we have chosen. Ther names
population_amounts and years were assigned above to two arrays of the same length. The function
with_columns (you can find the documentation here (http://data8.org/datascience/tables.html)) takes in
alternating strings (to represent column labels) and arrays (representing the data in those columns), which are all
separated by commas.
, In [4]: population = Table().with_columns(
"Population", population_amounts,
"Year", years
)
population
Out[4]: Population Year
2557628654 1950
2594939877 1951
2636772306 1952
2682053389 1953
2730228104 1954
2782098943 1955
2835299673 1956
2891349717 1957
2948137248 1958
3000716593 1959
... (56 rows omitted)
Now the data are all together in a single table! It's much easier to parse this data--if you need to know what
the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.
2. Creating Tables
Question 2.1. In the cell below, we've created 2 arrays. Using the steps above, assign top_10_movies to a
table that has two columns called "Rating" and "Name", which hold top_10_movie_ratings and
top_10_movie_names respectively.