Databases
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.
Error on line 3 of Python tag (line 5 of source): student_servername = BASE_SERVER+"/"+STUDENT_SERVERS[cs_username]['endpoint'] KeyError: 'None'
On the previous exercise you made a very simple FastAPI application. It was stateless meaning that it had no memory. When you provided two inputs to it, it responded a certain way, and it would continue to respond that same way forever. That might make sense in some situations, but in other situations we want to remember things for the purposes of data logging, or for the purposes of having stateful behavior. For this we need to remember things. And the proper way to do that is a database.
Persistent Storage
There are many different types of databases that exist, and while we do not mean to trivialize them in saying this, they all do pretty much the same thing, which is to provide a means to store and access information efficiently1. In this class we'll use SQLite2 which is a database management system. In order to interact with SQLite, we'll use SQL, which is a language used for managing databases. We will interact with SQLite and generate SQL commands in Python3 using the sqlite3
library which provides a convenient interface.
With SQLite, databases are files with the .db
file extension. Within the database are tables, and within those tables are rows of data that we append to, extract from, and do whatever we want with. These files are automatically saved to the actual hard drive so if you have a Python script that messes with them in some way, and you run that Python script, after the script is finished, the changes made will live on into the future in that database, far beyond the lifespan of that Python file's runtime.
SQLite
SQLite has a pretty easy-to-use Python interface once you get the basics of it. What's nice is that it is very modular so you can create databases and move/copy them as needed. Let's make a database. You'll want to start a Python process on your local computer. In general our order of operations will be
- import
sqlite3
- specify/create a database
- establish a connection to the database
- establish a cursor to the database (how we send it SQL commands)
- do stuff with the database (via the cursor)
- properly close connection to the database
In line with this approach, the snippet of code below will create a database called example.db
and create a table within it. Create a file called db_experiment.py
, place the code below in it, and run it:
1import sqlite32example_db = "example.db" # just come up with name of database3 4def create_database():5 conn = sqlite3.connect(example_db) # connect to that database (will create if it doesn't already exist)6 c = conn.cursor() # move cursor into database (allows us to execute commands)7 c.execute('''CREATE TABLE test_table (user text,favorite_number int);''') # run a CREATE TABLE command8 conn.commit() # commit commands (VERY IMPORTANT!!)9 conn.close() # close connection to database10 11create_database() #call the function!
Now in your terminal locally on your computer run (in the same directory where your Python process was) ls
. You should see a file in the list called example.db
. What we've done is create a database that includes one table in it. We made this table by using a SQL command. Specifically, the command used was the following:
CREATE TABLE test_table (user text,favorite_number int);
which says: "create a table called 'test_table', and have it contain two fields, one which contains a text entry called 'user' and one which contains an integer entry called 'favorite_number'". (there are other data types of entries as well see here You may find the real
data type to be useful for storing conventional floating point numbers in this exercise!).
After execution and commitment, the structure of our database looks like the following:
If you happen to run the same program again you'll get:
1Traceback (most recent call last):2 File "db_experiment.py", line 11, in <module>3 create_database() #call the function!4 File "db_experiment.py", line 7, in create_database5 c.execute('''CREATE TABLE test_table (user text,favorite_number int);''') # run a CREATE TABLE command6sqlite3.OperationalError: table test_table already exists
And this is because when the function create_database
runs the second time, it clashes with the fact that a table already exists in that database called test_table
. You cannot have identically named tables in the same database. You can avoid this issue with the following command:
CREATE TABLE IF NOT EXISTS test_table (user text,favorite_number int);
If you're wondering where the syntax/grammar of the SQL command we used came from, it is standard SQL, which has numerous tutorials online like this one. SQL applies to a lot of the SQL family of databases out there3. Let's go and do more stuff with SQL in our database.
Inserting
OK this database is pretty useless until we have some actual information in it. Let's add some stuff to the database then! Make another Python function called insert_into_database
and add it to your db_experiment.py
file. Copy-paste the lines of code below into that file, save it, and run it. Totally feel free to change the name and number too.
1def insert_into_database():2 conn = sqlite3.connect(example_db)3 c = conn.cursor()4 c.execute('''INSERT into test_table VALUES ('joe',5);''')5 conn.commit()6 conn.close()7 8insert_into_database() #call this function!
When it runs you'll see nothing happen probably...and that's good. In looking at the code, you should see a very similar structure to the first one, except the SQL command we run is the INSERT operation. In particular:
INSERT into test_table VALUES ('joe',5);
which says: "Insert into test_table the values (in order with the table structure) of 'joe', then 5". This will result in the following new state to our database:
Selecting
It would now be great to see what is in the database, and/or have the ability to extract information from it. For this we will need the SELECT
operation. So, let's create another function called lookup_database
with the structure shown below. Copy-paste the code below into the file we've been working in, comment out previous function calls, save it, and run it.
1def lookup_database():2 conn = sqlite3.connect(example_db)3 c = conn.cursor()4 things = c.execute('''SELECT * FROM test_table;''').fetchall()5 for row in things:6 print(row)7 conn.commit()8 conn.close()9 10lookup_database()#call it!
When you run it you should see (assuming you've only run the file once), and taking into account any name changes you might have made.
('joe', 5)
And this should make sense, since what we've done is run the SQL command:
SELECT * FROM test_table;
which says: "Select everything from test_table." (The *
symbol is the wildcard, meaning 'everything') By default in SQLite, execution of a SELECT query will return a SQLite iterator, and you can use either the fetchone()
or fetchall()
methods to get the first returned value from the query or all returned values in a Python list, respectively, allowing us the ability to manipulate them as needed. Alternatively, we can use the iterator directly in a for
loop, just like we do for the range()
operator—if you just need to iterate over all the elements!
If you just want to see only the favorite number of each users in your database, instead of using *
you could specify the columns you want to see such as:
SELECT favorite_number FROM test_table;
It is arguably good practice to avoid using *
, but since we're just getting started we'll use it here. Just keep in mind in it is asking for everything and that may not be what you want or need. A safer way to ask for everything in our current table would be:
SELECT user, favorite_number FROM test_table;
More Inserting
When we inserted data into our database above, we had a pre-made (hard-coded) Python string that we used for our SQL command. In particular we did c.execute('''INSERT into test_table VALUES ('joe',5);''')
which runs the SQL command INSERT into test_table VALUES ('joe',5);
. What if we'd instead had some values we generated programatically that we wanted to be inserted into the database? For example, if we had a Python variable user
and another variable favorite_of_user
. How can we incorporate them into the SQL string we need to generate?
Well as we know in Python, we could create a string from other datatypes by casting and some concatenation action. For example we could do this (the old-school way):
'''
to denote our strings though you could also use """
or likely even just "
for what we're doing. Remember tripled-quoted strings allow us to specify multi-line stringssql_command = '''INSERT into test_table VALUES ('%s',%d);''' %(user,favorite_of_user)
c.execute(sql_command)
or if we did it the newer Python way, we could do it this way:
sql_command = '''INSERT into test_table VALUES ('{}',{});'''.format(user,favorite_of_user)
c.execute(sql_command)
or if we did it the even newer Python way with f-strings, we could do it like this:
sql_command = f'''INSERT into test_table VALUES ('{user}',{favorite_of_user});'''
c.execute(sql_command)
While not apparent nor a threat when running these databases on a local machine, this approach to building SQL commands can be a huge security flaw when our database is living on a server and the Python variables we're using to construct our SQL command are coming from user-specified values from GET or POST values. If a visitor to your site finds a way to manipulate the user
variable to itself be a SQL command they can cause bad things to happen such as damage or destruction of your database. This is bad and is known as a SQL-Injection Attack and is the technical angle behind the comic below. Now you can say, "I get it."
In order to protect ourselves against this, inputs coming from "the outside" (anything you do not have complete control over such as user-specified inputs) should be sanitized and this can be done using the sqlite3
functionality shown below where insertion points of outside values are demarcated by question marks, and a following tuple provides the values to be inserted (in order). The SQLite Python library checks these values to make sure nothing SQL-like is in them, thus preventing injection:
c.execute('''INSERT into test_table VALUES (?,?);''',(user,favorite_of_user))
You can use SQL command syntax for all SQL queries we need to generate (creation of tables, selection of rows, etc...)
execute
is a tuple, so if you have only one value to insert, you still need to make sure your tuple is of the form (value,)
(yes that is a trailing comma there), otherwise you'll get an error. Please pay attention for this! I've wasted hours on this before and it is the least satisfying of bugs to figure out. You do not feel better after figuring this one out.
Moving On
OK, I'm getting restless. Let's insert some more things into our database. Make a fourth Python function called, lotsa_inserts
, copy-paste the code below into it, and run it. What this script does is create 100,000 random usernames and favorite numbers and inserts them one-after-the-other into the database:
1#maybe put imports at top like one should do:2import random3import string4 5def lotsa_inserts():6 conn = sqlite3.connect(example_db)7 c = conn.cursor()8 for x in range(100000):9 number = random.randint(0,1000000) # generate random number from 0 to 100000010 user = ''.join(random.choice(string.ascii_letters + string.digits) for i in range(10)) # make random user name11 c.execute('''INSERT into test_table VALUES (?,?);''',(user,number))12 conn.commit()13 conn.close()14 15lotsa_inserts() #Call this function!
If you now go back and rerun lookup_database
, you'll get waaay more things that print because there are now waaay more entries in this database.
What if we wanted to return the user/favorite number pair with the largest favorite number in the database? One solution would be to build on the script we had, and get the entire database in a Python list, and then write some Python script for us to return the max. This can be ok for small databases, but as your database grows, having Python sort 500,000 entries or something can start to get pretty slow, and this is where more refined SELECTs and sortings can be done on the database side.4
The first thing we can do is modify our SQL query with a sort command, ORDER BY
. In order to return the entire database in ascending order based on the value of favorite_number
you'd run the following command:
SELECT * FROM test_table ORDER BY favorite_number ASC;
If you modify your lookup_database
to have that command instead of the one we originally had in place, when you run it, you'll now see the entries in the database fly by in ascending order. If we want in descending order we could do this SQL command (try it to watch the values fly by in descending order)
SELECT * FROM test_table ORDER BY favorite_number DESC;
So far we've been running a query, generating a Python list from that query and then printing them all. If we wanted to print just the top entry, we could instead of a for
loop, just do print(things[0])
, but even then this can be slow because we're generating a really large Python list for no real reason other than it is in Python and that makes people feel safe. If we instead use the fetchone()
method on an ordered response, we can get just a single value, and this can have significant speedup benefits. Consider the code below (which you could implement in a separate file if you'd like, so long as you keep the same directory so example.db
is still accessible directly). If you run it on your currently existing database you should see a speedup of approximately three-fold between the two approaches. Please make sure you understand and appreciate why this is.
1import sqlite32import time3example_db = "example.db" #make sure this is run in the same directory as your other file.4conn = sqlite3.connect(example_db)5c = conn.cursor()6 7start = time.time()8things = c.execute('''SELECT * FROM test_table ORDER BY favorite_number DESC;''').fetchall()9print(things[0])10print(time.time()-start)11 12start = time.time()13thing = c.execute('''SELECT * FROM test_table ORDER BY favorite_number DESC;''').fetchone()14print(thing)15print(time.time()-start)16 17conn.commit()18conn.close()
You can also refine what is returned (and gain efficiency benefits) within the SQL query itself. For example the following SQL command is identical to the other two above, but limits what is done using SQL rather than the sqlite3
interface (and will give a speedup of about 20% or so on my machine with our current database as compared to our fast one above).
SELECT * FROM test_table ORDER BY favorite_number DESC LIMIT 1;
More details on the syntax of ORDER BY
can be found here.
WHERE
We can add further or alternative sorting mechanisms using the WHERE
keyword. For example, if we wanted a list of users that had favorite numbers only between 1132 and 1185 (to identify weird people since only weird people have favorite numbers in that range, amirite?) we could do the following SQL command (try this out with the lookup_database
function, and you should see a much smaller portion of your database print out)
SELECT * FROM test_table WHERE favorite_number BETWEEN 1132 AND 1185;
This command says, select all full rows in test_table
where favorite_number
is between 1132 and 1185.
We can nest our specifications as needed as well. The operation above will return a Python list of the entries that meet that condition, but what if we want that in order? We could do this by appending the ORDER BY
operator from earlier, like below:
SELECT * FROM test_table WHERE favorite_number BETWEEN 1132 AND 1185 ORDER BY favorite_number DESC;
More details on syntax of WHERE
can be found here.
Timestamps
Another really useful type which we can use with our databases are timestamps. SQLite sort of has undocumented support for timestamps, but when we use SQLite in conjunction with Python and its datetime
library, we're able to remember and pull elements from arrays based on times (possibly of when that row was inserted or when that row was modified). We'll do that using a new type of entry, the timestamp
. Consider the following code which creates a new database time_example.db
, with a single table in it called dated_table
that has three entries in it: a text
field called "user", an int
field called "favorite_number", and a timestamp
field called "timing".
1import sqlite32import datetime3example_db = "time_example.db" # just come up with name of database4conn = sqlite3.connect(example_db) # connect to that database (will create if it doesn't already exist)5c = conn.cursor() # move cursor into database (allows us to execute commands)6c.execute('''CREATE TABLE IF NOT EXISTS dated_table (user text,favorite_number int, timing timestamp);''') # run a CREATE TABLE command7conn.commit() # commit commands8conn.close() # close connection to database
What is the SQL query in the above code snippet?
We'll use this new database like the old one, but instead here, we'll time-stamp entries as they are made using the datetime
library, which is documented here.
We'd now like to populate that database with a few (5) fictitious users and their favorite numbers. We'll space when we enter these into the database by two seconds.
1import sqlite32import datetime3import string4import random5import time6example_db = "time_example.db" # name of database from above7conn = sqlite3.connect(example_db) # connect to that database8c = conn.cursor() # move cursor into database (allows us to execute commands)9for x in range(5):10 number = random.randint(0,1000000) # generate random number from 0 to 100000011 user = ''.join(random.choice(string.ascii_letters + string.digits) for i in range(10)) # make random user name12 c.execute('''INSERT into dated_table VALUES (?,?,?);''',(user,number,datetime.datetime.now())) #with time13 print("inserting {}".format(user)) #notification message14 time.sleep(2) #pause for 2 seconds before next insert15conn.commit() # commit commands16conn.close() # close connection to database
What is the SQL query in the above code snippet?
Just like the database before, we could then go and access and pull things out as needed:
1import sqlite32import datetime3example_db = "time_example.db" # name of database from above4conn = sqlite3.connect(example_db) # connect to that database5c = conn.cursor() # move cursor into database (allows us to execute commands)6things = c.execute('''SELECT * FROM dated_table;''').fetchall()7for x in things:8 print(x)9conn.commit() # commit commands10conn.close() # close connection to database
But where this gets cool is we can start to sort off of timestamps. Below we create a datetime
object that expresses "15 minutes ago" and then use that in combination with our SELECT
query to get items from the last 15 minutes.
1import sqlite32import datetime3example_db = "time_example.db" # name of database from above4conn = sqlite3.connect(example_db) # connect to that database5c = conn.cursor() # move cursor into database (allows us to execute commands)6fifteen_minutes_ago = datetime.datetime.now()- datetime.timedelta(minutes = 15) # create time for fifteen minutes ago!7things = c.execute('''SELECT * FROM dated_table WHERE timing > ? ORDER BY timing DESC;''',(fifteen_minutes_ago,)).fetchall()8for x in things:9 print(x)10conn.commit() # commit commands11conn.close() # close connection to database
What is the SQL query in the above code snippet?
Further documentation on the Python datetime
library can be found here.
In the Application
Ok everything we've done so far can totally be done in the context of a FastAPI application. A very, very simple deviation on the FastAPI application from the previous exercise could be that it logs all of the sent rh (relative humidity), t (temperature), and heat indices and then will return not just the current result, but rather the amount
previous results. So something like: http://<div><font color='red'><b>A Python Error Occurred:</b><p>
Error on line 1 of Python tag (line 367 of source):
print('%s' % (student_servername,))
NameError: name 'student_servername' is not defined
<p></font></div>/heat_index?rh=90&t=100&amount=4
wouldn't just yield 158.0
but rather the last four of those reults like:
rh: 90.0 t: 100.0 heat_index: 176.0! rh: 90.0 t: 100.0 heat_index: 176.0! rh: 85.0 t: 100.0 heat_index: 167.0! rh: 80.0 t: 100.0 heat_index: 158.0!
A behavior like that could be implemented with the following:
1RHT_DB = "/home/joe_test/rht_log.db" # you'd obvi change this2@app.get("/heat_index2")3def heat_index_function2(rh: float, t: float, amount: int | None): #last thing does default argument4 heat_index = round(compute_heat_index(rh,t)) #previously defined5 conn = sqlite3.connect(RHT_DB) # connect to that database (will create if it doesn't already exist)6 c = conn.cursor() # move cursor into database (allows us to execute commands)7 c.execute('''CREATE TABLE IF NOT EXISTS rht_table (rh real,t real, heat_index real);''') #jodalyst test8 c.execute('''INSERT into rht_table VALUES (?,?,?);''', (rh,t,heat_index))9 if amount is not None:10 prev_data = c.execute('''SELECT * FROM rht_table ORDER BY rowid DESC LIMIT ?;''',(amount,)).fetchall()11 else:12 prev_data = c.execute('''SELECT * FROM rht_table ORDER BY rowid DESC;''').fetchall()13 #rowid is an implicit 64 bit ordered row counter column that exists in all sqlite tables by default.14 # you can use it to access most recents15 conn.commit() # commit commands16 conn.close() # close connection to database17 outs = ""18 for t in prev_data:19 outs += f"rh: {t[0]} t: {t[1]} heat_index: {t[2]}! "20 return outs
Couple things of note in this example code:
You can create databases on the server automatically, but also add the caveat to only do so if the database doesn't already exist. This prevents a crash when you try to make a new database that already is there. The syntax we use above tries to create the database and table if it doesn't exist, and then adds a new entry to the table and returns what is currently in it.
All tables made in sqlite will have an implicit auto-indexing counting row called "rowid". This can be very useful when sorting on oldest or newest data. More info here.
Deliverable
Ok what we want for a deliverable for this exercise is very similar to the code above, except the number of previous results it provides should be based on timestamps rather than the raw index count.
We want something like: http://<div><font color='red'><b>A Python Error Occurred:</b><p>
Error on line 1 of Python tag (line 408 of source):
print('%s' % (student_servername,))
NameError: name 'student_servername' is not defined
<p></font></div>/heat_index2?rh=80&t=100&history=3
where rh
and t
are the usual suspects and history
takes in an argument and lists out the previous inputs that were within history
seconds in the past.
So starting from a pause of a long time: http://<div><font color='red'><b>A Python Error Occurred:</b><p>
Error on line 1 of Python tag (line 410 of source):
print('%s' % (student_servername,))
NameError: name 'student_servername' is not defined
<p></font></div>/heat_index2?rh=80&t=100&history=2
should yield rh: 80.0 t: 100.0 heat_index: 158.0!
.
Then if one second later you do:http://<div><font color='red'><b>A Python Error Occurred:</b><p>
Error on line 1 of Python tag (line 412 of source):
print('%s' % (student_servername,))
NameError: name 'student_servername' is not defined
<p></font></div>/heat_index2?rh=85&t=100&history=2
you should get rh: 85.0 t: 100.0 heat_index: 167.0! rh: 80.0 t: 100.0 heat_index: 158.0!
.
Then if 1.1 seconds later you did http://<div><font color='red'><b>A Python Error Occurred:</b><p>
Error on line 1 of Python tag (line 414 of source):
print('%s' % (student_servername,))
NameError: name 'student_servername' is not defined
<p></font></div>/heat_index2?rh=90&t=100&history=2
you should get rh: 90.0 t: 100.0 heat_index: 176.0! rh: 85.0 t: 100.0 heat_index: 167.0!
. (since the first submission was greater than 2 seconds ago).
When you think you have it working, make sure to activate (over ssh like before) your reverse proxy tunnel on efpi-10.mit.edu, spin up your application locally and run the checker below:
Footnotes
1There are numerous pros and cons to different databases. Some people will go to war for MongoDB or something else, and we're not passing judgement here.
2SQLite is pronounced either "S-Q-L-lite" or "sequelite", whichever you prefer, we don't care.
3SQLite, MySQL, PostgreSQL, etc
4the reason performing these sorts on the database side can be faster is that they'll be performed in the native C that the database is written in rather than using the bulkier higher-level structures of Python, which, while they can make our handling of data "easier", do add significant bulk to calculations and operations and make a poorer overall product.