Automating data analysis and workflows with pyODK¶

by Hélène Martin on October 19th, 2022

Jupyter Lab is an interactive computing environment used in this session. This file is a Jupyter Notebook which can be opened and edited in many platforms including Jupyter Lab. Even Github knows how to show Jupyter Notebooks! When you first open a notebook or if you see it in a view-only platform like Github, the output shown is static, saved from the last run.

If you have any questions or comments, please share them on the forum.

Since many are new to Python and to Jupyter Lab, we started with Hello, world!

In [1]:

Copied!

print("Hello, world!")
print("Hello, world!")

Hello, world!

Configuring `pyodk` and calling library methods¶

We looked at how to build a pyodk client and make our first request to the ODK Central backend. This client used Hélène's configuration in her home directory. The credentials she entered have access to three projects that we can list using the pyodk client's list method in the projects accessor. In pyodk, functionality is organized according to the resource (projects, forms, submissions) that they act on.

If you create your own .pyodk_config.toml in your home directory and run the cell below, you will see the projects that your credentials have access to.

In [42]:

Copied!

from pyodk.client import Client

client = Client()
client.open()
client.projects.list()
from pyodk.client import Client

client = Client()
client.open()
client.projects.list()

Out[42]:

[Project(id=38, name='Hélène encrypted', createdAt=datetime.datetime(2021, 4, 26, 18, 36, 51, 785000, tzinfo=datetime.timezone.utc), description=None, archived=None, keyId=19, appUsers=None, forms=None, lastSubmission=None, updatedAt=datetime.datetime(2021, 4, 26, 18, 37, 13, 924000, tzinfo=datetime.timezone.utc), deletedAt=None),
 Project(id=4, name='Impact++', createdAt=datetime.datetime(2021, 1, 31, 21, 52, 30, 606000, tzinfo=datetime.timezone.utc), description=None, archived=None, keyId=None, appUsers=None, forms=None, lastSubmission=None, updatedAt=datetime.datetime(2022, 3, 9, 0, 55, 21, 658000, tzinfo=datetime.timezone.utc), deletedAt=None),
 Project(id=52, name='pyODK webinar', createdAt=datetime.datetime(2022, 10, 18, 16, 17, 33, 980000, tzinfo=datetime.timezone.utc), description='For Oct 19 session on pyODK.', archived=None, keyId=None, appUsers=None, forms=None, lastSubmission=None, updatedAt=datetime.datetime(2022, 10, 19, 4, 42, 1, 824000, tzinfo=datetime.timezone.utc), deletedAt=None)]

This looks a bit magical! We also wrote a config.toml file together and explicitly built a client using that configuration. We specified a cache path as well so that credentials for this configuration get saved separately from those for the client we created above.

We can use multiple clients that connect to multiple servers or use different credentials on the same server in the same program.

In [4]:

Copied!

viewer_client = Client(config_path="config.toml", cache_path="cache.toml")
viewer_client.open()
viewer_client.projects.list()
viewer_client = Client(config_path="config.toml", cache_path="cache.toml")
viewer_client.open()
viewer_client.projects.list()

Out[4]:

[Project(id=52, name='pyODK webinar', createdAt=datetime.datetime(2022, 10, 18, 16, 17, 33, 980000, tzinfo=datetime.timezone.utc), description='For Oct 19 session on pyODK.', archived=None, keyId=None, appUsers=None, forms=None, lastSubmission=None, updatedAt=datetime.datetime(2022, 10, 19, 4, 42, 1, 824000, tzinfo=datetime.timezone.utc), deletedAt=None)]

We can list any type of resource. We get back a list of Python objects with appropriately-typed fields. These objects provide access to the resource metadata.

In [5]:

Copied!

client.forms.list()
client.forms.list()

Out[5]:

[Form(projectId=52, xmlFormId='foods', name='Favorite foods', version='2022102202', enketoId='DWTgFxdpHhMatjKjFPtAg7pRnvzwMnM', hash='3a0b4f0dc731adca7e6683b1081b8686', state='open', createdAt=datetime.datetime(2022, 10, 19, 5, 34, 26, 659000, tzinfo=datetime.timezone.utc), keyId=None, updatedAt=datetime.datetime(2022, 10, 19, 5, 45, 47, 999000, tzinfo=datetime.timezone.utc), publishedAt=datetime.datetime(2022, 10, 19, 5, 45, 47, 997000, tzinfo=datetime.timezone.utc)),
 Form(projectId=52, xmlFormId='participants', name='pyODK webinar participant survey', version='2022101802', enketoId='pfRYfxj0bOmGyCZnaZ0umsk0AsunD6N', hash='521961ede019cc24ceed078905506fea', state='open', createdAt=datetime.datetime(2022, 10, 19, 4, 39, 46, 62000, tzinfo=datetime.timezone.utc), keyId=None, updatedAt=datetime.datetime(2022, 10, 19, 4, 42, 1, 834000, tzinfo=datetime.timezone.utc), publishedAt=datetime.datetime(2022, 10, 19, 4, 40, 59, 355000, tzinfo=datetime.timezone.utc)),
 Form(projectId=52, xmlFormId='simple_repeat', name='simple_repeat', version='2022101001', enketoId='qlPrrhsIjqPdT5FnsLVDdh3nBFAszaW', hash='8bd777cb7b66660beafc034eee16b09b', state='open', createdAt=datetime.datetime(2022, 10, 18, 20, 49, 53, 997000, tzinfo=datetime.timezone.utc), keyId=None, updatedAt=datetime.datetime(2022, 10, 19, 4, 42, 1, 842000, tzinfo=datetime.timezone.utc), publishedAt=datetime.datetime(2022, 10, 18, 20, 49, 58, 484000, tzinfo=datetime.timezone.utc))]

Learning more about available functionality¶

We listed submissions as well. For these notes we've commented out the call with a # because there are a lot of submissions!

To learn more about the functionality available for each resource type in Jupyter Lab (and most development environments), we can type the name of our client followed by the name of the resource, add a period and then use the tab key to see what is available. For example, to learn more about methods available for submissions, we would type client.submissions. and then a tab.

We can also learn more about a specific method by typing its name (or selecting it from the suggestions given as above) and then typing SHIFT+TAB.

In [ ]:

Copied!

# client.submissions.list(form_id='participants')
# client.submissions.list(form_id='participants')

Listing submissions gives us a list of objects representing submission metadata. There's a lot we can do with that but typically what we really want is the submission content.

Getting submission data into pandas¶

We can get our submission data by using the get_table method. This will give us back our data as JSON in a top-level value key (for the OData standard). We could use this directly but it's even more convenient to get it into pandas, a Python library for data manipulation and analysis.

Some good resources for learning about pandas and using Python for data analysis and manipulation are:

The pandas Getting Started guide - great tutorials on focused topics
Kaggle - well-structured longer tutorials with exercises to check your understanding

There are also large numbers of courses through Udemy, EdX, etc, some of which provide certificates.

pyodk helps build a bridge into pandas directly from ODK Central so that we don't have to manage versioning of our dataset and can use Central as the ultimate source of truth. This is helpful for monitoring data as it comes in and should work smoothly for over a million indicators (submission count times field count). With large submission sets, we will be limited by our Internet connection and server RAM and CPU.

Once we're in pandas, there's nothing special about ODK data! Note that we can also use all of these techniques on a CSV export from Central.

get_table has several parameters that can be passed into it to do things like filter the submissions we request. Use SHIFT+TAB as described above to see what they are.

In [43]:

Copied!

import pandas as pd

json = client.submissions.get_table(form_id="participants")["value"]

df = pd.json_normalize(json, sep="-")
df.head(3)
import pandas as pd

json = client.submissions.get_table(form_id="participants")["value"]

df = pd.json_normalize(json, sep="-")
df.head(3)

Out[43]:

	__id	note_welcome	height_units	height_feet	height_meters	height	note_height_meters	pets	book	liked-type	...	__system-updatedAt	__system-submitterId	__system-submitterName	__system-status	__system-reviewState	__system-deviceId	__system-formVersion
0	uuid:670fa9de-9191-4968-a08e-1aba8061d0e0	None	ft	7.99	NaN	2.43	None	n	3	Point	...	None	711	Participant	None	None	None	2022101802
1	uuid:afe70664-6670-4ff1-8efc-a23eb0c6ec89	None	ft	3.01	NaN	0.92	None	y	Mr Muscle	Point	...	None	711	Participant	None	None	None	2022101802
2	uuid:f75e9730-89d8-407d-ba2a-9990ce90233f	None	m	NaN	1.89	1.89	None	n	Gamba	Point	...	None	711	Participant	None	None	None	2022101802

3 rows × 24 columns

Once our data is in pandas, we have access to powerful data cleaning and analysis tools.

We looked at quick ways to make plots:

In [55]:

Copied!

df.pets.value_counts().plot(kind="pie")
df.pets.value_counts().plot(kind="pie")

Out[55]:

<AxesSubplot:ylabel='pets'>

No description has been provided for this image

We also plotted the pets column and noticed that it has the exact same shape! Are the two correlated?

In [52]:

Copied!

df["height_code"] = df.height_units.astype("category").cat.codes
df["pets_code"] = df.pets.astype("category").cat.codes
df.pets_code.corr(df.height_code)
df["height_code"] = df.height_units.astype("category").cat.codes
df["pets_code"] = df.pets.astype("category").cat.codes
df.pets_code.corr(df.height_code)

Out[52]:

-0.4035087719298247

No, that's a pretty weak correlation so it's really a coincidence that the number of people who prefer meters is the same as the number of people who don't have pets. Too bad, I had made up a whole story in my head about Americans and pets.

Anyway, hopefully this illustrates that it's relatively quick and fun to explore data in this way.

pandas can also give us a really nice standard summary of numeric columns:

In [15]:

Copied!

df.height.describe()
df.height.describe()

Out[15]:

count    34.000000
mean      1.747941
std       0.211914
min       0.920000
25%       1.700000
50%       1.760000
75%       1.820000
max       2.430000
Name: height, dtype: float64

And we can do analysis on metadata as well as customize the plots:

In [45]:

Copied!

df["__system-submitterName"].value_counts().plot(kind="bar", rot=45)
df["__system-submitterName"].value_counts().plot(kind="bar", rot=45)

Out[45]:

<AxesSubplot:>

The examples above focus on analysis. We can also manipulate the data to clean it. For example, if we prefix all notes in our XLSForm with note_, we can remove them from the data table in a single step:

In [58]:

Copied!

df = df.drop(df.filter(regex="note_"), axis=1)
df = df.drop(df.filter(regex="note_"), axis=1)

Analyzing data from forms with repeats¶

We quickly went through an example of joining repeat data to the base form data. You can see the example worked in more detail here.

Using HTTP verb methods to make direct calls to the API¶

Our goal is to make many common workflow automation tasks directly available in pyodk as nice library methods. We haven't implemented most of those yet, though. You can still access the full Central API with the convenience of a configured pyodk client by using HTTP verb methods on your Client object.

We briefly looked at how to use the Central API docs to make the right calls. You can see a worked example here.

There's also a longer sample script here that shows provisioning App Users from a file of names and generating custom QR codes for each of them.