Projects/Portfolio

My personal research interests use a human-centered approach to data organization and access around digital data and research resources. This is a bunch of academic buzzwords, but is the most general umbrella that describes the type of programming, teaching, advising, and librarianship that I perform. I am a digitally and technologically oriented librarian, and I fit into many slots.

So when it comes to…

Programming: I have written data profiling programs that enable users to quickly get a sense data, produce machine readable summaries of the data, and template documentation for the data. Summaries act as guides for data cleaning assessment, giving users an instant picture of the extent of missing and invalid data. Autogenerated documentation populated with summary results of data follows the idea that it is easier to edit than create, and allows users to focus on the content that can only come from a human. I believe in gentle automation, where human time is saved in smart and appropriate ways.

Teaching: I want all workers to have equal access for data processing to enable data driven discovery and decision making. Reducing barriers for data analysis and transformation enables more data stakeholders to join the technical table, deploying and critically interpreting our tools for more and different kinds of needs. Receiving high quality and accessible programming education should not be a luxury nor random chance. To this end, I strive to constantly improve my own teaching skills and spread what I have learned with others.

Mentorship: Providing robust and healthy learning opportunities grows communities and affirms one’s identity and membership within that community. Mentoring takes many forms to solve many needs, and is something people at all levels need to receive and perform. Many students are eager to learn new technical skills or progress further into data science, but prior toxic or inadequate instruction can create barriers that only positive and personal mentorship can break down.

Librarianship: This is a complex are where service and scholarship intersect at many points. My desire is that no research question be blocked due to a technical limitation. Much of this is solved by strong data literacy, end user programming skills, data organization, and strong access to training and consultations. Creating these opportunities is a difficult and lengthy process, but ultimately creates a stronger group of researchers. This also goes for corporate organizations. Empowering everyone with strong information and data skills opens up opportunity for new insights, better organization, and awareness of resources. My librarianship includes: crafting a balance of information tidyness and sustainability, assessing how/if/when data tasks can be automated, identifying resources for learning new things, documenting data workflows, leading team building exercises around data, and more.

Data: I love working to assemble and manipulate data as preparation for analysis. Taking unstructured or semistructured, especially text data, into a format suitable for large scale investigation and interpretation is my favorite data puzzle to work on. Much of my data project work focused on webscraping and data cleaning. I developed a class called Open Data Mashups that takes students through the development of their own novel data project, where they must identify and gather 3-5 real datasets as preparation for a research project. This class incorporates elements of reproducibility, open science, data management, data documentation, data cleaning, and data munging.

HIGHLIGHTED WORKS

Much of my research work cannot be highlighted because the data are private and I do not have approval to make them public. This is due to some privacy concerns about the data, but also the prepublication nature of the data.

Data Workflows Workshop

This workshop was originally created for a large machine learning research team. This team had many subteams that needed a workshop to get started documenting the data workflows within and between these subteams. This workshop has been run over a dozen times across a variety of audiences and is easily adapted for many types of projects.

The goal of the workshop is to prompt participants to think critically about the physical and digital products being created from and used by the various stages or a project. This can be use prospectively, retrospectively, and in the middle of a project. Once a workflow is sketched, various review passes are completed to add pain points, due dates, assignments, etc.

You can find a recording of the workshop on YouTube, as recorded for the NCSA Blue Waters Data Workflows webinar series: https://www.youtube.com/watch?v=hseix4TH0eU

The materials for this workshop are all CC-BY and available here: http://hdl.handle.net/2142/91639

I use this recording and workshop as a class assignment in preparation for final project development. Student feedback has been very positive, and all the materials are open for you to adapt.

Data profiling tools

JSON Profile tool

Tools: Python 3.x and pandas. Most of the work is done in pure Python standard library, with pandas only being necessary for the HTML and Excel output.

The JSON profiling tool (https://github.com/elliewix/json-profile-tool) was developed in two parts: a set of helper functions that can be imported to generate summary descriptions and statistics about data stored in JSON structures, and a GUI application (built with tkinter) to execute these analyses on a local file. There are functions to write out these summaries to HTML tables, CSVs, JSON files, and Excel workbooks. The

This is a useful tool because many APIs dump out JSON payloads with little to no documentation of their schemas, or you need to perform data cleaning/profiling on non rectangular data. In these two cases, loading the data in Excel, Open Refine, or another tool designed for rectangular data may not appropriately represent the data’s structure.

The most powerful portion of this tool is the Excel output, where each observed field in the original data file becomes a new sheet in the Workbook. These sheets are populated will all the unique values seen within that field, and their counts. For example, if there are thousands of IDs within the records and one is duplicated, you can easily sort or filter the list of IDs by the count to identify the ones with the multiple observations.

AutoDocish (for CSV data)

Tools: Python 3.7. This project solely uses the Python 3 Standard Library.

Developing a deep understanding of a new dataset to you can be a cumbersome task that usually needs to have documentation created at the end. So why not use a tool that creates documentation that gives you a field or column level understanding of the data? A lot of exploratory data analysis creates important metadata for documentation along the way, and these two tools take that as an inspiration point for further adaptation.

AutoDocish (https://github.com/elliewix/data-profile-tool) was created to calculate summary information about tabular data, and write it out to a markdown document that looks a lot like a data codebook.

You can see a presentation about this project and tool from PyData Chicago 2016: https://www.youtube.com/watch?v=Hb7nvHbwNAw

J!Archive Scraping

This was one of my first projects after learning how to program. I went on a Jeopardy! kick and found that I was quite curious about the backgrounds of the participants. Luckily I found J! Archive as a data source, which has reasonably structured pages to scrape.

Initial research questions:

Are players with certain occupations likely to be more successful compared to the average player?
Are there certain “hot spots” of reported player home town? Have these hot spots changed over time?
Are players who defeat long running champions more likely to be long running champions as well?

Workshops and tutorials

IN PERSON

Co-instructor, HackCulture Data Analysis Challenge (February 2019)
- Designed a half day workshop and hackathon for data analysis
Co-instructor, Hesitant Data Scientist Workshop (September 2018 and January 2019)
- Two-day workshop around foundational skills for data analysis/science
- Instructed and provided lessons for other modules
Instructor, Intermediate Python for Data Automation (April 2018)
- Workshop for Reaching Across Illinois Library System
- 1.5 day workshop with 4 weekly follow up sessions. Python programming for data processing, common data formats, working with APIs, technical project management, and version control.
Instructor, Learn Tech to Teach Tech (February 2018)
- Workshop for code{4}lib 2018 Conference
- Full day workshop on personal learning plans, lesson development, and technical instruction
Co-instructor, Humanities Data: A Hands On Approach (July 2017 & July 2018)
- Workshop for Digital Humanities at Oxford Summer School 2017 & 2018
- Led sessions on: learning strategies for technical skills, Python, SQL, and ethics & social implications of data
Instructor, Data Management Workshop Series (January 2016 – May 2017)
- Instructed and updated the University Library’s three part data management workshop
Workshops for University Library (Illinois)
- Redesigned a series of four workshops on research data management and data publication
Instructor, Introduction to Practical Programming Workshop (January, 2015)
- Full day workshop presented as a LITA Institute workshop for the ALA Midwinter Meeting

DIGITAL RESOURCES

Wickes, E. (2018, April). Pandas for Archivists. 10.5281/zenodo.2567537
Wickes, E. (2018, July). SQL for Archivists. 10.5281/zenodo.1308113
Wickes, E. (2018, January). Getting Started with GitHub using GitHub Desktop. https://github.com/elliewix/github-training-brain-dumps/blob/master/github_directions.md
Wickes, E. (2017, September). Tour of Python for Illini Stats Club. 10.5281/zenodo.1166151
Wickes, E. (2017, July). Python Tour for Humanists. 10.5281/zenodo.2567543
Wickes, E. (2017, July). Ways of Installing Python. 10.5281/zenodo.1166149
Wickes, E. (2016, June). Python for Humanities Research. 10.5281/zenodo.1166147