A search interface for the Performing Patents Otherwise publication as part of the Politics of Patents case study (part of Copim WP6): this parses data from the archive of RTF files and provides additional data from the European Patent Office OPS API. https://patents.copim.ac.uk

15KB

on combining databases and books

an essay by Simon Bowie

Performing Patents Otherwise is an experimental publication drawing together a database in the form of a book. As part of the COPIM project, we turned a dataset of patents into a searchable website and book publication. In doing so, we deliberately subverted many of the traditional principles behind relational databases and search engines in order to expose the inherent poetics of the archive and to cultivate a sense of serendipity through computational randomness. This chapter outlines the tools we used to build this database book and offers some reflections on subverting traditional models of presenting data.

Databases are so integrated into the computational systems behind contemporary life that their functions are often hidden from us. Every time you log in to a website, your login details are pulled from a database. Every time you look at a tweet or an Instagram post, the content is pulled from a database. Every time you access a file in a document management platform like Nextcloud or Microsoft SharePoint, the link to the file is retrieved from a database. Relational databases are the silent background to our modern computer interactions.

A database is an organised collection of data. In computer systems, data is often stored in a relational database based on the relational model of data proposed by IBM computer scientist E. F. Codd (1970, p. 377) in his 1970 article in Communications of the ACM. Codd proposed “12 Principles of Relational Databases” (Kline, Gould, and Zanevsky, 1999, p. 5) which specify how data can organised in rows and columns in tables and then linked to other pieces of data through their relationships (though since lists in computer systems often treat 0 as the start of an ordered list, there are actually thirteen rules). This relational model forms the basis of the modern relational database management systems that power websites and web applications: open source MySQL used by Facebook, Twitter, and YouTube; open source MariaDB used by Mozilla, Google, and Wikipedia; open source PostgreSQL used by OpenStreetpMap, Reddit, Instagram, and the International Space Station; and closed source systems like Oracle or Microsoft SQL.

But if we go back and consider databases using their most basic definition as an ‘organised collection of data’, then a book can also be considered a database of sorts. Chris Kubica (2010) considers the analogy of books as relational databases and imagines the data relationships between parts of a book like chapters, table of contents, and index. He further drills down on the chapters to discover the sentences within and the words nested within the sentences.

book-as-database diagram from Kubica (2010)

The Performing Patents Otherwise publication offers one way of conceiving of database as book. As Open Source Software Developer for the COPIM project, I was asked to help consider how the dataset of patents curated by the Politics of Patents research project could be turned into a publication as part of COPIM’s work package supporting experimental open access publishing. It’s a wonderful dataset to browse through to discover fascinating and unexpected clothing designs and wearable inventions and we wanted to preserve that sense of randomly opening a file to discover something wonderfully weird while also making the data accessible in a modern way alongside published chapters. We’ve turned the curated and enhanced Politics of Patents dataset into a hybrid digital publication that combines a robust search engine with textual ‘book-type’ content to facilitate conversations and creative interventions with the data itself.

The dataset consists of approximately 320,000 patents for clothing and wearable technology from the European Patent Office. It comprises RTF files and accompanying PDF files arranged in a directory structure by year of patent registration followed by country code of originating country. Each RTF file contains basic metadata about the patent including patent title, application ID, publication ID, a short description or abstract describing the patent, and International Patent Classification (IPC) number. Each ID is itself structured data containing an ISO 3166-1 alpha-2 country code (e.g. GB for the United Kingdom) and an ISO 8601 date of application or publication.

To make the dataset accessible and searchable, I wanted to put the data into a search engine. In my previous work on library catalogue and search systems, I’d worked with Apache Solr, a reliable open source search engine that provides full-text search, faceted search, and advanced customisation. Solr is able to index RTF files using Apache Tika’s framework for extracting metadata and text from a range of document formats. We’ve made some customisations to Solr’s config to extract and index ‘year’ and ‘country’ data for each document using regular expressions to find the data in the documents.

The Solr search engine allows us to run search queries against the text but it doesn’t provide a built-in frontend for doing so. Using Flask, a web framework written in Python, I built a website frontend with a searchbox that queries the Solr search engine and presents the results in an accessible way parsing out title, abstract, year of publication, country of origin, and document ID. Additionally by using the document ID to query the European Patent Office’s Open Patent Services API, I was able to enhance the data we presented for each record by automatically pulling data like original language title, original language abstract, and images of the original patents. The images such as drawings from the patents often provide a new perspective on the inventions and clarify what is described in the abstracts.

diagram of the application architecture

Solr provides a reliable and robust search engine but we wanted to experiment with the idea of a search engine to reconceptualise how the data in this database could be represented. The traditional function of a search engine is to help lead a user towards a specific document that matches their search query. This is the case with library catalogues and archive search engines. As Marshall Breeding (2015) writes:

One dimension of this experience relates to satisfying those with something specific in mind. These patrons may have a favorite author or topic in mind and want specific resources, which might include the next book by that author or an exhaustive set of materials in the bibliography of a research topic. The catalogs and discovery services that libraries present to their users are designed especially for these kinds of information fulfillment activities.

But what if a user doesn’t have something specific in mind? This is a question I came across as a library systems developer in relation to browsing and serendipity. Part of the joy of going to a physical library is browsing the shelves and serendipitously discovering texts that they might otherwise never have come across: “How many times do we visit a library or bookstore expecting that something interesting will catch our attention?” (Breeding, 2015) Replicating this browsing experience in automated library catalogues and search engines has posed a challenge for library system developers.

To infuse the search engine for the patents data with serendipity, we inserted a large degree of randomness into the design of the system. You can inject randomness into a computer system by inserting a random number into an algorithm but unfortunately computers are not good at creating randomness to generate their own random number. A computer system is designed to follow instructions to the letter. Erin Herzstein (2021) sums this up by asking “How does a computer, a machine defined by its adherence to instructions and formulae, generate a random outcome?”

Computers can create randomness in two major ways. First, true random number generators which produce random numbers by “harvesting entropy” (Herzstein, 2021): they take unpredictable input from some external source such as a user randomly moving their finger around a laptop trackpad or mashing buttons on a keyboard or atmospheric noise like static. Second, pseudo-random number generators which use algorithms to generate numbers that appear to be random to a human but aren’t truly mathematically random. Pseudo-random number generators perform mathematical formulae a number of times to produce an output that appears random but is actually determined by the formulae.

To inject randomness into our search results from Solr, we use a pseudo-random number generator in the form of Python’s randint() method. The line rand = str(random.randint(0, 9999999)) produces a random integer in the range 0 to 9,999,999. We then make a request to the Solr search engine inserting that random number as part of a parameter for sorting the results: solrurl = 'http://' + solr_hostname + ':' + solr_port + '/solr/' + core + '/select?q.op=OR&q=*%3A*&wt=json&sort=random_' + rand + '%20asc&rows=1’. In essence, this sorts the search results differently every time and we pick the top result like picking the top card from a freshly shuffled deck of cards.

Building on this basic bit of code, we were able to build a number of interesting functions to randomly explore the dataset and facilitate serendipitous discovery of interesting patents. First and foremost, we can pull out one random patent from the entire dataset. Using the ‘A random entry’ button, a user could get a set of “motion gloves” from China in one click or a full-length “body suit” from the United States of America in the next. This allows the user to skip through the entire archive randomly like flipping through a book and randomly choosing a page to read.

We were also able to double this feature to pull up two random patents and present them side-by-side for comparison. ‘A juxtaposition of two’ simply performs the randomising function twice and puts both entries next to one another so a user can compare two entirely random patents from the archive. This often brings up interesting and poetic juxtapositions generated entirely fortuitously across the database.

The randomness functions also allow us to pull out random elements of data. ‘A poetics of titles’ retrieves the titles for 10 random patents and presents them in a quasi-poetic form. Each refresh brings a new arrangement of titles which is interesting both for comparison of the disparate patents in the database but also on its own merits as a poetic arrangement of clothing names. Similarly, ‘a handful of fragments’ retrieves 10 random abstracts and arranges them in a continuous form for a long poetics that removes the descriptions from their titles and their context to form an abstraction of abstracts.

Finally, ‘a scattering of images’ retrieves random patent IDs and uses these to pull image data from the European Patent Office’s Open Patent Services API so that random images from these patents can be presented side-by-side, again shorn of their context to stand alone as fascinating images in their own right.

These functions subvert the traditional functions of a search engine by emphasising the retrieval of records that the user does not expect and does not even necessarily want. They allow each user to uncover their own cross-section of the database without curation or mediation and in that way they emphasise the poetics inherent in the archive. This all stands alongside the traditional search functions that Solr allows so that a user can choose between the more directed experience of search or the randomised experience of browsing serendipitously.

With this basic application structure in place, we worked with designer and developer Joana Chicau to apply a design that brings out this poetics of the archive and present the patent records in a visually interesting way that balances database retrieval functions with aesthetics. Joana’s design not only draws the user into the database book but uses JavaScript to toggle the display of more information about the publication structure. On some pages, this JavaScript also reveals the code behind the website further subverting user expectations about how the back-end development of a site is hidden behind the front-end and making a statement about the openness built into our open source development. With a click, the code can become exposed to the user revealing how we connect to the database and where the data comes from. All our code is already open source, available publicly on GitHub under an MIT License, but this intervention moves the code into the foreground and exposes the data hidden behind the façade of the book.

In transforming this dataset into a book, we looked to many of the features of a print book that are difficult to replicate in the form of an electronic publication, especially one built on a traditional relational data model. It is possible to browse through books on a shelf to find things serendipitously and then to flick through the pages to discover something unexpected. We have tried to draw on this by creating randomising functions to flick through the dataset automatically. However a book may also have an index and a table of contents for finding precise information quickly. We’ve created a traditional table of contents for our database book and indexed all the contents in a search engine so that precise searches can instantly pull up relevant results. In this way, we have tried to produce a database book that combines the computational advantages of a database with the poetics of a book.

bibliography

Breeding, Marshall. 2015. ‘Serendipity: The Virtual-Library Experience’. Computers in Libraries 35 (09): 9–11.

Brin, Sergey, and Lawrence Page. 1998. ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine’. Computer Networks and ISDN Systems 30 (1–7): 107–17. https://doi.org/10.1016/S0169-7552(98)00110-X.

Codd, E. F. 1970. ‘A Relational Model of Data for Large Shared Data Banks’. Communications of the ACM 13 (6): 377–87. https://doi.org/10.1145/362384.362685.

Herzstein, Erin. 2021. ‘How Do Computers Generate Random Numbers?’ Medium. 30 January 2021. https://levelup.gitconnected.com/how-do-computers-generate-random-numbers-a72be65877f6.

Kline, Kevin, Lee Gould, and Andrew Zanevsky. 1999. Transact-SQL Programming: Covers Microsoft SQL Server 6.5 /7.0 and Sybase Adaptive Server 11.5. O’Reilly Media, Inc.

Kubica, Chris. 2010. ‘Your Book as a Database: A Primer’. Publishing Perspectives (blog). 10 June 2010. https://publishingperspectives.com/2010/06/your-book-as-a-database-a-primer/.