Browse Source

added full draft of making-of essay

main
Simon Bowie 1 year ago
parent
commit
8e512c7e41
2 changed files with 81 additions and 13 deletions
  1. +34
    -0
      web/app/static/styles/custom.css
  2. +47
    -13
      web/content/section_5/on-combining-databases-and-books.md

+ 34
- 0
web/app/static/styles/custom.css View File

@@ -485,6 +485,40 @@ canvas > * {
}

/* MAKING */

.making {
margin: 0% 20%;
}

.making h1 {
text-align: center;
}

.making p {
margin: 2rem 0rem;
}

.making img {
width: 100%;
}

.making h2, .making h3 {
margin: 2rem 0rem;
padding: 0rem 1rem;
}

.making h2 {
font-size: 2.25 rem;
background-color: var(--color-lightyellow);
border-radius: 0.5rem;
}

.making h3 {
border-bottom: 0.25rem dashed var(--color-lightyellow);
}


/***************** MOBILE ****************/

@media screen and (min-width:0px) and (max-width: 768px) {

+ 47
- 13
web/content/section_5/on-combining-databases-and-books.md View File

@@ -1,27 +1,61 @@
Databases are so integrated into the computational systems behind contemporary life that their functions are often hidden from us. Every time you log in to a website, your login details are pulled from a database. Every time you look at a tweet or an Instagram post, the content is pulled from a database. Every time you access a file in a document management platform like Nextcloud or Microsoft SharePoint, the link to the file is retrieved from a database. Relational databases are the silent background to our modern computer interactions.
# on combining databases and books

A database is an organised collection of data. In computer systems, data is often stored in a relational database based on the relational model of data proposed by IBM computer scientist E. F. Codd (1970, p. 377) in his 1970 article in *Communications of the ACM*. Codd proposed "12 Principles of Relational Databases" (Kline, Gould and Zanevsky, 1999, p. 5) which specify how data can be organised in rows and columns in tables and then linked to other pieces of data through their relationships (although since lists in computer systems often treat 0 as the start of an ordered list, there are actually thirteen rules). This relational model forms the basis of the modern relational database management systems that power websites and web applications: open source MySQL used by Facebook, Twitter, and YouTube; open source MariaDB used by Mozilla, Google, and Wikipedia; open source PostgreSQL used by OpenStreetMap, Reddit, Instagram, and the International Space Station; and closed source systems like Oracle or Microsoft SQL.
## an essay by [Simon Bowie](https://simonxix.com)

But if we go back and consider databases using their most basic definition as an 'organised collection of data', then a book can also be considered a database of sorts. Chris Kubica (2010) considers the analogy of books as relational databases and imagines the data relationships between parts of a book like chapters, table of contents, and index drilling down on the chapters to discover the sentences within and the words nested within the sentences.
Performing Patents Otherwise is an experimental publication drawing together a database in the form of a book. As part of the [COPIM project](https://www.copim.ac.uk/), we turned a dataset of patents into a searchable website and book publication. In doing so, we deliberately subverted many of the traditional principles behind relational databases and search engines in order to expose the inherent poetics of the archive and to cultivate a sense of serendipity through computational randomness. This chapter outlines the tools we used to build this database book and offers some reflections on subverting traditional models of presenting data.

![image.png](/static/images/bd2.png)The Performing Patents Otherwise publication provides an interesting way to conceive of database as book by offering one way of turning a dataset into a publication. We've turned the curated and enhanced Politics of Patents dataset into a hybrid digital publication combining a robust search engine with textual 'book-type' content to facilitate conversations and creative interventions with the data itself. In many ways, we've deliberately subverted the traditional principles behind relational databases and search engines in order to facilitate creativity and serendipity through computational randomness.
Databases are so integrated into the computational systems behind contemporary life that their functions are often hidden from us. Every time you log in to a website, your login details are pulled from a database. Every time you look at a tweet or an Instagram post, the content is pulled from a database. Every time you access a file in a document management platform like Nextcloud or Microsoft SharePoint, the link to the file is retrieved from a database. Relational databases are the silent background to our modern computer interactions.

As Open Source Software Developer for the COPIM project, I was brought in to help consider how the dataset of patents curated by the Politics of Patents research project could be turned into a publication as part of COPIM's work package supporting experimental open access publishing. The dataset consists of approximately 320,000 patents for clothing and wearable technology from the European Patent Office comprising RTF files and accompanying PDF files arranged in a directory structure by year of patent registration followed by country code of originating country. Each RTF file contains basic bibliographic details including patent title, application ID, publication ID, a short description or abstract describing the patent, and International Patent Classification (IPC) number. Each ID is itself structured data containing an ISO 3166 alpha-2 country code (e.g. GB for the United Kingdom) and an ISO 8601 date of application or publication.
A database is an organised collection of data. In computer systems, data is often stored in a relational database based on the relational model of data proposed by IBM computer scientist E. F. Codd (1970, p. 377) in his 1970 article in Communications of the ACM. Codd proposed “12 Principles of Relational Databases” (Kline, Gould, and Zanevsky, 1999, p. 5) which specify how data can organised in rows and columns in tables and then linked to other pieces of data through their relationships (though since lists in computer systems often treat 0 as the start of an ordered list, there are actually thirteen rules). This relational model forms the basis of the modern relational database management systems that power websites and web applications: open source MySQL used by Facebook, Twitter, and YouTube; open source MariaDB used by Mozilla, Google, and Wikipedia; open source PostgreSQL used by OpenStreetpMap, Reddit, Instagram, and the International Space Station; and closed source systems like Oracle or Microsoft SQL.

It's a wonderful dataset to browse through discovering fascinating and unexpected clothing designs and wearable inventions and we wanted to preserve that sense of randomly opening a file to discover something wonderfully weird while also making the data accessible in a modern way alongside published chapters.
But if we go back and consider databases using their most basic definition as an ‘organised collection of data’, then a book can also be considered a database of sorts. Chris Kubica (2010) considers the analogy of books as relational databases and imagines the data relationships between parts of a book like chapters, table of contents, and index. He further drills down on the chapters to discover the sentences within and the words nested within the sentences.

To make the dataset accessible and searchable, I wanted to put the data into a search engine. In my previous work on library catalogue and search systems I'd worked with Apache Solr, a reliable open source search engine offering full-text search, faceted search, and advanced customisation. Solr is able to index RTF files using Apache Tika's framework for extracting metadata and text from a range of document formats. We've made some customisations to Solr's config to extract and index 'year' and 'country' data for each document using regular expressions to find the data in the document text.
![book-as-database diagram from Kubica (2010)](/static/images/bd2.png)

The Solr search engine allows us to run search queries against the full text but it doesn't offer a built-in frontend for doing so. Using Flask, a web framework written in Python, I built a website frontend with a searchbox that queries the Solr search engine and presents the results in an accessible way parsing out title, abstract, year of publication, country of origin, and document ID. Additionally by using the document ID to query the the European Patent Office's Open Patent Services API, I was able to enhance the data we presented for each record by automatically pulling data like original language title, original language abstract, and images of the original patents. The images such as drawings from the patents often provide a new perspective on the inventions and clarify what is described in the abstracts.
The Performing Patents Otherwise publication offers one way of conceiving of database as book. As Open Source Software Developer for the COPIM project, I was asked to help consider how the dataset of patents curated by the [Politics of Patents](https://www.politicsofpatents.org/) research project could be turned into a publication as part of COPIM’s work package supporting experimental open access publishing. It’s a wonderful dataset to browse through to discover fascinating and unexpected clothing designs and wearable inventions and we wanted to preserve that sense of randomly opening a file to discover something wonderfully weird while also making the data accessible in a modern way alongside published chapters. We’ve turned the curated and enhanced Politics of Patents dataset into a hybrid digital publication that combines a robust search engine with textual ‘book-type’ content to facilitate conversations and creative interventions with the data itself.

Solr provides a reliable and robust search engine but we wanted to experiment with the idea of a search engine to reconceptualise how the data in this database book could be represented. The traditional function of a search engine is to help lead a user towards a specific document that matches their search query. This is the case with library catalogues and archive search engines. As Marshall Breeding (2015) writes:
The dataset consists of approximately 320,000 patents for clothing and wearable technology from the European Patent Office. It comprises RTF files and accompanying PDF files arranged in a directory structure by year of patent registration followed by country code of originating country. Each RTF file contains basic metadata about the patent including patent title, application ID, publication ID, a short description or abstract describing the patent, and International Patent Classification (IPC) number. Each ID is itself structured data containing an [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) country code (e.g. GB for the United Kingdom) and an [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) date of application or publication.

To make the dataset accessible and searchable, I wanted to put the data into a search engine. In my previous work on library catalogue and search systems, I’d worked with [Apache Solr](https://solr.apache.org/), a reliable open source search engine that provides full-text search, faceted search, and advanced customisation. Solr is able to index RTF files using [Apache Tika’s](https://tika.apache.org/) framework for extracting metadata and text from a range of document formats. We’ve made some customisations to Solr’s config to extract and index ‘year’ and ‘country’ data for each document using regular expressions to find the data in the documents.

The Solr search engine allows us to run search queries against the text but it doesn’t provide a built-in frontend for doing so. Using [Flask](https://flask.palletsprojects.com/en/2.2.x/), a web framework written in Python, I built a website frontend with a searchbox that queries the Solr search engine and presents the results in an accessible way parsing out title, abstract, year of publication, country of origin, and document ID. Additionally by using the document ID to query the European Patent Office’s [Open Patent Services API](https://www.epo.org/searching-for-patents/data/web-services/ops.html), I was able to enhance the data we presented for each record by automatically pulling data like original language title, original language abstract, and images of the original patents. The images such as drawings from the patents often provide a new perspective on the inventions and clarify what is described in the abstracts.

Solr provides a reliable and robust search engine but we wanted to experiment with the idea of a search engine to reconceptualise how the data in this database could be represented. The traditional function of a search engine is to help lead a user towards a specific document that matches their search query. This is the case with library catalogues and archive search engines. As Marshall Breeding (2015) writes:

> One dimension of this experience relates to satisfying those with something specific in mind. These patrons may have a favorite author or topic in mind and want specific resources, which might include the next book by that author or an exhaustive set of materials in the bibliography of a research topic. The catalogs and discovery services that libraries present to their users are designed especially for these kinds of information fulfillment activities.

But what if a user doesn't have a specific document in mind? This is a question I came across as a library systems developer in relation to browsing and serendipity. Part of the joy of going to a physical library is browsing the shelves and serendipitously discovering texts that they might otherwise have never come across: "How many times do we visit a library or bookstore expecting that something interesting will catch our attention?" (Breeding, 2015) Replicating this browsing experience in automated library catalogues and search engines has posed a challenge for library system developers.
But what if a user doesn’t have something specific in mind? This is a question I came across as a library systems developer in relation to browsing and serendipity. Part of the joy of going to a physical library is browsing the shelves and serendipitously discovering texts that they might otherwise never have come across: "How many times do we visit a library or bookstore expecting that something interesting will catch our attention?" (Breeding, 2015) Replicating this browsing experience in automated library catalogues and search engines has posed a challenge for library system developers.

To infuse the search engine for the patents data with serendipity, we inserted a large degree of randomness into the design of the system. You can inject randomness into a computer system by inserting a random number into an algorithm but unfortunately computers are not good at creating randomness to generate their own random number. A computer system is designed to follow instructions to the letter. Erin Herzstein (2021) sums this up by asking "How does a computer, a machine defined by its adherence to instructions and formulae, generate a random outcome?"

Computers can create randomness in two major ways. First, true random number generators which produce random numbers by “harvesting entropy” (Herzstein, 2021): they take unpredictable input from some external source such as a user randomly moving their finger around a laptop trackpad or mashing buttons on a keyboard or atmospheric noise like static. Second, pseudo-random number generators which use algorithms to generate numbers that appear to be random to a human but aren’t truly mathematically random. Pseudo-random number generators perform mathematical formulae a number of times to produce an output that appears random but is actually determined by the formulae.

To inject randomness into our search results from Solr, we use a pseudo-random number generator in the form of Python’s [randint() method](https://docs.python.org/3/library/random.html). The line `rand = str(random.randint(0, 9999999))` produces a random integer in the range 0 to 9,999,999. We then make a request to the Solr search engine inserting that random number as part of a parameter for sorting the results: `solrurl = 'http://' + solr_hostname + ':' + solr_port + '/solr/' + core + '/select?q.op=OR&q=*%3A*&wt=json&sort=random_' + rand + '%20asc&rows=1’`. In essence, this sorts the search results differently every time and we pick the top result like picking the top card from a freshly shuffled deck of cards.

Building on this basic bit of code, we were able to build a number of interesting functions to randomly explore the dataset and facilitate serendipitous discovery of interesting patents. First and foremost, we can pull out one random patent from the entire dataset. Using the ‘A random entry’ button, a user could get a set of “motion gloves” from China in one click or a full-length “body suit” from the United States of America in the next. This allows the user to skip through the entire archive randomly like flipping through a book and randomly choosing a page to read.

We were also able to double this feature to pull up two random patents and present them side-by-side for comparison. ‘A juxtaposition of two’ simply performs the randomising function twice and puts both entries next to one another so a user can compare two entirely random patents from the archive. This often brings up interesting and poetic juxtapositions generated entirely fortuitously across the database.

The randomness functions also allow us to pull out random elements of data. ‘A poetics of titles’ retrieves the titles for 10 random patents and presents them in a quasi-poetic form. Each refresh brings a new arrangement of titles which is interesting both for comparison of the disparate patents in the database but also on its own merits as a poetic arrangement of clothing names. Similarly, ‘a handful of fragments’ retrieves 10 random abstracts and arranges them in a continuous form for a long poetics that removes the descriptions from their titles and their context to form an abstraction of abstracts.

Finally, ‘a scattering of images’ retrieves random patent IDs and uses these to pull image data from the European Patent Office’s Open Patent Services API so that random images from these patents can be presented side-by-side, again shorn of their context to stand alone as fascinating images in their own right.

These functions subvert the traditional functions of a search engine by emphasising the retrieval of records that the user does not expect and does not even necessarily want. They allow each user to uncover their own cross-section of the database without curation or mediation and in that way they emphasise the poetics inherent in the archive. This all stands alongside the traditional search functions that Solr allows so that a user can choose between the more directed experience of search or the randomised experience of browsing serendipitously.

With this basic application structure in place, we worked with designer and developer Joana Chicau to apply a design that brings out this poetics of the archive and present the patent records in a visually interesting way that balances database retrieval functions with aesthetics. Joana’s design not only draws the user into the database book but uses JavaScript to toggle the display of more information about the publication structure. On some pages, this JavaScript also reveals the code behind the website further subverting user expectations about how the back-end development of a site is hidden behind the front-end and making a statement about the openness built into our open source development. With a click, the code can become exposed to the user revealing how we connect to the database and where the data comes from. All our code is already open source, available [publicly on GitHub](https://github.com/COPIM/politics_of_patents) under an MIT License, but this intervention moves the code into the foreground and exposes the data hidden behind the façade of the book.

In transforming this dataset into a book, we looked to many of the features of a print book that are difficult to replicate in the form of an electronic publication, especially one built on a traditional relational data model. It is possible to browse through books on a shelf to find things serendipitously and then to flick through the pages to discover something unexpected. We have tried to draw on this by creating randomising functions to flick through the dataset automatically. However a book may also have an index and a table of contents for finding precise information quickly. We’ve created a traditional table of contents for our database book and indexed all the contents in a search engine so that precise searches can instantly pull up relevant results. In this way, we have tried to produce a database book that combines the computational advantages of a database with the poetics of a book.

## bibliography

Breeding, Marshall. 2015. ‘Serendipity: The Virtual-Library Experience’. Computers in Libraries 35 (09): 9–11.

Brin, Sergey, and Lawrence Page. 1998. ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine’. Computer Networks and ISDN Systems 30 (1–7): 107–17. https://doi.org/10.1016/S0169-7552(98)00110-X.

Codd, E. F. 1970. ‘A Relational Model of Data for Large Shared Data Banks’. Communications of the ACM 13 (6): 377–87. https://doi.org/10.1145/362384.362685.

To infuse the search engine for the patents data with serendipity, we inserted a large degree of randomness into the design of the system. You can inject randomness into a computer system by inserting a random number into an algorithm like a search engine request but unfortunately computers are not good at creating randomness. A computer system is designed to follow instructions to the letter. Erin Herzstein (2021) sums this up by asking "How does a computer, a machine defined by its adherence to instructions and formulae, generate a random outcome?" Computers can create randomness in two major ways. First, true random number generators produce random numbers by "harvesting entropy" (Herzstein, 2021): they take unpredictable input from some external source such as a user randomly moving their finger around a laptop trackpad or mashing buttons on a keyboard or atmospheric noise like static. Second, pseudo-random number generators use algorithms to generate numbers that appear to be random to a human but aren't truly mathematically random. Pseudo-random number generators perform mathematical formulae a number of times to produce an outcome that appears random but is actually determined by the formulae.
Herzstein, Erin. 2021. ‘How Do Computers Generate Random Numbers?’ Medium. 30 January 2021. https://levelup.gitconnected.com/how-do-computers-generate-random-numbers-a72be65877f6.

We use a pseudo-random number generator in the form of Python randint() method (https://docs.python.org/3/library/random.html). The line `rand = str(random.randint(0, 9999999))` produces a random integer in the range 0 to 9,999,999. We then make a request to the Solr search engine inserting that random number as part of a parameter for sorting the results: `solrurl = 'http://' + solr_hostname + ':' + solr_port + '/solr/' + core + '/select?q.op=OR&q=*%3A*&wt=json&sort=random_' + rand + '%20asc&rows=1'`. In essence, this sorts the search results differently every time and we pick the top result like picking the top card from a freshly shuffled deck of cards.
Kline, Kevin, Lee Gould, and Andrew Zanevsky. 1999. Transact-SQL Programming: Covers Microsoft SQL Server 6.5 /7.0 and Sybase Adaptive Server 11.5. O’Reilly Media, Inc.

Building on this basic bit of code, we were able to build a number of interesting functions to randomly explore the dataset and serendipitously discover interesting patents.
Kubica, Chris. 2010. ‘Your Book as a Database: A Primer’. Publishing Perspectives (blog). 10 June 2010. https://publishingperspectives.com/2010/06/your-book-as-a-database-a-primer/.

Loading…
Cancel
Save