Quantcast
Channel: OUseful.Info, the blog... » Arcadia
Viewing all articles
Browse latest Browse all 10

What’s Inside a Book?

$
0
0

A couple of months ago, when I started looking at the idea of emergent social positioning in online social networks, I was focussing on trying to model the positioning of certain brands and companies, in part with a view to trying to identify ones that were associated with innovation, or future thinking in some way.

Based on absolutely no evidence at all, I surmised that one useful signal in this regard might be the context in which companies or brands are mentioned in popular, MBA-related business books, the sort of thing that Harvard Business Review publish, for example.

Here’s how my thinking went then:

- generate a bipartite network graph that connects the book’s index terms with page numbers of the pages they appear on based on the index entries* in a given book. A bipartite graph is one that contains two sorts or classes of node (in this case, index term nodes and book page number nodes). The index terms are likely to include companies, brands, people and ideas/concepts. Sometimes, particular index terms may be identified as companies, names, etc, through presentational mark up – a bold font, or italics, for example. These presentational conventions can often be mapped onto semantic equivalents. Terms might also be passed through something like the Reuters’ Open Calais service, or TSO’s Data Enrichment Service.

- collapse the network graph by generating links between things that are connected to the same page number and remove the page number nodes from the graph. You now have a graph that connects brands, people and other index terms with each other, where edges represent the relation “is on the same page in the same book as”. If companies and other index terms appear on several pages together, we might reflect this by increasing the weight of the edge that connects them, for example by using edge weight to represent the number of pages where the two terms co-exist.

(*This will be obvious to some, but not to others. To a certain extent, a book index provides a faceted/search term limited search engine interface to a book, that returns certain pages as results to particular queries…)

Note that we can generate a network for a specific book, in which case we can render a graphical summary of the content, relations within and structure of that book, or we can generate more comprehensive networks that summarise the index term relations across several books.

My thinking then was that if we can grab the indexes of a set of business books, we could map which companies and brands were being associated either with each other or with particular concepts in MBA land.

Which is where the problem lays – because I haven’t found anywhere where I can readily get hold of the indexes of business books in a sensible machine readable format. Given an electronic cpy of a book, I guess I could run some text processing algorithms over it looking for word pairs in close association with each other and generating my own view over the book. But the reason for using an actual book index is at least twofold: firstly, because there has presumably been a a quality process that determines what terms are entered into the index; secondly, because the index, if used by a human reader, will be influencing which parts of the book (and hence which related terms) they will be exposed to.

(It’s maybe also worth noting that books also contain a lot of other structured metadata – tables of contents, lists of figures, titles, headings, subheadings, emphasis, lists, captions, and so on, all of which provide cues as to how the book is structured and how ideas and entities contained within it relate to each other.)

As to why I’m posting this now? I first floated this idea with @edchamberlain following a JISC bibliography data event, and he reminded me of it at the Arcadia Project review a couple of days ago ;-)

Related, sort of: Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags, which looked at mapping corporate mentions in the BBC/OU co-pro business programme The Bottom Line:

First attempt at tagging BBC/OU 'The Bottom Line' progs using opencalais

Also Citation Positioning.

PS this is clever – and related – via @ostephens: http://www.eatyourbooks.com/ (“‘Tell us which books you own’ We have indexed the most popular cookbooks & magazines so recipes become instantly searchable.”).



Viewing all articles
Browse latest Browse all 10

Trending Articles