July 9, 2009

POS Frequencies in the MONK Corpus, with Additional Musings

This post is on the work I presented at DH ‘09, plus some thoughts on what’s next for my project. It’s related to this earlier post on preliminary part-of-speech frequencies across the entire MONK corpus, but includes new material and figures based on some data pruning and collection as mentioned in this post (details below).

A word, first, on why I’m working on this. I don’t really care, of course, about the relative frequencies of various parts of speech across time, any more than chemists care about, say, the absorption spectra of molecules. What I’m looking for are useful diagnostics of things that I do care about but that are hard to measure directly (like, say, changes in the use of allegory across historical time or, more broadly, in rhetorical cues of literary periodization).

My hypothesis is that allegory should be more prominent and widespread in the short intervals between literary-historical periods than during the periods themselves. Since we also suspect that allegorical writing should be “simpler” on its face than non-allegorical writing (because it needs to sustain an already complicated set of rhetorical mappings over large parts of its narrative), it makes sense (in the absence of a direct measure of “allegoricalness”) to look for markers of comparative narrative simplicity/complexity as proxies for allegory itself. I think part-of-speech frequency might be one such measure. In any case if I’m right about allegory and periodization and if I’m also right about specific POS frequencies as indicators of allegory, then we should expect certain POS frequencies to exhibit significant (in the statistical sense) fluctuations around periodizing moments and events. (I wish there were fewer ifs in that last sentence; I’ll say a bit below about how one could eliminate them.)

So … what do we see in the MONK case? Recall that the results from the full dataset looked like this:

POS Frequencies, Full MONK Corpus

POS Frequencies, Full MONK Corpus

But that’s messy and not of much use. It doesn’t focus on the few POS types that I think might be relevant (nouns, verbs, adjectives, adverbs); it includes a bunch of texts that aren’t narrative fiction (drama, sermons, etc.); and it’s especially noisy because I didn’t make any attempt to control for years in which very few texts (or authors) were published. (Note that the POS types listed are the reduced set of so-called “word classes” from NUPOS.)

Here’s what we get if we limit the POSs (PsOS?) in question, exclude texts that aren’t narrative fiction, and group together the counts from nearby years with low quantities of text:

POS Frequencies, Reduced and Consolidated MONK Corpus

POS Frequencies, Reduced and Consolidated MONK Corpus

And here’s the same figure with the descriptive types (adjectives and adverbs) added together:

POS Frequencies, Reduced and Consolidated MONK Corpus (Adj + Adv)

POS Frequencies, Reduced and Consolidated MONK Corpus (Adj + Adv)

[Some data details, skippable if you don't care. First, note that the x axes in all three figures need to be fixed up; they're just bins by year label, rather than proper independent variables. I'll fix this soon, but it doesn't make much difference in the results. You can download the raw POS counts for the full corpus (not sorted by year of publication), as well as those restricted to texts with genre = fiction. These are interesting, I guess, but more useful are the same figures split out by year of publication, both for the whole corpus, and just for fiction (presented as frequencies rather than counts). Finally, there are the fiction-only, year-consolidated numbers (back to counts for these, because I'm lazy). The table of translations between the full NUPOS tags and the (very reduced) word classes presented here is also available.]

So what does this all mean? The first thing to notice is that there’s no straightforward confirmation of my hypotheses in these figures. There’s some meaningful fluctuation in noun and verb frequency over the first half of the nineteenth century—which I think might be an interesting indication of the kind of writing that was dominant at the time (see the noun and verb frequency section of this post)—but no corresponding movement in the combined frequency of adjectives and adverbs. This might mean several things: I might be wrong about the correlation between such frequencies and periodizing events, or I might not be looking at the right POS types, or (quite likely, regardless of other factors) I might not have low enough noise levels to distinguish what one would expect to be fairly small variations in POS frequency.

Where to go from here? A few directions:

I’ll keep working on a bigger corpus. The fiction holdings from MONK are only about 1000 novels, spread (unevenly) over 120+ (or 150+) years. So we’re looking at eight or fewer books on average in any one year, and that’s just not very much if we want good statistics.

There are a couple of ways to go about doing this. Gutenberg has around 10,000 works of fiction in English, so it’s an order of magnitude larger. There are issues with their cataloging and bibliographic quality, but I think they’re addressable and I’m at work on them now. The Open Content Alliance has hundreds of thousands of high-quality digitizations from research libraries, though there are some cataloging issues and I’m not sure about final text quality (which relies on straight OCR rather than hand-correction as does Gutenberg). Still, OCA (or Google Books, depending on what happens with the proposed settlement, or Hathi) would offer the largest possible corpus for the foreseeable future. I’ve been talking to Tim Cole at UIUC about the OCA holdings and will report more as things come together.

But I think it’s also worth asking whether or not POS frequencies are the right way to go; I started down that path on a hunch, and it would be nice to have some promising data before I put too much more effort into pursuing it. What I need, really, are some exploratory descriptive statistics comparing known allegorical and nonallegorical texts. One of the reasons I’ve held off on doing that was because it seems like a big project. The time span I have in mind (several centuries), plus the range of styles, genres, national origins, genders, etc. suggest that the test corpus would need to be large (on the order of hundreds of books, say) if it’s not to be dominated by any one author/nation/gender/period/subject/etc. But how much reading and thinking would I have to do to identify, with high confidence, at least 100 major works of allegorical fiction and another 100 of comparable nonallegorical fiction? And would even that be enough? A daunting prospect, though it’s something that I’m probably going to have to do at some point.

But I got an interesting suggestion from Jan Rybicki (who works in authorship attribution, not coincidentally) at DH. Maybe it would suffice, at least preliminarily, to pick a handful of individual authors who wrote both allegorical and nonallegorical works reasonably close together in time, and to look for statistical distinctions between them. Since I’d be dealing with the same author, many of the problems about variations in period, national origin, gender, and so forth would go away, or at least be minimized. I suspect this wouldn’t do very well for finding distinctive keywords, which I imagine would be too closely tied to the specific content of each work (which is a problem that the larger training set is intended to overcome), but it might turn up interesting lower-level phenomena like (just off the top of my head) differences in character n-grams or sentence length. It would take some work to slice and dice the texts in every conceivably relevant statistical way, but I’m going to need to do that anyway and it’s hardly prohibitive.

So that’s one easy, immediate thing to do. In the longer run, what I really want is to see what people in the field have understood to be allegorical and what not, which would have the great advantage, at least as a reference point, of eliminating some of the problems of individual selection bias. One way to do that would be to mine JSTOR, looking, for example, for collocates of “allegor*” or (more ambitiously) trying to do sentiment analysis on actual judgments of allegoricalness. I suspect the latter is out of the question at the moment (as I understand it, the current state of the art is something like determining whether or not customer product reviews are positive or negative, which seems much, much easier than determining whether or not an arbitrary scholarly article considers any one of the several texts it discusses to be allegorical or not). But the former—finding terms that go along with allegory in the professional literature, seeing how the frequency of the term itself and of specific allegorical works and authors changes over (critical) time, and so on—might be both easy and helpful; at the very least, it would be immensely interesting to me. So that’s something to do soon, too, depending on the details of JSTOR access. (JSTOR is one of the partners for the Digging into Data Challenge and they’ve offered limited access to their collection through a program they’re calling “data for research,” so I know they’re amenable to sharing their corpus in at least some circumstances. I was told at THATCamp by Loretta Auvil that SEASR is working with them, too.)

[Incidentally, SEASR is something I've been meaning to check out more closely for a long time now. The idea of packaged but flexible data sources, analytics, and visualizations could be really powerful and could save me a ton of time.]

Finally (I had no idea I was going to go on so long), there are a couple of things I should read: Patrick Juola’s “Measuring Linguistic Complexity” (J Quant Ling 5:3 [1998], 206-13)—which might have some pointers on distinguishing complex nonallegorical works from simpler allegorical ones—plus newer work that cites it. And Colin Martindale’s The Clockwork Muse, which has been sitting on my shelf for a while and which was (re)described to me at DH as “brilliant and infuriating and wacky.” Sign me up.

July 5, 2009

Some POS Frequency Factoids

I’ll be posting a couple of times in the next few days about DH ‘09, THATCamp, and the state of my project. First, though, a handful of (mildly) interesting plots concerning part-of-speech frequency correlations from the MONK corpus.

MONK contains about 1,000 novels and novel-like works spread over the eighteenth, nineteenth, and twentieth centuries. (The full corpus is larger and covers a longer timespan; it includes drama, witchcraft narratives, some nonfiction, etc.) I’ve counted occurrences of the major POS types across just the narrative fiction, divided them up by year of publication, and then grouped together a few nearby years in which few or no books were included. In the end, there’s coverage from 1742 through 1905, with all years (or groups of years) containing at least 500,000 words by four or more authors and no group spanning more than five years. This is the same dataset from which I’ll construct some POS frequency vs. time graphs in a later post (where I’ll also link to the raw counts).

First, two cases that that are easy to anticipate and serve as a kind of check that things aren’t too far off:

Adjective frequency vs. noun frequency

Adjective frequency vs. noun frequency

Adverb frequency vs. verb frequency

Adverb frequency vs. verb frequency

About what you’d expect: a decent positive correlation between the frequency of nouns or verbs and the frequency of words that modify them. Slightly weaker correlation in the adverb case, presumably because adverbs don’t always modify verbs.

Then there’s an interesting case that I think I can explain, but wouldn’t have predicted:

Noun frequency vs. verb frequency

Noun frequency vs. verb frequency

Noun and verb frequency are inversely correlated. This makes sense, I suppose, if you think of novels as tending toward portraiture or action (and for all I know if may be a well known phenomenon). But I expected to see more nouns imply more verbs, since you’d need more things for those subjects and objects to do. In any case, I learned something here from my few minutes with GGobi.

Finally, one that leaves me at a loss:

Adjective frequency vs. adverb frequency

Adjective frequency vs. adverb frequency

How can adjectives and adverbs be apparently uncorrelated? Shouldn’t there be flowery novels rich in both of them and plain ones rich in neither? I’ll investigate, but in the meantime I’d love to be told that this, too, is already accounted for.

Last note: GGobi is really nifty, even if it doesn’t produce beautiful figures out of the box (see above).

July 1, 2009

The Shakespeare Industry

Loosely apropos Ed Finn’s panel at DH on Pynchon, Matt Jockers and I were trying to guess the most-published-upon author in English. I figured Shakespeare, he suggested Joyce. This morning I ran a couple of quick queries on the MLA database and came up with the following:

	  Shakespeare	Joyce
2008+     	  716	  151
2004+     	 3826	  937
1999+     	 8159	 2135
All (1923+)	35489	 9315

There are some details to explain, but the take-away point is that Shakespeare seems to be the object of about four times more scholarship than Joyce.

The details: These are raw result counts for the subject queries “Shakespeare William” and “Joyce James,” both of which are defined subject headings in MLA. The counts are total matching items of all types (journal articles, refereed journal articles, books, chapters, and other) published from the listed year to the present. I didn’t make any attempt to distinguish major from minor works (e.g., books from articles), nor single-subject studies from multi-subject ones. This is obviously pretty non-rigorous, but it was good enough to satisfy my passing curiosity.

This is interesting and at least a little unexpected to me. I figured Shakespeare would be in the lead, especially over the full history of criticism, but I thought things would be much closer, especially in recent years. I wonder if part of the gap might be explained by a higher likelihood of talking about Shakespeare in any given English renaissance context than about Joyce in any given modernist one?

April 16, 2009

A Pretty (?) Picture

This graph is absurd, I know, but it’s a hint of what I’ve been working on. Click for largeness. More to come soon.

POS-Graph-Small.jpg

April 7, 2009

The Formal Charge Against Lurie

A bit more on Disgrace. Talking things over with Liz Evans, she pointed out that the specific charge leveled against Lurie by Melanie Isaacs isn’t entirely clear; we’re never told anything beyond the fact that it involves an alleged breach of “article 3.1 of the university’s Code of Conduct,” which “addresses victimization or harassment of students by teachers” and is a subsection of article 3, concerning “victimization or harassment on grounds of race, ethnic group, religion, gender, sexual preference, or disability” (38-39). Nor do we see the content of Melanie’s statement to the committee (which statement Lurie claims not to have read, though it has been provided to him). [Footnote: There's also the technical charge of irregularity in grading and recordkeeping, but that is obviously a subsidiary matter, probably best understood as a gesture toward bureaucratic verisimilitude and an attempt to raise the probability of conviction by including a lesser but more easily proven allegation.] The members of the committee refer to the charges alternately as involving “harassment,” “abuse,” and “exploitation.” But it seems unlikely—this was Liz’s point—that they involve rape; if they did, it’s hard to imagine the committee entertaining the possibility that Lurie would retain his position at the university (which does appear to be the suggestion, provided he is willing to make a sincere apology and undergo counseling, etc.).

Why is this important? Because it’s part of the analogy between Melanie’s mistreatment and Lucy’s, hence of the structural and allegorical parallel between colonial violence and retributive justice. I hadn’t noticed this fact concerning Melanie’s accusation, but it adds another important way in which she resembles Lucy; they both present a legal claim to the authorities, but withhold from their accounts any mention of rape. This strengthens the parallel between the two women, and thus reinforces our obligation to make sense of the similarities and differences in the way they’re treated, in the ways they respond to that treatment, and in their respective social and historical positions.

April 7, 2009

Klaaste on Disgrace

I just received a copy—via ILL, on microfilm, from Johannesburg—of an opinion piece on Disgrace from the Sowetan (by Aggrey Klaaste, 3 April 2000, p. 9). It’s one that I had seen referenced in a couple of places, but had never before been able to read. Nothing of interest as literary criticism, but it’s a potentially useful fragment of documentation concerning the novel’s initial political reception in South Africa. If you’re interested, I’ve put up a marginal-quality PDF copy (what can I say, it’s a scan of a printout from microfilm.)

Also: The fact that it took two months to get a hold of this article—and that there were many more that were simply unobtainable at a major U.S. research university with a diligent ILL department—illustrates part of the problem with doing politically and culturally informed work on Coetzee outside South Africa. It’s certainly not impossible, and I don’t mean to overplay the difficulty, but the primary sources are much trickier to track down than I expected them to be.

Also also: Microfilm apparently still exists.

April 4, 2009

Debt and Punishment

A bit more on the function of debt in Disgrace, especially with respect to the TRC.

I’ve been trying to figure out how punishment works as a form of compensation for the victims of ethical and legal wrongs (which are not the same thing). In an earlier post, I moved away from the idea that legal (i.e., state-sanctioned, tribunal-mediated) punishment was intended to provide a compensatory satisfaction to those who have been wronged. I think this is generally true, at least as a theoretical principle of modern law; it’s one of the reasons, for instance, that “victims’ rights” remains a marginal concept. (The other being, of course, that the state is now understood to intervene between perpetrator and victim, so that the victim isn’t a proper party to the exercise of legal justice.)

But what about institutions like the TRC, which I claimed in my ACLA paper do serve a significantly compensatory function? How so? Well, it’s partially that they provide something of value to the victim in a context where more direct compensation in the form of reparations, significant socioeconomic restructuring, etc. is unlikely. But there’s more to it than that, I think. The desire of victims to receive a “sincere” apology—illustrated at some length in Coetzee’s novel through both the reactions of Melanie’s family and David’s souring relationship with Petrus—is a desire to see the perpetrator suffer the pangs of conscience. The victim in such cases enjoys, takes satisfaction in, the perpetrator’s self-punishment, which is more harsh than most of what can be otherwise imposed on the perpetrator (as evinced by the obvious inadequacy, to most of those involved, of Lurie merely losing his job absent any sincere expression of contrition). So there’s a sense in which legal punishment, even today, is intended to provide a kind of repayment to the victim, and this is especially true in the case of (pseudo) tribunals like the TRC that are indended to address large-scale historical wrongs.

Note, too, that all this is a theory of law and rights that’s presupposed by and enables the bad reading of Coetzee developed and critiqued in my paper, not something that I’m insisting is necessarily the basis of all contemporary law. In any case, though, Coetzee is in a way law-agnostic; his real analysis is of ethics or morality, which he is at pains to show work differently. That’s the point of Lurie’s encounter with the commission, which mixes law and ethics in a way that doesn’t work well for either one.

April 3, 2009

MorphAdorner Release

The first public release of MorphAdorner—version 0.9, released April 3, 2009—is now available. There’s full documentation, too. Congratulations and many thanks to Phil Burns – this is great news.

I discussed MorphAdorner as part of my series of posts on part-of-speech taggers a couple of months back, and will be using it for much of my upcoming work.

My understanding is that Phil intends to leave MorphAdorner mostly as-is for the time being, unless it’s taken up by another project; MONK has been funding current development, I think, and it (MONK) is winding down. Which reminds me: A public version of the MONK workbench, with a bevy of analytical tools and access to several thousand texts across four-plus centuries, should be available soon. Will post here when it’s up, though I’m not involved in making that happen.

March 31, 2009

Disgrace and Debt

A quick follow-up to this past weekend’s ACLA conference. My seminar was on literature and law; it was interesting and useful indeed, despite being a bit to the side of what I usually do. If you’re interested, I’ve posted a copy of my paper (PDF); a more formal treatment is in the works.

The talk argues that while a legal framework in which ethical wrongs are treated as analogous to economic debts is probably inevitable, Coetzee’s novel shows how this economic treatment is inadequate as a basis of moral action. Specifically, the economic analogy enables—maybe even requires—bad readings of Disgrace, ones in which Lurie becomes the true victim insofar as he pays out of all proportion for his offense against Melanie. If we drop the debt model, we can also avoid this problem.

Joey Slaughter, one of the co-conveners of the seminar, objected that contemporary understandings of the law are not in fact based on such an economic model, which struck him as “premodern” (of the eye-for-an-eye type). I see the point, which is similar to Foucault’s analysis in Discipline and Punish; we no longer think that the task of punishment is to provide a compensatory enjoyment for the victim, whether the victim in question is understood to be the directly harmed individual or the state/corporate body as a whole. All true, and I may have drifted perilously close to suggesting something along those lines. But my point had less to do with victims or with punishment as payment than it did with an “account balance” of sorts for the perpetrator. The idea—which I was suggesting underlies the principle of proportionality in sentencing and that Coetzee rejects as an adequate account of morality (but not necessarily of law)—is that punishments should deprive the perpetrator of any surplus or advantage accumulated through his offense (plus an additional deterrent amount, though I didn’t raise that point in the talk for lack of time).

The easier version is when the offense is straightforwardly economic, though even then it’s not dead simple. If I steal $10 from you, I’ll need to repay that amount, plus a deterrent amount, plus whatever we collectively deem appropriate for the inherent damage caused by a violation of the law (related, for instance, to the fact that we all feel less secure once we’ve experienced the fact of theft). It’s harder—and this is one of the novel’s points—when the violation in question is non- or supra-economic (as with rape, exploitation, etc.). But in either case, the idea isn’t to repay the victim by allowing her to enjoy the perpetrator’s suffering (which plainly doesn’t work, as the novel demonstrates at length), it’s to deprive the perpetrator of his illegitimately accrued advantage. What the proper balance should be is a tricky question, but it’s also what the law must do. Ethics, on the other hand (and this is my reading of Coetzee), doesn’t let you off even after you’ve paid a compensatory amount; there is nothing you can do to fix or to balance your sins, and no amount of your suffering offsets them. If you’ve been wronged in turn, you don’t break even at some point, ethically speaking—you just go on being wrong. True, you’ve now been wronged, too, but that’s of a different order; there’s no universal ethical as opposed to legal account to settle.

Depressing stuff, perhaps, but then Coetzee isn’t the author for joy. More on this to come at some point.

March 30, 2009

Contemporary U.S. Novel Syllabus

I’m finished with a draft of my Contemporary U.S. Novel syllabus for next semester. I’ve posted a copy of the full syllabus (PDF) and a little flier about the course (also PDF).

The primary texts, with dates of publication and page counts:

  • David Foster Wallace, Infinite Jest (1996, 1104 pp.)
  • Barbara Kingsolver, The Poisonwood Bible (1998, 576 pp.)
  • Colson Whitehead, John Henry Days (2001, 389 pp.)
  • Jonathan Safran Foer, Extremely Loud and Incredibly Close (2005, 368 pp.)
  • Junot Díaz, The Brief Wondrous Life of Oscar Wao (2007, 352 pp.)
  • Rivka Galchen, Atmospheric Disturbances (2008, 256 pp.)

There will be a handful of theoretical and critical readings as well (Jameson, Chow, Hayles, Hardt, Zadie Smith, others). The initial list of primary texts was about five times as long, but, well, semesters are short.

A post on Coetzee tomorrow, then back to DH/computational stuff for a while.