Google's Ngram Viewer Goes Wild

With the addition of wildcard search-term capabilities, Google's fabulous language-analysis tool gets even more powerful.

It's been nearly three years since Google rolled out its Ngram Viewer, allowing armchair historians to plot the trajectories of words and phrases over time based on an enormous corpus of data extracted from the Google Books digitization project. Since then, there have been numerous studies seeking to glean some cultural significance from the graphs of falling and rising word usage. And the graphs themselves have inspired imitators: Recently, the engineering team behind Rap Genius introduced Ngram-style graphing of historical word frequency in rap lyrics, and, more bizarrely, New York Times wedding announcements. (You can even compare the hiphop and matrimonial datasets.)
As the Ngram model extends its influence, Google continues to tinker, making improvements to the Ngram Viewer's already slick interface. Last year saw a major upgrade, with a sizable increase in the underlying data spanning English and seven other languages, as well as the introduction of part-of-speech tagging and mathematical operators that allowed for more sophisticated searches. Today, meet Ngram Viewer 3.0. While the corpus itself hasn't expanded in this version, the search features have become even more useful, especially now that wildcards are in the mix.
Anyone who has spent time delving into databases knows how much flexibility you can get with wildcards: use an asterisk to stand in for any word, and suddenly your search horizons have expanded. In the new Ngram Viewer, using the asterisk as a wildcard will display the top ten most frequently appearing words that fill the slot over the range of time you have selected. The asterisk can be combined with parts of speech, too, so "*_NOUN" will find only the nouns that could appear in the sequence of words you're searching on.
Now if you type "*_NOUN 's theorem" into the Ngram Viewer, you will see a graph with the ten most common names (which count as nouns) that have spawned eponymous theorems — names like Godel, Bayes, and Euler. (Right-clicking will toggle back and forth between a view tracking the different variants and one showing a single line encompassing all the variants.)

When the Google project team (Jon Orwant, Slav Petrov, and Dipanjan Das) gave me a sneak peek at the new version of the Ngram Viewer, I had no shortage of wildcard searches to test out. On Twitter, I've fielded questions like "Besides media moguls, what other moguls are there?" and "What can be ragtag other than a bunch?" It's possible to answer these questions using the publicly available corpora compiled by Mark Davies at Brigham Young University, but the peculiar interface can be off-putting to casual users. With the Ngram Viewer, you just need to enter a search like "*_NOUN mogul" or "ragtag *_NOUN" and select a year range. It turns out that in 20th-century sources, media moguls are joined by movie moguls, real estate moguls, and Hollywood moguls, while the most likely things to be ragtag are armies, groups, and bands.



By Ben Zimmer
Source and more:
http://www.theatlantic.com/technology/archive/2013/10/googles-ngram-viewer-goes-wild/280601/

0 yorum: