Adrien Barbaresi & Kay-Michael Würzner
BBAW
"Venn diagrams of lexical variables for quality assessment and corpus
comparison"
Given the "opportunistic" nature of web corpora, questions arise regarding their intrinsic coherence and quality. Naive approaches to web crawling and web texts may yield positive results when text quantity is more important than text quality, but they are bound to impede proper linguistic research. In fact, there are (corpus) linguists who advocate a meticulous selection and extraction of web texts, since size cannot necessarily compensate for lack of quality. In our daily work, we find it convenient to visualize overlaps between corpora or specific corpora subsets using (proportional) Venn diagrams. They can be used to illustrate relationships between finite collections of sets. More specifically, they allow for a clear breakdown of intersections between the type inventory (i.e. lexicon) of multiple text corpora. We exemplify this method by comparing four German corpora from the web, which foster different expectations in terms of genre, quality, and internal structure. We wish to give a reasonable image of some of their differences and similarities. The first is a web version of a "traditional" newspaper corpus (Die ZEIT, years 2010 to 2013); the second a large collection of blog data split into posts and corresponding comments; the third a corpus of TV, movie, and computer game subtitles. All four corpora have been tokenized, morphologically analyzed and Part-of-Speech tagged. The visualizations have been created using the Vennerable R package which features proportional Venn diagrams for up to nine sets using the Chow-Ruskey algorithm. We believe that this kind of visualization can help to answer everyday questions regarding corpus adjustments as well as more general research questions such as the delimitation of web genres. Stirling Chow and Frank Ruskey. Drawing area-proportional venn and euler diagrams. In Giuseppe Liotta, editor, Graph Drawing, volume 2912 of Lecture Notes in Computer Science, pages 466-477. Springer, Berlin, Heidelberg, 2004. Jonathan Swinton. vennerable, 2009