A stilometria a nyelvi stílus tanulmányozásának egyik alkalmazási területe, általában az írott szövegek vonatkozásában, de sikeresen alkalmazták már zenei[1] és képzőművészeti (festészeti)[2] területeken is[3]. Egy másik elképzelés szerint nyelvészeti diszciplínaként definiálható, amely a statisztikai elemzést alkalmazza az irodalmi szövegekre a szerzők stílusát különböző kvantitatív jellemzőkkel kiértékelve.[4]
A stilometriát gyakran alkalmazzák az ismeretlen vagy vitatott szerzőjű művek szerzői azonosításában.[5] Csakúgy vannak jogi, mint tudományos és irodalmi alkalmazásai is, a Shakespare-szerzőségi vizsgálatoktól az igazságügyi nyelvészetig.
[szerkesztés]A stilometria a szövegelemzés korábbi módszereiből nőtt ki, amelyek a hitelesség, szerzői identitás és egyebek bizonyítására irányultak.
A diszciplína újabb művelésének nagy lendületet adtak az angol reneszánsz dráma szerzői problémáinak vizsgálata körüli kutatások. A kutatók és az olvasók rájöttek, hogy a korszak bizonyos műveiben jól felismerhető nyelvi megoldások azonosíthatók, és megpróbálták ezeket ismeretlen vagy többszerzős művek szerzőinek beazonosításában felhasználni. A legkorábbi próbálkozások nem mindig voltak sikeresek: 1901-ben egy kutató megpróbálta John Fletcher egyik nyelvi preferenciáját, az "'em" rövidített forma használatát a "them" forma helyett felhasználni arra, hogy a közös műveikben elkülönítse az ő részeit a Philip Massinger által írottaktól, de tévedésből egy olyan kiadást használt, amelyben Massinger összes rövid "'em" formáját a szerkesztő kiegészítette "them"-re.[6]
A stilometria alapjait a lengyel filozófus Wincenty Lutosławski vetette meg a Principes de stylométrie (1890) c. művében. Lutosławski a módszerét arra használta, hogy Platón dialógusainak a koronológiáját felállítsa.[7]
A számítógépek és kapacitásuk fejlődése a nagy mennyiségű adat feldolgozásában nagyságrendekkel megnövelte az ilyen jellegű munkálatok sikerességét. A számítógépek nagy adatelemző kapacitása azonban nem garantálja a minőségi eredményeket. A '60-as évek elején A. Q. Morton tiszteletes a Szent Pálnak tulajdonított 14 újtestamentumi levélről azt derítette ki számítógépes elemzés segítségével, hogy hat különböző szerzőtől származik. Módszerének ellenőrzése James Joyce műveire alkalmazva azt az eredményt hozta, hogy az Ulysses, Joyce több nézőpontú, többstílusú remekműve öt különböző személy által íródott, akik közül láthatóan egyikük sem vett részt Joyce első regényének, az Ifjúkori önarcképnek a megalkotásában.[8]
Idővel azonban a kutatók finomítottak a módszereiken és megközelítésmódjukon, hogy jobb eredményekhez jussanak. Az egyik sikeres korai vizsgálat volt pl. a Frederick Mosteller és David Wallace nevéhez fűződő 12 The Federalist Papers vitatott szerzőségének a tisztázása.[9] Bár a kiinduló feltételezésekkel és a módszertannal kapcsolatos kérdések még mindig felmerülnek (és talán mindig is fel fognak merülni), ma már kevesen vitatják azt az alaptételt, hogy az írott szövegek nyelvészeti elemzése értékes információkkal és felismerésekkel szolgálhat. (Ez már a számítógépek megjelenése előtt is nyilvánvaló volt: a textuális/nyelvi megközelítés sikeres alkalmazása a Fletcher-kánonra Cyrus Hoy és mások által az 1950-es évek végén és a 60-as évek elején egyértelmű eredményeket hozott.)
[szerkesztés]Applications of stylometry include literary studies, historical studies, social studies, gender studies, and many forensic cases and studies.[10][11] It can also be applied to computer code.[12] Stylometry can also be used to predict whether someone is a native or non native English speaker through their typing speed.[13]
Stylometry as a method is vulnerable to the distortion of text during revision.[14] There is also the case of the author adopting different styles in the course of his career as was demonstrated in the case of Plato, who chose different stylistic policies such as the those adopted for the early and middle dialogues addressing the Socratic problem.[15]
Aktuális kutatások
[szerkesztés]Modern stylometry draws heavily on the aid of computers for statistical analysis, artificial intelligence and access to the growing corpus of texts available via the Internet.[16] Software systems such as Signature[17] (freeware produced by Dr Peter Millican of Oxford University), JGAAP[18] (the Java Graphical Authorship Attribution Program—freeware produced by Dr Patrick Juola of Duquesne University), stylo[19][20] (an open-source R package for a variety of stylometric analyses, including authorship attribution, developed by Maciej Eder, Jan Rybicki and Mike Kestemont) and Stylene[21] for Dutch (online freeware by Prof Walter Daelemans of University of Antwerp and Dr Véronique Hoste of University of Ghent) make its use increasingly practicable, even for the non-expert.
Tudományos összejövetelek és események
[szerkesztés]Stylometric methods are discussed in several academic fields, mostly as a tangential field of application as with machine learning, natural language processing, and lexicography.
Igazságügyi nyelvészet
[szerkesztés]The International Association of Forensic Linguists (IAFL) organises the Biennial Conference of the International Association of Forensic Linguists (13th edition in 2016 in Porto) and publishes The International Journal of Speech, Language and the Law with forensic stylistics as one of its central topics.
[szerkesztés]The Association for the Advancement of Artificial Intelligence (AAAI) has hosted several events on subjective and stylistic analysis of text.[22][23][24]
[szerkesztés]PAN workshops (originally, plagiarism analysis, authorship identification, and near-duplicate detection, later more generally workshop on uncovering plagiarism, authorship, and social software misuse) organised since 2007 mainly in conjunction with information access conferences such as ACM SIGIR, FIRE, and CLEF. PAN formulates shared challenge tasks for plagiarism detection,[25] authorship identification,[26] author gender identification,[27] author profiling,[28] vandalism detection,[29] and other related text analysis tasks, many of which hinge on stylometry.
Érdekes esettanulmányok
[szerkesztés]- Around 1370 to 1070 BC, as recorded in the Book of Judges, one tribe identified members of another tribe in order to kill them by asking them to say the word Shibboleth which in the dialect of the intended victims sounded like "sibboleth".[30]
- In 1439, Lorenzo Valla showed that the Donation of Constantine was a forgery, an argument based partly on a comparison of the Latin with that used in authentic 4th-century documents.
- In 1952, the Swedish priest Dick Helander was elected bishop of Strängnäs. The campaign was competitive and Helander was accused of writing a series of a hundred-some anonymous libelous letters about other candidates to the electorate of the bishopric of Strängnäs. Helander was first convicted of writing the letters and lost his position as bishop but later partially exonerated. The letters were studied using a number of stylometric measures (and also typewriter characteristics) and the various court cases and further examinations, many contracted by Helander himself during the years up to his death in 1978 discussed stylometric methodology and its value as evidence in some detail.[31][32]
- In 1975, after Ronald Reagan had served as governor of California, he began giving weekly radio commentaries syndicated to hundreds of stations. After his personal notes were made public on his 90th birthday in 2001, a study to determine which of those talks were written by him and which were written by various aides used stylostatistical methods.[33]
- In 1996, the stylometric analysis of the controversial, pseudonymously authored book Primary Colors, performed by Vassar College professor Donald Foster[34] brought the field to the attention of a wider audience after correctly identifying the author as Joe Klein. (This case was only resolved after a handwriting analysis confirmed the authorship).
- In 1996, stylometric methods were used to compare the Unabomber manifesto with letters written by one of the suspects, Theodor Kaczynski to his brother, which led to his apprehension and later conviction.[35]
- In April 2015, researchers using stylometry techniques identified a play, Double Falsehood, as being the work of William Shakespeare.[36] Researchers analyzed 54 plays by Shakespeare and John Fletcher and compared average sentence length, studied the use of unusual words and quantified the complexity and psychological valence of its language.
- In 2016, MacDonald P. Jackson, Emeritus Professor of English at the University of Auckland, New Zealand and a Fellow of the Royal Society of New Zealand, who had spent his entire academic career analyzing authorship attribution, wrote a book titled Who Wrote "the Night Before Christmas"?: Analyzing the Clement Clarke Moore Vs. Henry Livingston Question,[20], in which he evaluates the opposing arguments and, for the first time, uses the author-attribution techniques of modern computational stylistics to examine the long-standing controversy. Jackson employs a range of tests and introduces a new one, statistical analysis of phonemes; he concludes that Livingston is the true author of the classic work.
- In 2017, Simon Fuller and James O'Sullivan published a study claiming that bestselling author James Patterson does not do any writing in his co-authored novels.[37][38][39] According to O'Sullivan, his collaboration with former U.S. president Bill Clinton, The President is Missing, is an exception to this rule.[40]
- In 2017, a group of linguists, computer scientists, and scholars analysed the authorship of Elena Ferrante. Based on a corpus created at University of Padua containing 150 novels written by 40 authors, they analyzed Ferrante's style based on seven of her novels. They were able to compare her writing style with 39 other novelists using, for example, stylo.[19] The conclusion was the same for all of them: Domenico Starnone is the secret hand behind Elena Ferrante.[41]
- In 2018, Mark Glickman, senior lecturer in statistics at Harvard University worked with Ryan Song, a former statistics student at Harvard, and Jason Brown, a professor at Dalhousie University in Nova Scotia, applied stylometry to find that, most likely, The Beatles' song "In My Life" was composed by John Lennon, but with a 50% chance that Paul McCartney wrote the middle eight.[42]
- In 2019, the ETSO project: Stylometry applied to the Spanish Golden Age Theater, led by Álvaro Cuéllar González and Germán Vega García-Luengos (University of Valladolid) managed to gather more than 1200 plays of the Spanish Golden Age. After applying stylometrical analysis, the attribution of Mujeres y criados to Lope de Vega[43][44] was ratified, and an authorship problem was detected in La monja alférez, a play attributed to Pérez de Montalbán which, thanks to these analyzes and through historical and philology research, was eventually attributed to Juan Ruiz de Alarcón.[45][46][47][48]
- In 2020, Rachel McCarthy and James O'Sullivan proved that Emily Brontë is the true author of Wuthering Heights, ending speculation by some critics that the novel might have been written by one of her siblings, specifically either Branwell or Charlotte.[49]
Adatok és módszerek
[szerkesztés]Since stylometry has both descriptive use cases, used to characterise the content of a collection, and identificatory use cases, e.g. identifying authors or categories of texts, the methods used to analyse the data and features above range from those built to classify items into sets or to distribute items in a space of feature variation. Most methods are statistical in nature, such as cluster analysis and discriminant analysis, are typically based on philological data and features, and are fruitful application domains for modern machine learning approaches.
Whereas in the past, stylometry emphasized the rarest or most striking elements of a text, contemporary techniques can isolate identifying patterns even in common parts of speech. Most systems are based on lexical statistics, i.e. using the frequencies of words and terms in the text to characterise the text (or its author). In this context, unlike in information retrieval, the observed occurrence patterns of the most common words are more interesting than the topical terms which are less frequent.[50][51]
The primary stylometric method is the writer invariant: a property held in common by all texts, or at least all texts long enough to admit of analysis yielding statistically significant results, written by a given author. An example of a writer invariant is frequency of function words used by the writer.
In one such method, the text is analyzed to find the 50 most common words. The text is then broken into 5,000 word chunks and each of the chunks is analyzed to find the frequency of those 50 words in that chunk. This generates a unique 50-number identifier for each chunk. These numbers place each chunk of text into a point in a 50-dimensional space. This 50-dimensional space is flattened into a plane using principal components analysis (PCA). This results in a display of points that correspond to an author's style. If two literary works are placed on the same plane, the resulting pattern may show if both works were by the same author or different authors.
1. Gaussi statisztika
[szerkesztés]Stylometric data are distributed according the Zipf-Mandelbrot law. The distribution is extremely spiky and leptokurtic, reason why researchers had to turn their backs to statistics to solve e.g. authorship attribution problems. Nevertheless, usage of Gaussian statistics is perfectly possible by applying data transformation.[52]
2. Neurális hálózatok
[szerkesztés]Neural networks, a special case of statistical machine learning methods, have been used to analyze authorship of texts. Text of undisputed authorship are used to train the neural network through processes such as backpropagation, where training error is calculated and used to update the process to increase accuracy. Through a process akin to non-linear regression, the network gains the ability to generalize its recognition ability to new texts to which it has not yet been exposed, classifying them to a stated degree of confidence. Such techniques were applied to the long-standing claims of collaboration of Shakespeare with his contemporaries Fletcher and Christopher Marlowe,[53][54] and confirmed the view, based on more conventional scholarship, that such collaboration had indeed taken place.
A 1999 study showed that a neural network program reached 70% accuracy in determining authorship of poems it had not yet analyzed. This study from Vrije Universiteit examined identification of poems by three Dutch authors using only letter sequences such as "den".[55]
A study used deep belief networks (DBN) for authorship verification model applicable for continuous authentication (CA).[56]
One problem with this method of analysis is that the network can become biased based on its training set, possibly selecting authors the network has more often analyzed.[55]
3. Genetikus algoritmusok
[szerkesztés]The genetic algorithm is another machine learning technique used in stylometry. This involves a method that starts out with a set of rules. An example rule might be, "If but appears more than 1.7 times in every thousand words, then the text is author X". The program is presented with text and uses the rules to determine authorship. The rules are tested against a set of known texts and each rule is given a fitness score. The 50 rules with the lowest scores are thrown out. The remaining 50 rules are given small changes and 50 new rules are introduced. This is repeated until the evolved rules correctly attribute the texts.
4. Ritka párok
[szerkesztés]One method for identifying style is called "rare pairs", and relies upon individual habits of collocation. The use of certain words may, for a particular author, idiosyncratically entail the use of other, predictable words.
Szerzői azonosítás az azonnali üzenetküldésben
[szerkesztés]The diffusion of Internet has shifted the authorship attribution attention towards online texts (web pages, blogs, etc.) electronic messages (e-mails, tweets, posts, etc.), and other types of written information that are far shorter than an average book, much less formal and more diverse in terms of expressive elements such as colors, layout, fonts, graphics, emoticons, etc. Efforts to take into account such aspects at the level of both structure and syntax were reported in.[57] In addition, content-specific and idiosyncratic cues (e.g., topic models and grammar checking tools) were introduced to unveil deliberate stylistic choices.[58]
Standard stylometric features have been employed to categorize the content of a chat over instant messaging,[59] or the behavior of the participants,[60] but attempts of identifying chat participants are still few and early. Furthermore, the similarity between spoken conversations and chat interactions has been neglected while being a key difference between chat data and any other type of written information.
Lásd még
See also the academic journal Literary and Linguistic Computing (published by the University of Oxford) and the Language Resources and Evaluation journal.
Külső hivatkozások
