Electronic resources for Forensic Linguistics (Friday 3:15-4:45)

  • Carole Chaski, Justice Department, Washington, DC

An electronic parsing system for document authentication

Document Authentication can play an important investigative role in manytypes of crime, from threatening letters and patent fraud to homicide. Asone means of authenticating documents, language-based authoridentification has been developing over the last twenty years, includingMorton's (1990, 1991a, 1991b) "stylometry," McMenamin's (1993) "forensicstylistics," and other techniques such as type-token ratio and prescriptive error analysis. Previous language-based author identificationmethods either fail to classify documents correctly when testedindependently (Totty et al, 1987; Hardcastle 1993; Chaski 1997), usestatistics inappropriately (Herdan 1964; Sanford et al 1994; Hilton andHolmes 1993), are not explicated well enough to be replicable (Tiersma1993; Finegan 1990; Crystal 1995; Goutsos 1995), or lack a theoreticalgrounding in linguistics (Crystal 1995; Chaski 1996).

During my fellowship at the National Institute of Justice (the researchbranch of the US Department of Justice), I have been developing an Electronic Parsing Authentication System in order to test previous methodsand to develop a parsing-based method which meets the Daubert criteria for admitting scientific evidence in US Courts.

The Writing Sample Database In 1995, I collected a database of writing samples. The database was designed to take into account both statistical sampling issues and linguistic performance. The database currently includes 92 subjects ofsimilar dialect and educational levels, representing both genders andseveral ethnicities. Writing tasks were designed to evoke specific text-types similar to those encountered in actual questioned documentcases and to evoke different registers. Some subjects contributed asfew as 50 words, while others produced several thousand. The databasecurrently contains over 100, 000 words. The Writing Sample Database willbe available in electronic format to other researchers through theNational Criminal Justice research Service at the conclusion of myfellowship.

  • The EPAS: 

The Electronic Parsing Authentication System has been developed within Filemaker Pro 3.0 as a standalone application. It can be accessed on bothMacintosh and Windows platforms; its data can be ported to spreadsheet and statistical applications. The database is relational, and contains segment databases for different linguistic levels (morpheme, phrase, clause, sentence and discourse) whose analytical results can be accessed by othersegment databases as needed. The programming language contained in Filemaker Pro 3.0 is sophisticated enough to allow me to write programs which perform parsing routines while still permitting human intervention ; natural language parsing has not developed a 100% accuracy which is required in forensic investiigation, so human intervention permits the highest order to accuracy. Finally, because the parsing is mostly conducted by the computer system, the typical sources of human error--fatigue, boredom, repetition bias, etc. -- are avoided.

  • The Search for Idiolectal Markers

Currently, I am using EPAS to search for markers which are able both to discriminate between and cluster together documents authored by a small subset of the Writing Sample Database. The class of linguistic structures which allow for optionality are the first areas I am investigating for idiolectal markers. I have been conducting pilot studies of various markers which demonstrate that a parsing methodology has potential for finally providing a scientific method of author identification.

  • A. R. Gray, P. J. Sallis, and S. G. MacDonell, University of Otago, New Zealand

Software Forensics: Extending Authorship Analysis to Computer Programs

Computer programs are written in source code that can be treated as a form of language. While source code is much more formal than spoken or written language, it still contains considerable information that provides evidence of authorship characteristics. In fact, much of the work already carried out in computational linguistics for text corpus authorship analysis has parallels for source code. Similarly, many of the techniques used in forensic linguistics for determining the authorship of written documents also apply to computer programs. In this way software forensics, when determining the author of malicious software, can be seen as a new and exciting area of forensics.

Measurements, including some from conventional software metrics, can be extracted from source code based on such attributes as the types of data structures used, the control flow of the program, comments in the code, and names used within the program. These metrics are often similar to stylistic tests used in computational linguistics. While not part ofsource code analysis, some environmental measurements can also be extracted from executable code such as the hardware platform and the compiler employed for its production. These data can also be used to determine authorship origins.

Programmers tend to have coding styles that are fairly distinct, and often recognisable to their colleagues, but the issue of how well this can behidden, or mimicked, is also of obvious importance when ascribing authorship to an individual. There is a question of whether or not there is sufficient information available using these combined techniques to provide adequate authorship evidence for use within a legal context. In other words whether authorship identification or characterisation, can be performed at levels of sufficient certainty for these results to then be presented as legal argument. 

Source code authorship analysis could be, and in some cases has been, used for a number of tasks including software forensics as mentioned above, the detection of plagiarism in educational settings, and the resolution of disputed claims of code authorship. Each of these application areas requires a specialised approach.

While all three areas will be mentioned, the main focus for this paper is on software forensics. The association of characteristics with source code style can be used to suggest the gender, background, or some personality characteristics of the author, as well as authorship discrimination and matching. The use of source code measurements to determine the author of malicious code is made much more difficult for code that has already been compiled into an executable format.

In this paper the authors look at how authorship analysis for text can be extended to source code, working from a basis in computational linguistics and forensics. They also consider some of the special issues of source code analysis, distinguish between the different application areas for such analysis, and discuss the legal implications of using such techniques for some areas, with a focus on the software forensics application.

  • David G. Hale, Olin Corporation, and Bethany K. Dumas, University of Tennessee

Electronic resources for forensic linguistics: creating a web journal

This paper will give a detailed history of "Language in the JudicialProcess" (LJP) <http://ljp.la.utk.edu>, an electronic publication created for use by academics, attorneys, and other professionals interested inlanguage and the law and suggest ways in which such low-cost electronicres ources can serve the interests of linguists, lawyers, litigants, students, and computer scientists.

The paper provides a comprehensive framework of requirements and considerations for building a web-based journal.

Topics covered include:

  • Why the web?

  • preparation checklist

  • guiding principles

  • hardware requirements

  • tools used (software, etc.)

  • quagmires & considerations