|
Electronic resources for Forensic Linguistics (Friday
3:15-4:45)
An electronic parsing system for document authentication
Document Authentication can play an important investigative role in manytypes of crime, from threatening letters and patent fraud to homicide. Asone means of authenticating documents, language-based authoridentification has been developing over the last twenty years, includingMorton's (1990, 1991a, 1991b) "stylometry," McMenamin's (1993) "forensicstylistics," and other techniques such as type-token ratio and prescriptive error analysis. Previous language-based author identificationmethods either fail to classify documents correctly when testedindependently (Totty et al, 1987; Hardcastle 1993; Chaski 1997), usestatistics inappropriately (Herdan 1964; Sanford et al 1994; Hilton andHolmes 1993), are not explicated well enough to be replicable (Tiersma1993; Finegan 1990; Crystal 1995; Goutsos 1995), or lack a theoreticalgrounding in linguistics (Crystal 1995; Chaski 1996).
During my fellowship at the National Institute of Justice (the
researchbranch of the US Department of Justice), I have been developing
an Electronic Parsing Authentication System in order to test previous
methodsand to develop a parsing-based method which meets the Daubert
criteria for admitting scientific evidence in US Courts.
The Writing Sample Database In 1995, I collected a database of writing
samples. The database was designed to take into account both statistical
sampling issues and linguistic performance. The database currently
includes 92 subjects ofsimilar dialect and educational levels,
representing both genders andseveral ethnicities. Writing tasks were
designed to evoke specific text-types similar to those encountered in
actual questioned documentcases and to evoke different registers. Some
subjects contributed asfew as 50 words, while others produced several
thousand. The databasecurrently contains over 100, 000 words. The
Writing Sample Database willbe available in electronic format to other
researchers through theNational Criminal Justice research Service at the
conclusion of myfellowship.
The Electronic Parsing Authentication System has been developed within Filemaker Pro 3.0 as a standalone application. It can be accessed on bothMacintosh and Windows platforms; its data can be ported to spreadsheet
and statistical applications. The database is relational, and contains
segment databases for different linguistic levels (morpheme, phrase,
clause, sentence and discourse) whose analytical results can be accessed by othersegment databases as needed. The programming language contained
in Filemaker Pro 3.0 is sophisticated enough to allow me to write
programs which perform parsing routines while still permitting human intervention
; natural language parsing has not developed a 100% accuracy which
is required in forensic investiigation, so human intervention permits
the highest order to accuracy. Finally, because the parsing is
mostly conducted by the computer system, the typical sources of human error--fatigue, boredom, repetition bias, etc. -- are avoided.
Currently, I am using EPAS to search for markers which are able both to
discriminate between and cluster together documents authored by a small
subset of the Writing Sample Database. The class of linguistic
structures which allow for optionality are the first areas I am
investigating for idiolectal markers. I have been conducting pilot
studies of various markers which demonstrate that a parsing methodology
has potential for finally providing a scientific method of author
identification.
-
A. R. Gray, P. J. Sallis, and S. G. MacDonell, University of Otago, New Zealand
Software Forensics: Extending Authorship Analysis to Computer Programs
Computer programs are written in source code that can be treated as a
form of language. While source code is much more formal than spoken or
written language, it still contains considerable information that
provides evidence of authorship characteristics. In fact, much of the work
already carried out in computational linguistics for text corpus
authorship analysis has parallels for source code. Similarly, many of the
techniques used in forensic linguistics for determining the authorship of
written documents also apply to computer programs. In this way software
forensics, when determining the author of malicious software, can be seen as a
new and exciting area of forensics.
Measurements, including some from conventional software metrics, can be
extracted from source code based on such attributes as the types of data
structures used, the control flow of the program, comments in the code,
and names used within the program. These metrics are often similar to
stylistic tests used in computational linguistics. While not part ofsource code analysis, some environmental measurements can also
be extracted from executable code such as the hardware platform and the
compiler employed for its production. These data can also be used to
determine authorship origins.
Programmers tend to have coding styles that are fairly distinct, and
often recognisable to their colleagues, but the issue of how well this
can behidden, or mimicked, is also of obvious importance when ascribing
authorship to an individual. There is a question of whether or
not there is sufficient information available using these combined
techniques to provide adequate authorship evidence for use within a legal
context. In other words whether authorship identification or
characterisation, can be performed at levels of sufficient certainty for
these results to then be presented as legal argument.
Source code authorship analysis could be, and in some cases has been,
used for a number of tasks including software forensics as mentioned
above, the detection of plagiarism in educational settings, and the
resolution of disputed claims of code authorship. Each of these
application areas requires a specialised approach.
While all three areas will be mentioned, the main focus for this paper
is on software forensics. The association of characteristics with source
code style can be used to suggest the gender, background, or some
personality characteristics of the author, as well as authorship discrimination
and matching. The use of source code measurements to determine the author
of malicious code is made much more difficult for code that has already
been compiled into an executable format.
In this paper the authors look at how authorship analysis for text can
be extended to source code, working from a basis in computational
linguistics and forensics. They also consider some of the special issues of
source code analysis, distinguish between the different application areas
for such analysis, and discuss the legal implications of using such
techniques for some areas, with a focus on the software forensics application.
Electronic resources for forensic linguistics: creating a web journal
This paper will give a detailed history of "Language in the JudicialProcess" (LJP) <http://ljp.la.utk.edu>, an electronic publication
created for use by academics, attorneys, and other professionals interested inlanguage and the law and suggest ways in which such low-cost
electronicres ources can serve the interests of linguists, lawyers, litigants,
students, and computer scientists.
The paper provides a comprehensive framework of requirements and considerations for building a web-based journal.
Topics covered include:
|