Explore

Chapter 3

 

Methodology

 

This chapter describes the process of creation of the Annual Reports of Irish Companies (ARIC) corpus and also discusses tools and types of linguistic analyses that were applied.

3.1 Corpus design and data collection
The corpus Annual Reports of Irish Companies (ARIC) which was built for the aims of the present study is composed of 20 Annual Reports (ARs) of public companies listed on Euronext, previously known as the Irish Stock Exchange, that in 2018 merged with this cross-border European stock exchange (operating markets in Amsterdam, Brussels, Dublin, Lisbon, London, Milan, Oslo, and Paris). The companies chosen for this project comprise the list of ISEQ 20, which constitutes a benchmark stock market index that tracks companies trading on the Dublin stock exchange. ISEQ 20 is the index of 20 public companies with the highest trading volume and market capitalisation contained within the Irish Stock Exchange Quotient (ISEQ) Overall Index (Euronext Dublin, 2021, Investopedia, 2020).
Three main considerations were taken into account in corpus design:

(a) authenticity: selecting texts for corpus containing authentic materials representing the actual language that BE learners and users will encounter;

(b) pedagogical value and credibility: finding trustworthy resources that present instructional benefits. This is based on the premise that the language of the ARs offers insights on how the language is used in the context of business as it comprises specific lexis and unique phrases typical to this setting. As Nickerson and Planken, (2015: 105) observe, ARs offer vast pedagogical contributions in the advanced BE environment,

(c) diversity: the annual reports have multiple authors, which sustains linguistic variety and supports avoiding focusing on idiosyncratic language,

(d) availability: reports identified for this project are publicly accessible on the respective companies’ websites and easily available to download as pdf documents. For practical reasons, the reports for the year 2019 have been selected, as full reports for the year 2020 had not yet been made available by all the companies subject to this investigation at the time of commencement of this research project, i.e. April 2021. The examined documents provide a lexically rich material and are substantial in size. The number of pages of the ARs ranges from 81 (Kingspan, Glenveagh) to 389 pages (AIB).

3.3 Rationale for choosing Annual Reports
There are many reasons to justify the decision to focus on the genre of Annual Reports (ARs) for this project. ARs are described as “corporate disclosure documents” (Nickerson et al, 2016) and admittedly the genre has attracted considerable academic attention in recent years (de Groot, 2012). In respect of authenticity and reliability ARs “are the most important external documents and the most used channels for communication between organisations and stakeholders” (Cao et al., 2012), they are therefore an effective tool for promoting the company. As annual reports “are the main channel that companies use to communicate with stakeholders” (Moreno, Casasola, 2015), it can be implied that they receive plenty of attention from internal departments and external consultancies, and as a consequence, they are subject to careful scrutiny.
As regards their availability, ARs are readily accessible online documents in a pdf format, which also makes them a preferred and a very practical choice for Business English research. The fact that they have already received a lot of attention from acclaimed linguists (Bhatia, Cao et al, de Groot, et al.) is evidential of their linguistic and pedagogical value.

3.4 Text Identification
Contrary to Rutheford (2005), whose study examines solely ARs’ narratives with the objective of contributing to a better understanding of the genre on its own, this study’s ambition was to conduct a linguistic analysis that would inform the design of BE teaching materials. Therefore the focus is on linguistic items contained within the reports which bear pedagogical importance.

Given this objective, the corpus consists of the nearly full content of the analysed annual reports. Therefore among the material included were: (a) captions and notes within graphical material; (b) tables standing alone from the main text – non-numerical details; (c) slogans, quotes, and mission statements; (d) headlines and repeated headings and main material, (b) tables, lists, and bullet-pointed material. The full list of annual reports used to build the corpus of Irish Annual Reports corpus can be found in Table 1. The corpus creation procedure was first tested by working with only one report to start with (Kerry Group); subsequently, the following process was undertaken for the rest of the material. All the reports were downloaded in the form of PDF files from the respective companies’ websites (available in the Investor Relations section). The reports were then individually submitted to the optical character recognition (OCR) process using “OCR My Pfd” software in order to transfer them from PDF to text format (.txt). Once in text format, the reports were edited manually to correct any errors resulting from the ORC. Manual editing involved: deleting/removing the title and footnotes, names of the company, as well as numerical data such as results, percentages, and indicators.

As Sinclair (2004) points out: “It is important to avoid perfectionism in corpus building. It is an inexact science, and no one knows what an ideal corpus would be like.” Accordingly, he advises to treat the results more as indicative data, rather than definite. In line with his comment, creating the best corpus possible was attempted in the given circumstances, nevertheless, the creation of the ARIC Corpus took several attempts.
As to the size of a corpus, Bowker and Pearson (2002:10) observe that there are no “hard and fast rules about how large a corpus needs to be”. By the end of the process the target corpus had 918,436 tokens; understood as “the number of words in a text or corpus” (Milton, 2009: 9) or “single linguistic units” (Baker et al, 2006: 161) and 16,057 types of words, referring to “the number of different words” or “the total number of unique words” used in the corpus (Milton, 2009:9, Baker, 2006:162).
As mentioned previously the corpus comprises the reports written by multiple and different authors – often marketing agencies. This and also the fact that the discussed ARs represent companies belonging to different industries, prevents the analysis of idiosyncratic language. Table 1 lists out companies, whose ARs were used in this study, as well identifies their respective numbers of tokens and word types:

Table 1: Details of the specific reports, number of tokens, word types and business sector.

 

In content analysis it is common to “increase the efficiency of word searches by excluding from the text to be examined frequently occurring words and other language units of no significance to the analysis” (Rutherford, 2005:361), and so to focus on content words, understood as “a set of words in a language consisting of nouns, adjectives, main verbs and adverbs” (Baker et al, 2006: 47). A similar procedure was adopted here. The Whole 1990s BNC Stop List (Scott, n.d.) was used to eliminate the function words, also referred to as ‘grammatical words’, understood as “set of words consisting of pronouns, prepositions, determiners, conjunctions, auxiliary and modal verbs” (Baker et al, 2006: 76). Additionally, the most prominent function words, free-standing letters, companies’ names, and the name of the months appearing more frequently than other months’ names, and frequently occurring abbreviations in the target corpus were added to the list of words which were eliminated. These are listed in Table 2. Therefore, excluded from the analysis were: (a) frequently occurring structural (aka function) words (such as articles, conjunctions, pronouns, and common, auxiliary verbs (Milton:2009: 12;) (b) days and months of the year; (c) numerical denominations; (e) identifiable company and product names; (f) individual names of the board members and directors (g) addresses, as well as any geographic names. The full Stop List used in the study is available to view in Appendix 2. Details regarding content material can be found in the table below:

 

Table 2 Content excluded and included in the corpus.

 

What is more, files that appeared corrupted, i.e. they included whole phrases spelt together and not separated, were either remedied by manual editing, i.e. separation, (Dalata report), or deleted, as in the case of Smurfit Kappa’s AR, where frequent occurrence of corrupted files could be observed as the chains of capital letters creating nonsense, unidentifiable strings of letters and characters such as: PLOOLRQDQGDUHVXPPDULVHGLQ or Êøçì÷Ìòððì÷÷è. The report was attempted to be OCR-ed 3 times using different formats and software, with no apparent results. Corrupted lines were deleted due to time constraints. Considering that these represent 1% of the running text (0.6%), which is not statistically significant, this has no real impact on the overall quality of the corpus.

Other manual editing consisted of separating the words or phrases that mistakenly occurred together in the OCR process, and words the text editor would not recognise as standard English and automatically highlighted. Whenever in doubt as to correctness of the word, the AR pdf document was consulted and the original, version was adopted (e.g. businessto-busines vs. business-to-business).

3.5 The Software
The ARIC corpus was built using AntConc software (Anthony, 2020), a freely available and versatile corpus tool. AntConc is a “multi-platform, multi-purpose corpus analysis toolkit, designed specifically for use in the classroom” (Anthony, 2004) and it hosts a comprehensive set of tools including a concordancer, word and keyword frequency generators and tools for clusters/lexical bundle analysis. This section mentions the linguistic analyses applied in the present study and identifies tools that facilitated each investigation. The manual (Anthony, 2014) and various video tutorials (Antony, 2014, Bednarek 2011) were consulted for training purposes. Therefore, the following tools were used to identify the following linguistic features of ARs:
(a) the Word List Tool was used to identify the 50 most frequent content words,
(b) the Keyword List Tool facilitated the top 50 key content words analysis,
(c) the Collocates Tool identified the 20 most significant collocations of the 50 most frequent content words,
d) the Clusters/N-Grams Tool was consulted to identify the 50 most frequent 4-word clusters.

3.6 Types of corpus-based linguistic analysis
This section Examines in detail the steps involved in the different types of analysis of the lexical features of ARIC.

3.6.1 Frequency
The first step of the corpus analysis was to obtain a frequency list of content words in the corpus using AntCon Word List Tool (Anthony, 2020). The BNC-based stop list (Scott, n.d.) which contained frequent function words was used to exclude these items from frequency analysis. This excluded word sequences consisting of entirely grammatical/function words from the investigation. In addition, proper nouns, personal names and abbreviations were removed manually from this list if they occurred in high frequency, details are discussed in section 3.4.
Firstly, the txt. files comprising the 20 edited ARs were uploaded to the concordancer to create the corpus. Subsequently the Stop List (Scott, n.d.) was uploaded and the Tool Preference settings were adjusted to allow for the use of the Stop List. The Word List tool was run and the output consisting of 16,057 entries was saved into a dedicated folder.
The current research, for practical reasons focuses on the 50 most frequent content words, and this frequency analysis has a dual purpose, it serves as a basis of understanding which vocabulary should be prioritised in the BE instruction, and as a starting point for further investigations.

3.6.2 Keywords
The keywords analysis allows for identification of words that are unusually frequent (or infrequent) in the target corpus in comparison with the words in a reference corpus. This allows identification of characteristic words in the corpus which may play a major role in BE and BE acquisition. For the purpose of this research project the Brown corpus (1979) was used as a reference and uploaded into the tool settings. This identification involved the automatic comparison of word lists using the AntConc Keyword Tool (Anthony, 2020), a feature available within the concordancer. The output of the analysis – the positive key-word list of 1,729 entries was also saved in the dedicated folder.

3.6.3. Collocations
The process of collocations investigation proved more time consuming and required manual editing. Following the recommendations on the frequency analysis, as described in section 4.3, collocations of the target ARIC corpus in this project were analysed from the perspective of the 50 most frequent content words as search words/nodes. Again, for practical reasons, the focus here was specifically on the 20 top collocates of the top 50 highly frequent content words.

As to the statistical measure, the Mutual Information was applied, understood as “a measure that compares the probability of finding two items together and of finding each item on its own ” and is used as a measure of the strength of a collocation (Baker et al, 2006:120). The Mutual Information score (MI) of 3, which is typically considered “a strong collocate” (Hunston, 2002:75, cited by Ackerman and Chen, 2013). However, the information on statistical significance has not been included in the table to focus on the frequency data.

The window span of 5 (L and R) which is also a “common setting”, and also a default feature of AntConc (Antony, 2014, Bendarek, 2011), which indicates how many words to the left and right from the search word was applied. The span of five words for examination on either side of the node/search word is not only quite typical but also, the wider the span, may lower the relevance and “less useful information emerges when searching for collocates beyond the four or five-word span” (Kostopoulou, 2015:75). The cut-off point, understood as the minimum collocate frequency, was set to 2, following Bendarek’s (2011) recommendation.
The first step of the analysis was to obtain a list of content words in the corpus using AntConc concordancer software (Anthony, 2020). The previously created list of the 50 most frequent content words of the ARIC corpus served as as the basis of the subsequent analysis of the collocates of these highly frequent words. Similarly to the two previous analyses, an output file corresponding with each word was saved in the dedicated folder (total 50 txt. files.) A list of these collocations constituting 1,000 entries was compiled and is available in the Appendix 3.

Next the most significant lexical collocations were manually selected to inform the development of the teaching materials; following the advice of Ackerman and Chen (2012:1) who argue that “only with human intervention can a data-driven collocation listing be of much pedagogical use while still taking advantage of statistical information to help identify and prioritise the corpus-derived collocational items that traditional manual examination is unable to manage”.

3.6.4 Four-word clusters
As Biber et al (2005:376) write: “the actual frequency cut-off used to identify lexical bundles is arbitrary”, therefore for practical reasons for the present study, a frequency approach was used to identify the 50 most frequent four-word sequences. Similarly to the previous analyses, AntConc software was used specifically its Clusters/N-Grams tool. This analysis was independent from the previous content word and keyword analyses. A total of 9,380 clusters were identified with the minimum frequency setting of 10, that is, clusters that occur at least 10 times within the target corpus. The output was saved and subsequent manual elimination of clusters that did not consist of 4 meaningful words, such as of the group’s, in the group’s took place, due to their little pedagogical value. Finally, the list of 50 meaningful phrases is presented in Appendix 3.

3.7 Materials development
The target corpus was analysed and the learning objectives were defined in terms of acquisition of the most frequently occurring lexical items presented in the context of the business genre of AR. The findings informed the development of teaching resources presented as worksheets with a variety of language tasks and exercises that will potentially allow the teachers to further develop their own corpus-based resources to assist learners in acquiring the respective lexical items (vocabulary, collocations and four-word clusters).
Additionally, elements of the genre approach, as recommended by Frendo (2005:81), were integrated into the content of some of these materials and activities for BE learners. He suggests genre-based teaching may particularly benefit more advanced learners of BE. In a genre approach, learners are given the chance to study model texts and to base their own production on what they have noticed.

Materials created in this project may be seen as a component of a bigger syllabus/programme rather than being stand-alone and were developed to accommodate teaching on the CEFR level B2+, as they are aimed at the learners who “are highly competent in his or her own first language and will often be striving for the same level of competency in English foreign language (…)” (Walker, 2011). This description reflects the CEFR global scale descriptor of this level according to which a learner on level B2 is classified as an independent user who:

“Can understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation. Can interact with a degree of fluency and spontaneity that makes regular interaction with native speakers quite possible without strain for either party. (…)” (CoE, 2010).

"The literature provides various examples of adopting CL in language teaching and specifically in reference to BE."