EUROPRO Digital Corpus

Characteristics and description of EUROPRO


Type: EUROPROwebs is a specialised, ad-hoc, static corpus of websites created and maintained by research projects participating in the Horizon2020 program.

Size: The pilot corpus consists of 30 research project websites. It is our aim to elaborate a database of a total of 100 research project websites, in which Spanish institutions and universities participate. 

Sampling: A convenience sampling was decided to better accommodate to our purposes and, thus, researchers from our institution, the University of Zaragoza, participate in each of the projects. This will enable us to obtain further qualitative data from informants. 

Representativeness: This corpus captures the current professional, digital practices of international research collaboration, in this case, through projects. The program H2020 is one of the leading European frameworks for research and knowledge dissemination, so websites may point at salient linguistic and discursive strategies and practices to be replicated.


Type: EUROPROtweets corpus also contains digital, static texts that have been downloaded, in these case tweets.

Size: 20 Twitter accounts were collected and downloaded out of the 30 research projects that make up the EUROPROwebs corpus.

Sampling: In addition to the convenience sampling employed to compile the EUROPROwebs corpus, an observational study on the research projects’ preferrences regarding social network use was carried out. Results determined that Twitter was the most frequent social network.

Representativeness: This corpus illustrates the uses and tendencies predominant in social media communication, as applied to international academic contexts. Twitter has been regarded as the most suitable social network in order to increase research impact and e-visibility, so analysing the language being employed in tweets may be insightful to identify complementary discursive and pragmatic patterns when research groups communicate project information.

For further info about the methodological decisions we made together for the design and compilation of the digital corpus, click here.

Conundrums in tagging and coding EUROPRO


Selected websites in EUROPROwebs pilot were compiled from April to May 2019. The specific start and end date of each project was retrieved to determine the stage and development of the corresponding research project at the moment of the corpus compilation. In EUROPROtweets pilot, tweets and retweets from research project Twitter accounts were also coded and downloaded at a set date, June 2019.

Hypermediality and hypermodality

EUROPROwebs corpus was tagged for hyperlinks (external, internal and peripheral), for visuals (such as tables, figures, pictures, logos, etc.), for videos and audios. In the case of EUROPROtweets, information regarding hyperlinks and multimodal elements (pictures, videos, GIFs) in the tweets was coded.

Layout and web design

Besides downloading the texts contained in the website, screenshots for every page were taken and saved. Information was also recorded of the extent to which the text could be directly accessed from the website menu.


Information was retrieved on the number of likes obtained by each tweet, and the number of retweets of other users. Information was also retrieved on the number of hashtags used in the tweets by the research group and the number of mentions (@) to other Twitter users.

To take into account all the previous aspects affecting websites and Twitter, specific measures were taken for tagging and labelling the texts before proceeding to the analysis. Full disclosure in our report