From Clio Infrastructure
Further details
What would be done in the lifetime of the present proposal?
The aims of the first three year project would be as follows:
- To inventory and collect as much as possible of the data that have already been made machine-readable by different research projects across Europe.
- To supplement this strategically by very limited digitisation of some national level census material.
- To build up a network of collaborating scholars working in the area.
- To facilitate the development of national projects aimed at creating pre-census and census data at national and sub-national levels.
- To ensure consistent methodologies across projects and build a common repository for consistently coded data open to all scholars.
How this would be achieved
- One researcher (an economic historian with good IT, people and administrative skills) would be employed as a project manager. His or her job would to be to manage the project; to inventory existing datasets for all of Europe; to make contact with existing research groups; to persuade them to join the hub and to negotiate access to existing datasets. This would involve substantial travel.
- Making all collected data available for downloading from a website. This would also be the project manager’s responsibility but he would have assistance and advice for two days per month from Ms Gill Newton an IT specialist at the Cambridge Group for the History of Population and Social Structure. All data would be coded to a common coding scheme: the Historical International Classification of Occupations (HISCO). This work would be undertaken in collaboration another hub led by Professor Marco van Leeuwen at the International Institute for Social History in Amsterdam and Professor Ineke Maas at the University of Utrecht.
- We would hold an annual workshop in Cambridge, to which we would invite researchers working on the creation and use of hub datasets.
- A core aim of the hub would be to encourage and facilitate the formation of national research projects within the hub and to ensure that they produce commensurable datasets. One practical form of assistance would be to offer extended technical guidance on how to run such a project and how to collect and manage the data. The Cambridge team now has a substantial body of expertise in this area. We would invite interested parties to visit Cambridge for periods of a week or more where they could liaise with the principal investigator, the project co-ordinator, the IT specialist, and others. Such training/liaison sessions could be run on an ad-hoc basis or timed to immediately precede or follow the annual workshop. Further advice and assistance could of course be provided over the phone and by email and there might be a case for site visits.
- A second form of assistance would be to make small grants, up to £5,000, available to credible applicants who wish to carry out survey work on source availability prior to making a funding application to a relevant body. I have budgeted for six such grants (or a larger number of smaller grants).
- A share of the budget has been set aside for data digitisation. The sum is sufficient to guarantee that, at minimum, by the end of three years we would have machine-readable nation-state level census data for most of Europe for the whole census period (c.1870 to the present).
Further information on digitisation plans
From c.1870 most European states collected and published comprehensive data through national censuses which can therefore be digitised relatively cheaply.
The digitisation objectives of this hub over the first three years will be:
- To collect, standardise and disseminate machine-readable datasets of occupational census material, principally at the national level for as many countries as possible within the area covered by the current EU27 from the date census recording of occupations began down to the present. As much age-specific and sex-specific and marital status specific data would be collected as is presented in the original printed sources. It is difficult to estimate exactly how much material can be digitised for a given budget in advance of surveying all of the material for a number of reasons. Firstly, the starting date varies from one state to another country will be determined by the date that the census first started to record full occupational data. For the UK this would be 1841 but for most European states the starting date would come later in the nineteenth century. Secondly, the states making up this area have varied over the period. Thirdly, the level of occupational and age specific detail varies both between states and from one census year to another. In the UK in 1851 data was provided for 20 different age groups in around 400 occupational categories, while in 1871 data was provided for only two age groups and in the early Spanish censuses less than a dozen occupational categories were employed. Fourthly, sometimes there is more than one relevant census and there can be as many as three (population censuses, occupational censuses and industrial censuses). However, work done in digitising the UK census does give us a clear indication of the volume of material which can be digitised by a data editor each week. On this basis we expect to be able to digitise around 600 census tables (around 60 mbytes of data). This would be sufficient to generate national level data for males and females for an average of 14 censuses (c.1870 to the present) for a minimum of 20 countries but we would hope to be able to cover the whole of Europe. All the data would be coded to the Historical International Classification of Occupations (HISCO).
- To put the UK county level tables, all coded to HISCO, on the hub (c.1360 tables and c.1.4 gigabytes of data) plus any further data which become available through independent projects.
- To collect, standardise and disseminate machine-readable datasets of census-derived population datasets at sub-national level (at a spatial level equivalent to the UK counties or French departments) for all census years for each state within the current EU area. This would be a relatively small dataset of perhaps 1 mbyte.
