Core Research And Teaching Datasets In Universities

Core Research And Teaching Datasets In Universities

The government is finding ways to improve public service, encourage competition and promote economic growth by giving access and free use of publicly funded datasets for these purposes. The college is concerned, though, that the application of these principles to universities (which are private charitable bodies, though partly publicly funded) will have unintended consequences which will work in the very opposite direction to the policy intended.

Whether higher education institutions are included in scope will be a crucial decision. The generally accepted understanding of a public service would not normally include post-compulsory education or research. Universities are autonomous charitable bodies not subject to direction by a government minister.  With the forthcoming change in the funding of undergraduate students, grants will form only a minor part of the cost of higher education for most universities. With the majority of income coming from or on behalf of students as individuals, the government has already strengthened the power that consumer choice will have on competition between institutions.

Universities already have published or are about to publish large sets of data about their performance. Shortly, each university will publish a key information set of data against a common standard intended to inform potential applicants about the attributes of each course such as learning and assessment requirements, staff contact time, the likely employment and salary prospects after graduation.

We would be concerned if an extension to FOIA as envisaged in the consultation resulted in further demands for datasets when, in our view, the accountability, competition, choice and quality enhancement policy imperatives are already being met through existing public datasets. The consultation asks what threshold would be appropriate to determine the range of public services in scope. We do not think that higher education and research is a public service in the normally understood meaning of that term and so universities should not be in scope. A helpful additional criterion might be the proportion of funding a private body receives from the government through grant or contract.

Currently a (probably large) number of authorities only collect data that they have to. Submitting data to the government could mean more work for some public bodies.

Data collection for economic development is vital especially in our current economic climate. Data needs provenance/metadata which needs to be cited on any data that is published. It can be difficult/expensive for the public sector to get hold of/collect the core dataset and this is often why the data is not published. Often, data is not collected at a low enough level to make it meaningful for re-use and to target services/support in the right places.

The possible implications of releasing “raw” data before checking its quality are huge. This can provide skewed or erroneous results as data will not have been checked, for example, or may not have been input correctly. For example, the medical research on iron; for decades people thought spinach had a much higher iron content than anything else when in fact the decimal point had been erroneously put in the wrong place. If the public body has to release the data, it is better that it is released once after it has been checked and validated than released twice because the first release contained errors.

There have been numerous stories in recent years about computer hackers who have gained access to personal information, bank account details and even national security systems. Whilst making educational and medical records accessible is a good idea to make them available over the Internet could turn into a hacker’s paradise. Making it possible for unscrupulous individuals to find out what medications are prescribed to a person or their children, for example. Patients already have the right to view their medical records at the surgery.

If this does go ahead, then people should also be given the opportunity to opt out. There are a lot of older people, for example, who do not have access to a computer or the Internet and they will not want their information being made available in this way. The public has a right to say what happens with their personal information.

If data is to be provided free to the public, and presumably local authorities will be expected to collect, maintain and report this data, then how is the cost of this expected to be born when local authorities are experiencing significant budget cuts and are concentrating on service delivery? Data collection and publication has traditionally been seen as a back-office function. Collection of data can be expensive and thought needs to be given around which datasets are collected and whether they will provide value for money. In view of the reduced budgets that public bodies are experiencing this additional requirement to proactively publish data on services will add further burden.

Metadata about datasets will need to be recorded and released with the data. If the government wants consistency, then it will need to explain what datasets it wants collected and the method of collection otherwise consistency will be lost.

If a business intends to make money from public data the source should be acknowledged and money made from using this data should be shared with the organizations providing the data. It is unethical to allow others to make money without at least acknowledging the work and contribution made by others.

Yes, a public provider should have the right to refuse to publish because of unreasonable cost. Yes, if the data requester is prepared to meet the cost and the data is not sensitive in some way then it should be provided. Yes, this could have an impact on the service provider delivering its core functions so delivering this data would have to be discussed with the requester, a time scale agreed and if the service provider is not able to provide the data themselves, then the work can be put out to an external organization.

Changing procurement rules will not make any difference. It’s the specification in each IT procurement project that needs changing to ensure that it is straightforward to extract data.

Through a new initiative the public sector has been steadily working towards publishing information online. Availability of resources is the usual reason why some organizations have been slower than others. Publishing data on the Internet is as much about having the resources to get the data into the system as it is purchasing the system. Putting the data into the system is more expensive and time consuming than buying software.

The public sector does not have a track history of being a demanding customer. Widespread change in the medium term is only achievable if the public sector is going to receive some financial support. This begs the question of what will happen to this lower quality data? If this poorer quality data will be used to compare with higher quality data then that will be like comparing apples with pears. If the data re-user knows the data quality is poor they will not use it. Set a standard from the start, give the public sector enough time to organize collection and place liability on the public body to provide quality data.

Metadata provides the user with who, what, where, how, when of the dataset so the re-user can make a judgement about the quality and appropriateness.

Commitment to open data should be a corporate responsibility with a senior person taking overall responsibility for compliance. An experienced/trained person should deal with data protection and privacy, they need to understand all aspects of data management. Responsibility would have to be added to job descriptions. Legislation will be required making clear levels of responsibility and punishment for not meeting these requirements. This means that monitoring will be necessary. Monitoring would be required to ensure that public bodies are collecting data and publishing it. Maybe regular downloads to a national hub, regular updates are required to meet agreed standards. Data from the providers should be held centrally so that the public knows where to go.

If businesses and individuals are to use data produced by public providers then they must also be responsible for publishing accurate results and not manipulating data to gain a specific desired result. Like the science world carries out peer review, maybe something similar could be set up. The re-user can look on the hub to locate the information they require and then use a hyperlink to go to the web where the data is held. Discovery level metadata could be recorded using one of the existing national or international standards.

Metadata could be uploaded each time there is a change to a dataset or a new one is published. This will make the data more meaningful.

Data has intellectual property rights. Create an inventory of the datasets that are already known and in use. Then get each sector to look at their service area, define which datasets are required and prioritize them. Public bodies spend their time and money collecting data for matters where they are measured, not necessarily on areas that are important locally.

Yes, data should always be high quality. It is far more costly to publish lower quality data, refine it and then republish. There is always an element of compiling and formatting even poor data before it is released so this work would be duplicated if the same dataset is published twice or more. It will also mean that re-users are less likely to use the poor-quality data and wait for the good quality data as they will have to compile the data or run their analysis twice too. This is not cost effective. Defining quality will depend on the dataset in question. Polishing data implies tweaking or massaging the figures to make it look good. This is not a good idea if the government wants quality.

Releasing public data will also mean that it will highlight areas of incompetence as well as good practice. The government should be asking what the information will be used for and that each time a re-user asks for data they have to sign a declaration about what they want the data for. People undervalue data and don’t see it as important. It is possible that data can be used for criminal or immoral activity. The public has a right to know who has accessed what data and for what purpose. It’s their data, paid for from the public purse.

Any underlying data behind advice and decisions should be published with and at the same time as the report/document. In principle publishing datasets along with the analysis is a good idea. However, it will mean that the public will question government decisions but will mean that policies and decisions have been made on matters of fact and will improve transparency. The government would need to ensure that the published datasets are accurate and have not been manipulated to gain a specific desired result. Inaccurate data would be embarrassing for the government and give the public another reason to mistrust politicians.

Prioritization of datasets should be based on need. Each sector will have different priorities and whilst some steer from government will be required, each public body will have its own priorities so some flexibility about local priorities should be allowed. Publish evidence of existing datasets behind regular statements first. Cut your teeth on what is familiar and tried and tested first. Take a phased approach to allow time for data creators and statisticians in public bodies to become familiar with what is required and the standards they must meet. Then gradually introduce new dataset requirements around new initiatives.

Often, it is not possible to put a patient’s full medical records onto the system as older handwritten information can be impossible to read. Even if this information can be read, it will take an extremely long time to get all medical records onto a computer system because of the quantities involved. So, this needs to be seen as a long term project; it is simply not possible to get all this information accessible online in the medium term.

Releasing data quickly is not a good idea as it will compromise quality. It also means a duplication of work and re-users will view it with suspicion. It may also impact on best practice.

Users should register for access to data. Current providers of free data require to register. This will allow the government to monitor who is accessing the data, ascertain which datasets are more popular, whether the uptake in the business community has been as anticipated, what sort of data is being used by the business community etc.

Those who use public data should also share their results and findings with the public/government as it may be useful for policy making, public health, economic development etc.

Data is like cars. You don’t sell cars to the public in kit form for them to put together. Or they will end up with a few screws and bits left over. You provide them with the completed article which is safe to use and quality controlled. Car manufacturers only release their cars to be sold after they have had a level of quality control and testing done. Most people do not know how to use data or interpret it. It is not taught in school and unless it is provided with an explanation it is unlikely to be meaningful to many people. Anecdotal evidence is that public bodies concentrate their energies and funding on areas where they are monitored. Service areas which are not under scrutiny receive a lower priority.

Careful consideration needs to be given to third party intellectual property rights and copyright. Also thought needs to be given on who will use the data and for what purpose? What are the implications of making this data available? Whilst possibly stimulating the economy there can be adverse implications too. E.g. publishing crime data might affect house prices/desirability to live in a specific area. There is already anecdotal evidence that estate agents are advising clients not to report crime as it may affect the value of their house.

High quality data can be expensive and it takes time and money to make sure that data meets a certain standard. That means that there will be an up-front cost attached to getting things set up ready to meet these proposed requirements and afterwards a maintenance cost. This means that there will be a cost attached to public bodies providing this information and hence a cost to the public purse.

Private companies and academic bodies should also submit data. Utilities could provide data on how much electricity is going into the grid from wind farms and solar panel generation? Truly open data should include the private sector making datasets they have collected available for use by the public sector too. This information would be extremely useful for analysis of the economic market.

We are extremely concerned that core research and teaching datasets in universities, not “by-products of service delivery”, would become subject to open access and be available for commercial use. Notwithstanding the definition of “data-set” which excludes information in non-government bodies about aspects unrelated to public service, because public universities are subject to the Freedom of Information Act, the risk is that such data will be brought into scope unless explicitly excluded. One of the aims of open access is to promote economic growth, yet the very opposite is likely to happen in some sectors if this policy is applied without modification. Much university research, whether funded publicly or privately, will result in the creation of large datasets. All the outcomes of research are published, usually in peer-reviewed journals, and publicly available (increasingly in open-access journals). In many cases, though, the datasets themselves will form the basis of a commercially exploitable opportunity. It is essential that these datasets are protected so that they can be commercialized by the university (or its agent) ie universities should be encouraged to continue to follow the very path that government is advocating in this consultation.

The core datasets should not be taken and exploited free of charge by a third party who had no interest or risk in their creation. Crucially, companies sponsoring university research which generates data may cease to do so if they cannot protect the products of the research. Such companies would be likely to take their funding overseas to jurisdictions which allowed them to protect their assets. The loss of such income and the threat that research data might be freely open for exploitation by others would precipitate a loss of many key staff members with consequent damage. We have similar concerns about datasets used in teaching, produced at considerable cost, which would then be available without charge for use by a competitor. Core research and teaching datasets need to be excluded from these proposals as of right, not on a case-by-case basis using specific exemptions.

The consultation makes hardly any mention of the overheads that free access brings. These are not simply the cost of making them publicly available but the subsequent cost of supporting them when in the public domain.  As well as access for public benefit, these proposals would give access for private benefit. Many single-issue pressure groups and journalists on “fishing trips”, for example, would seize the opportunity of accessing specific data, perhaps when not complete. The cost to the university would be in explaining the context, justifying the data, and responding to queries which can only divert academics and others from their productive core work.  The presumption in the consultation is for early release over improved quality of data. The risk (which may be considerable) is that users make the wrong decisions to their own detriment because of poor data.


Jeff C. Palmer is a teacher, success coach, trainer, Certified Master of Web Copywriting and founder of Jeff is a prolific writer, Senior Research Associate and Infopreneur having written many eBooks, articles and special reports.



Leave a Reply

Your email address will not be published. Required fields are marked *