The Environment Agency (EA) has around 10,000 years' worth of hydrological data on river levels and flows. However, it is stored on materials that are fast degrading.
The risk of flooding and drought within England is a priority area of focus for the EA which strives to protect and enhance the environment, to contribute to sustainable development and to help protect the nation’s security in the face of emergencies.
Over the years, a vast amount of hydrological data has been collected through manual efforts, amassing an impressive physical archive of approximately 10,000 years’ worth of valuable river level and flow information. This vital data could be used to build more accurate climate and flood modelling and help forecast and minimise the impact of future adverse weather events.
However, a significant challenge is that much of this historical environmental surveillance data has been stored on biodegradable materials, such as paper charts, microfilm and punch tape. These important documents face the risk of irreversible degradation and therefore need cataloguing urgently. Adding to this challenge, the EA is losing the ability to interpret even this archive as staff retire.
While manual data extraction is underway, the time-consuming plotting of physical data onto graphs means this process – currently estimated to take 40 years – is unsustainable and a new, faster solution was needed.
The Department for Environment, Food & Rural Affairs (Defra) approached the Accelerated Capability Environment (ACE) on behalf of the EA to explore the feasibility of using cutting-edge artificial intelligence (AI) and machine-learning technology to digitise, read and interpret the physical data significantly faster while maintaining accuracy.
Working with domain specialists and data users from across the EA, an initial options analysis identified two suitable open-source tools to take forward for the PoC stage – one which was fully automated, and the second which had a human in the loop – so two, rather than the expected one.
The first PoC, the fully automated tool, showed low feasibility for effective digitisation, due to limitations in accurately rescuing handwritten information that is crucial for understanding axis labels, chart metadata such as location and start date, and adapting to different chart types, such as those with missing gaps, or smudges caused by water damage. It is recommended that further assessments be made in the future as Optical Character Recognition (OCR) performance improves with time.
Pivoting to the second, human-in-the-loop tool for AI-assisted data rescue produced better results, and recommendations were also made for feature changes which would adapt and increase the effectiveness of this tool on live datasets, including integrating additional AI elements from the first PoC.