top of page

Pro Search from Earthmark: AI-Powered Data Collection from Sustainability Reports

Written by Katja Ovchinnikova, Technical Lead at Earthmark


In the fast-paced world of sustainability, accurate and actionable data is the foundation for meaningful change. At Earthmark, we are committed to empowering businesses and consumers with the tools they need to drive change, and our latest innovation, Pro Search, is a testament to this mission.


Pro Search is Earthmarks newly developed environmental data extraction and aggregation tool, designed to overcome the challenges of disclosure and communications. Leveraging cutting-edge AI technologies (LLMs) to gather and analyse sustainability report data. By integrating Pro Search into our backend infrastructure, we are boosting our capability to provide reliable, standardised, and comprehensive environmental performance data in a simple, understandable way worldwide.


The Problem: Fragmented Data in Sustainability Reports


The surge in corporate sustainability commitments has led to an overwhelming volume of environmental data—but accessing and processing this information remains a significant challenge:


  • Inconsistent formats: Environmental data is often buried in lengthy sustainability reports, fragmented across PDFs, HTML pages, and other document types.

  • Access barriers: Data retrieval is further complicated by paywalls and subscription-based platforms, restricting accessibility for SMEs and consumers.

  • Incomplete reporting: Some companies share breakdowns of Scope 1-3 emissions, waste management, and energy usage, but other reports omit key data points or refer to outdated figures.


This lack of standardisation and transparency hampers efforts to monitor progress. We’re now bridging this gap.


How it Works


Pro Search leverages an innovative workflow that blends automation and AI to extract, process, and standardise environmental data. Its key features include:


  1. Search functionality: locate public sustainability reports online, avoiding processing any reports with restricted access for web crawlers.

  2. Preprocessing: Text extraction from PDFs and HTML pages, retrieving environmental metrics like Scope 1-3 emissions, waste (total, disposed, recycled), energy consumption, and revenue.

  3. AI-powered analysis: Data extraction and standardisation.

  4. Revenue and ESG integration: Environmental analysis is supplemented with revenue and ESG data to provide a comprehensive overview of each company.

  5. Parent company and sector mapping: The tool identifies company ownership structures and industries, mapping them to respective parent companies and sectors using the LLM knowledge base and NAICS code classification.


Highlights


Initial experimenting with a selection of 80 brands chosen at random has achieved impressive results:


  • 49% ESG data coverage

  • 61% revenue data coverage

  • 71% mapped to parent companies

  • 75% mapped to sectors


Tens of thousands of companies globally share environmental performance via sustainability reports. With a success rate of 0.45 in finding ESG reports and the correct data successfully extracted each time, this outperformed expectations when considering the current reporting landscape.


It hasn't all been plain sailing, though. Here are some of the challenges we’ve come up against:


  • Initial report retrieval: While we were able to identify many reports, finding the most recent and complete documents remains an area for improvement. Search engine APIs operate on a key word level and are susceptible to mistaken identities (for example, USC mistaken for University of Southern California instead of the clothing retail brand).

  • Data extraction from images: Thorough PDF processing is still in beta with most LLMs and doesn’t always work, so for some smaller files it can be resolved but for larger files this remains a challenge.

  • Self-calculation: LLM is able to calculate the correct values based on percentages and sums, although mistakes are still common if the phrasing is complicated.

  • LLM is able to find the parent company using its general knowledge and estimate the company’s revenue, although the estimation is approximate.


Overcoming Challenges in Environmental Data Collection


This new capability addresses key limitations in sustainability disclosure and communication:


  • Enhanced accuracy: LLMs like Claude and GPT calculate values and aggregate data accurately, even when reports use complex phrasing.

  • Improved data coverage: Able to combine multiple data sources and types ensures broader and more consistent coverage.

  • Cost efficiency: By integrating search engine APIs and scalable cloud services, it minimises costs for data extraction compared to manual processes or premium subscriptions.

  • Comparability: The varied reporting landscape creates inconsistencies which this helps to simplify, facilitating straightforward comparisons regardless of business size, region or industry.


A Step Towards Greater Transparency


At Earthmark, our mission is to enable businesses to transition from sustainability intent to action, and Pro Search is a game-changer in achieving this. With its capability to extract and standardise environmental data, it provides businesses with the tools to measure, monitor, and communicate their environmental performance with greater transparency.


Regulations such as CSRD have been introduced to create greater data availability for businesses of all sizes operating in Europe. But we can’t wait and this data still needs to become actionable by the right people in the right places.


As we deploy Pro Search into production, our team is already working on the next steps: enhancing report retrieval accuracy, reducing LLM costs, and integrating new data sources to improve coverage.


What This Means for Customers and Partners


This new capability offers a streamlined solution to navigate the complexities of environmental reporting, enabling brands to showcase their sustainability efforts with confidence. Pro Search underscores Earthmark’s dedication to continuous innovation and building a sustainable future.


Earthmark’s customers can rest assured that, despite the vast sea of complex environmental data, the latest and most robust information is being used to create a fair, representative view of a brands’ environmental performance. Confidence in the data empowers people to make educated decisions in their everyday lives.


Stay tuned as we roll this out to support our customers and stakeholders in their journey towards measurable sustainability impact. Together, we’re bridging the gap between intent and action.


Ready to lead with transparency?


Connect with us to learn how Earthmark’s innovations can empower your business to take the next step in its sustainability journey.

Screenshot 2023-05-18 at 20.34.09.png

Work with Earthmark

Learn more about how Earthmark can help you embrace, understand and communicate environmental performance for your brand. 

bottom of page