High-quality data is the foundation for training and applying large AI models—and the "fuel" for enterprises to transform and upgrade with AI. However, many companies struggle when developing AI applications because large models have difficulty understanding unstructured data.
Can more enterprise users gain access to powerful data tools, achieving AI-ready data freedom?
Recently, OpenDataLab and DingTalk jointly launched DLU (Document Language Understanding), a document parsing tool for enterprise users, based on MinerU. DLU aims to help enterprises overcome challenges in obtaining AI-ready data, lower the barrier to AI application development, and accelerate the large-scale adoption of AI technology across industries.
MinerU is an intelligent document parsing engine developed by OpenDataLab at the Shanghai Artificial Intelligence Laboratory (Shanghai AI Lab). With its precise parsing capabilities and broad compatibility, MinerU has gained immense popularity among users, boasting over 40,000 stars on GitHub.
As an internationally recognized AI research institution, the Shanghai AI Lab has deep technical expertise in large models and data intelligence. Its self-developed OpenDataLab platform is a leading AI large-model data platform in China, bringing together more than 7,700 open-source, high-quality datasets and providing over 2 million data services to more than 100,000 users. The recently released MinerU 2.0 delivers significant improvements in both parsing speed and accuracy, achieving performance comparable to mainstream 72B large models with just 0.98B parameters.
DingTalk, the enterprise-level smart mobile office platform under Alibaba Group, boasts a rich suite of enterprise document products and a massive user base. Products such as DingTalk Docs and AI Tables have already deeply integrated MinerU's capabilities and provide document parsing functions to ecosystem developers through an open platform, laying a solid technical and scenario-based foundation for the joint development of DLU.
Built on MinerU, DLU is set to be open-sourced soon. It features outstanding file format compatibility, deep content understanding, and precise structured output. DLU supports mainstream Office documents, PDFs, Markdown, and code files, as well as DingTalk's native document, spreadsheet, and AI Table formats. It can extract plain text content, accurately parse complex visual elements such as charts, formulas, illustrations, and even chemical molecular formulas, and efficiently convert them into high-quality corpora suitable for large-model training.
DLU will be deeply integrated into DingTalk's office collaboration ecosystem, creating a closed loop for the entire AI application workflow
In the future, DLU will leverage DingTalk's strengths in enterprise service scenarios, deeply integrating into the office collaboration ecosystem. It will enable users to complete the entire process—from document creation and parsing/extraction to knowledge base management, data annotation, and customized model training—within the same platform, significantly enhancing both AI application development and daily office productivity.
He Conghui, a young scientist at the Shanghai Artificial Intelligence Laboratory and the founder of the OpenDataLab/MinerU open-source project, said: "MinerU already has a broad user base. We aim to further expand its applications in enterprise scenarios, fully leveraging the value of the OpenDataLab platform. Together with our partners, we want to create a 'PyTorch of data tools,' helping more enterprises achieve AI-ready data freedom."
Zhu Hong, CTO of DingTalk, stated: "By open-sourcing DLU, we can effectively address the challenges enterprises face in preparing data in the AI era, laying a solid foundation for intelligent transformation. DingTalk is actively building a new AI ecosystem and looks forward to partnering with more technology leaders and industry players to provide strong support for the digital and intelligent upgrades of industries across the board."
DomTech is DingTalk's officially designated service provider in Macau, specializing in providing DingTalk services to a wide range of customers. If you'd like to learn more about DingTalk platform applications, feel free to contact our online customer service or reach us by phone at +852 95970612 or by email at cs@dingtalk-macau.com. Our skilled development and operations team brings extensive market experience and is ready to provide you with professional DingTalk solutions and services!
Português
English