Mathdoc’s IT team focuses on different areas of development:
Full stack development is the main activity. It serves the following purposes:
All developments are hosted on a shared Gitlab, and Mathdoc’s computer scientists rely on this tool for continuous integration.
Bibliographic records contain information about the nature of the document (e.g. monograph, periodical), its source (e.g. title, author, date, subject, publisher), informational content (descriptors, keywords, summary) and its physical location (classification number). For digital documents, these records are called metadata and are markers that are introduced in files or in appropriate programming languages, XML markup languages. They facilitate access to the informational content of a computer resource, improve the efficiency of information searches compared to full-text searches and enable the interoperability of digital resources.
Documentalists from the Numdam project, editors of certain centre Mersenne journals and individual contributors to the semi-automatic translation project have all helped developed specific applications that enable the editing of data in JATS and BITS formats.
The application used for Numdam is written in XQuery and XForms. It is based on eXist-db, an open source native XML database (document-oriented NoSQL database). The one for centre Mersenne is written in Python / Django / VueJS.
LaTeX is the pivot format used for the editorial process at centre Mersenne: an article’s source code provides all the information about that article. Compilation of a document produces both the PDF file and a metadata file. Tralics is used to create the metadata XML file and for the conversion of mathematical formulas to MathML.
The Geodesic project aims to build a global digital library by listing and promoting the move to open access for the entire mathematical corpus. Documentalists aggregate the source content through web services using the OAI-PMH protocol or by web crawling.
Several journals and partner publishers transfer their data to be archived in Numdam. For these data, computer scientists have developed a processing system using chains referred to as acquisition chains. They are often specific to the data provider for the purpose of adapting to their formats and in most cases use XSLT transformations along with other file operations.
Links are created within the bibliographies of articles through a matching process to ensure each reference corresponds to its entries in the mathematical databases zbMATH and MathSciNet. If they exist, links are also established with Crossref, EuDML and Numdam, or to a website providing the full text of the article cited in the bibliography.
A project has recently been initiated to retrieve author records from the IdRef database to resolve the issues of duplicates and homonyms in the Numdam authors database. Since the zbMATH and ORCID author identifiers are linked to IdRef, Mathdoc can also retrieve these informations.
The Numdam project includes a set of tools that ensure quality control of the digitisation results by an external service provider. They check the status of images, PDF files and compliance of the metadata tagging with Mathdoc guidelines.
Mathdoc is committed to the mission of long-term data preservation. A sustainable archiving solution has also been developed to ensure quality control.
Making data interoperable is one of the FAIR principles of open science. The metadata produced by Mathdoc are distributed in open access under CC0 license, which makes it possible to enhance and improve the visibility of the content. They are available in XML from the Numdam and centre Mersenne OAI-PMH servers and they populate other platforms such as zbMATH, Gallica, BASE and EuDML. Articles published by centre Mersenne are also registered via web services in external databases: Crossref, DOAJ, PubMed, CLOCKSS.
Mathdoc helps populate France’s national knowledge base (Base de Connaissance Nationale – BACON), a CC0-licensed reference metadata warehouse managed by Abes. Its objective is to optimise the reporting of electronic resources in order to facilitate access and promote the sharing of metadata between science communication stakeholders. To this end, Mathdoc provides its own data to Abes in the form of KBART files via an automatically generated spreadsheet containing all the data related to a journal.
LaTeX is the pivot format used for the editorial process at centre Mersenne: an article’s source code provides all the information about that article.
Compilation of a document produces both the PDF file and a metadata file. Tralics is used to create the metadata XML file and for the conversion of mathematical formulas to MathML.
Mathdoc has developed and maintains a particular LaTeX class that enables automatic extraction of metadata during compilation. Named “cedram” after the predecessor of centre Mersenne and based on amsart, it remains very close to other classes of articles for authors.
The document remains compilable, even in an incomplete environment, to produce the PDF file for an author, for example. A set of scripts makes this process transparent for those who monitor standards compliance for centre Mersenne. The integrated compilation of articles can be controlled using the editorial management tool.
In 2021, Mathdoc started a project to extract the full text of LaTeX sources under the same conditions. This will improve accessibility of the content by HTML rendering on each journal’s website and ensure better interoperability by making the text available in XML in JATS format.
All centre Mersenne journals use this class and adapt it to their specific needs, in particular by authorising a dedicated template. A few journals outside the centre Mersenne also use this class. This same class can be used in a particular mode to acquire content that requires some transcription in the Numdam project.
Unlike backups, which are usually large and have a fairly short lifespan, the purpose of an archive is to preserve data over a long period of time. Mathdoc follows the OAIS model recommendations, in particular:
The archive is built and populated automatically, particularly when new articles are published. Quality control scripts are available to check the archive. The data is stored on various media that are distributed geographically: servers replicated on several national sites (GRICAD and via Mathrice at different establishments). In addition to the in-house archiving solution, data from centre Mersenne is also sent to CLOCKSS for distributed archiving and made available to users in the extreme case of a publisher ending its distribution.