Home  »  II-SDV  »  Programme  »  Monday 23 April 2018

Monday 23 April 2018

09:00 - 10:30

Chair: Christoph Haxel, Dr. Haxel CEM, Austria

Simplifying Knowledge Discovery with state-of-art Technologies

The global economy is moving towards knowledge economy. The consumption and production of intellectual capital is the new order. The organizations that do this are far more profitable than their counterparts in traditional industries. A knowledge centric company prioritizes managing its internal knowledge as virtuous cycle whereby the knowledge is produced, then shared, consumed and then produced far more.  

The information and knowledge has exploded in this digital age, both in size (Terabytes within enterprise and Petabytes in external world) and complexity (numerous unstructured formats like document, images, video, software code) etc… This leads into the challenge of knowledge discovery. More the time and effort a person takes to search for the relevant piece of knowledge/information, lesser is an individual’s productivity and higher the likelihood of an organization wasting money on duplication of work and reinventing the wheel.    

The current state-of-art technologies to mine knowledge combined with right process to capture knowledge can help alleviate the challenges of knowledge discovery. Natural language processing can capture the intent of the user rather than just looking at disconnected words. Applying Semantic Search can ensure that highly relevant documents show up in the first page of search. Visualization techniques can help graphically represent the documents and how are they linked to various topics, categories etc, which makes it easier for a user to zoom, pan and navigate to the most relevant content. Data mining and Knowledge graph technologies can help provide direct answer to the queries of the user rather than having him browse through out the documents. Summarization techniques can again help throw up the snippets from the document in response to a query rather than having the user to go through several documents.   

Having a quality knowledge base behind is also equally important to overcome the challenges in knowledge discovery. Quality of knowledge documents is more important than quantity of documents. Knowledge centric companies implement a review process to ensure that only high quality documents go into the knowledge base. Peer review (like comment and rating on documents, most accessed documents) helps highlight quality documents over the rest to a knowledge seeker.


A journey in Open Data

From legal obligation to publish data to usage in various public Apps, we will explore how the Open Data movement can help companies to build a disruptive approach in their data analysis projects.
We will introduce how public entities can store their data in central repositories and how those data can be accessed by the mass and how private entities are trying to monetize their data.
We will introduce how public entities can store their data in central repositories and how those data can be accessed by every one using a simple smartphone.

Technologies covered in this presentation are Ckan, Drupal, Data4Citizen and some of the most famous API such as Uber, Weather, Waze ...

10:30 - 11:00

Exhibition and Networking Break

11:00 - 12:30

Chair: Wolfgang Thielemann, Bayer, Germany

AI Technologies in Practice at Libraries

AGI - Information Management Consultants went in 2002 from corporate business libraries to academic libraries. They build a turn-key application fitting to major library management systems - named intelligentCAPTURE. It merges knowledge about workflows, library standards and data capturing technologies in a simple user interface. It has the target to increase the indexing depth of library catalogues by digging in the table of contents of books or abstracts of articles. Both are already highly condensed content provided by the author, in natural language and often not in the language of the user. The application has a dialog for scanning the parts of a printed publication - handling of paper and selecting the relevant parts to be captured and processed is the only human work – 1 to 2 minutes in average. Everything else is done parallel and automatically in the background. With Abbyy FineReader Engine, a strong KI application the printed text in images is recognized. The library systems provide the languages of the scanned documents. Based on that information one or multiple languages from a total of 200 supported languages are selected automatically. The text analytic engine used is built for German. Therefore between OCR output and text analytics machine translation from Google-API is built in. The linguistic text analytics engine extracts about 10 % of the table of contents or abstracts as normalized descriptors, free descriptors, noun phrases, names, countries. Title, OCR text and all indexing results can finally be translated into one or multiple target languages and forwarded to the library management systems or other facilities.

AGI provides additionally a Lucene based search application with build in thesaurus searching. dandelon.com is an implementation of this search application holding about 1,5 million thesaurus entries in up to 25 languages from multiple subject areas like MeSH for medicine, EUROVOC for political science – or thesauri developed by AGI clients. dandelon.com is public and increases by 6000 books per month, including 1000 ebooks. For ebooks scanning and OCR is replaced by a PDF text extraction facility. Beyond this complete articles, books or other kinds of papers can be processed and forwarded to data containers and search facilities.

Date and Text Mining @ Springer

Springer Nature advances discovery by publishing robust and insightful research, supporting the development of new areas of knowledge and making ideas and information accessible. Key to this is our ability to provide the best possible service to the whole research community: helping authors to share their discoveries; enabling researchers to find, access and understand the work of others; supporting librarians and institutions with innovations in technology and data.

As the volume of scientific publications increases along with the coverage of full-text XML, Springer Nature recognizes the importance of research techniques like Text and Data Mining (TDM). In his presentation, Henning Schoenenberger provides an outline of TDM tools Springer Nature has on offer. The focus is on APIs for full-text, metadata, open access, ontologies and citations. They enable researchers to find patterns, discover relationships, semantically understand and analyze scientific content.



12:30 - 14:00

Lunch, Exhibition and Networking

14:00 - 16:10

Chair: Patrick Beaucamp, Bpm-Conseil, France

Automatic Categorization of Patent Documents in the International Patent Classification (IPCCAT)

Since 2003, the World Intellectual Property Organization has been developing through its Simpleshift Partner a system (IPCCAT) for assisting users in categorizing patent documents in the International Patent Classification (IPC). IPCCAT supports the classification of documents in several languages and aims to assist users in locating relevant IPC symbols by providing them with a convenient web-based service. The approach taken between 2004 and 2009 for developing such a system relied on powerful machine learning algorithms that are trained on manually classified documents to recognize IPC topics and was essentially limited by the computing power and limited coverage of the deepest level of the IPC. In 2017, WIPO decided to resume IPCCAT research and development, targeting categorization across the full IPC which includes 72,981 subgroups in its 2017.01 version.


We now detail in-house results of this research and development building on the increase in computing power, as well as of larger training collections and improved extraction algorithms, applying a custom-built computer-assisted categorizer to English and French-language patent documents. The use of n-gram further improved IPCCAT accuracy thus justifying in 2017 to keep on building on legacy algorithms rather than exploring other like Deep Learning ones. We find that reliable automated categorization in the full IPC is now achievable for the statistical methods employed here. With a combination of around 10,000 neural networks and an increased coverage of the IPC, IPCCAT suggesting three IPC symbols for each document can predict the most relevant IPC class correctly for around 90% of documents, the most relevant IPC subclass for about 85% of documents and the most relevant IPC subgroup for 80% of documents.   IPCCAT opens new horizons to the IPC community e.g. to speed-up completion of the reclassification of patent documents after a revision of the IPC.


Using Machine Learning for Automatic Classification of Companies

Focusing on the significance of targets is one of the key drivers for quality of web search.

Filtering targeted companies based on the significance of their business model for the expected search results was one of our “nice to haves” last year.

Evaluating a number of artificial intelligence approaches based on neural networks, classical machine learning and semantic technologies lead us to a working hybrid approach.

Challenges in Visualizing Pharmaceutical Information - Part III: Smart Strategy Dashboards

This paper revisits the issues discussed in our 2013 and 2016 presentations "Challenges in Visualizing Pharmaceutical Business Information," where we analyzed some of the unique challenges in visualizing competitive intelligence information for the pharmaceutical industry.  A key challenge for pharmaceutical companies is to evaluate the competitive landscapes for drug launches many years in the future, based on a combination of publicly available drug pipeline and clinical trials data and internal company knowledge.  This information is often conveyed in hand-drawn PowerPoint slides, which are very time consuming to create and update as the competitive landscape changes.  In this year's presentation, we'll discuss the development of a "dashboard" of visualizations (piano charts, trial completion timelines, maps, etc.) to facilitate the analysis and visualization of competitive drug landscapes.


Exhibition and Networking Break

16:10 - 17:40

Chair: Thorsten Zank, BASF, Germany

When Artificial Intelligence Joins Intellectual Property

Three years ago, we thought this would be impossible to accomplish. Today, it is a reality : Artificial Intelligence gains more and more importance in the value chain of Intellectual Property. The « red button » allowing to obtain, with very limited interaction from human intelligence, a full technology landscape while starting from a simple technical fact sheet, is at hand. Assessment of the progress made so far.

The journey continues: the addition of French, Russian, Chinese, Korean and Japanese Patents to PATENTSCOPE ChemSearch

PATENTSCOPE is a free patent search system offered by the World Intellectual Property Organization (WIPO), covering over 66 million patent documents.

The last big enhancement of the system was the addition of chemical search capabilities, accomplished using InfoChem’s automatic text- and image-mining technologies.

An automatic workflow was developed and put into operation allowing real-time, multi-modal chemical text annotation and image recognition.

Since the number of patent applications in some Asian countries increases rapidly, this process was enhanced to also handle Russian, Korean, Japanese and Chinese files in the most recent project phase.

This talk will report the results and will address technical challenges encountered such as OCR quality, heterogeneity of sources, scalability, performance and parallelization.


Navigating to new shores: the Biopharma Navigator

We present the latest developments around the Biopharma Navigator, a consolidated large search, analysis and reporting application for tens of millions of biomedical documents. In its latest version the application has expanded to include yet more document sources, is offering real-time data-driven dashboards, an enhanced API that facilitates integration into third-party environments, advances in expert identification, the extension of the pharmacovigilance approach to new sources from news and social media as well as live extension of drug name repositories and clinical trial monitoring.


The Biopharma Navigator is used by a growing number of experts in the industry for their daily analyses and can be employed either on a simple subscription basis or with an on-premise installation. The Biopharma Navigator is our answer for the question how big data, cognitive computing analysis and intuitive webfrontends can be combined to provide broad and up-to-date information access to Life Science professionals.