Automatic Records Classification with Microsoft SharePoint Premium (formerly Syntex)

Introduction

Cadence Solutions was engaged by the Content Management unit within the Technology Department of a Public Sector Organization (“Organization” or “PSO”) to examine the functionality and suitability of Microsoft SharePoint Premium as it relates to document classification within the PSO’s SharePoint Online (SPO) environment. Specifically, PSO desired to test whether the machine teaching capabilities of SharePoint Premium would be sufficient to auto-apply a Functional Classification Taxonomy (FCT) metadata element to a document.

The FCT is an Information Management (IM) tool to help users functionally classify records. Functional classification is categorizing records based on which business activities created the records. The FCT is a functional classification scheme that groups all PSO business activities as a hierarchy of functions (top level) and activities (bottom level). Users must functionally classify a record in SPO with a combination of a function and one of its activities (a functional class), like FIN - Budgeting and Forecasting.

Cadence Solutions was specifically engaged by the Department to perform the configuration work, provide training to the Department, and to provide this report as a perspective on this project. Cadence Solutions performed this work pro-bono using trial licenses provided directly to the Organization through their enterprise agreement with Microsoft.

Cadence Solutions was provided a set of sample documents fitting each of the five (5) selected FCT terms. These sample documents were stored on the SharePoint site established for the project in the Organization’s test environment. Cadence Solutions’ technical team proceeded to establish the connection to SharePoint Premium, train five (5) models.

At the closure of Cadence Solutions’ configuration effort, the Organization was left with five (5) trained SharePoint Premiumdocument classification models which were able to apply the FCT to test documents.

The following report captures the highlights of this project and suggests future areas of investigation for the Department.

What is Microsoft SharePoint Premium?

Microsoft SharePoint Premium is a set of tools integrated in the Microsoft 365 environment providing a wide set of features related to unstructured content. While the brand is much broader than the scope of this project, it is valuable to consider the breadth of the features available. This project focused on a subset of the available features, but for clarity, we have copied the broad descriptions from Microsoft in the following paragraphs as an introduction to the Syntex platform as a whole.

Microsoft SharePoint Premium is Content AI integrated in the flow of work. It puts people at the center, with content seamlessly integrated into collaboration and workflows, turning content from a cost into an advantage. SharePoint Premium automatically reads, tags, and indexes high volumes of content and connects it where it’s needed—in search, in applications, and as reusable knowledge. It manages your content throughout its lifecycle with robust analytics, security, and automated retention.

Whether you’re focused on customer transactions, processing invoices, writing a contract that requires a signature, or struggling to understand the flood of unstructured content, Content AI with Syntex can help.

Learn more about Syntex here.

Business driver?

In the current state, much of the classification of documents is manual. From a net new document perspective, that creates extra work on the part of the user to ensure a document is classified appropriately at the time of upload. Gaps in training or general human error raise the risk of misclassification of documents which has downstream impacts on the retention policy and on the overall compliance to record keeping legislation. As it relates to existing documents, there are likely documents currently in the SharePoint environment that are misclassified or not classified at all with an FCT. Additionally, migration of existing content from other repositories into SharePoint requires substantial manual effort to ensure an appropriate FCT is applied to all in-scope documents. The gaps in this process are a significant barrier to enforcing appropriate content management and the gaps carry significant risk as it relates to compliance.

In summary, the manual and error-prone process of classifying documents with an appropriate FCT is a risk to the Organization, preventing an effective and efficient transition to unified digital record keeping in SharePoint.

How was SharePoint Premium applied?

Using the Document Processing features of SharePoint Premium, we trained five (5) Document Processing* models in SharePoint Premium to apply a defined FCT based on similarity to training documents within each model. *: A ‘Unstructured Document Processing Model’ is a Machine Teaching model which takes in sample documents with human intervention to train the model on recognition of key phrases, which it then uses to apply a classification.

Five Models trained using SharePoint Premium

The following are the 5 models and their respective FCTs. The models in Syntex are classified as a Content Type with their respective names using Syntex:

Model/Content Type	FCT
Annual Report	GOV - Strategic Reporting
Director’s Orders – Consumer Protection	MOC - Enforcement
Environmental Assessment	RM - Strategic Risk Assessment
Mineral Assessment Reports	NR - Mineral Rights Administration
Departmental Order	GOV - Legislation

SharePoint Premium model classification and FCT extraction process

There are two main processes to link the models with the FCT. The first process is to classify a document as a model (content type) using the SharePoint Premium classifier. After the document is classified as a model, the next step is to extract the FCT value out from the model.

The process of classifying the models and mapping them to the FCT involved the following steps:

Configure SharePoint Premium Model Classifier

We created 5 models from the SharePoint Premium Content Center using the Unstructured Document Processing Model, also known as Document Teaching method. SharePoint Premium models have a 1-1 mapping to the SharePoint Content Type, so a unique Content Type was created for each model created.
Once the models were created, we trained each of the model’s classifier based on the training documents that were provided. The training leveraged explanations provided by the user to teach the model to correctly identify the positive and negative examples for the model. The explanations for each model had key phrases and/or proximities defined to teach the model to associate those terms with the model.
Once training was done, we performed a test on the model classifier to validate that the models were being classified correctly as per our training.

Configure SharePoint Premium Model Extractor

After the model classification was fully configured, we proceeded with the next step, which was extracting the FCT. Since FCT provided by PSO is stored in SharePoint Term Store, we needed to utilize an extractor that would extract a value from the model to associate with the FCT.

Test Models on a Document Library

We have tested the model classification/extraction on the SPO site and derive the following statistical results:
- Classification Accuracy: 97%
- Confusion Matrix: 97%
- Precision: 0.98
- Average Confidence Score: 99.45%

SharePoint Premium model test results

Model Configuration Result

The following is the summary of the result of the models’ configuration on the FM1(SharePoint Premium) Content Center:

Model	Classifier Accuracy (/100)	FCT Accuracy (%)
Annual Report	100	88
Director’s Orders – Consumer Protection	100	87
Environmental Assessment	94	81
Mineral Assessment Reports	100	85
Departmental Order	100	100
Average	98.8	88.2

Classifier Accuracy: determined by the result of the SharePoint Premium model classifier’s training.

FCT Accuracy: determined by the result of the SharePoint Premium model FCT extractor’s training. Classifiers for most models had an accuracy of 100, and the average of all models were 98.8.

Lower classifier accuracy was correlated with the models not having a standard format.
FCT Extractor’s accuracy during training was lower, averaging 88.2%. Lower FCT extraction rate also was correlated with models with multiple unstructured formats.
FCT Extractor’s accuracy for Environmental Assessment was improved in the actual extraction process by implementing Method B for FCT extraction. Method B does not rely on the FCT Accuracy to extract the FCT value.
Overall, the configuration of the model surpassed the original expectation.

Model Classification/Extraction Result

The following is the model classification/extraction result of SharePoint Premium model tests on the SPO site:

Legend: (-)=Negative, (+)=Positive, (-)=False Negative, (+)=False Positive

Model	Expected	Actual
Annual Report	42 (+) 4 (-)	36 (+) 4 (-) 6 (-)
Director’s Orders – Consumer Protection	50 (+) 5 (-)	48 (+) 5 (-) 2 (-)
Environmental Assessment	94 (+) 5 (-)	107 (+) 4 (-) 13 (+)
Mineral Assessment Reports	50 (+) 1 (-)	50 (+) 1 (-)
Departmental Order	61 (+) 3 (-)	61 (+) 3 (-)

Result Statistics

The following is the statistics of the SharePoint Premium model test results on the SPO site:

Statistic	Formula	Result
Classification Accuracy	Number of correct records/ Total number of document samples	97%
Confusion Matrix	True Positive + True Negative/Total Sample	97%
Precision	Expected True Positives/Actual True Positives + Actual False Positives	0.98
Average Confidence Score	Average Syntex confidence score based on the model explanations	99.45%

Using statistical measures such as precision and accuracy, we can analyze the observed data from the SharePoint Premium models to derive a meaningful output. Classification accuracy refers to the average of discrepancy between observed and expected results, while precision describes how close the observed results are with each other. Ideally, precision should be high (close to 1), with a good set of actual results that have mostly true positives and true negatives. A precision score of 1 is achieved when the numerator and denominator are equal. The effectiveness of a model in making predictions can be evaluated using a confusion matrix, which accounts for the number of false positives and negatives in the calculation. The average confidence scores can also be used to assess the accuracy of the model's predictions. Overall, the statistical measures suggest that the model is effective in classifying documents.

What Worked?

The following items summarize the successful results experienced by the team during this project:

The trained models were able to successfully apply an FCT to documents stored in SharePoint Document Libraries.
Models with a templated structures were more easily classified and was able to extract the FCT value out more correctly.
Extraction leveraging the assignment of a default FCT value (Extraction Method B) was able to extract the FCT values out of documents without any templatized structure. This method also does not depend on the explanation and the accuracy of the FCT extractor as it relies on the extractor to fail to assign the default value. Also, there is no reliance on utilizing the synonyms of the terms in the Term Store. Overall, Method B seems to be solid method to use when there is a 1-1 mapping between the model (content type) and the extracted value (FCT).
SharePoint Premium reported a classification accuracy of 97%, confusion matrix of 97%, and precision of 0.98. This means that the model was effective in classifying the documents.
Model classifiers’ explanations are better to be simple unless stress testing finds the necessity for improvements.

Cadence Solutions Opinion on the suitability of SharePoint Premium

As it relates to the initial question of whether SharePoint Premium can auto-apply an FCT classification onto a document based on contents, we have found that SharePoint Premium is able to do that successfully. Additionally, we have found the reported FCT extraction results indicate that the approach of training the models was very successful.

FAQ from Project Team

The following questions were asked by the project team, with responses in-line:

What are the machine learning components of the Document Understanding Model to refer to it as Content AI?

Document understanding is based on AI solution named machine teaching. You can learn more about machine teaching here.

How are negative examples used in the Document Understanding Model?

Both positive and negative examples are used to train the model. For instance, if you are training a model to identify contracts, you may want to add service agreements as negative samples. These are documents that may look like contracts but are not contracts. The machine teaching overview above explains this in detail.

If one negative example is identified, would similar negative examples be identified without changing the Explanation?

Yes, the idea here is to teach SharePoint Premium on negative samples. When evaluating documents, if SharePoint Premium comes across similar negative sample documents, those will not be considered to be positive input. Machine teaching overview above explains this as well. How is the confidence score calculated? The confidence score from the explanations is set by Phrase list, Regular expression & Proximity. The better these are setup and accurately identifying the unstructured data the better the confidence score. Read more here.

Is there a limit of how many SharePoint Premium models we can apply to a single Document Library? Are there any other limitations?

One library can have more than one model deployed and files will be evaluated against every model. Whichever model produces the highest confidence score will be assigned to the document. For now, there is no defined upper limit of how many models you can apply.

Download the infographic below for more information.