Last year we partnered with a FinTech firm that wanted to implement a new natural language processing (NLP) model and transition to a new cloud provider. To accomplish this, we decided to use Azure Functions for cloud hosting and spaCy as the foundation for the NLP model. Our initial builds only ran into minor configuration issues. However, soon as our application began to grow, we hit some major issues, including: unstable builds, crashing CI/CD pipelines, failed deployments. With these issues hindering our progress, it was time to take the system apart to determine the cause of the problems. Everything was working perfectly—except spaCy. This was a major issue, however, as spaCy was key to this project: best results versus competing machine learning tools and calibrated to the client’s use case.

After many frustrating hours, and what proved to be numerous trials and, ultimately, errors, we discovered the best way to deploy a large spaCy model to Azure functions: manually use the model data directory as a part of the application’s repository.

TL;DR

  • Download and extract: en_core_web_lg-2.2.5.tar.gz
  • Copy model data directory to your app folder: __app__
  • Import SpaCy into your function file: import spacy
  • Load the model using: nlp = spacy.load('<filepath>/en_core_web_lg-2.2.5')

Background

Why did we choose spaCy? spaCy is a production ready, incredibly fast and accurate, NLP tool, especially when using their large general-purpose models. We were able to conduct accurate part-of-speech tagging and noun chunking out of the box. From there, we developed a propriety processing algorithm to consume spaCy’s Dependency Parser. This was tailored to the specific type of text documents that the client was analyzing. All in all, we ended up with a very clean and representative collection of data.

Why did we choose Azure Functions? Since spaCy is just one part of our Azure Function that handles data processing, we needed a system without cold start lag and had the ability to dynamically scale. The capabilities of Azure Functions and a dedicated App Service plan worked well with what we were trying to accomplish versus AWS Lambda. A dedicated App Service plan allowed us to run the Function application like a traditional web app. The plan uses dedicated scalable memory and compute. So, we were able to keep the application in memory, including the large spaCy model, allowing hot starts. By hosting on Azure Functions, our app is cached in memory, has unbounded timeout duration, flexible memory usage (up to 14 GB), and flexible compute.

Different Methods We Tried

We tried to incorporate the model download as one of the build steps. The benefit of this would have been a smaller repo and access to the most recent model on each update. However, due to the size of the download, our builds began to randomly fail. At the same, our build time was exploding, making it impossible to iterate quickly. Builds were averaging 20+ minutes.

$ python -m spacy download en_core_web_sm

>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")

Installation via pip

Similar to the previous method, this would give us a smaller repo, but we would have lost the ability to dynamically use the most recent model. Nonetheless, we needed a stable build and deployment pipeline. So, we incorporated the model's external URL as a part of our requirements.txt . Our team now had luck on the build step; however, the deployment started to randomly fail. Builds saw a minor improvement averaging less than 10 minutes, yet the deployments started taking 15 to 20 minutes, if they worked.

# With external URL
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz

Our Final Implementation! Wooooo, it is working

Our last ditch effort was to manually download and extract the language model as a part of our application's repo. This meant drastically increasing the repo size. It also meant that we would have to manually replace the model when a new one became available. With nothing else working, we considered our options and decided these downsides were manageable for us.

We used the most recent large English model available at the time en_core_web_lg-2.2.5. After downloading and extracting the archive, we saved the model directory in our app folder.

SpaCy-Model-in-app-Folder

To use the model in our function, we loaded the model directly. For us this was in the parent director of our current utility function. Below is our code.

import spacy

def get_spacy_path():     
    current_path = pathlib.Path(__file__).parent.parent    
    return str(current_path / 'en_core_web_lg-2.2.5')
print(get_spacy_path())
nlp = spacy.load(get_spacy_path())

Infrastructure

  • App Service: B2 (3.5 GB memory; 1:1 vCPU:Core)
  • Git repo: Azure Repos
  • Azure Function: Python Function App version 2.0
  • Python: version 3.7
  • CI/CD: Azure DevOps Pipelines (YAML deployment)

Closing Thoughts

Our approach was not ideal. The expectations for how these technologies could/should be combined were not close to reality. We could have saved ourselves a lot of pain if we tested various deployment methods upfront. Our first priority was testing and building the application—DevOps was secondary. As we approach new projects, our testing framework will include production deployment, treating it as equally as important as the application code itself from day one.


Final Deployment YAML

# Production ----- Python Function App to Linux on Azure
# Build a Python function app and deploy it to Azure as a Linux function app.
# Add steps that analyze code, save build artifacts, deploy, and more:
# https://docs.microsoft.com/azure/devops/pipelines/languages/python

trigger:
- master

variables:
  # Azure Resource Manager connection created during pipeline creation
  azureSubscription: <subscription>

  # Function app name
  functionAppName: <appName>

  # Agent VM image name
  vmImageName: 'ubuntu-latest'

  # Working Directory
  workingDirectory: '$(System.DefaultWorkingDirectory)/__app__'

stages:
- stage: Build
  displayName: Build stage

  jobs:
  - job: Build
    displayName: Build
    pool:
      vmImage: $(vmImageName)

    steps:
    - bash: |
        if [ -f extensions.csproj ]
        then
            dotnet build extensions.csproj --runtime ubuntu.16.04-x64 --output ./bin
        fi
      workingDirectory: $(workingDirectory)
      displayName: 'Build extensions'

    - task: UsePythonVersion@0
      displayName: 'Use Python 3.7'
      inputs:
        versionSpec: 3.7

    - bash: |
        pip install --upgrade pip
        pip install -t .python_packages/lib/site-packages -r requirements.txt
      workingDirectory: $(workingDirectory)
      displayName: 'Install application dependencies'

    - task: ArchiveFiles@2
      displayName: 'Archive files'
      inputs:
        rootFolderOrFile: '$(workingDirectory)'
        includeRootFolder: false
        archiveType: zip
        archiveFile: $(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip
        replaceExistingArchive: true

    - publish: $(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip
      artifact: drop

- stage: Deploy
  displayName: Deploy stage
  dependsOn: Build
  condition: succeeded()

  jobs:
  - deployment: Deploy
    displayName: Deploy
    environment: 'production'
    pool:
      vmImage: $(vmImageName)

    strategy:
      runOnce:
        deploy:

          steps:
          - task: AzureFunctionApp@1
            displayName: 'Azure functions app deploy'
            inputs:
              azureSubscription: '$(azureSubscription)'
              appType: functionAppLinux
              appName: $(functionAppName)
              package: '$(Pipeline.Workspace)/drop/$(Build.BuildId).zip'