Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enhance error handling in Azure document embedder #8941

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mdrazak2001
Copy link

Related Issues

Proposed Changes:

  • Add error handling in _embed_batch to continue processing remaining documents
  • Log failed embeddings with batch range information
  • Match error handling behavior with OpenAIDocumentEmbedder
  • Add unit tests for graceful error handling

How did you test it?

  • Added unit tests for graceful error handling in the AzureOpenAIDocumentEmbedder class

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@mdrazak2001 mdrazak2001 requested a review from a team as a code owner February 26, 2025 17:52
@mdrazak2001 mdrazak2001 requested review from anakin87 and removed request for a team February 26, 2025 17:52
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 26, 2025
@mdrazak2001 mdrazak2001 requested a review from a team as a code owner February 26, 2025 17:58
@mdrazak2001 mdrazak2001 requested review from dfokina and removed request for a team February 26, 2025 17:58
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 13550304131

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 18 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.07%) to 90.047%

Files with Coverage Reduction New Missed Lines %
components/embedders/azure_document_embedder.py 18 69.74%
Totals Coverage Status
Change from base Build 13542952701: 0.07%
Covered Lines: 9581
Relevant Lines: 10640

💛 - Coveralls

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

I found a small opportunity for improving this PR.

Comment on lines 213 to 237
try:
if self.dimensions is not None:
response = self._client.embeddings.create(
model=self.azure_deployment, dimensions=self.dimensions, input=batch
)
else:
response = self._client.embeddings.create(model=self.azure_deployment, input=batch)

# Append embeddings to the list
all_embeddings.extend(el.embedding for el in response.data)

# Update the meta information only once if it's empty
if not meta["model"]:
meta["model"] = response.model
meta["usage"] = dict(response.usage)
else:
# Update the usage tokens
meta["usage"]["prompt_tokens"] += response.usage.prompt_tokens
meta["usage"]["total_tokens"] += response.usage.total_tokens

except Exception as e:
# Log the error but continue processing
batch_range = f"{i} - {i + batch_size}"
logger.exception(f"Failed embedding of documents in range: {batch_range} caused by {e}")
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please align this implementation with that of the OpenAIDocumentEmbedder?

I think that it is better for a few reasons:

  • groups args for the embedding creation API call
  • uses the more specific APIError instead of Exception
  • logs the IDs of the Documents fow which the embedding generation failed

Copy link
Author

@mdrazak2001 mdrazak2001 Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @anakin87.
For points 1 and 2, I can update the implementation to:

  • Group args for the embedding creation API call
  • Use the more specific APIError instead of Exception

For point 3 (logging document IDs), I notice this would require changing the signature of _prepare_texts_to_embed from:

def _prepare_texts_to_embed(self, documents: List[Document]) -> List[str]

to:

def _prepare_texts_to_embed(self, documents: List[Document]) -> Dict[str, str]

Would you be okay with this signature change to align it with OpenAIDocumentEmbedder's implementation? This would help improve error logging by identifying which specific documents failed during embedding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally OK with changing the signature of _prepare_texts_to_embed.
It's an internal method (_something), so changing its signature and behavior is not considered a breaking change.

@mdrazak2001 mdrazak2001 requested a review from anakin87 March 3, 2025 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AzureOpenAIDocumentEmbedder fails entire run when one document throws error
3 participants