Carbon Changelog

Monthly updates with the latest features added, improvements made and bugs squashed.

New Embedding Models

  • We now support embedding generation using OpenAI’s text-embedding-3-small and text-embedding-3-large models.

  • To define the embedding model, utilize the embedding_model parameter in the POST body for the /embeddings and other API endpoints. By default, if no specific model is provided, the system will use OPENAI (the original Ada-2).

  • Find more details on the models available here.

Return HTML for Webpages

  • presigned_url field under user_files_v2 now returns a pre-signed URL to the raw HTML content for each web page.

  • parsed_text_url field still returns a pre-signed URL for the corresponding plain text.

  • Find more details here.

Return Websiate Tags in File Metadata

  • file_metadata field under user_files_v2 now returns og:image and og:description for each web page.

  • Find more details here.

Omit Content by CSS Selector 

  • You can now exclude specific CSS selectors from web scraping. This ensures that text content within these elements does not appear in the parsed plaintext, chunks, and embeddings. Useful for omitting irrelevant elements, such as headers or footers, which might affect semantic search results.

  • The web_scrape request objects supports a new fields:

  •  css_selectors_to_skip: Optional[list[str]] = []

  • Find more details here.

JSON File Support

  • We’ve added support for JSON files via local upload and 3rd party connectors.

  • How It Works:

    • The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.

    • max_items_per_chunk is a parameter that determines how many JSON objects to include in a single chunk.

    • A new chunk is created if either the max_items_per_chunk and chunk_size limit is reached. For example:

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 JSON objects.

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 JSON object.

  • Learn more details here.

Freshdesk Connector is Live

  • All Published articles from an end user’s Freshdesk knowledge base are synced when connected to Carbon.

  • The Carbon Connect enabledIntegrations value is FRESHDESK.

  • You can find more info here.

Speed Improvements to Hybrid Search

  • We improved the speed of hybrid search by a factor of 10x by creating sparse vector indexes on file upload vs. query time.

    • Steps to Enable:

      • Pass the following body to the /modify_user_configuration endpoint: { "configuration_key_name": "sparse_vectors", "value": { "enabled": true } }

    • Set the parameter generate_sparse_vectors to true via the /uploadfile endpoint.

  • We’ll be rolling out faster hybrid search support across 3rd party connectors in the upcoming weeks.

  • Find more details here and here.

Deleting Files based on Sync Status

  • You can now delete file(s) based on sync_status via the delete_files endpoint.

  • We added 2 parameters:

    • sync_statuses - parameter to pass a list of sync statuses for file deletion.

      • For example, { "sync_statuses": ["SYNC_ERROR", "QUEUED_FOR_SYNC"] }. When this parameter value is passed we will delete all files in the SYNC_ERROR and QUEUED_FOR_SYNC status that belong to the end user identified by customer-id in headers that made the request.

    • delete_non_synced_only - boolean parameter that limits deletion to files that have not been re-synced before.

      • For example, a previously synced Google Drive file enters the QUEUED_FOR_SYNC status again during a scheduled re-sync. Setting delete_non_synced_only to true would prevent this file from being deleted as well.

  • Files are deletable in all statuses except SYNCING, EVALUATING_RESYNC and QUEUED_FOR_OCR states.  Including SYNCING, EVALUATING_RESYNC, QUEUED_FOR_OCR in the list will result in an error response - files in these statuses must wait until they transition out of the status to be deleted.

  • Find more details here.

Carbon Connect Updates

  • Added support for the following functionalities in Carbon Connect (React component + JavaScript SDK):

    • Additional embedding models (OPENAI, AZURE_OPENAI, COHERE_MULTILINGUAL_V3 for text and audio files, and VERTEX_MULTIMODAL for image files).

    • Enable audio and image file support. Reference documentation on file formats available.

    • OCR support for PDFs from local file uploads via Carbon Connect.

    • Hybrid search supported.

Remove Customer-Id on Select Endpoints

  • We’re removing customer-id as a required header for the following endpoints where it is not required:

    • /auth/v1/white_labeling

    • /user

    • /webhooks

    • /add_webhook

    • /delete_webhook/{webhook_id}

    • /organization

Vector Database Integration

  • We are starting to build out direct integrations with vector database providers!

  • What this means:

    • After authenticating a vector database provider via API key, Carbon automatically synchronizes between user data sources and the embeddings within your vector database. Whenever a user file is processed, we handle the seamless update of your vector database with the latest embeddings.

    • You’ll have full functionality to all our Carbon’s API endpoints, including hybrid search if sparse vector storage is supported by your vector database.

    • Migrations between vector databases is made simple since Carbon provides a unified API to interface with all providers.

  • The first vector database integration we’re announcing is with Turbopuffer. Many more to come!

S3 Connector 

  • We launched our S3 connector today that enables syncing objects from buckets.

  • The Carbon Connect enabledIntegrations value for S3 is S3.

  • See more specifics about our S3 connector here.

File + Account Management Component (BETA)

  • Users to add and revoke access to accounts under each connection.

  • Users to view and select specific folders and files for sync.

  • The aim is to offer a pre-built file selector for integrations without their own.

  • The component is currently offered in React but we’ll add support for other frameworks soon.

  • You can find the npm package here. Please note it’s still in BETA so your feedback is much appreciated!

Expanding sort for user_files_v2

  • You can sort by name, file_size and last_sync on order_by field in the user_files_v2 body.

  • See more details here.

Support for audio file uploads via connectors

  • We’ve enabled support for audio files via the following connectors: S3, Google Drive, Onedrive, SharePoint, Box, DropBox, Zotero.

  • See list of supported audio files here.

Google Verification

  • Carbon’s Google Connector is officially Google-verified. This means users will no longer see the warning screen when authenticating with Carbon’s Google connector.

OCR Public Preview

  • We’ve been rolling out support for OCR, starting with PDFs uploaded locally (images and data connectors to follow).

Exposing Sync Error Reasons

  • We are now exposing error messages under the sync_error_reason field for files entering SYNC_ERROR status.

  • You can find a list of common errors here and we’ll be updating this on an ongoing basis.

List and Sync Items from Data Sources

  • We’re introducing new functionalities that allow customers to synchronize and retrieve a comprehensive list of items such as files, folders, collections, articles, and more from a user’s data source. This enhancement empowers you to create an in-house file selection flow, while enabling Carbon to also provide a user-friendly file selector UI and convenient helper methods within our SDK.

  • You can find more details here.

Upload Chunks and Embeddings

  • Added /upload_chunks_and_embeddings endpoint to enable uploading of chunks and vectors to Carbon directly.

  • See more specific details here.