Carbon Changelog

Monthly updates with the latest features added, improvements made and bugs squashed.

Guru Connector

  • The Guru connector allows users to sync collections, folders, and cards from their Guru account.

  • CCv3 support for Guru will be coming soon and the enabledIntegration value is GURU.

  • See more details here.

Sync Filter for Email Attachments

  • Customers can specifically select to sync only emails that contain attachments.

  • You will still need to specify sync_attachments to true and also set the following filter:

{ "key": "has", "value": "attachment" }

Auto-Refresh Synced Files List in CCv3

  • We now automatically refresh the synced file list whenever users select additional files using our in-house or third-party file picker view. This eliminates the need for users to manually refresh the view.

Updated Children Prop

  • The children prop of the CCv3 component now accepts any valid React node as the children of the modal, from a simple <div> to an entire component.

  • Here’s an example of how the children prop can be used:

children={ <button onClick={() => setOpen((prev) => !prev)}> Toggle Connect </button> }

Custom Styling for Carbon Connect

  • Users can now control styling of CCv3 by targeting the specific class names we’ve provided. This allows for complete customization to match the desired look and feel of the application.

  • For example, classes names include:

    • cc-modal: Applies to the entire modal component

    • cc-modal-header: Targets the header section of the modal

    • cc-modal-footer: Targets the footer section of the modal

    • cc-modal-close: Applies to the close button of the modal

    • cc-modal-overlay: Targets the overlay background of the modal

  • By utilizing these class names, users can easily override the default styles and apply their own CSS rules to achieve the desired appearance.

OCR Support for JPG and PNG

  • We now support jpg, jpeg and png file formats for OCR.

  • In addition to the normal steps for enabling OCR, please set media_type to TEXT (via file upload and /integrations/oauth_url) so Carbon knows to process the image via OCR (versus generating image embeddings via our image embedding model).

HTML for Confluence Articles

  • We now return the raw HTML output for each Confluence article via the file_metadata.saved_filename object under user_files_v2.

Cancel Source Items Sync

  • We added an endpoint /integrations/items/sync/cancel to cancel data source syncs that are initiated via /integrations/items/sync.

  • This allows customers to manually stop syncing for user data sources where sync_status = SYNCING.

New Gmail Filter

  • We added a new Gmail filter to sync all emails sent from a given account. Example:

{ "filters": { "key": "in", "value": "sent" } }

Return Raw Notion Blocks

  • We now return the raw output (blocks) for each Notion page via saved_filename under user_files_v2 when include_raw_file: true.

Shared Google Drive Source Items

  • We now return shared Google Drive files and folders via integration/items/list.

Clearer Error Message for SYNC_ERROR Status

  • When a file goes into SYNC_ERROR from re-syncing via /resync_file because it has been deleted in source, sync_error_message will now say File not found in data source

  • The webhook sent for that error will also contain sync_error_message in additional_information.

Slack UI in Carbon Connect v3 (3.0.0-beta32)

  • Select Conversations to Sync

    • After authenticating, users have full control over which conversations they want to sync via CCv3, including:

      • Public channels

      • Private channels

      • Direct messages (DMs)

      • Group DMs

  • Manage Synced Conversations

    • Users can manage their list of synced conversations at any time via CCv3.

    • Easily add or remove channels and DMs to adjust what gets synced between Slack and Carbon.

Carbon Connect Enhancements

  • Synced URLs for Web Scrapes (CCv3 beta30

    • We now display synced URLs in a dedicated list view under the WEB_SCRAPE integration.

    • The default columns displayed in the list view are name, status, created_at.

    • Parent URLs will be displayed as “folders” and children URLs will be displayed as “files” within the folder.

  • When showFilesTab is set to false we surface a Select files button in the account drop-down for users to sync new files.

  • Data Source Polling Interval

    • Added a new configuration property at the component level called dataSourcePollingInterval.

    • This property controls how frequently data sources are polled for any updates and events.

    • The value is specified in milliseconds (ms) and the minimum allowed value for this property is 3000 ms. The default is 8000 ms.

  • Speaker Diarization

    • Added includeSpeakerLabels for LOCAL_FILES integration and file extensions.

    • Added include_speaker_labels to fileSyncConfig for third-party connectors.

  • openFilesTabTo Param

    • The openFilesTabTo prop is set on the component level and determines which tab (FILE_PICKER or FILES_LIST) the user is taken to by default when they select an integration.

    • The prop takes a string value of either "FILE_PICKER" | "FILES_LIST".

    • This prop only applies when the customer has enabled Carbon’s in-house file picker.

  • We now display a banner when data source items are being synced. The user will still be able to select previously synced items for upload in the meantime.

  • Guru support in CCv3 has been added. The enabledIntegration is GURU.

  • We improved the file list view to be better optimized for mobile devices and ensured that the column headers and values align properly.


Pongo Reranking Modal

  • We’ve added Pongo as a supported reranker model alongside Jina and Cohere.

  • Similar to Cohere and Jina reranking, users can now use PONGO_RERANKER in the following manner on the embeddings endpoint: { "query": "how is anime made?", "k": 5, "rerank": {"model": "PONGO_RERANKER"} }

Third-Party File Picker Behavior

  • We added a new parameter automatically_open_file_picker to the external file sync urls: /integrations/oauth_url and /integrations/connect. When true, the file picker for Google Drive, Box, OneDrive, Sharepoint, Dropbox will automatically open when the user lands on the successful connection page.

  • It’s important to note that some users’ browsers may have popup blockers that could prevent this parameter from functioning. In such cases, the user may receive a prompt from their browser asking for permission to allow popups from the platform. If the user grants permission, the feature will work as intended for future syncs.

  • It’s worth mentioning that OneDrive and SharePoint behave differently due to Microsoft treating the file picker as a separate app. Instead of directly opening the file picker, it will trigger another OAuth prompt. If the user consents to the file picker OAuth, the file picker will then automatically open afterwards.

Speaker Diarization

  • Speaker diarization has been added for audio transcription models. This allows us to format chunks so that the text is organized by utterances and each utterance will be labeled with the speaker. It’ll take this format:

[Speaker A] speaker A's utterance

[Speaker B] speaker B's utterance

  • For local file uploads, there is a new parameter include_speaker_labels. And for external file uploads, the parameter file_sync_config object can take a new property include_speaker_labels. When either is set to true, speaker diarization will be enabled for the audio transcription services

  • Minor note: Speaker label may appear differently depending on the transcription service. Deepgram uses numbers to label speakers while AssemblyAI uses letters.

request_id on Additional Webhooks

  • request_id is now included in following webhook events under the additional_information object for external files: UPDATE, FILES_CREATED, FILE_READY, FILE_ERROR, FILES_SKIPPED, FILE_SYNC_LIMIT_REACHED

Cold Storage for Files (Beta)

  • Overview

    • Carbon supports moving file embeddings between hot and cold storage. This feature allows you to optimize storage costs and improve performance by keeping embeddngs for frequently accessed files in hot storage (vector storage) while moving less frequently used files to cold storage (object storage).

  • Enabling Cold Storage

    • By default, the cold storage feature is not enabled. Once enabled, files will automatically be moved to cold storage after a set period of inactivity. To enable cold storage, you must set a flag at file upload time. Currently cold storage is only available for local file uploads via /uploadfile, /upload_text and /upload_file_from_url.

      • Moving Files from Hot to Cold Storage

        • Once enabled, files will be automatically moved from hot to cold storage after a specified period of inactivity. This period is determined by the time_to_move_to_cold_storage parameter, which represents the number of seconds a file must be inactive before it’s moved to cold storage. There is no manual way to move files to cold storage.

          • You can make an API request to the /modify_cold_storage_parameters endpoint which allows customers to update existing files to use cold storage.

      • Moving Files from Cold to Hot Storage

        • To move files from cold to hot storage, you must make an API request to /move_to_hot_storage. The request will take filters similar to /user_files_v2, and all files matching the provided filters will be moved to hot storage.

        • To avoid a single request hogging resources, there is a limit of 200 files that can be moved in one request. If the number of files matching the filters exceed  200, the files will be processed in batches of 200 over a longer period of time

    • /embeddings Endpoint Behavior

      • If a request is made to /embeddings that involves files in cold storage, an error will be returned that includes a l file_ids for the affected files. This a lows the client to know which files need to be moved to hot storage before the request can be processed.

      • However, exclude_cold_storage_embeddings is set to true, any files in cold storage will be ignored, and no error  ill be thro n for requests involving files in cold storage. Then the search will naturally exclude those files.

      • In the future, we may enable a way to allow /embeddings to work with files that are in both cold and hot storage.

  • File Object Information

    • Activity is defined as when a file was last used, which currently includes file re-syncs, queries involving that file, and updates to file tags.

    • The following fields under the file object (under user_files_v2) are related to cold storage:

      • last_use: A timestamp indicating when a file was last used (i.e., when it last had activity).

      • supports_cold_storage: A flag indicating whether or not a file can be moved to cold storage.

      • time_to_move_to_cold_storage: An integer representing the number of seconds a file must be inactive before it’s moved to cold storage.

      • embedding_storage_status: The storage status of the embeddings for a file, indicating whether they are in cold or hot storage.

  • New Cold Storage Webhooks

    • MOVED_TO_COLD_STORAGE- This event is fired when a file is moved to cold storage.

    • MOVED_TO_HOT_STORAGE- This event is fired when a file is moved to hot storage.

You can find our documentation on cold storage here.

Warnings Object to API Responses

  • In the next two weeks, we plan to add a warnings object to our API responses to display warning messages.

  • Here’s an example of how it looks:

{ "documents": [], "warnings": [ { "warning_type": "FILES_IN_COLD_STORAGE", "object_type": "FILE_LIST", "object_id": [ 47058 ], "message": "These files won't be queried because they are not in hot storage." } ] }

Carbon Connect 3.0 (CCv3) Enhancements

  • We’ve added 3 new props to CCv3:

    • The showFilesTab (boolean) prop has been reintroduced to CCv3 with a default value of true. As a quick reminder, this prop allows customers to hide the file selector and file list view from the CCv3 component. It can be enabled or disabled at both the component and integration levels. If specified for a specific integration, it will override the component-level configuration.

    • The filesTabColumns (array) prop has been added on both the component and integration levels. This prop controls which columns are displayed and hidden in the file list view and accepts an array of strings with values “name”, “status”, “created_at”, and “external_url”.

    • The transcription_service (enum) prop has been added under fileSyncConfig and transcriptionService for LOCAL_FILES integration to specify which speech-to-text model to use for transcriptions. You can specify the enum as ASSEMBLYAI or DEEPGRAM but the prop defaults to DEEPGRAM.

Google Cloud Storage Connector 

  • We launched our GCS connector that enables syncing files from buckets.

  • The Carbon Connect enabledIntegrations value for GCS is GCS.

  • See more specifics about our GCS connector here.

DigitalOcean Storage Connector

  • We launched our DigitalOcena Storage connector that enables syncing files from buckets.

  • The Carbon Connect enabledIntegrations value for Digital Ocean Spaces is S3 (CC support will be launched tomorrow).

  • The Spaces API is interoperable with the AWS S3, so Digital Ocean Spaces makes use of the existing S3 endpoints.

  • This means that the source of Digital Ocean files is S3. To differentiate between data sources and files from Spaces Object Storage, additional metadata has been added:

    • Data Source Metadata

      • data_source_metadata: Indicates the type of data source. Possible values include:

        • S3: Represents an Amazon S3 data source.

        • DigitalOcean Space: Represents a DigitalOcean Spaces data source.

    • File Metadata

      • file_metadata: Specifies the type of file. Possible values include:

        • S3 File: Represents a file stored in Amazon S3.

        • DigitalOcean Space File: Represents a file stored in DigitalOcean Spaces.

        • S3 Bucket: Represents a file representation for a S3 Bucket.

        • DigitalOcean Space Bucket: Represents a file representation for a DigitalOcean Space Bucket.

  • See more specifics about our DigitalOcean Spaces connector here.

New file_types_at_source Filter for /user_files_v2 and /embeddings

  • Introduced a new optional field file_types_at_source for /user_files_v2 and /embeddings.

  • The file_types_at_source field is an array type that currently accepts the following values:

    • TICKET

    • ARTICLE

  • This new field allows users to specify whether we return tickets, articles or both when retrieving content (files and embeddings) from Zendesk, Intercom and Freshdesk.

    • If file_types_at_source contains TICKET, ticket content from Zendesk, Intercom and Freshdesk are returned.

    • If file_types_at_source contains ARTICLE, article content from Zendesk, Intercom and Freshdesk are returned.

AssemblyAI Integration for Audio Transcriptions

  • We are excited to announce that Carbon now supports multiple audio transcription services. In addition to our existing integration with Deepgram, we have added support for AssemblyAI, providing our users with more options and flexibility when transcribing audio files.

  • To accommodate the new transcription service, we have updated the following endpoints to accept the new parameters transcription_service that allow you to specify which service to use. Valid values are deepgram and assemblyai. If no value is specified, Deepgram will be used as the default transcription service.

  • For local files, the endpoints are:

    • /uploadfile

    • /upload_file_from_url

  • For external files, transcription_service is set within the file_sync_config parameter, under:

    • /integrations/oauth_url

    • /integrations/connect

    • /integrations/files/sync

  • Similar to files transcribed by Deepgram, files transcribed by AssemblyAI also have an additional saved file containing the full JSON response from the AssemblyAI service. To access the transcription response, query the files using the user_files_v2 endpoint with the include_additional_files parameter set to true.

Carbon Webhook Libraries

  • We have released our official webhook libraries for handling the verification of webhook signatures. You can find our updated documentation here, and access our libraries on GitHub here.

Zendesk Auto-Sync Update

We are thrilled to announce that the Zendesk connector now supports auto-sync.

  • Carbon can now sync any new articles with auto-sync enabled.

    • Help Center Categories are now synced into Carbon as files, and Help Center Categories and articles form a parent-child relationship.

  • Reconnecting Existing Zendesk Connections:I

    • If you have existing Zendesk connections in Carbon, please note that you will need to reconnect them to enable the updates above.

Organization Connector Settings

  • The /organization endpoint now includes connector_settings in the response, providing additional information about the organization’s connector configurations, starting with permitted file formats.

  • The /organization/update endpoint has been updated to accept the data_source_config parameter, allowing customers to configure permitted file formats for organization users. The data_source_config parameter should be provided in the following format:

{ "data_source_configs": { "GOOGLE_DRIVE": { "allowed_file_formats": ["PDF", "DOCX"] }, "DROPBOX": { "allowed_file_formats": ["XLSX", "CSV"] }, "DEFAULT": { "allowed_file_formats": ["PDF", "DOCX", "XLSX", "NOTION"] } } }

  • DEFAULT is applied to all data sources that do not have configs defined.

  • If the data_source_config parameter includes file formats that are not supported by Carbon, those formats will be ignored, and only the supported formats from each data source will synced.

Carbon Self-Hosting on AWS

  • Starting today, customers have the option to host a Carbon instance on their own cloud, with full access to all features of our managed solution, including data connectors, hybrid search, and more.

  • We’re launching on Microsoft Azure and Google Cloud later next month!

  • Book a demo if you’re interested to learn more:https://cal.com/carbon-ai/30min

Confluence Enhancements

We’ve made improvements to the Confluence Connector related to the following:

  • Auto-Sync Improvements

    • Auto-syncs process will now index new pages that are added to a previously synced parent page. If a user syncs their entire Confluence account, then the space will be the top-most file.

    • If pages are deleted from a synced parent page in Confluence, the scheduled sync will remove them from the synced content.

  • File Metadata Enhancements

    • The file_metadata property now includes additional information about the type of Confluence item each file represents (spaces and pages).

    • The file_metadata property will also record the external_id of the file’s parent and root, providing better context and hierarchy information.

  • To take advantage of these updates, users will need to reconnect their Confluence account and re-sync their Confluence files.

Reranker Models for Search

We are excited to introduce native support for reranker models. With this release, customers now have the option to rerank search result chunks to provide more relevant and accurate results.
How it works:

  • When making a search query via the embeddings endpoint, customers can control the reranking behavior by setting the rerank parameter in the payload.

    • If rerank is set to "JINA_MULTILINGUAL_BASE_V2" the search result chunks will be reranked using the Jina reranking algorithm.

    • If rerank is set to "COHERE_RERANK_MULTILINGUAL_V3", the search result chunks will be reranked using the Cohere reranking algorithm.

    • If the rerank parameter is not specified or set to any other value, the default ranking will be used.

  • The response format from the embeddings endpoint remains consistent regardless of whether rerank is enabled or not.

We’ll be adding support for more reranker models in the weeks to come!New Webhook: WEBSCRAPE_URLS_READY
We’ve added a new webhook named WEBSCRAPE_URLS_READY that triggers each time a specific web page from a web scrape request is finished processing.

Introducing Carbon Connect 3.0

We’re thrilled to announce the beta release of Carbon Connect 3.0, packed with exciting updates and improvements, based on customer feedback.Key Features and Improvements

1. Seamless File and Folder Uploads
Carbon Connect 3.0 now supports both file and folder uploads by default, eliminating the need for the filePickerMode property. Uploading entire folder directories is now a breeze with our new drag-and-drop functionality.

2. Carbon’s In-House File Picker
We’re excited to introduce Carbon’s in-house file picker is now available for all connectors, except for Slack, Gmail, and Outlook (currently in development). To use Carbon’s file picker instead of the source’s file picker, simply set the new useCarbonFilePicker property to true.

3. Enhanced In-Modal Notifications
We’ve completely replaced toast notifications with in-modal notifications, providing a more cohesive and user-friendly experience. As a result, the enableToasts property has been removed.

4. Customizable Theme Options
Personalize your Carbon Connect experience with our new theme options. Use the theme property to set the application’s theme to light, dark, or auto (default). When set to auto, Carbon Connect will automatically adapt to your system’s theme.

5. Simplified File Limit Control
Limiting the number of files is now easier than ever. Simply set the maxFilesCount property to 1 to restrict uploads to a single file. The allowMultipleFiles property has been removed for a more straightforward approach.

Upcoming Enhancements
We’re continuously working to improve Carbon Connect and have exciting plans for the near future:

1. Enhanced Customization Options
We’re working on bringing back customization options from Carbon Connect 2.0, including loadingIconColor, primaryBackgroundColor, primaryTextColor, secondaryBackgroundColor, and secondaryTextColor.

2. Expanded In-House File Pickers
In the coming weeks, we’ll be launching Carbon’s in-house file pickers for Outlook, Slack, and Gmail, providing a consistent and seamless experience across all connectors.

Installation
You can install the new component for testing via the command npm install carbon-connect@beta. We plan to bring 3.0 out of beta by the end of the month!

Here’s a Loom video providing a quick walkthrough of the new modal: https://www.loom.com/share/b7b241fa5e5e4d0a92fb5e748d3d6ec3

External URLs Filter

A new external_urls filter has been added to the user_files_v2 endpoint.This filter allows you to refine the results returned by the endpoint based on a list of external_urls passed.

File Deletion Enhancements 

  • When a customer deletes a file from Carbon (via delete_files_v2), they have the flexibility to control whether the file row in the database is preserved or marked as deleted when deleting a file.

    • This behavior is managed by the preserve_file_record flag. If preserve_file_record is set to true, then we delete the files stored in our S3/GCS while keeping the file record and metadata to allow for re-syncs and auto-syncs.

    • We also added a file_contents_deleted field to the user_files_v2 endpoint. If the field is returned as true, then the file record still exists, but the stored file content is deleted.

  • Find more details here.

High Accuracy Mode 

  • We’ve introduced a new optional boolean parameter to the /embeddings endpoint called high_accuracy . If set to true, then vector search may give more accurate results at a slight performance penalty. By default, it’s false.

  • Find more details here.

To And From Filters for Outlook and Gmail

  • We added 2 more filters for syncing emails from Outlook and Gmail:

  • Note: Outlook only supports from filters.

Intercom Auto-Sync Update

  • We are thrilled to announce 2 updates to our Intercom connector:

    • Carbon can now sync multiple Intercom Help Centers:

      • Help Centers are now synced into Carbon as files, and Help Center and articles form a parent-child relationship.

      • Just as only published articles are synced, only activated Help Centers will be synced.

    • Carbon can now sync any new published articles with auto-sync is enabled.

  • Reconnecting Existing Intercom Connections:

    • If you have existing Intercom connections in Carbon, please note that you will need to reconnect them to enable the updates above.

New Endpoint: /list_users

  • A new endpoint has been added to list all users under your organization.

    • Filters: include filtering using list of customer_id .

    • Pagination: Request body needs pagination limit and offset.

    • Sorting: Sort by created_at and updated_at and ascending/descending.

    • Find more details here.

More Chunk Metadata

  • We’ve added chunk metadata for the following data sources to the /embeddings and /list_chunks_and_embeddings endpoints:

    • Websites:

"content_metadata": { "file_metadata": { "file_level_embedding_properties": { "title": "RGB to HEX", "description": "Convert RGB color codes to HEX HTML format for use in web design and CSS. Also converts RGBA to HEX." } } }

  • Email:

"content_metadata": { "file_metadata": { "file_level_embedding_properties": { "cc": "Swapnil Banga <swapnilbanga@outlook.com>, swapnil.banga@squareboat.com", "sender": "Swapnil Banga <swapnil.galaxy@gmail.com>", "timestamp": "2024-06-21T14:52:24Z" } } }

New Endpoint: /list_chunks_and_embeddings

  • A new endpoint has been added /list_chunks_and_embeddings. This endpoint is similar to the existing /text_chunks endpoint but with some key differences:

    • Retrieve chunks for multiple files that match the filter criteria instead of just a single file.

    • Filters: The filters for this endpoint are the same as those found via user_files_v2, allowing for more granular filtering of chunks based on file-level data.

    • Ordering: The order_by values enables sorting based on file-specific attributes.

    • Pagination Behavior: The order_by, limit, and offset parameters now correspond to the initial query that filters files, before chunks and embeddings are fetched for the filtered files.

    • The count parameter still refers to the total count of all embeddings for all filtered files, not the total count of filtered files.

Introducing /fetch_webpage Endpoint

  • We’re excited to announce a new and improved endpoint for fetching webpage URLs: /fetch_webpage. This endpoint offers an asynchronous way to retrieve webpage data and URLs.

    • Fetch URLs: The /fetch_webpage endpoint accepts a POST request with the url parameter as input.

    • Webhook Notifications: Upon completion of a webpage request, one of the following webhooks will be sent:

      • WEBPAGE_ERROR (object type: WEBPAGE): Indicates that the request failed. The webhook payload includes the corresponding webpage ID.

      • WEBPAGE_READY (object type: WEBPAGE): Indicates that the request succeeded. The webhook payload includes the corresponding webpage ID.

  • User Webpage History: Users can access their webpage request history by querying the /user_webpages endpoint. This endpoint returns the results of all the user’s webpage requests.

split_rows for Third-Party Data Sources and Carbon Connect

  • split_rows has been added for Excel and CSV files uploaded from third-party data sources and Carbon Connect.

    • For Carbon Connect, it is available as a parameter (splitRows) on the integration and file extension level for LOCAL_FILES and part of the fileSyncConfig field for third-party integrations.

Notion Updates

  • Notion now supports the sync_files_on_connection parameter.

    • When set to true (default), selected files via the Notion file picker will be synced immediately.

    • When set to false, permissions to access the selected files will still be granted, but users need to use Carbon’s file picker or the /integrations/items/list and /integrations/files/sync endpoints to sync the files.

  • The root_external_id field under /integrations/items/list is now returned for Notion files as well.

Slack Connector Launch

  • We’ve officially launched our Slack Connector.

  • How It Works:

    • OAuth and Token Refresh

      • Slack integration will use OAuth for authentication.

    • Conversation Listing

      • Users can list their Slack conversations using the /integrations/slack/conversations endpoint.

      • The endpoint supports filtering by conversation type:

        • Public channels

        • Private channels

        • Private messages

        • Group conversations

  • Conversation Sync

    • There will be no automatic or global file sync for Slack. Instead, a dedicated sync endpoint /integrations/slack/sync is available, which accepts conversation_id (required) and after (date) filter parameters.

    • Messages are synced in 15-minute blocks. For example, all messages between 2:15 and 2:30 will be synced together. Replies to messages outside the block will still be synced in the same block.

    • Currently, only message content is synced. Attachments and reactions are not included.

Chunk Metadata for CSV and Excel Files

  • We added row metadata to each chunk result. Moving forward, the beginning and ending row number in the table corresponding to a specific chunk will be returned under content_metadata.

  • Similar to the page numbers and x/y coordinates we return for PDF chunks, the beginning and ending row number allows you to directly reference where in a file the chunk was found.

FILE_DELETED Webhook Update

  • We pushed the update to fire the FILE_DELETED webhook if you manually re-sync a file via the  /integrations/oauth_url, /integrations/connect, resync_file and /integrations/files/sync endpoints.

  • This means if you pass in the external id of a folder, and we find that an item in the folder is deleted then we would delete the file and fire the webhook.

New CSV Parameter: split_rows

  • We have introduced a new optional query parameter called split_rows that accepts a boolean value. This parameter provides more flexibility when handling CSV rows that exceed max token limits.

  • Here’s how it works:

    • If split_rows is set to true:

      • CSV rows will be automatically split if they exceed either the specified chunk size or the maximum token limit of the embedding model.

      • This allows for processing of larger CSV rows without encountering errors.

    • If split_rows is set to false (default value):

      • The behavior remains the same as before.

      • If a CSV row exceeds the limits, an error will be thrown.

  • The default value of split_rows is set to false to ensure backwards compatibility.

  • This param is currently available for local file uploads via API and will be rolled out for external data sources and Carbon Connect shortly.

Dropbox Business Support

  • We’ve launched support for Dropbox Business. Users with Dropbox Business accounts can now access and sync files from team folders shared with them.

  • Users will need to reconnect their existing Dropbox connections if they want to start syncing files from their teams.

Inclusion and Exclusions for Sitemaps and Web Scrapes

  • We’ve added a new feature that allows you to filter web and sitemap scrapes by specific URL paths. This enhancement gives you greater control over the data you collect, enabling you to focus on the most relevant content for your needs.

    • For sitemaps, you can now include or exclude URLs based on their paths using the following parameters:

      • url_paths_to_include

        • Description: Filters sitemap URLs that contain any of the specified paths.

        • Value: A list of up to 10 strings representing the URL paths to include.

        • Example: url_paths_to_include: ["/products", "/collections"]

      • url_paths_to_exclude

        • Description: Filters out sitemap URLs that contain any of the specified paths.

        • Value: A list of up to 10 strings representing the URL paths to exclude.

        • Example: url_paths_to_exclude: ["/products", "/collections"]

    • For web scrapes, you can now specify the starting paths based on the URL paths:

      • url_paths_to_include

        • Description: The scrape will start at the specified paths, and if a recursion depth is set, it will only include links that also contain these paths.

        • Value: A list of up to 10 strings representing the URL paths to include.

        • Example: url_paths_to_include: ["/products", "/collections"]

Webhook Health Monitoring

  • We added a more robust health check logic for webhook URLs.

    • If a URL is flagged as unhealthy (and marked as status FLAGGED), the system will automatically poll the URL every 10 seconds to check its status and fire a new webhook event called CHECKUP per poll request.

      • For CHECKUP events, there is no requirement to verify the signature, although you still have the option to do so if desired.

      • When receiving a CHECKUP event, it is safe to simply return a 200 response without any additional processing.

  • If a successful response is received during the health check, the URL will be re-activated.

Notifications via Email

  • We are excited to announce the launch of email notifications to keep our customers informed about important events and actions occurring on our platform. In this initial release, we have implemented the following email notifications:

    • Webhook Events Paused

      • Trigger: This notification is sent when a webhook has been temporarily paused due to failing to return a response 20 times within a 60-second window.

      • Purpose: To alert customers about any interruptions in webhook functionality and provide them with timely information to investigate and resolve the issue.

    • Webhook Events Unpaused

      • Trigger: This notification is sent when a previously paused webhook has been unpaused after our system’s polling mechanism (which runs every 10 seconds) determines that the webhook is healthy and responsive again.

      • Purpose: To inform customers that the webhook has resumed normal operation and that data flow has been restored.

Video Embeddings Support

  • We now support embedding generation for videos, allowing you to run semantic search on the video content based on the similarity of a video snippet to the search query or the text within the video frames, similar to OCR.

    • /uploadfile now takes a new optional parameter called media_type, whose value comes from the FileContentTypes enum. By default all video file formats will default to audio processing if media_type isn’t provided.

    • Currently videos are supported via the uploadfile and upload_file_from_url endpoints but we’ll be adding support for third-party connectors and in Carbon Connect soon.

  • We support the following video file formats:

    • AVI

    • FLV

    • MKV

    • MOV

    • MP4

    • MPEG

    • MPG

    • WEBM

    • WMV

  • The maximum file size is 1 GB, but it can be increased upon request.

  • See more details here.

  • Please note that video embedding generation takes much longer than text and image embeddings. For example, it took 60-90s to embed a 3-minute video.

Intercom Tickets Integration

  • We’re thrilled to announce that our Intercom connector now has support for tickets.

  • The /integrations/oauth_url and integrations/connect endpoints sync articles by default. To customize the sync behavior, use the file_sync_config parameter.

  • You can now also view and sync tickets via the global endpoints /integrations/items/list and /integrations/files/sync.

  • To start syncing ticket content, the Intercom scope should include:

    • To sync user articles only, add these scopes:

      • Read one admin

      • Read and List Articles

    • To sync both user articles and tickets, also add:

      • Read and list users and companies

      • Read tickets

  • The following ticket information is available as tags for filtering:

{ "ticket_type": "Support Request", "ticket_status": "resolved", "ticket_category": "Customer", "ticket_submitter": "example.user@projectmap.com", "ticket_assigned_team": "Technical", "ticket_assigned_admin": "swapnil@carbon.ai" }

  • Text chunks will include the conversation history (comments on the ticket).

  • You can find more details here.

New Webhook Statuses

  • Each created webhook will now have a status of either ACTIVE or FLAGGED that is returned under webhooks endpoint response.

  • ACTIVE: The webhook is operating normally and successfully receiving events.

  • FLAGGED: The webhook URL failed to return a response more than 20 times within a 60 second window. This indicates a potential issue with your webhook URL that you should check. If a webhook is moved to the FLAGGED status, please contact us to update.

Incremental Syncs for Gmail and Outlook

  • We have introduced incremental syncs for the following endpoints for Gmail and Outlook:

    • /integrations/items/sync

    • /integrations/connect

    • /integrations/oauth_url

  • How It Works

    • By setting incremental_sync to true, only new or updated files since the last sync will be re-synced. This means that if a file has already been synced and hasn’t been modified, it will be skipped during the next sync.

    • If the embedding properties or tags of a file change between sync requests, those specific files will be re-synced.

    • Carbon sends a FILE_SKIPPED webhook event for files skipped during the incremental sync. The body of the webhook will contain a list of file_ids for files and a reason in additional_information.

  • This update addresses a common problem where files would be re-synced if a user went through the 3rd-party file selector to select files that had already been synced. With incremental syncs, this issue is resolved, ensuring that only truly new or updated files are synchronized.

  • Note: Incremental syncs is already enabled for Box, Dropbox, OneDrive and Google Drive.

Aggregated Usage Metrics Update

  • We’re excited to announce several improvements to how we aggregate and expose file statistics across the API.

  • The following metrics will now be returned via the /organization and /user endpoints:

    • aggregate_file_size

    • aggregate_num_characters

    • aggregate_num_tokens

    • aggregate_num_embeddings

    • aggregate_num_files_by_source

    • aggregate_num_files_by_file_format

  • To fetch the most updated metrics via the organization endpoint moving forward, you need to take following steps:

    1. The endpoint /organization/statistics takes no parameters and submits a request to asynchronously re-aggregate organization file statistics.

    2. When the re-aggregation is complete, a webhook of the event type FILE_STATISTICS_AGGREGATED will be sent.

    3. After receiving that event, making a request to /organization will return the updated file statistics in the response body.

    4. Additionally, a timestamp of when the file statistics were last updated can be found in file_statistics_aggregated_at.

fileSyncConfig Property for Carbon Connect

  • We have added a new fileSyncConfig prop for Carbon Connect that is set at the component or integration level and accepts the following properties:

  • auto_synced_source_types  (AutoSyncedSourceTypes array): An array specifying the types of sources to automatically sync files from.

  • sync_attachments (boolean): Set to true to enable synchronization of attachments, or false to disable attachment syncing. Applies to helpdesk tickets currently.

  • detect_audio_language (boolean): Set to true to enable automatic detection of audio language during file upload, or false to disable audio language detection.

Deepgram Audio Langauge Detection

  • This feature easily enables automatic language detection for audio file uploads.

    • Added a new optional query parameter detect_audio_language

    • When set to true, Deepgram will automatically detect the language of the uploaded audio file

    • Defaults to false if not specified

    • Applies to the upload_files_from_url and uploadfile endpoints.

Updated Webhook Event: FILE_SYNC_LIMIT_REACHED

  • We have improved the functionality of the FILE_SYNC_LIMIT_REACHED webhook event to provide more granular information when users exceed file upload limits. This event will now be triggered in the following scenarios:

    • When a user attempts to upload files that would cause them to exceed the maximum number of allowed files (max_files).

    • When a user tries to upload more files than the maximum allowed per upload (max_files_per_upload).

    • When a user exceeds the daily 2.5GB file sync limit (existing functionality).

  • To differentiate between the three different limit scenarios, we have introduced a new reason property in the event’s additional information. The reason property will have one of the following values:

    • Max files per upload limit exceeded.

    • Max files limit exceeded.

    • Organization daily limit for file sync has been reached.

HTML File Support

  • We now support for uploading .html files from local and third-party data sources.

  • Similar to other file formats, we provide the original .html file as well as a plain text version of the file as pre-signed URLs via the user_files_v2 endpoint.

Freshdesk Tickets Integration

  • We’re thrilled to announce that our Freshdesk connector now has support for tickets.

  • The /integrations/freshdesk and integrations/connect endpoints sync articles by default. To customize the sync behavior, use the file_sync_config parameter.

  • You can now also view and sync tickets via the global endpoints /integrations/items/list and /integrations/files/sync.

  • To start syncing ticket content, the Freshdesk API key should belong to a user with access to agents and tickets permissions.

  • The following ticket information is available as tags for filtering:

{ "ticket_type": "incident", "ticket_status": "open", "ticket_assignee": "swapnil+zen1@carbon.ai", "ticket_priority": "normal", "ticket_requester": "customer@example.com", }

  • Text chunks will include the conversation history (comments on the ticket).

  • You can find more details here.

New Webhook Type: SPARSE_VECTOR_GENERATION

  • We have introduced a new webhook event type SPARSE_VECTOR_GENERATION that is triggered when the queued status of sparse vector generation for a file changes. It is called SPARSE_VECTOR_QUEUE_STATUS  and has object type CHUNK_LIST.

  • This new webhook includes an object in the additional_information with the key-name sparse_vector_queue_status. The object has two fields:

    • sparse_vector_queue_status, which can be either queued, aborted, or failed

    • sparse_vector_queue_error, which is null unless sparse_vector_queue_status is aborted or failed

  • See more details here.

parent_file_id for Embeddings

  • The embeddings response now includes a parent_file_id field for each chunk returned.

  • This field can contain an integer value representing the ID of the parent file, or null if there is no parent file associated with the embedding.

SharePoint and OneDrive Folder Selection and Syncing

  • You can now select an entire folder for upload, and Carbon will automatically include all nested subfolders and files. This brings our SharePoint and OneDrive functionality in line with popular services like Google Drive, Dropbox and Notion.

  • We have also introduced auto-sync for SharePoint and OneDrive folders. Any new folders and files added to your selected parent folder will be automatically detected and synced by Carbon. To enable auto-sync on folders, the user will need to re-upload the folders again through the 3rd-party file picker.

Dropbox Folder Selection and Syncing

  • You can now select an entire folder for upload, and Carbon will automatically include all nested subfolders and files.

  • We have also introduced auto-sync for Dropbox folders. Any new folders and files added to your selected parent folder will be automatically detected and synced by Carbon, which brings our Dropbox functionality in line with popular services like Google Drive and Notion.

Webhook for Files Skipped

  • To improve visibility into your file processing pipeline, we’ve added a new webhook event: FILES_SKIPPED.

  • This event is triggered whenever Carbon skips processing for one or more files, such as when a file exceeds the size limits imposed by a third-party integration. The webhook payload will include a list of external_file_ids for the affected files, as well as an additional_information field with details on why processing was skipped. This allows you to easily identify and handle files that couldn’t be processed.

Zendesk Tickets Integration

  • We’re thrilled to announce that our Zendesk connector now has support for tickets.

  • The integrations/oauth_url and integrations/connect endpoints now sync articles by default. To sync only tickets or both articles and tickets, use the file_sync_config parameter. The file_sync_config parameter can also enable syncing attachments from ticket comments.

  • You can now also view and sync tickets via the global endpoints /integrations/items/list and /integrations/files/sync.

  • To start syncing ticket content, users must disconnect and reconnect their accounts with the new scopes. Don’t worry, disconnecting won’t affect your files.

  • The following ticket information is available as tags for filtering:

{ "ticket_type": "incident", "ticket_status": "open", "ticket_assignee": "swapnil+zen1@carbon.ai", "ticket_priority": "normal", "ticket_requester": "customer@example.com", "ticket_submitter": "swapnil+zen1@carbon.ai" }

  • Text chunks will include the conversation history (comments on the ticket).

  • You can find more details here.

Carbon Connect 2.0 Exits Beta

  • Carbon Connect 2.0 has officially exited beta as version 2.0.0.

Incremental Syncs for Data Sources

  • We have introduced incremental syncs for the following endpoints:

    • /integrations/items/sync

    • /integrations/connect

    • /integrations/oauth_url

  • How It Works

    • By setting incremental_sync to true, only new or updated files since the last sync will be re-synced. This means that if a file has already been synced and hasn’t been modified, it will be skipped during the next sync.

    • If the embedding properties or tags of a file change between sync requests, those specific files will be re-synced.

    • Carbon sends a FILE_SKIPPED webhook event for files skipped during the incremental sync. The body of the webhook will contain a list of file_ids for files and a reason in additional_information.

  • This update addresses a common problem where files would be re-synced if a user went through the 3rd-party file selector to select files that had already been synced. With incremental syncs, this issue is resolved, ensuring that only truly new or updated files are synchronized.

  • Note: Incremental syncs are only enabled on certain sources to start, including Box, Dropbox, OneDrive and Google Drive.

Re-Sync Child Files Via Resync_File Endpoint

  • When a file-id that belongs to a parent file (i.e., a folder) is submitted for re-sync via the resync_file endpoint, the associated child files will now also be re-synced.

  • This enhancement ensures that all related files within a folder hierarchy are properly synced when the parent file is re-synced.

Post Messages for Third-Party File Pickers

  • External data sources that utilize third-party file pickers will now post messages containing data of the selected file to the parent window when they are used in an iframe.

  • The message will be structured in the following format:

{ "event": "SELECTED", "data": list[{ "external_id": str, "parent_external_id": str | null, "name": str, "url": str | null, "is_folder": bool, "file_format": str | null, }], }

  • Note: Not all of the properties in the data list are available for every data source. For example, GDrive will have parent_external_id, but parent_external_id will always be null for Microsoft because its file picker does not return that data.

New Parameter include_containers

  • A new optional boolean parameter filters.include_containers has been added to the user_files_v2 API endpoint. This parameter allows you to control whether containers (folders) should be included in the API response.

    • When include_containers is set to false, the API will exclude folders from the response. This means that only files with actual content will be returned.

    • In addition to folders, the following types of files will also be excluded when include_containers is false:

      • RSS feed URLs

      • Email queries

      • GitBook spaces

      • GitHub directories

  • These excluded files typically group other files together but do not have any content themselves.

  • The default behavior of user_files_v2 remains unchanged. If the include_containers parameter is not provided or is set to true, folders will be included in the API response as before.

File Statistics Now Include MIME Type

  • file_statistics under the user_files_v2 endpoint now return the MIME type of the file, providing more detailed information about each file.

Organization-Level User Settings

  • Introduced the ability to configure user settings at the organization level.

  • Use the /organization/update endpoint with the global_user_config parameter to set the following organization-wide user settings:

    • auto_sync_enabled_sources

    • max_file

    • max_files_per_upload

  • Find more details here.

Customizable Sync Page Copy

  • Organizations now have the ability to customize the copy on the sync page after a user has connected to an external source.

  • Customizable elements include:

    • Header text

    • Subheader text

    • Button text

  • To update the sync page copy, DM us to make the requested changes. This is a white label specific feature.

  • Please note that success and error messages are not customizable at this time.

File List for Local File Uploads 

  • Added a new screen in Carbon Connect 2.0 (2.0.0-beta25) that displays a list of files uploaded locally by the user

  • Use the showFilesTab configuration option to control whether this view is visible

Limit File Uploads by Type

  • Organizations can now restrict the types of files that can be uploaded to Carbon.

    • File extension restrictions can be set per data source or globally for a given organization.

    • Users can still select disallowed file formats from the file picker, but these files will be ignored during the upload process.

  • To enable this feature, provide Carbon with a list of allowed file extensions, which must be a subset of Carbon’s supported file formats. A dedicated API endpoint will be coming soon!

New GitHub Endpoints

  • We’ve added two new endpoints to enhance the usability of the GitHub connector:

    • /integrations/github/repos: This endpoint allows users to retrieve a list of their GitHub repositories.

    • /integrations/github/sync_repos: This endpoint accepts a list of GitHub repository IDs, enabling users to list items from the specified repositories.

  • These new endpoints provide a more streamlined and efficient way to interact with GitHub repositories within Carbon.

GitHub Repository Selection Screen

  • We’ve introduced a dedicated screen in Carbon Connect 2.0 (2.0.0-beta24) for selecting GitHub repositories.

  • This new feature allows users to easily choose the repositories they want to sync and list items from. The repository selection screen is automatically displayed whenever a user connects their GitHub account.

  • This enhancement simplifies the process of managing GitHub repositories within Carbon Connect, providing a more intuitive and user-friendly experience.

Enhancements to Item Listing

  • We’ve added a new parameter called sync_source_items (or syncSourceItems in Carbon Connect) to give users more control over item syncing. By setting this parameter to false, users can prevent listing items from the corresponding connector.

  • By default, sync_source_items is set to true for all connectors, except for GitHub, where it is set to false. This default behavior for GitHub helps prevent rate limit-related sync issues with GitHub.

  • This enhancement provides users with greater flexibility in managing item syncing across different connectors.

Sorting Options for Source Items

  • We’ve introduced new sorting parameters, order_by and order_dir, for source items (/integrations/items/list). Users can now choose to sort items by the following criteria:

    • id: Sort items by their unique identifier.

    • name: Sort items alphabetically by their name.

    • directories_first: Sort folders first, followed by the remaining items. Both folders and files are sorted by name.

  • By default, items are sorted by name in ascending order (asc), maintaining the existing behavior. Please note that when directories_first is selected, the order_dir parameter is ignored.

External URLs in Salesforce

  • We now return the external URL for Salesforce Knowledge articles for Lightning users.

File List for Local File Uploads 

  • Added a new screen in Carbon Connect 2.0 (2.0.0-beta25) that displays a list of files uploaded locally by the user

  • Use the showFilesTab configuration option to control whether this view is visible

Limit File Uploads by Type

  • Organizations can now restrict the types of files that can be uploaded to Carbon.

    • File extension restrictions can be set per data source or globally for a given organization.

    • Users can still select disallowed file formats from the file picker, but these files will be ignored during the upload process.

  • To enable this feature, provide Carbon with a list of allowed file extensions, which must be a subset of Carbon’s supported file formats. A dedicated API endpoint will be coming soon!

File Statistics Now Include MIME Type

  • file_statistics under the user_files_v2 endpoint now return the MIME type of the file, providing more detailed information about each file.

Organization-Level User Settings

  • Introduced the ability to configure user settings at the organization level.

  • Use the /organization/update endpoint with the global_user_config parameter to set the following organization-wide user settings:

    • auto_sync_enabled_sources

    • max_file

    • max_files_per_upload

  • Find more details here.

Customizable Sync Page Copy

  • Organizations now have the ability to customize the copy on the sync page after a user has connected to an external source.

  • Customizable elements include:

    • Header text

    • Subheader text

    • Button text

  • To update the sync page copy, DM us to make the requested changes. This is a white label-specific feature.

  • Please note that success and error messages are not customizable at this time.

Support for Solar Embeddings

  • Exciting news! We’ve integrated Upstage’s Solar Embeddings into our platform, offering you a powerful new embedding model on Carbon.

  • To utilize this embedding model, specify the slug SOLAR for embedding_model

  • You can find more details here.

FILE_CREATED for Web Scrape

  • We have expanded the FILE_CREATED webhook events to fire when files are generated from web scraping requests.

IS_RESYNC for FILE_READY Webhook

  • We’ve added a new boolean property additional_information.is_resync to the FILE_READY webhook event.

    • When it is false, the file was synced for the first time.

    • When it is true, the file was already synced previously so the current sync is a re-sync.

Carbon Connect 2.0 Is Exiting Beta

  • Carbon Connect 2.0 is exiting beta by this Friday!

  • This means if you run npm install carbon-connect moving forward and do not specify a version, we’ll install 2.0 by default.

  • If you need help or have any questions moving over to Carbon Connect 2.0, DM me.

Loading Screen for Carbon Connect 2.0 (carbon-connect@2.0.0-beta22)

  • We added a new component level prop loadingIconColor which defines the color of the loader icon. This can be specified using standard CSS color names, or directly as either a Hexadecimal (Hex) code or RGB color values.

Support for Google Drive Shortcuts

  • Users can now seamlessly sync Google Drive shortcuts to reference the files and folders they point to.

    • How It Works:

      • For shortcuts within folders, a file object will be generated. When this shortcut file is synced, it will also synchronize its targeted file separately, though not as a child. Please note, there is no hierarchical relationship between a shortcut and its target.

      • If the shortcut is directly selected from Google’s file picker, a shortcut file object will not be created. Instead, the target will be synced directly.

      • Importantly, the shortcut file itself will not contain any parsed text of chunks. Instead, it acts as a pointer, with the file_metadata.target_external_file_id attribute identifying the file the shortcut targets.

New Webhook Events

  • We’ve introduced 2 additional webhook events to help track file sync statuses:

    • FILE_CREATED: This event is fired when a user queues up a file to be synced for the first time. The body of the webhook will contain a list of file_ids for files that were created in the same upload, and multiple events could fire for the same upload if a lot of files were queued.

    • ALL_UPLOADED_FILES_QUEUED: This event is fired when every single item in an upload has been queued for sync, including all children of folders in an upload. The body will contain the upload’s request_id.

  • Couple notes:

    • Both file_ids and request_ids can be used to filter for the files in /user_files_v2.

    • A request_id is now always generated for an upload to support the ALL_UPLOADED_FILES_QUEUED webhook. Previously, it was only generated by the user (unless you’re using Carbon Connect) and passed to us as a parameter. You may still do that and we’ll use your generated request_id, but if they don’t then we’ll generate an request_id for you on behalf of the user’s upload.

    • These two webhooks currently are supported for 3rd party data sources only. Support for web scrapes and local file uploads will be coming soon.

  • You can find more details here.

GitHub Connector

  • We launched our Github integration today that syncs pages from both public and public repositories.

  • The Carbon Connect enabledIntegration slug for Github is GITHUB. You’ll need to update to 2.0.0-beta19 to access the new screen.

  • Users should first submit their GitHub username and access token to our integration endpoint at /integrations/github. Then you can then use our global endpoints for listing and syncing specific files in different repositories:

    • List files from repositories with the global endpoints /integrations/items/list

    • Sync files from repositories with the global endpoint /integrations/files/sync

  • See more specifics about our Github integration here.

Set Max Files Per Upload

  • A new user-level parameter, max_files_per_upload, has been introduced that can be modified via the /update_users endpoint. It determines the maximum number of files a user can upload in a single request.

    • Files that exceed the maximum number of files will be moved into the SYNC_ERROR status with webhooks being fired to alert you.

  • You can check the file_single_upload_limit set for a particular user via the user endpoint.

  • Find more details here.

  • Important Update: The parameter max_files now serves to establish the overall file upload limit for a user across all uploads.

Add include_all_children to Embeddings Endpoint

  • Added param include_all_children to the embeddings endpoint. When this param is set to true, the search is run over all filtered files as well as their children.

  • Filters applied to the endpoint extend to the returned child files.

In-House File Picker for Confluence and Salesforce

  • We’re excited to introduce our in-house file picker, starting with Confluence and Salesforce. Our in-house file picker is still in beta, but you can test it out by manually running npm install carbon-connect@2.0.0-beta13

  • With this update, end users gain the ability to directly select and upload specific files from Confluence and Salesforce. Previously, this functionality was unavailable as neither platform offered their own dedicated file pickers.

  • When syncFilesOnConnection is set to false then our file picker will be enabled.

  • Here’s a quick walkthrough I recorded.

Hiding 3rd-Party File Picker

  • The endpoints /integrations/oauth_url and /integrations/connect now support a new boolean parameter named enable_file_picker.

    • When enable_file_picker is set to true (default behavior), a button will be displayed on the success page. Clicking this button will open the file picker associated with the respective source. This is the standard behavior.

    • Conversely, setting enable_file_picker to false will hide the file picker button on the success page. In such cases, end users will be directed to use custom or in-house file pickers for file selection.

Sync Outlook and Gmail Attachments

  • We’ve introduced a new property called sync_attachments, which can be specified when syncing via /integrations/gmail/sync and /integrations/outlook/sync endpoints. By default, this property is set to false.

  • Setting sync_attachments to true enables Carbon to automatically sync file attachments from corresponding emails. This includes not only traditional file attachments but also files (such as images) that are added in-line within emails.

  • Each file attachment will be assigned a unique file_id, with the parent_id corresponding to the email the file was attached to.

  • Please note that the same rules that apply to our file uploads also apply to attachments in terms of file size and supported extensions.

Set User File Limits

  • You have the flexibility to set the maximum number of files that a unique customer ID can upload using the file_upload_limit field on the update_users endpoint.

  • This value can be adjusted as needed, allowing you to tailor it according to your own plan limits.

  • Then you can check the upload limit set for a specific user via the custom_limits object on the user endpoint.

  • See details here.

Flags for OCR

  • Added ocr_job_started_at to the user_files_v2 response to denote whether OCR was enabled for a particular file.

  • Added additional OCR properties to be returned via ocr_properties, including whether table parsing was enabled.

  • See details here.

Role Management in Customer Portal

  • You now have the ability to manage who in your organization can create, delete, and view API keys.

  • Here’s a breakdown of the current roles available:

    • Admin: This role is empowered to both create and delete API keys.

    • User: Users with this role can view API keys.

  • Moving forward, these roles will determine user permissions and access across different sections of the Carbon Customer Portal.

  • You can access the customer portal via portal.carbon.ai

Expanded OCR Support in Carbon Connect

  • The prop useOCR can now be enabled on the integration level for the following connectors (in addition to local files):

    • OneDrive

    • Dropbox

    • Box

    • Google Drive

    • Zotero

    • SharePoint

  • The prop parsePdfTablesWithOcr can now be enabled on the integration level to parse tables with OCR when useOCR is set to true.

  • Please note OCR support is only applicable for PDFs at the moment.

  • You can find more details here.

Return chunk_index on the /embeddings Endpoint

  • We now return the chunk_index for specific chunks returned via the /embeddings endpoint.

  • You can find more details here.

Migrations between Embedding Models

  • You can now request migrations between embedding models with minimal downtime.

  • Email me if you’re interested. The cost per migration (not including embedding token costs) starts at $850 one-time.

New request_id Field

  • Carbon now accommodates the inclusion of a request_id within OAuth URLs, global sync endpoints, and custom sync endpoints (such as Gmail, Outlook, etc.), allowing users to define it as needed. Non-OAuth URL endpoints that auto-sync upon connection (e.g., Freshdesk, Gitbook) also supports this value. The request_id serves as a filter for files through user_files_v2.

  • With Carbon Connect, enabling the useRequestIds parameter to true will trigger automatic assignment of the request_id. This request_id will be returned in INITIATE and ADD/UPDATE callbacks.

    • It’s essential to note that this configuration adjustment is applicable at the component level rather than the integration level.

    • This enhancement is part of version 2.0.0-beta8.

    • Find more details here.

syncFilesOnConnection For More Data Sources

  • We’ve added the sync_files_on_connection parameter to the oauth_url endpoint for the following data sources: Intercom, Salesforce, Zendesk, Confluence, Freshdesk, and Gitbook.

  • This parameter is also accessible for each enabledIntegration in Carbon Connect. You can find more information about this here.

  • By default, this parameter is set to true. When enabled, all files will be synchronized automatically after a user connects their account. This is particularly useful when a user connects a data source that doesn’t have a built-in file picker.

Delete Child Files Based on Parent ID

  • Added a flag named delete_child_files to the delete_files endpoint. When set to true, it will delete all files that have the same parent_file_ids as the file submitted for deletion. This flag defaults to false.

  • Find more details here.

upload_chunks_and_embeddings Updates

  • You can now upload only chunks to Carbon via the upload_chunks_and_embeddings and we can generate the embeddings for you. This is useful for migrations where you want to migrate between embedding models and vector databases.

  • In the API request, you can exclude embeddings and set chunks_only to true. Then, include your embedding model API key (OpenAI or Cohere) under custom_credentials.

{ "api_key": "lkdsjflds" }

  • Make sure to include some delay between requests. There are also stricter limits on how many embeddings/chunks can be uploaded per request if chunks_only is true. Each request can only include 100 chunks.

Data Source Connections with Pre-Existing Auth

  • If you’re using our white labeling add-on, we added a new POST endpoint /integrations/connect so customers can bypass the authentication flow on Carbon by directly passing in an access token.

  • The request takes an authentication object that contains all the necessary pieces of data to connect to user’s account. The object will vary by data source and a list specifying the required keys can be found in our docs. If the connection is successful, the upserted data source will be returned.

  • This endpoint also returns a sync url for some data source types that will initiate the sync process.

Improvements to CSV, TSV, XLSX, GSheet Parsing

  • You have the option to now chunk CSV, TSV, XLSX, and Google Sheets by tokens via chunk_size and/or rows via max_items_per_chunk parameters. When a file is processed, we will add rows to a chunk until adding the next row would exceed chunk_size or max_items_per_chunk.

  • If a single row exceeds chunk_size or the embedding model’s limit for number of tokens, then the file’s sync_error_message will point out which row has too many tokens.

  • For example:

  • If each CSV row is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 CSV rows.

  • If each CSV row is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 CSV rows.

  • Consequently, it is essential to ensure that the number of tokens in a CSV row does not surpass the token limits established by the embedding models. Token counting is currently only supported for OpenAI models currently.

  • You can find more details here.

Improvements to OCR

  • Table parsing in PDFs has been improved significantly with this most recent OCR update.

  • In order to use the enhanced table parsing features, you need to set parse_pdf_tables_with_ocr to true when uploading PDFs (use_ocr must also be true).

    • Any tables parsed when parse_pdf_tables_with_ocr is true have their own chunk(s) assigned to them. These chunks can be identified by the presence of the string TABLE in embedding_metadata.block_types.

    • The format of these tabular chunks will be the same format as CSV-derived chunks.

    • Using this table-parsing feature in conjunction with hybrid search should provide much better search results than before (assuming the PDF has tables that need to be searched over).

  • If you’re using OCR we now also return metadata such as coordinates and page numbers even if set_page_as_boundary is set to false.

    • Specifically, we will return the bounding box coordinates as well as the start and end page number of the chunk.

    • In the event that pg_start < pg_end, then you should interpret bounding box coordinates slightly differently. x1 and x2 will correspond to the minimum x1 and maximum x2 over all pages for the chunk. y1 will correspond to the upper-most coordinate of the part of the chunk on pg_start, and y2 will correspond to the bottom-most coordinate of the part of the chunk on pg-end.

Carbon Connect 2.0 (Beta)

  • We are thrilled to announce the beta launch of Carbon Connect 2.0, with the following improvements:

  • Support multiple active accounts per data source.

  • Improved data source list.

  • Built-in user interface for users to view and re-sync files per account.

  • Ability for users to directly disconnect active accounts.

  • To install Carbon Connect 2.0 please npm install carbon-connect@2.0.0-beta5. It is not treated as the latest version of Carbon Connect so you won’t get this version automatically.

  • Few other important updates for Carbon Connect 2.0:

  • We’ve made a change to remove file details from the payload of UPDATE callbacks. If you used to get files in this way, you’ll now need to switch to using our SDK or API to get the updated files when a data source updates.

  • When you’re specifying embedding models, just make sure to use the format like this: embeddingModel={EmbeddingGenerators.OPENAI_ADA_LARGE_1024}, instead of just writing out a string.

  • You can hide our built-in UI for viewing and re-syncing files using the showFilesTab param on either the global component or enabledIntegration level.

Scheduled Syncs Per User and Data Source

  • Control user and data source syncing using the /update_users endpoint, allowing organizations to specify enabled syncing for particular users and data source types. The endpoint accepts a list of user IDs and data sources, with an option to enable syncing for all sources using the string 'ALL'.

    • Each request supports up to 100 customer IDs.

  • In the following example, future Gmail accounts for specified users will automatically have syncing enabled according to the provided settings.

{ "customer_ids": ["swapnil@carbon.ai", "swapnil.galaxy@gmail.com"], "auto_sync_enabled_sources": ["GMAIL"] }

  • Find more details in our documentation here.

  • Note: This update is meant to replace our file-level sync logic and any existing auto-syncs have been migrated over to use this updated logic.

Delete Files Based on Filters

  • We added the /delete_files_v2 endpoint which allows customers to delete files via the same filters as /user_files_v2

  • We plan to deprecate the /delete_files endpoint in a month.

  • Find more details in our documentation here.

Filtering for Child Files

  • We added the ability to include all descendent (child) files on both /delete_files_v2 and /user_files_v2 when filtering.

  • Filters applied to the endpoint extend to the returned child files.

  • We plan to deprecate the parent_file_ids filter on the /user_files_v2 endpoint in a month.

Customer Portal v1

  • We’ve officially launched v1 of our Customer Portal - portal.carbon.ai

  • You can currently manage your API keys directly via the Portal, and we plan to release the following functionality next quarter:

    • User management

    • Usage monitoring

    • Billing management

  • For current customers, you can reset your password with the email provided to Carbon to gain access. If you don’t know the email you have on file, DM me!

integration/items/list Improvements

  • We are implementing four distinct filters: external_ids, ids, root_files_only, and name, each meant to filter data based on their respective fields.

    • The root_files_only filter will exclusively return top-level files. However, if a parent_id is specified, then root_files_only can’t be specified and vice versa.

  • The external_url has been added to the response body of the integrations/items/list endpoint.

  • See more details here.

Multiple Active Accounts Per Data Source

  • Carbon now support multiple active accounts per data connection!

  • We’ve introduced two new parameters across various API endpoints to support this functionality across all our connectors. While these parameters are optional for users with a single data source of each type, they become mandatory when managing multiple accounts.

    • /integrations/oauth_url

      • data_source_id: Specifies the data source from which synchronization should occur when dealing with multiple data sources of the same type.

      • connecting_new_account: This parameter is utilized to consistently generate an OAuth URL as opposed to a sync URL. A sync URL is the destination where users are redirected after a successful OAuth authentication to synchronize their files. While this parameter can be skipped when adding the first data source of that type, it should be explicitly specified for subsequent additions.

    • /integrations/s3/files, /integrations/outlook/sync, /integrations/gmail/sync

      • data_source_id: Used to specify the data source for synchronization when managing multiple data sources of the same type.

    • /integrations/outlook/user_folders, /integrations/outlook/user_categories, /integrations/gmail/user_labels

      • data_source_id: Specifies the data source to be utilized when there are multiple data sources of the same type.

  • Note that the following endpoints already have a mandatory requirement to pass in a data_source_id: /integrations/items/sync,/integrations/items/list,/integrations/files/sync/,integrations/gitbook/spaces,/integrations/gitbook/sync

New Embedding Models

  • We now support embedding generation using OpenAI’s text-embedding-3-small and text-embedding-3-large models.

  • To define the embedding model, utilize the embedding_model parameter in the POST body for the /embeddings and other API endpoints. By default, if no specific model is provided, the system will use OPENAI (the original Ada-2).

  • Find more details on the models available here.

Return HTML for Webpages

  • presigned_url field under user_files_v2 now returns a pre-signed URL to the raw HTML content for each web page.

  • parsed_text_url field still returns a pre-signed URL for the corresponding plain text.

  • Find more details here.

Return Website Tags in File Metadata

  • file_metadata field under user_files_v2 now returns og:image and og:description for each web page.

  • Find more details here.

Omit Content by CSS Selector 

  • You can now exclude specific CSS selectors from web scraping. This ensures that text content within these elements does not appear in the parsed plaintext, chunks, and embeddings. Useful for omitting irrelevant elements, such as headers or footers, which might affect semantic search results.

  • The web_scrape request objects supports a new fields:

  •  css_selectors_to_skip: Optional[list[str]] = []

  • Find more details here.

JSON File Support

  • We’ve added support for JSON files via local upload and 3rd party connectors.

  • How It Works:

    • The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.

    • max_items_per_chunk is a parameter that determines how many JSON objects to include in a single chunk.

    • A new chunk is created if either the max_items_per_chunk and chunk_size limit is reached. For example:

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 JSON objects.

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 JSON object.

  • Learn more details here.

Gitbook Connector

  • We launched our Gitbook integration today that syncs pages from any public and shared spaces.

  • The Carbon Connect enabledIntegrations value for Gitbook is GITBOOK.

  • Gitbook does not come with a pre-built file selector so we added 2 endpoints for listing and syncing Gitbook spaces:

    • List all Gitbook spaces with /integrations/gitbook/spaces (API Reference)

    • Sync multiple spaces at once with integrations/gitbook/sync (API Reference)

  • You can also use our global endpoints for listing and syncing specific pages in Gitbook spaces:

    • List pages in spaces with the global endpoints /integrations/items/list

    • Sync pages in spaces with the global endpoint /integrations/files/sync

    • Note: Spaces are treated like folders via the Carbon API.

  • See more specifics about our Gitbook integration here.

  • Note: our Gitbook page parser is still in beta so feedback is much appreciated!

Delete Endpoint Update

  • We’re transitioning file deletion from sync to async processing.

  • This means that the FILE_DELETED webhook event will not fire immediately and instead fire when the file is actually deleted.

  • We are also limiting 50 files to be deleted per /delete_files request to limit the load on our servers. We advise spacing out delete requests every 24 hours.

Pinecone Integration 

  • We’ve launched our Pinecone destination connector! We offer support for both pod-based and serverless offerings.

  • Carbon seamlessly updates your Pinecone instance with the latest embeddings upon processing user files. Users gain full access to Carbon’s API endpoints, including hybrid search for supported sparse vector storage.

  • Find more details here.

New Carbon SDKs

  • Moving forward, we will be able to provide support for a greater number of SDKs and promptly release SDK support for API updates. If there is a language for which you want us to add SDK support, we should be able to turn that around in less than a week.

  • We’re adding support for the following languages today:

  • The current Javascript SDK will continue to be supported for the next month, and it will be available longer term. However, new features that are introduced will only be supported in the new Typescript SDK moving forward.

Delete Users Endpoint

  • Added an endpoint /delete_users that takes an array of customer IDs and deletes all those users.

  • Deleting a user revokes all of the user’s oauth connections and deletes all their files, embeddings and chunks.

  • The request format is:

{ "customer_ids": ["USER_1", "USER_2", "USER_3"] }

  • Find more details here.

Salesforce Connector is Live

  • All articles from an end user’s Salesforce Knowledge can be listed and synced via the global API endpoint /integrations/items/list and /integrations/files/sync.

  • The Carbon Connect integration (launching tomorrow) will sync all articles by default.

  • The enabledIntegrations value is SALESFORCE.

  • You can find more info here.

Outlook Folders 

  • After connecting your Outlook account, you can use this endpoint to list all of your folders on outlook.

  • This includes both system folders like inbox and user-created folders.

  • Find more details here.

Gmail Labels 

  • After connecting a Gmail account, you can use the /integrations/gmail/user_labels endpoint to list all of your labels.

  • User created labels will have the type user and Gmail’s default labels will have the type system.

  • Find more details here.

Delete Child Files Based on Parent ID

  • Added a flag named delete_child_files to the delete_files endpoint. When set to true, it will delete all files that have the same parent_file_ids as the file submitted for deletion. This flag defaults to false.

  • Find more details here.

Carbon Connect Updates 

  • Added support for JSON file formats and maxItemsPerChunk param to specify the number of items to include in a specific chunk.

  • Added cssSelectorsToSkip to WEB_SCRAPE to define CSS Selectors to exclude when converting HTML to plaintext.

  • Added SALESFORCE as an enabledIntegration on Carbon Connect.

  • For Salesforce, we added a param syncFilesOnConnection that defaults to true and will automatically sync all pages from a user’s Salesforce account.

  • We’ll be adding this param to other connectors too, meaning you can automatically sync all files from connectors that don’t have built-in file selectors (Gitbook, Confluence, etc).

  • This parameter is also added to the /integrations/oauth_url endpoint as sync_files_on_connection and also defaults to true.


Freshdesk Connector is Live

  • All Published articles from an end user’s Freshdesk knowledge base are synced when connected to Carbon.

  • The Carbon Connect enabledIntegrations value is FRESHDESK.

  • You can find more info here.

Speed Improvements to Hybrid Search

  • We improved the speed of hybrid search by a factor of 10x by creating sparse vector indexes on file upload vs. query time.

    • Steps to Enable:

      • Pass the following body to the /modify_user_configuration endpoint: { "configuration_key_name": "sparse_vectors", "value": { "enabled": true } }

    • Set the parameter generate_sparse_vectors to true via the /uploadfile endpoint.

  • We’ll be rolling out faster hybrid search support across 3rd party connectors in the upcoming weeks.

  • Find more details here and here.

Deleting Files based on Sync Status

  • You can now delete file(s) based on sync_status via the delete_files endpoint.

  • We added 2 parameters:

    • sync_statuses - parameter to pass a list of sync statuses for file deletion.

      • For example, { "sync_statuses": ["SYNC_ERROR", "QUEUED_FOR_SYNC"] }. When this parameter value is passed we will delete all files in the SYNC_ERROR and QUEUED_FOR_SYNC status that belong to the end user identified by customer-id in headers that made the request.

    • delete_non_synced_only - boolean parameter that limits deletion to files that have not been re-synced before.

      • For example, a previously synced Google Drive file enters the QUEUED_FOR_SYNC status again during a scheduled re-sync. Setting delete_non_synced_only to true would prevent this file from being deleted as well.

  • Files are deletable in all statuses except SYNCING, EVALUATING_RESYNC and QUEUED_FOR_OCR states.  Including SYNCING, EVALUATING_RESYNC, QUEUED_FOR_OCR in the list will result in an error response - files in these statuses must wait until they transition out of the status to be deleted.

  • Find more details here.

Carbon Connect Updates

  • Added support for the following functionalities in Carbon Connect (React component + JavaScript SDK):

    • Additional embedding models (OPENAI, AZURE_OPENAI, COHERE_MULTILINGUAL_V3 for text and audio files, and VERTEX_MULTIMODAL for image files).

    • Enable audio and image file support. Reference documentation on file formats available.

    • OCR support for PDFs from local file uploads via Carbon Connect.

    • Hybrid search supported.

Remove Customer-Id on Select Endpoints

  • We’re removing customer-id as a required header for the following endpoints where it is not required:

    • /auth/v1/white_labeling

    • /user

    • /webhooks

    • /add_webhook

    • /delete_webhook/{webhook_id}

    • /organization

Vector Database Integration

  • We are starting to build out direct integrations with vector database providers!

  • What this means:

    • After authenticating a vector database provider via API key, Carbon automatically synchronizes between user data sources and the embeddings within your vector database. Whenever a user file is processed, we handle the seamless update of your vector database with the latest embeddings.

    • You’ll have full functionality to all our Carbon’s API endpoints, including hybrid search if sparse vector storage is supported by your vector database.

    • Migrations between vector databases is made simple since Carbon provides a unified API to interface with all providers.

  • The first vector database integration we’re announcing is with Turbopuffer. Many more to come!

S3 Connector 

  • We launched our S3 connector today that enables syncing objects from buckets.

  • The Carbon Connect enabledIntegrations value for S3 is S3.

  • See more specifics about our S3 connector here.

File + Account Management Component (BETA)

  • Users to add and revoke access to accounts under each connection.

  • Users to view and select specific folders and files for sync.

  • The aim is to offer a pre-built file selector for integrations without their own.

  • The component is currently offered in React but we’ll add support for other frameworks soon.

  • You can find the npm package here. Please note it’s still in BETA so your feedback is much appreciated!

Expanding sort for user_files_v2

  • You can sort by name, file_size and last_sync on order_by field in the user_files_v2 body.

  • See more details here.

Support for audio file uploads via connectors

  • We’ve enabled support for audio files via the following connectors: S3, Google Drive, Onedrive, SharePoint, Box, DropBox, Zotero.

  • See list of supported audio files here.

Google Verification

  • Carbon’s Google Connector is officially Google-verified. This means users will no longer see the warning screen when authenticating with Carbon’s Google connector.

OCR Public Preview

  • We’ve been rolling out support for OCR, starting with PDFs uploaded locally (images and data connectors to follow).

Exposing Sync Error Reasons

  • We are now exposing error messages under the sync_error_reason field for files entering SYNC_ERROR status.

  • You can find a list of common errors here and we’ll be updating this on an ongoing basis.

List and Sync Items from Data Sources

  • We’re introducing new functionalities that allow customers to synchronize and retrieve a comprehensive list of items such as files, folders, collections, articles, and more from a user’s data source. This enhancement empowers you to create an in-house file selection flow, while enabling Carbon to also provide a user-friendly file selector UI and convenient helper methods within our SDK.

  • You can find more details here.

Upload Chunks and Embeddings

  • Added /upload_chunks_and_embeddings endpoint to enable uploading of chunks and vectors to Carbon directly.

  • See more specific details here.

CARBON

Data Connectors for LLMs

COPYRIGHT @ 2024 JCDT DBA CARBON