Carbon Changelog

Monthly updates with the latest features added, improvements made and bugs squashed.

Video Embeddings Support

  • We now support embedding generation for videos, allowing you to run semantic search on the video content based on the similarity of a video snippet to the search query or the text within the video frames, similar to OCR.

    • /uploadfile now takes a new optional parameter called media_type, whose value comes from the FileContentTypes enum. By default all video file formats will default to audio processing if media_type isn’t provided.

    • Currently videos are supported via the uploadfile and upload_file_from_url endpoints but we’ll be adding support for third-party connectors and in Carbon Connect soon.

  • We support the following video file formats:

    • AVI

    • FLV

    • MKV

    • MOV

    • MP4

    • MPEG

    • MPG

    • WEBM

    • WMV

  • The maximum file size is 1 GB, but it can be increased upon request.

  • See more details here.

  • Please note that video embedding generation takes much longer than text and image embeddings. For example, it took 60-90s to embed a 3-minute video.

Intercom Tickets Integration

  • We’re thrilled to announce that our Intercom connector now has support for tickets.

  • The /integrations/oauth_url and integrations/connect endpoints sync articles by default. To customize the sync behavior, use the file_sync_config parameter.

  • You can now also view and sync tickets via the global endpoints /integrations/items/list and /integrations/files/sync.

  • To start syncing ticket content, the Intercom scope should include:

    • To sync user articles only, add these scopes:

      • Read one admin

      • Read and List Articles

    • To sync both user articles and tickets, also add:

      • Read and list users and companies

      • Read tickets

  • The following ticket information is available as tags for filtering:

{ "ticket_type": "Support Request", "ticket_status": "resolved", "ticket_category": "Customer", "ticket_submitter": "example.user@projectmap.com", "ticket_assigned_team": "Technical", "ticket_assigned_admin": "swapnil@carbon.ai" }

  • Text chunks will include the conversation history (comments on the ticket).

  • You can find more details here.

New Webhook Statuses

  • Each created webhook will now have a status of either ACTIVE or FLAGGED that is returned under webhooks endpoint response.

  • ACTIVE: The webhook is operating normally and successfully receiving events.

  • FLAGGED: The webhook URL failed to return a response more than 20 times within a 60 second window. This indicates a potential issue with your webhook URL that you should check. If a webhook is moved to the FLAGGED status, please contact us to update.

Incremental Syncs for Gmail and Outlook

  • We have introduced incremental syncs for the following endpoints for Gmail and Outlook:

    • /integrations/items/sync

    • /integrations/connect

    • /integrations/oauth_url

  • How It Works

    • By setting incremental_sync to true, only new or updated files since the last sync will be re-synced. This means that if a file has already been synced and hasn’t been modified, it will be skipped during the next sync.

    • If the embedding properties or tags of a file change between sync requests, those specific files will be re-synced.

    • Carbon sends a FILE_SKIPPED webhook event for files skipped during the incremental sync. The body of the webhook will contain a list of file_ids for files and a reason in additional_information.

  • This update addresses a common problem where files would be re-synced if a user went through the 3rd-party file selector to select files that had already been synced. With incremental syncs, this issue is resolved, ensuring that only truly new or updated files are synchronized.

  • Note: Incremental syncs is already enabled for Box, Dropbox, OneDrive and Google Drive.

Aggregated Usage Metrics Update

  • We’re excited to announce several improvements to how we aggregate and expose file statistics across the API.

  • The following metrics will now be returned via the /organization and /user endpoints:

    • aggregate_file_size

    • aggregate_num_characters

    • aggregate_num_tokens

    • aggregate_num_embeddings

    • aggregate_num_files_by_source

    • aggregate_num_files_by_file_format

  • To fetch the most updated metrics via the organization endpoint moving forward, you need to take following steps:

    1. The endpoint /organization/statistics takes no parameters and submits a request to asynchronously re-aggregate organization file statistics.

    2. When the re-aggregation is complete, a webhook of the event type FILE_STATISTICS_AGGREGATED will be sent.

    3. After receiving that event, making a request to /organization will return the updated file statistics in the response body.

    4. Additionally, a timestamp of when the file statistics were last updated can be found in file_statistics_aggregated_at.

fileSyncConfig Property for Carbon Connect

  • We have added a new fileSyncConfig prop for Carbon Connect that is set at the component or integration level and accepts the following properties:

  • auto_synced_source_types  (AutoSyncedSourceTypes array): An array specifying the types of sources to automatically sync files from.

  • sync_attachments (boolean): Set to true to enable synchronization of attachments, or false to disable attachment syncing. Applies to helpdesk tickets currently.

  • detect_audio_language (boolean): Set to true to enable automatic detection of audio language during file upload, or false to disable audio language detection.

Deepgram Audio Langauge Detection

  • This feature easily enables automatic language detection for audio file uploads.

    • Added a new optional query parameter detect_audio_language

    • When set to true, Deepgram will automatically detect the language of the uploaded audio file

    • Defaults to false if not specified

    • Applies to the upload_files_from_url and uploadfile endpoints.

Updated Webhook Event: FILE_SYNC_LIMIT_REACHED

  • We have improved the functionality of the FILE_SYNC_LIMIT_REACHED webhook event to provide more granular information when users exceed file upload limits. This event will now be triggered in the following scenarios:

    • When a user attempts to upload files that would cause them to exceed the maximum number of allowed files (max_files).

    • When a user tries to upload more files than the maximum allowed per upload (max_files_per_upload).

    • When a user exceeds the daily 2.5GB file sync limit (existing functionality).

  • To differentiate between the three different limit scenarios, we have introduced a new reason property in the event’s additional information. The reason property will have one of the following values:

    • Max files per upload limit exceeded.

    • Max files limit exceeded.

    • Organization daily limit for file sync has been reached.

HTML File Support

  • We now support for uploading .html files from local and third-party data sources.

  • Similar to other file formats, we provide the original .html file as well as a plain text version of the file as pre-signed URLs via the user_files_v2 endpoint.

Freshdesk Tickets Integration

  • We’re thrilled to announce that our Freshdesk connector now has support for tickets.

  • The /integrations/freshdesk and integrations/connect endpoints sync articles by default. To customize the sync behavior, use the file_sync_config parameter.

  • You can now also view and sync tickets via the global endpoints /integrations/items/list and /integrations/files/sync.

  • To start syncing ticket content, the Freshdesk API key should belong to a user with access to agents and tickets permissions.

  • The following ticket information is available as tags for filtering:

{ "ticket_type": "incident", "ticket_status": "open", "ticket_assignee": "swapnil+zen1@carbon.ai", "ticket_priority": "normal", "ticket_requester": "customer@example.com", }

  • Text chunks will include the conversation history (comments on the ticket).

  • You can find more details here.

New Webhook Type: SPARSE_VECTOR_GENERATION

  • We have introduced a new webhook event type SPARSE_VECTOR_GENERATION that is triggered when the queued status of sparse vector generation for a file changes. It is called SPARSE_VECTOR_QUEUE_STATUS  and has object type CHUNK_LIST.

  • This new webhook includes an object in the additional_information with the key-name sparse_vector_queue_status. The object has two fields:

    • sparse_vector_queue_status, which can be either queued, aborted, or failed

    • sparse_vector_queue_error, which is null unless sparse_vector_queue_status is aborted or failed

  • See more details here.

parent_file_id for Embeddings

  • The embeddings response now includes a parent_file_id field for each chunk returned.

  • This field can contain an integer value representing the ID of the parent file, or null if there is no parent file associated with the embedding.

SharePoint and OneDrive Folder Selection and Syncing

  • You can now select an entire folder for upload, and Carbon will automatically include all nested subfolders and files. This brings our SharePoint and OneDrive functionality in line with popular services like Google Drive, Dropbox and Notion.

  • We have also introduced auto-sync for SharePoint and OneDrive folders. Any new folders and files added to your selected parent folder will be automatically detected and synced by Carbon. To enable auto-sync on folders, the user will need to re-upload the folders again through the 3rd-party file picker.

Dropbox Folder Selection and Syncing

  • You can now select an entire folder for upload, and Carbon will automatically include all nested subfolders and files.

  • We have also introduced auto-sync for Dropbox folders. Any new folders and files added to your selected parent folder will be automatically detected and synced by Carbon, which brings our Dropbox functionality in line with popular services like Google Drive and Notion.

Webhook for Files Skipped

  • To improve visibility into your file processing pipeline, we’ve added a new webhook event: FILES_SKIPPED.

  • This event is triggered whenever Carbon skips processing for one or more files, such as when a file exceeds the size limits imposed by a third-party integration. The webhook payload will include a list of external_file_ids for the affected files, as well as an additional_information field with details on why processing was skipped. This allows you to easily identify and handle files that couldn’t be processed.

Zendesk Tickets Integration

  • We’re thrilled to announce that our Zendesk connector now has support for tickets.

  • The integrations/oauth_url and integrations/connect endpoints now sync articles by default. To sync only tickets or both articles and tickets, use the file_sync_config parameter. The file_sync_config parameter can also enable syncing attachments from ticket comments.

  • You can now also view and sync tickets via the global endpoints /integrations/items/list and /integrations/files/sync.

  • To start syncing ticket content, users must disconnect and reconnect their accounts with the new scopes. Don’t worry, disconnecting won’t affect your files.

  • The following ticket information is available as tags for filtering:

{ "ticket_type": "incident", "ticket_status": "open", "ticket_assignee": "swapnil+zen1@carbon.ai", "ticket_priority": "normal", "ticket_requester": "customer@example.com", "ticket_submitter": "swapnil+zen1@carbon.ai" }

  • Text chunks will include the conversation history (comments on the ticket).

  • You can find more details here.

Carbon Connect 2.0 Exits Beta

  • Carbon Connect 2.0 has officially exited beta as version 2.0.0.

Incremental Syncs for Data Sources

  • We have introduced incremental syncs for the following endpoints:

    • /integrations/items/sync

    • /integrations/connect

    • /integrations/oauth_url

  • How It Works

    • By setting incremental_sync to true, only new or updated files since the last sync will be re-synced. This means that if a file has already been synced and hasn’t been modified, it will be skipped during the next sync.

    • If the embedding properties or tags of a file change between sync requests, those specific files will be re-synced.

    • Carbon sends a FILE_SKIPPED webhook event for files skipped during the incremental sync. The body of the webhook will contain a list of file_ids for files and a reason in additional_information.

  • This update addresses a common problem where files would be re-synced if a user went through the 3rd-party file selector to select files that had already been synced. With incremental syncs, this issue is resolved, ensuring that only truly new or updated files are synchronized.

  • Note: Incremental syncs are only enabled on certain sources to start, including Box, Dropbox, OneDrive and Google Drive.

Re-Sync Child Files Via Resync_File Endpoint

  • When a file-id that belongs to a parent file (i.e., a folder) is submitted for re-sync via the resync_file endpoint, the associated child files will now also be re-synced.

  • This enhancement ensures that all related files within a folder hierarchy are properly synced when the parent file is re-synced.

Post Messages for Third-Party File Pickers

  • External data sources that utilize third-party file pickers will now post messages containing data of the selected file to the parent window when they are used in an iframe.

  • The message will be structured in the following format:

{ "event": "SELECTED", "data": list[{ "external_id": str, "parent_external_id": str | null, "name": str, "url": str | null, "is_folder": bool, "file_format": str | null, }], }

  • Note: Not all of the properties in the data list are available for every data source. For example, GDrive will have parent_external_id, but parent_external_id will always be null for Microsoft because its file picker does not return that data.

New Parameter include_containers

  • A new optional boolean parameter filters.include_containers has been added to the user_files_v2 API endpoint. This parameter allows you to control whether containers (folders) should be included in the API response.

    • When include_containers is set to false, the API will exclude folders from the response. This means that only files with actual content will be returned.

    • In addition to folders, the following types of files will also be excluded when include_containers is false:

      • RSS feed URLs

      • Email queries

      • GitBook spaces

      • GitHub directories

  • These excluded files typically group other files together but do not have any content themselves.

  • The default behavior of user_files_v2 remains unchanged. If the include_containers parameter is not provided or is set to true, folders will be included in the API response as before.

File Statistics Now Include MIME Type

  • file_statistics under the user_files_v2 endpoint now return the MIME type of the file, providing more detailed information about each file.

Organization-Level User Settings

  • Introduced the ability to configure user settings at the organization level.

  • Use the /organization/update endpoint with the global_user_config parameter to set the following organization-wide user settings:

    • auto_sync_enabled_sources

    • max_file

    • max_files_per_upload

  • Find more details here.

Customizable Sync Page Copy

  • Organizations now have the ability to customize the copy on the sync page after a user has connected to an external source.

  • Customizable elements include:

    • Header text

    • Subheader text

    • Button text

  • To update the sync page copy, DM us to make the requested changes. This is a white label specific feature.

  • Please note that success and error messages are not customizable at this time.

File List for Local File Uploads 

  • Added a new screen in Carbon Connect 2.0 (2.0.0-beta25) that displays a list of files uploaded locally by the user

  • Use the showFilesTab configuration option to control whether this view is visible

Limit File Uploads by Type

  • Organizations can now restrict the types of files that can be uploaded to Carbon.

    • File extension restrictions can be set per data source or globally for a given organization.

    • Users can still select disallowed file formats from the file picker, but these files will be ignored during the upload process.

  • To enable this feature, provide Carbon with a list of allowed file extensions, which must be a subset of Carbon’s supported file formats. A dedicated API endpoint will be coming soon!

New GitHub Endpoints

  • We’ve added two new endpoints to enhance the usability of the GitHub connector:

    • /integrations/github/repos: This endpoint allows users to retrieve a list of their GitHub repositories.

    • /integrations/github/sync_repos: This endpoint accepts a list of GitHub repository IDs, enabling users to list items from the specified repositories.

  • These new endpoints provide a more streamlined and efficient way to interact with GitHub repositories within Carbon.

GitHub Repository Selection Screen

  • We’ve introduced a dedicated screen in Carbon Connect 2.0 (2.0.0-beta24) for selecting GitHub repositories.

  • This new feature allows users to easily choose the repositories they want to sync and list items from. The repository selection screen is automatically displayed whenever a user connects their GitHub account.

  • This enhancement simplifies the process of managing GitHub repositories within Carbon Connect, providing a more intuitive and user-friendly experience.

Enhancements to Item Listing

  • We’ve added a new parameter called sync_source_items (or syncSourceItems in Carbon Connect) to give users more control over item syncing. By setting this parameter to false, users can prevent listing items from the corresponding connector.

  • By default, sync_source_items is set to true for all connectors, except for GitHub, where it is set to false. This default behavior for GitHub helps prevent rate limit-related sync issues with GitHub.

  • This enhancement provides users with greater flexibility in managing item syncing across different connectors.

Sorting Options for Source Items

  • We’ve introduced new sorting parameters, order_by and order_dir, for source items (/integrations/items/list). Users can now choose to sort items by the following criteria:

    • id: Sort items by their unique identifier.

    • name: Sort items alphabetically by their name.

    • directories_first: Sort folders first, followed by the remaining items. Both folders and files are sorted by name.

  • By default, items are sorted by name in ascending order (asc), maintaining the existing behavior. Please note that when directories_first is selected, the order_dir parameter is ignored.

External URLs in Salesforce

  • We now return the external URL for Salesforce Knowledge articles for Lightning users.

File List for Local File Uploads 

  • Added a new screen in Carbon Connect 2.0 (2.0.0-beta25) that displays a list of files uploaded locally by the user

  • Use the showFilesTab configuration option to control whether this view is visible

Limit File Uploads by Type

  • Organizations can now restrict the types of files that can be uploaded to Carbon.

    • File extension restrictions can be set per data source or globally for a given organization.

    • Users can still select disallowed file formats from the file picker, but these files will be ignored during the upload process.

  • To enable this feature, provide Carbon with a list of allowed file extensions, which must be a subset of Carbon’s supported file formats. A dedicated API endpoint will be coming soon!

File Statistics Now Include MIME Type

  • file_statistics under the user_files_v2 endpoint now return the MIME type of the file, providing more detailed information about each file.

Organization-Level User Settings

  • Introduced the ability to configure user settings at the organization level.

  • Use the /organization/update endpoint with the global_user_config parameter to set the following organization-wide user settings:

    • auto_sync_enabled_sources

    • max_file

    • max_files_per_upload

  • Find more details here.

Customizable Sync Page Copy

  • Organizations now have the ability to customize the copy on the sync page after a user has connected to an external source.

  • Customizable elements include:

    • Header text

    • Subheader text

    • Button text

  • To update the sync page copy, DM us to make the requested changes. This is a white label-specific feature.

  • Please note that success and error messages are not customizable at this time.

Support for Solar Embeddings

  • Exciting news! We’ve integrated Upstage’s Solar Embeddings into our platform, offering you a powerful new embedding model on Carbon.

  • To utilize this embedding model, specify the slug SOLAR for embedding_model

  • You can find more details here.

FILE_CREATED for Web Scrape

  • We have expanded the FILE_CREATED webhook events to fire when files are generated from web scraping requests.

IS_RESYNC for FILE_READY Webhook

  • We’ve added a new boolean property additional_information.is_resync to the FILE_READY webhook event.

    • When it is false, the file was synced for the first time.

    • When it is true, the file was already synced previously so the current sync is a re-sync.

Carbon Connect 2.0 Is Exiting Beta

  • Carbon Connect 2.0 is exiting beta by this Friday!

  • This means if you run npm install carbon-connect moving forward and do not specify a version, we’ll install 2.0 by default.

  • If you need help or have any questions moving over to Carbon Connect 2.0, DM me.

Loading Screen for Carbon Connect 2.0 (carbon-connect@2.0.0-beta22)

  • We added a new component level prop loadingIconColor which defines the color of the loader icon. This can be specified using standard CSS color names, or directly as either a Hexadecimal (Hex) code or RGB color values.

Support for Google Drive Shortcuts

  • Users can now seamlessly sync Google Drive shortcuts to reference the files and folders they point to.

    • How It Works:

      • For shortcuts within folders, a file object will be generated. When this shortcut file is synced, it will also synchronize its targeted file separately, though not as a child. Please note, there is no hierarchical relationship between a shortcut and its target.

      • If the shortcut is directly selected from Google’s file picker, a shortcut file object will not be created. Instead, the target will be synced directly.

      • Importantly, the shortcut file itself will not contain any parsed text of chunks. Instead, it acts as a pointer, with the file_metadata.target_external_file_id attribute identifying the file the shortcut targets.

New Webhook Events

  • We’ve introduced 2 additional webhook events to help track file sync statuses:

    • FILE_CREATED: This event is fired when a user queues up a file to be synced for the first time. The body of the webhook will contain a list of file_ids for files that were created in the same upload, and multiple events could fire for the same upload if a lot of files were queued.

    • ALL_UPLOADED_FILES_QUEUED: This event is fired when every single item in an upload has been queued for sync, including all children of folders in an upload. The body will contain the upload’s request_id.

  • Couple notes:

    • Both file_ids and request_ids can be used to filter for the files in /user_files_v2.

    • A request_id is now always generated for an upload to support the ALL_UPLOADED_FILES_QUEUED webhook. Previously, it was only generated by the user (unless you’re using Carbon Connect) and passed to us as a parameter. You may still do that and we’ll use your generated request_id, but if they don’t then we’ll generate an request_id for you on behalf of the user’s upload.

    • These two webhooks currently are supported for 3rd party data sources only. Support for web scrapes and local file uploads will be coming soon.

  • You can find more details here.

GitHub Connector

  • We launched our Github integration today that syncs pages from both public and public repositories.

  • The Carbon Connect enabledIntegration slug for Github is GITHUB. You’ll need to update to 2.0.0-beta19 to access the new screen.

  • Users should first submit their GitHub username and access token to our integration endpoint at /integrations/github. Then you can then use our global endpoints for listing and syncing specific files in different repositories:

    • List files from repositories with the global endpoints /integrations/items/list

    • Sync files from repositories with the global endpoint /integrations/files/sync

  • See more specifics about our Github integration here.

Set Max Files Per Upload

  • A new user-level parameter, max_files_per_upload, has been introduced that can be modified via the /update_users endpoint. It determines the maximum number of files a user can upload in a single request.

    • Files that exceed the maximum number of files will be moved into the SYNC_ERROR status with webhooks being fired to alert you.

  • You can check the file_single_upload_limit set for a particular user via the user endpoint.

  • Find more details here.

  • Important Update: The parameter max_files now serves to establish the overall file upload limit for a user across all uploads.

Add include_all_children to Embeddings Endpoint

  • Added param include_all_children to the embeddings endpoint. When this param is set to true, the search is run over all filtered files as well as their children.

  • Filters applied to the endpoint extend to the returned child files.

In-House File Picker for Confluence and Salesforce

  • We’re excited to introduce our in-house file picker, starting with Confluence and Salesforce. Our in-house file picker is still in beta, but you can test it out by manually running npm install carbon-connect@2.0.0-beta13

  • With this update, end users gain the ability to directly select and upload specific files from Confluence and Salesforce. Previously, this functionality was unavailable as neither platform offered their own dedicated file pickers.

  • When syncFilesOnConnection is set to false then our file picker will be enabled.

  • Here’s a quick walkthrough I recorded.

Hiding 3rd-Party File Picker

  • The endpoints /integrations/oauth_url and /integrations/connect now support a new boolean parameter named enable_file_picker.

    • When enable_file_picker is set to true (default behavior), a button will be displayed on the success page. Clicking this button will open the file picker associated with the respective source. This is the standard behavior.

    • Conversely, setting enable_file_picker to false will hide the file picker button on the success page. In such cases, end users will be directed to use custom or in-house file pickers for file selection.

Sync Outlook and Gmail Attachments

  • We’ve introduced a new property called sync_attachments, which can be specified when syncing via /integrations/gmail/sync and /integrations/outlook/sync endpoints. By default, this property is set to false.

  • Setting sync_attachments to true enables Carbon to automatically sync file attachments from corresponding emails. This includes not only traditional file attachments but also files (such as images) that are added in-line within emails.

  • Each file attachment will be assigned a unique file_id, with the parent_id corresponding to the email the file was attached to.

  • Please note that the same rules that apply to our file uploads also apply to attachments in terms of file size and supported extensions.

Set User File Limits

  • You have the flexibility to set the maximum number of files that a unique customer ID can upload using the file_upload_limit field on the update_users endpoint.

  • This value can be adjusted as needed, allowing you to tailor it according to your own plan limits.

  • Then you can check the upload limit set for a specific user via the custom_limits object on the user endpoint.

  • See details here.

Flags for OCR

  • Added ocr_job_started_at to the user_files_v2 response to denote whether OCR was enabled for a particular file.

  • Added additional OCR properties to be returned via ocr_properties, including whether table parsing was enabled.

  • See details here.

Role Management in Customer Portal

  • You now have the ability to manage who in your organization can create, delete, and view API keys.

  • Here’s a breakdown of the current roles available:

    • Admin: This role is empowered to both create and delete API keys.

    • User: Users with this role can view API keys.

  • Moving forward, these roles will determine user permissions and access across different sections of the Carbon Customer Portal.

  • You can access the customer portal via portal.carbon.ai

Expanded OCR Support in Carbon Connect

  • The prop useOCR can now be enabled on the integration level for the following connectors (in addition to local files):

    • OneDrive

    • Dropbox

    • Box

    • Google Drive

    • Zotero

    • SharePoint

  • The prop parsePdfTablesWithOcr can now be enabled on the integration level to parse tables with OCR when useOCR is set to true.

  • Please note OCR support is only applicable for PDFs at the moment.

  • You can find more details here.

Return chunk_index on the /embeddings Endpoint

  • We now return the chunk_index for specific chunks returned via the /embeddings endpoint.

  • You can find more details here.

Migrations between Embedding Models

  • You can now request migrations between embedding models with minimal downtime.

  • Email me if you’re interested. The cost per migration (not including embedding token costs) starts at $850 one-time.

New request_id Field

  • Carbon now accommodates the inclusion of a request_id within OAuth URLs, global sync endpoints, and custom sync endpoints (such as Gmail, Outlook, etc.), allowing users to define it as needed. Non-OAuth URL endpoints that auto-sync upon connection (e.g., Freshdesk, Gitbook) also supports this value. The request_id serves as a filter for files through user_files_v2.

  • With Carbon Connect, enabling the useRequestIds parameter to true will trigger automatic assignment of the request_id. This request_id will be returned in INITIATE and ADD/UPDATE callbacks.

    • It’s essential to note that this configuration adjustment is applicable at the component level rather than the integration level.

    • This enhancement is part of version 2.0.0-beta8.

    • Find more details here.

syncFilesOnConnection For More Data Sources

  • We’ve added the sync_files_on_connection parameter to the oauth_url endpoint for the following data sources: Intercom, Salesforce, Zendesk, Confluence, Freshdesk, and Gitbook.

  • This parameter is also accessible for each enabledIntegration in Carbon Connect. You can find more information about this here.

  • By default, this parameter is set to true. When enabled, all files will be synchronized automatically after a user connects their account. This is particularly useful when a user connects a data source that doesn’t have a built-in file picker.

Delete Child Files Based on Parent ID

  • Added a flag named delete_child_files to the delete_files endpoint. When set to true, it will delete all files that have the same parent_file_ids as the file submitted for deletion. This flag defaults to false.

  • Find more details here.

upload_chunks_and_embeddings Updates

  • You can now upload only chunks to Carbon via the upload_chunks_and_embeddings and we can generate the embeddings for you. This is useful for migrations where you want to migrate between embedding models and vector databases.

  • In the API request, you can exclude embeddings and set chunks_only to true. Then, include your embedding model API key (OpenAI or Cohere) under custom_credentials.

{ "api_key": "lkdsjflds" }

  • Make sure to include some delay between requests. There are also stricter limits on how many embeddings/chunks can be uploaded per request if chunks_only is true. Each request can only include 100 chunks.

Data Source Connections with Pre-Existing Auth

  • If you’re using our white labeling add-on, we added a new POST endpoint /integrations/connect so customers can bypass the authentication flow on Carbon by directly passing in an access token.

  • The request takes an authentication object that contains all the necessary pieces of data to connect to user’s account. The object will vary by data source and a list specifying the required keys can be found in our docs. If the connection is successful, the upserted data source will be returned.

  • This endpoint also returns a sync url for some data source types that will initiate the sync process.

Improvements to CSV, TSV, XLSX, GSheet Parsing

  • You have the option to now chunk CSV, TSV, XLSX, and Google Sheets by tokens via chunk_size and/or rows via max_items_per_chunk parameters. When a file is processed, we will add rows to a chunk until adding the next row would exceed chunk_size or max_items_per_chunk.

  • If a single row exceeds chunk_size or the embedding model’s limit for number of tokens, then the file’s sync_error_message will point out which row has too many tokens.

  • For example:

  • If each CSV row is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 CSV rows.

  • If each CSV row is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 CSV rows.

  • Consequently, it is essential to ensure that the number of tokens in a CSV row does not surpass the token limits established by the embedding models. Token counting is currently only supported for OpenAI models currently.

  • You can find more details here.

Improvements to OCR

  • Table parsing in PDFs has been improved significantly with this most recent OCR update.

  • In order to use the enhanced table parsing features, you need to set parse_pdf_tables_with_ocr to true when uploading PDFs (use_ocr must also be true).

    • Any tables parsed when parse_pdf_tables_with_ocr is true have their own chunk(s) assigned to them. These chunks can be identified by the presence of the string TABLE in embedding_metadata.block_types.

    • The format of these tabular chunks will be the same format as CSV-derived chunks.

    • Using this table-parsing feature in conjunction with hybrid search should provide much better search results than before (assuming the PDF has tables that need to be searched over).

  • If you’re using OCR we now also return metadata such as coordinates and page numbers even if set_page_as_boundary is set to false.

    • Specifically, we will return the bounding box coordinates as well as the start and end page number of the chunk.

    • In the event that pg_start < pg_end, then you should interpret bounding box coordinates slightly differently. x1 and x2 will correspond to the minimum x1 and maximum x2 over all pages for the chunk. y1 will correspond to the upper-most coordinate of the part of the chunk on pg_start, and y2 will correspond to the bottom-most coordinate of the part of the chunk on pg-end.

Carbon Connect 2.0 (Beta)

  • We are thrilled to announce the beta launch of Carbon Connect 2.0, with the following improvements:

  • Support multiple active accounts per data source.

  • Improved data source list.

  • Built-in user interface for users to view and re-sync files per account.

  • Ability for users to directly disconnect active accounts.

  • To install Carbon Connect 2.0 please npm install carbon-connect@2.0.0-beta5. It is not treated as the latest version of Carbon Connect so you won’t get this version automatically.

  • Few other important updates for Carbon Connect 2.0:

  • We’ve made a change to remove file details from the payload of UPDATE callbacks. If you used to get files in this way, you’ll now need to switch to using our SDK or API to get the updated files when a data source updates.

  • When you’re specifying embedding models, just make sure to use the format like this: embeddingModel={EmbeddingGenerators.OPENAI_ADA_LARGE_1024}, instead of just writing out a string.

  • You can hide our built-in UI for viewing and re-syncing files using the showFilesTab param on either the global component or enabledIntegration level.

Scheduled Syncs Per User and Data Source

  • Control user and data source syncing using the /update_users endpoint, allowing organizations to specify enabled syncing for particular users and data source types. The endpoint accepts a list of user IDs and data sources, with an option to enable syncing for all sources using the string 'ALL'.

    • Each request supports up to 100 customer IDs.

  • In the following example, future Gmail accounts for specified users will automatically have syncing enabled according to the provided settings.

{ "customer_ids": ["swapnil@carbon.ai", "swapnil.galaxy@gmail.com"], "auto_sync_enabled_sources": ["GMAIL"] }

  • Find more details in our documentation here.

  • Note: This update is meant to replace our file-level sync logic and any existing auto-syncs have been migrated over to use this updated logic.

Delete Files Based on Filters

  • We added the /delete_files_v2 endpoint which allows customers to delete files via the same filters as /user_files_v2

  • We plan to deprecate the /delete_files endpoint in a month.

  • Find more details in our documentation here.

Filtering for Child Files

  • We added the ability to include all descendent (child) files on both /delete_files_v2 and /user_files_v2 when filtering.

  • Filters applied to the endpoint extend to the returned child files.

  • We plan to deprecate the parent_file_ids filter on the /user_files_v2 endpoint in a month.

Customer Portal v1

  • We’ve officially launched v1 of our Customer Portal - portal.carbon.ai

  • You can currently manage your API keys directly via the Portal, and we plan to release the following functionality next quarter:

    • User management

    • Usage monitoring

    • Billing management

  • For current customers, you can reset your password with the email provided to Carbon to gain access. If you don’t know the email you have on file, DM me!

integration/items/list Improvements

  • We are implementing four distinct filters: external_ids, ids, root_files_only, and name, each meant to filter data based on their respective fields.

    • The root_files_only filter will exclusively return top-level files. However, if a parent_id is specified, then root_files_only can’t be specified and vice versa.

  • The external_url has been added to the response body of the integrations/items/list endpoint.

  • See more details here.

Multiple Active Accounts Per Data Source

  • Carbon now support multiple active accounts per data connection!

  • We’ve introduced two new parameters across various API endpoints to support this functionality across all our connectors. While these parameters are optional for users with a single data source of each type, they become mandatory when managing multiple accounts.

    • /integrations/oauth_url

      • data_source_id: Specifies the data source from which synchronization should occur when dealing with multiple data sources of the same type.

      • connecting_new_account: This parameter is utilized to consistently generate an OAuth URL as opposed to a sync URL. A sync URL is the destination where users are redirected after a successful OAuth authentication to synchronize their files. While this parameter can be skipped when adding the first data source of that type, it should be explicitly specified for subsequent additions.

    • /integrations/s3/files, /integrations/outlook/sync, /integrations/gmail/sync

      • data_source_id: Used to specify the data source for synchronization when managing multiple data sources of the same type.

    • /integrations/outlook/user_folders, /integrations/outlook/user_categories, /integrations/gmail/user_labels

      • data_source_id: Specifies the data source to be utilized when there are multiple data sources of the same type.

  • Note that the following endpoints already have a mandatory requirement to pass in a data_source_id: /integrations/items/sync,/integrations/items/list,/integrations/files/sync/,integrations/gitbook/spaces,/integrations/gitbook/sync

New Embedding Models

  • We now support embedding generation using OpenAI’s text-embedding-3-small and text-embedding-3-large models.

  • To define the embedding model, utilize the embedding_model parameter in the POST body for the /embeddings and other API endpoints. By default, if no specific model is provided, the system will use OPENAI (the original Ada-2).

  • Find more details on the models available here.

Return HTML for Webpages

  • presigned_url field under user_files_v2 now returns a pre-signed URL to the raw HTML content for each web page.

  • parsed_text_url field still returns a pre-signed URL for the corresponding plain text.

  • Find more details here.

Return Website Tags in File Metadata

  • file_metadata field under user_files_v2 now returns og:image and og:description for each web page.

  • Find more details here.

Omit Content by CSS Selector 

  • You can now exclude specific CSS selectors from web scraping. This ensures that text content within these elements does not appear in the parsed plaintext, chunks, and embeddings. Useful for omitting irrelevant elements, such as headers or footers, which might affect semantic search results.

  • The web_scrape request objects supports a new fields:

  •  css_selectors_to_skip: Optional[list[str]] = []

  • Find more details here.

JSON File Support

  • We’ve added support for JSON files via local upload and 3rd party connectors.

  • How It Works:

    • The parser iterates through each object in a file and flattens it. Keys on the topmost level remain the same, but nested keys are transformed into the dot separated path to reach the key’s value. Each component of the path can either be a string for a nested object or integer for a nested list.

    • max_items_per_chunk is a parameter that determines how many JSON objects to include in a single chunk.

    • A new chunk is created if either the max_items_per_chunk and chunk_size limit is reached. For example:

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 JSON objects.

      • If each JSON object is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 JSON object.

  • Learn more details here.

Gitbook Connector

  • We launched our Gitbook integration today that syncs pages from any public and shared spaces.

  • The Carbon Connect enabledIntegrations value for Gitbook is GITBOOK.

  • Gitbook does not come with a pre-built file selector so we added 2 endpoints for listing and syncing Gitbook spaces:

    • List all Gitbook spaces with /integrations/gitbook/spaces (API Reference)

    • Sync multiple spaces at once with integrations/gitbook/sync (API Reference)

  • You can also use our global endpoints for listing and syncing specific pages in Gitbook spaces:

    • List pages in spaces with the global endpoints /integrations/items/list

    • Sync pages in spaces with the global endpoint /integrations/files/sync

    • Note: Spaces are treated like folders via the Carbon API.

  • See more specifics about our Gitbook integration here.

  • Note: our Gitbook page parser is still in beta so feedback is much appreciated!

Delete Endpoint Update

  • We’re transitioning file deletion from sync to async processing.

  • This means that the FILE_DELETED webhook event will not fire immediately and instead fire when the file is actually deleted.

  • We are also limiting 50 files to be deleted per /delete_files request to limit the load on our servers. We advise spacing out delete requests every 24 hours.

Pinecone Integration 

  • We’ve launched our Pinecone destination connector! We offer support for both pod-based and serverless offerings.

  • Carbon seamlessly updates your Pinecone instance with the latest embeddings upon processing user files. Users gain full access to Carbon’s API endpoints, including hybrid search for supported sparse vector storage.

  • Find more details here.

New Carbon SDKs

  • Moving forward, we will be able to provide support for a greater number of SDKs and promptly release SDK support for API updates. If there is a language for which you want us to add SDK support, we should be able to turn that around in less than a week.

  • We’re adding support for the following languages today:

  • The current Javascript SDK will continue to be supported for the next month, and it will be available longer term. However, new features that are introduced will only be supported in the new Typescript SDK moving forward.

Delete Users Endpoint

  • Added an endpoint /delete_users that takes an array of customer IDs and deletes all those users.

  • Deleting a user revokes all of the user’s oauth connections and deletes all their files, embeddings and chunks.

  • The request format is:

{ "customer_ids": ["USER_1", "USER_2", "USER_3"] }

  • Find more details here.

Salesforce Connector is Live

  • All articles from an end user’s Salesforce Knowledge can be listed and synced via the global API endpoint /integrations/items/list and /integrations/files/sync.

  • The Carbon Connect integration (launching tomorrow) will sync all articles by default.

  • The enabledIntegrations value is SALESFORCE.

  • You can find more info here.

Outlook Folders 

  • After connecting your Outlook account, you can use this endpoint to list all of your folders on outlook.

  • This includes both system folders like inbox and user-created folders.

  • Find more details here.

Gmail Labels 

  • After connecting a Gmail account, you can use the /integrations/gmail/user_labels endpoint to list all of your labels.

  • User created labels will have the type user and Gmail’s default labels will have the type system.

  • Find more details here.

Delete Child Files Based on Parent ID

  • Added a flag named delete_child_files to the delete_files endpoint. When set to true, it will delete all files that have the same parent_file_ids as the file submitted for deletion. This flag defaults to false.

  • Find more details here.

Carbon Connect Updates 

  • Added support for JSON file formats and maxItemsPerChunk param to specify the number of items to include in a specific chunk.

  • Added cssSelectorsToSkip to WEB_SCRAPE to define CSS Selectors to exclude when converting HTML to plaintext.

  • Added SALESFORCE as an enabledIntegration on Carbon Connect.

  • For Salesforce, we added a param syncFilesOnConnection that defaults to true and will automatically sync all pages from a user’s Salesforce account.

  • We’ll be adding this param to other connectors too, meaning you can automatically sync all files from connectors that don’t have built-in file selectors (Gitbook, Confluence, etc).

  • This parameter is also added to the /integrations/oauth_url endpoint as sync_files_on_connection and also defaults to true.


Freshdesk Connector is Live

  • All Published articles from an end user’s Freshdesk knowledge base are synced when connected to Carbon.

  • The Carbon Connect enabledIntegrations value is FRESHDESK.

  • You can find more info here.

Speed Improvements to Hybrid Search

  • We improved the speed of hybrid search by a factor of 10x by creating sparse vector indexes on file upload vs. query time.

    • Steps to Enable:

      • Pass the following body to the /modify_user_configuration endpoint: { "configuration_key_name": "sparse_vectors", "value": { "enabled": true } }

    • Set the parameter generate_sparse_vectors to true via the /uploadfile endpoint.

  • We’ll be rolling out faster hybrid search support across 3rd party connectors in the upcoming weeks.

  • Find more details here and here.

Deleting Files based on Sync Status

  • You can now delete file(s) based on sync_status via the delete_files endpoint.

  • We added 2 parameters:

    • sync_statuses - parameter to pass a list of sync statuses for file deletion.

      • For example, { "sync_statuses": ["SYNC_ERROR", "QUEUED_FOR_SYNC"] }. When this parameter value is passed we will delete all files in the SYNC_ERROR and QUEUED_FOR_SYNC status that belong to the end user identified by customer-id in headers that made the request.

    • delete_non_synced_only - boolean parameter that limits deletion to files that have not been re-synced before.

      • For example, a previously synced Google Drive file enters the QUEUED_FOR_SYNC status again during a scheduled re-sync. Setting delete_non_synced_only to true would prevent this file from being deleted as well.

  • Files are deletable in all statuses except SYNCING, EVALUATING_RESYNC and QUEUED_FOR_OCR states.  Including SYNCING, EVALUATING_RESYNC, QUEUED_FOR_OCR in the list will result in an error response - files in these statuses must wait until they transition out of the status to be deleted.

  • Find more details here.

Carbon Connect Updates

  • Added support for the following functionalities in Carbon Connect (React component + JavaScript SDK):

    • Additional embedding models (OPENAI, AZURE_OPENAI, COHERE_MULTILINGUAL_V3 for text and audio files, and VERTEX_MULTIMODAL for image files).

    • Enable audio and image file support. Reference documentation on file formats available.

    • OCR support for PDFs from local file uploads via Carbon Connect.

    • Hybrid search supported.

Remove Customer-Id on Select Endpoints

  • We’re removing customer-id as a required header for the following endpoints where it is not required:

    • /auth/v1/white_labeling

    • /user

    • /webhooks

    • /add_webhook

    • /delete_webhook/{webhook_id}

    • /organization

Vector Database Integration

  • We are starting to build out direct integrations with vector database providers!

  • What this means:

    • After authenticating a vector database provider via API key, Carbon automatically synchronizes between user data sources and the embeddings within your vector database. Whenever a user file is processed, we handle the seamless update of your vector database with the latest embeddings.

    • You’ll have full functionality to all our Carbon’s API endpoints, including hybrid search if sparse vector storage is supported by your vector database.

    • Migrations between vector databases is made simple since Carbon provides a unified API to interface with all providers.

  • The first vector database integration we’re announcing is with Turbopuffer. Many more to come!

S3 Connector 

  • We launched our S3 connector today that enables syncing objects from buckets.

  • The Carbon Connect enabledIntegrations value for S3 is S3.

  • See more specifics about our S3 connector here.

File + Account Management Component (BETA)

  • Users to add and revoke access to accounts under each connection.

  • Users to view and select specific folders and files for sync.

  • The aim is to offer a pre-built file selector for integrations without their own.

  • The component is currently offered in React but we’ll add support for other frameworks soon.

  • You can find the npm package here. Please note it’s still in BETA so your feedback is much appreciated!

Expanding sort for user_files_v2

  • You can sort by name, file_size and last_sync on order_by field in the user_files_v2 body.

  • See more details here.

Support for audio file uploads via connectors

  • We’ve enabled support for audio files via the following connectors: S3, Google Drive, Onedrive, SharePoint, Box, DropBox, Zotero.

  • See list of supported audio files here.

Google Verification

  • Carbon’s Google Connector is officially Google-verified. This means users will no longer see the warning screen when authenticating with Carbon’s Google connector.

OCR Public Preview

  • We’ve been rolling out support for OCR, starting with PDFs uploaded locally (images and data connectors to follow).

Exposing Sync Error Reasons

  • We are now exposing error messages under the sync_error_reason field for files entering SYNC_ERROR status.

  • You can find a list of common errors here and we’ll be updating this on an ongoing basis.

List and Sync Items from Data Sources

  • We’re introducing new functionalities that allow customers to synchronize and retrieve a comprehensive list of items such as files, folders, collections, articles, and more from a user’s data source. This enhancement empowers you to create an in-house file selection flow, while enabling Carbon to also provide a user-friendly file selector UI and convenient helper methods within our SDK.

  • You can find more details here.

Upload Chunks and Embeddings

  • Added /upload_chunks_and_embeddings endpoint to enable uploading of chunks and vectors to Carbon directly.

  • See more specific details here.

CARBON

Data Connectors for LLMs

COPYRIGHT @ 2024 JCDT DBA CARBON