Carbon Changelog

Monthly updates with the latest features added, improvements made and bugs squashed.

XLSM File Support

  • Added support for XLSM file uploads from third-party and local file uploads.

  • Similar to XLSX files, each row is split on its own line, and each element within the row has the header of its corresponding column added to it as a prefix. Our parser assumes that the first row and only the first row is the header. Macros, images, and charts aren’t supported yet.

Webscrape Improvements

  • We now immediately abort the web scrape run when the file is deleted in Carbon, freeing up resources to submit another web scrape request.

  • Web scrapes auto-sync can now be managed on the user and organization level via auto_sync_enabled_sources as WEB_SCRAPE. The default auto-sync schedule is 2 weeks (as opposed to daily for other data sources).

Store Files Without Parsing

  • The sync endpoints now take a new parameter store_file_only (file_sync_config.store_file_only for external files) to allow users to skip parsing during the sync. This means the file will have a presigned_url but not a parsed_text_url.

  • Because we are skipping parsing, we won’t be able to count the number of characters in the file. That means the only metrics we’ll report to Stripe are bytes uploaded (and URLs scraped if it’s a web scrape file).

Sync Additional GitHub Data

  • In addition to syncing files from repositories, you can fetch data directly from GitHub via the following endpoints:

    • /integrations/data/github/pull_requests: Lists all pull requests for a repository

    • /integrations/data/github/pull_requests/{pull_number}: Retrieves a specific pull request

    • /integrations/data/github/pull_requests/comments: Fetches comments on a pull request

    • /integrations/data/github/pull_requests/files: Retrieves files that were changed

    • /integrations/data/github/pull_requests/commits: Retrieves a list of commits on a pull request

    • /integrations/data/github/issues: Lists repository issues

    • /integrations/data/github/issues/{issue_number}: Retrieves a specific issue

  • By default, we return responses with mappings applied, but there is an option to include the entire GitHub response on every endpoint (include_remote_data).

  • Find more details in our documentation here.

/user_files_v2: New upload_id Property

  • User files now contain a new property called upload_id which is generated internally by Carbon. This property groups together files that were part of the same upload. Each upload from a third-party file picker will have its own unique upload_id, even if the files were uploaded in the same session. Sessions are still identified by the request_id. If the same file is uploaded multiple times, only the most recent upload_id is saved.

  • Webhooks that send the request_id will now also send the upload_id.

  • The /user_files_v2 endpoint now accepts a new filter called upload_ids.

New ALL_FILES_PROCESSED Webhook

  • The new webhook ALL_FILES_PROCESSED will be sent when all files in an upload have moved into the “READY,” “SYNC_ERROR,” “READY_TO_SYNC,” or “RATE_LIMITED” status. It includes the request_id as the sent object and the upload_id as additional information.

API Update

  • Starting next Tuesday (10/15), the hot_storage_time_to_live field under file upload endpoints will no longer take values in seconds. Instead it will need to be a discrete number of days from the list: [1, 3, 7, 14, 30]. Anything else will raise an exception.

Self-Hosting Updates

  • You can now bring your own S3-compatible object storage instead of using S3 (AWS) or Google Blob Storage (GCP).

  • Added a flag DISABLE_RATE_LIMITS to disable all of Carbon’s rate limits listed here.

Premium Proxies for Web Scraping

  • We have introduced a new feature called use_premium_proxies for web scraping and sitemap scraping that can be enabled upon request. This feature aims to enhance the success rate when scraping websites that utilize captchas or bot blockers.

  • Please note that enabling this feature upon request may result in longer web scraping durations.

Limit File Syncs by Character Count

  • Initial file syncs now includes the option to limit based on the number of characters. There are three levels of character limits:

    • max_characters_per_file: A single file from the user cannot exceed this character limit.

    • max_characters_per_upload: Custom character limit for the user across a single upload request.

    • max_characters: Custom character limit for the user across all of the user’s files. Please note that in this case, the value can slightly exceed the limit.

  • These limits can be configured using the user (/update_user) or organization (/organization/update) update endpoints. If these limits are exceeded, the file that surpasses the threshold will be moved to SYNC_ERROR, and the corresponding webhook (FILE_ERROR) will be sent. Please be aware that files that have already synced from the same upload request will not be rolled back.

Email Notifications for Usage Limits

  • You can now enable the following emails (currently upon request) to be sent to admin and users under your portal.carbon.ai account:

    • Daily Limit Reached: Your organization has reached the 2.5GB (or custom) upload limit across all users and data sources. We’ll return the organizationName, uploadLimit, and resetTime.

    • User Exceeded Their Upload Limit: A certain user has exceeded the upload limits you set via max_files or max_files_per_upload. We’ll return the customerId, limitType, and dateTime.

    • User Exceeded Their Upload Limit: A certain user has exceeded the upload limits you set via max_characters_per_file, max_characters_per_upload, or max_characters. We’ll return the customerId, limitType, and dateTime.

Self-Hosting Updates

  • We have added two new environment variables:

    • HASH_BEARER_TOKEN: when set to true, we store only hashed bearer tokens in the database. This is optional and adds an additional layer of security if your database is already encrypted at rest.

    • DATA_SOURCE_ENCRYPTION_KEY: enables encryption of client secrets and access tokens when set. This key should be a URL-safe, base64-encoded 32-byte key. Refresh tokens are not encrypted because they are not useful without the client secret. Encrypted values can be decrypted and rolled back using a migration that does this for all tokens.

  • You can now use your own SQS-compatible queue instead of using SQS (AWS) or PubSub (GCP). Currently we’ve implemented elasticmq as the open-source SQS alternative.

Carbon Connect Enhancements

  • Users can now opt to have the updated_at column displayed in filesTabColumns instead of created_at, allowing for sorting by this column.

  • Status updates for files in a syncing state have been implemented. The file view will automatically refresh when the file changes to either Ready or Sync Error.

  • Users can now opt to have the updated_at column displayed in filesTabColumns instead of created_at, allowing for sorting by this column.

  • Status updates for files in a syncing state have been implemented. The file view will automatically refresh when the file changes to either Ready or Sync Error.

API Endpoint for White Labeling

  • If white-labeling is enabled for your organization, you can directly manage your oauth credentials for white-labeling via the following endpoints:

    • /white_label/create: Add credentials to white label data sources.

    • /white_label/delete: Delete credentials for white-labeled data sources.

    • /white_label/update: Update credentials for a white-labeled data source.

    • /white_label/list: List credentials for white-labeled data sources.

  • Below is a list of data sources that can be white-labeling:

NOTION GOOGLE_DRIVE BOX ONEDRIVE SHAREPOINT INTERCOM SLACK ZENDESK OUTLOOK GMAIL SERVICENOW SALESFORCE ZOTERO CONFLUENCE DROPBOX GOOGLE_CLOUD_STORAGE GONG

  • For all these data source types, client_id and redirect_uri are required credentials. client_secret is optional for those who want to create data source with access tokens obtained outside of Carbon. For data source specific credentials:

    • Google Drive takes an api_key optionally for those who want to use Google’s file picker

    • OneDrive and Sharepoint take a file_picker_client_id and file_picker_redirect_uri for those who want to use Microsoft’s file picker.

  • Note: Carbon will encrypt client secrets in our database, but return them unencrypted in the API responses.

Disabling File Formats in CCv3 File Picker (3.0.21)

  • You can now disable the selection of unsupported or disabled file formats in the CCv3 in-house file picker for the following integrations:

GOOGLE_DRIVE ONEDRIVE SHAREPOINT BOX DROPBOX S3 (includes Digital Ocean Spaces) ZOTERO AZURE_BLOB_STORAGE GOOGLE_CLOUD_STORAGE

  • By default, all file formats supported by Carbon are enabled. Users can set the allowed_file_formats under connector settings at the user (update_users) or organization level (organization/update) to control which file formats are enabled.

Self-Hosting Updates (1.3.18)

  • We now allow environment variables for the Carbon application to be passed as a yaml file. The config.yaml file is a configuration file that stores all the environment variables for the application. It can have multiple environments such as dev, prod, etc. Each environment should be placed under a key with the name of the environment, and the key must be at the top level of the file. The environment that is used is determined by the global_env key-value pair (e.g. global_env: dev). It’s important to note that the variables in this file are converted into environment variables. Essentially, every key-value pair at the leaf level is extracted. The key becomes the key of the environment variable in all caps, and the value remains the same.

    • For instance, the microsoft_file_picker_client_id variable under prod.data_connects.onedrive would be converted to the env variable: MICROSOFT_CLIENT_FILE_PICKER_CLIENT_ID=test_id_here.

  • Here is an example of the .yaml file for reference.

Custom Metadata for Data Sources

  • We added the functionality to add custom tags to data sources, similar to those currently supported for files.

  • You can add tags to any data source via the following endpoints:

    • data_sources/tags/add: Add tags to a data source.

    • data_sources/tags/remove: Remove tags from a data source.

  • Any endpoints for connecting data sources (ie: integrations/connect and /integrations/oauth_url) all take a data_source_tags param for adding tags.

  • The tags must be added as a key-value pair (same as file tags). Example: {{"userId": "swapnil@carbon.ai"}}

  • We have also introduced two parameters in Carbon Connect (3.0.23), allowing customers to add and filter displayed data sources for users:

    • dataSourceTags: These are key-value pairs that will be added to all data sources connected through CCv3.

    • dataSourceTagsFilterQuery: This parameter filters for tags when querying data sources. It functions similarly to our documented file filters. If not provided, all data sources will be returned. Example: {{"key": "userId", "value": "swapnil@carbon.ai"}}

Sharepoint Team Site Support

  • We now support Sharepoint team sites. To connect a Sharepoint team site, leave sharepoint_site_name undefined when calling /integrations/oauth_url

Cursor-Based Pagination

  • We have begun to implement a more efficient pagination system for our API endpoints, starting with the /user_files_v2 endpoint.

  • We introduced a new parameter called starting_id in the pagination block. It is recommended to use a combination of limit and starting_id instead of limit and offset. This not only reduces the load on our backend but also leads to significantly faster response times for you. The limit-starting_id approach is essentially cursor-based pagination, with the cursor being the starting_id in this case.

  • To use it, if you are unsure about which ID to use for starting_id, you should initially make a query with just a limit, order direction, and field to order by. For example:

{ "order_by": "id", "order_dir": "desc", "pagination": { "limit": 10 } }

  • You will receive a list of results (files in the case of /user_files_v2), ordered by id in descending order. From here, you can use the last ID in the list as the starting ID for the next API call. For instance:

{ "order_by": "id", "order_dir": "desc", "pagination": { "limit": 10, "starting_id": 25032 } }

  • This assumes that the last ID of the first API call was 25032. By following this method, you can retrieve the next 10 files. You can continue this process as needed.

  • We aim to eventually phase out offset-based pagination in favor of this cursor-based pagination, as limit-based pagination performs significantly worse at a database level.

Self-Hosting Updates

  • Azure Blob Storage has been added as an object storage backend, alongside S3 (AWS), Google Blob Storage (GCP), and S3-compatible open source alternatives.

Customer Portal v2

  • We have completely redesigned our customer portal UI (portal.carbon.ai) and have a roadmap to significantly enhance the functionality.

    • You can now manage the following through the portal:

      • Webhooks

      • API keys

      • Admin and User Permissions

      • Subscription Plans

Drives Listed As Top-Level Items

  • Personal and Shared Drives are now listed as the top-level sources items via both the API and and within the in-house Carbon file picker.

  • Drives themselves cannot be selected for syncing but you can click in to select folders and files with the Drives.

Self-Hosting Updates

  • We added the following environment variables for self-hosted deployments:

    • default_request_timeout: This is the default timeout for all requests made to external APIs and URLs. Defaults to 7 seconds.

    • web_scrape_request_timeout: This timeout is specifically for requests made during web scraping. Defaults to 60 seconds.

    • <data_source_name>_request_timeout: This allows you to customize the request timeout for specific data sources. Replace <data_source_name> with the actual name, such as notion_request_timeout or google_drive_request_timeout. Defaults to 7 seconds.

Custom Scopes for Connectors

  • You can now directly pass in custom scopes to request from the OAuth provider via /integrations/oauth_url. The scopes will be used as it is, not combined with the default scopes that Carbon requests.

  • The scopes must be passed in as an array, example:

"scopes": [
    "https://www.googleapis.com/auth/userinfo.profile",
    "https://www.googleapis.com/auth/userinfo.email",
    "https://www.googleapis.com/auth/drive.readonly"
  ]

  • Support for custom scopes has also been added to Carbon Connect 3.0.26. The prop is called scopes and is an array that can be set only on the integration level.

Presigned URL Expiry Time

  • We added a new, optional field on the /user_files_v2 endpoint called presigned_url_expiry_time_seconds that can be used to set the expiry time for generated presigned URLs. The default is 3600 seconds.

List Sharepoint Sites

  • You can now list all the SharePoint sites associated with a user’s SharePoint account.

  • After connecting to a SharePoint account, you can use the endpoint /integrations/sharepoint/sites/list to retrieve a list of all sites in the account.

    • This endpoint has two optional parameters:

      • data_source_id: This must be provided if there are multiple SharePoint connections under the same customer ID.

      • cursor: This is used for pagination.

    • Each site will return three properties:

      • site_url

      • site_display_name

      • site_name: This value is used for the sharepoint_site_name when connecting sites with integrations/oauth_url.

  • Please note that this endpoint requires an additional scope, Sites.Read.All, which Carbon does not request by default. In order to list sites, connect sites, and sync files from connected sites, you must include Sites.Read.All in the /integrations/oauth_url through the scopes parameter, along with the required scopes: openid, offline_access, User.Read, and Files.Read.All.

New Filters for Source Items

  • We added two new optional filters for /integrations/items/list:

    • file_formats: Filter based on all file formats supported by Carbon. This is a new feature that won’t be backfilled, so it will only apply to data sources that are synced or re-synced moving forward.

    • item_types: Filter on different item types at the source; for example, help centers will have TICKET and ARTICLE, while Google Drive will have FILE and FOLDER.

Return external_url for Freshdesk

  • We now return the external_url value for Freshdesk articles.

Sync Outlook Emails Across All Folders

  • We have introduced support for syncing Outlook emails across all user folders. Users can specify the folder as null to achieve this, with the default being the inbox if this input is excluded.

Support for .eml and .msg Files

  • We’ve added support for .eml and .msg files for both local and third-party file uploads.

Return Document Chunks without Embeddings

  • We added a new flag generate_chunks_only under files_sync_config for third-party connectors (as generate_chunks_only) and at the top-level for web scrapes, file uploads, and raw text (as generateChunksOnly).

  • When this flag is set to true, documents will be chunked without generating embeddings, and the /list_chunks_and_embeddings will list chunks only.

  • If generate_chunks_only is set to true then it overrides skip_embedding_generation. Once generate_chunks_only is set to true embeddings will not be generated irrespective of the value passed for skip_embedding_generation.

ServiceNow Connector

  • The ServiceNow connector allows customers to synchronize incidents and attachments from their accounts, and support for knowledge articles and catalogs will be added soon!

  • Carbon Connect support is coming tomorrow. The enabledIntegration will be SERVICENOW.

  • You can find more details here.

Carbon Connect Enhancements

  • If a synced file in the “Synced File” list view is in ERROR status, an error message will be displayed when hovering over the Error status label.

  • If a file is re-synced via the “Synced File” list view, a success or error message will be provided based on the outcome.

  • The ServiceNow connector has been added to CCv3. The slug for the enabledIntegration is SERVICENOW.

Gong Connector

  • Just launched our Gong connector for syncing Gong calls and retrieving the call transcripts.

    • CCv3 support for the Gong Connector will be added later this week with the enabledIntegration slug being GONG.

  • By default, the Gong connector will sync all of your workspaces and calls. However, you can customize this behavior:

    • To turn off automatic syncing of all workspaces and calls, set the sync_files_on_connection parameter to false when configuring the connector.

    • To manually sync specific workspaces or calls, use the global endpoints (/integrations/items/list and /integrations/files/sync).

  • To include speaker names and emails (when available), set the include_speaker_labels flag under file_sync_config to true.

  • New calls are auto-synced from existing workspaces but any new workspaces created later will require syncing manually.

  • Find more details here.

External URL for Gmail and Outlook

  • The external_url field is now returned for both Gmail and Outlook email files under user_files_v2

Return Raw Slack Messages

  • We now return the individual Slack messages under the additional_presigned_urls->messages_json field when you set the include_additional_files parameter to true for user_files_v2.

  • The pre-signed file will contain the raw Slack response for all the messages in that file. The JSON will have one entry per conversation, with the conversation timestamp as the key.

Improved Search for Carbon Connect (3.0.12)

  • The search functionality in CCv3 has been enhanced to enable searching through all items in the directory or selected folder, rather than just what is displayed in the front-end.

Improved Notion Parsing

  • We’ve improved our Notion parser to support parsing for the following blocks:

    • Toggle lists

    • In-line tables, text, code blocks, and lists

    • Numbered and bullet lists

    • Synced blocks

    • Multi-column blocks

    • Text with links

Syncing Intercom Conversations

  • In addition to articles and tickets, Carbon now syncs Intercom Conversations.

  • You can specific CONVERSATION under file_sync_config to enable syncing conversations:

"file_sync_config": { "auto_synced_source_types": ["CONVERSATION"], "sync_attachments": true }

  • The following conversation information is available as tags for filtering:

{ "conversation_status": "open", "conversation_priority": "not_priority", "conversation_submitter": "example.user@projectmap.com", "conversation_assigned_team": "Support", "conversation_assigned_admin": "swapnil+int2@carbon.ai" }

Notion Database Properties

  • Notion database properties are now returned per page within the database.

    • All Notion database properties are supported except for relation.

    • Properties are parsed per page in a database. They are parsed in a key-value format (property_name: property_value) and are added to the beginning of the parsed page  (parsed_text_url) as a newline separated list.

    • The file returned by presigned_url also now contains the JSON representation of the Notion page. The page’s properties and child blocks can be found in the object.

Sync Files Without Processing

  • We now allow new file records to be created in Carbon (and displayed via /user_files_v2) without processing and saving the actual file. the remote file content will not be downloaded, and no chunks or embeddings will be generated. Only some metadata such as name, external id, and external URL (depending on the source being synced from) will be stored.

  • This feature can be enabled by setting the flag skip_file_processing to true under file_sync_config for a given data source, and the sync_status of files in this state will be READY_TO_SYNC.

  • It’s important to note that this flag overrides both the skip_embedding_generation and generate_chunks_only flags.

apiURL prop for CCv3 (3.0.14)

  • For customers that self-host Carbon, we added the prop apiURL to CCv3 which defaults to https://api.carbon.ai but can be set to another URL value. This URL value then acts as the base path for all of the requests made through Carbon Connect.

Qdrant Destination Connector

  • You can now “bring your own” Qdrant index to use with Carbon.

  • Carbon can automatically synchronizes embeddings generated from customer data sources with any Qdrant index.

  • To enable, we’ll require your Qdrant API key, an URL, and a mapping of embedding generators (ie: OPENAI) to collection names:

{ "api_key": "API_KEY", "url": "URL", "collection_names": { "EMBEDDING_GENERATOR_1": "COLLECTION_NAME_1", "EMBEDDING_GENERATOR_2": "COLLECTION_NAME_2" } }

Azure Blob Storage Connector

  • We launched our Azure Blob Storage connector that enables syncing files and folders from blobs.The Carbon Connect enabledIntegrations value for Azure Blob Storage is AZURE_BLOB_STORAGE, and CCv3 support will launch tomorrow.Find more details on our Azure Blob Storage connector here.

Business OneDrive Support for Microsoft File Picker

  • The file picker button will now appear on the successful connection page for Business OneDrive accounts.

  • In order to open the file picker, the tenant name of the business account is required. Carbon will try to find it through Microsoft’s API by default. If it can’t be found, the file picker button won’t appear, and the successful connection page will instruct the user to close the tab.

Carbon Self-Hosting on Google Cloud Platform

  • Starting today, customers have the option to host a Carbon instance within their own GCP instance, with full access to all features of our managed solution, including data connectors, hybrid search, and more.

  • As a reminder, we’re already live on AWS and launching on Azure next month!

  • Book a demo if you’re interested to learn more: https://cal.com/carbon-ai/30min

Unified API for CRMs

  • We are introducing a unified API to access standardized data directly from CRM systems, starting with Salesforce.

  • To start, you can now sync data from the following CRM objects:

    • Accounts

    • Leads

    • Contacts

    • Opportunities

  • You can find more details in our documentation here.

Google Sheets Update

  • The file returned in presigned_url for Google Sheets has been changed from txt to xlsx. The txt file is still available in parsed_text_url.


Sync Filter for Email Attachments

  • Customers can specifically select to sync only emails that contain attachments.

  • You will still need to specify sync_attachments to true and also set the following filter:

{ "key": "has", "value": "attachment" }

Auto-Refresh Synced Files List in CCv3

  • We now automatically refresh the synced file list whenever users select additional files using our in-house or third-party file picker view. This eliminates the need for users to manually refresh the view.

Updated Children Prop

  • The children prop of the CCv3 component now accepts any valid React node as the children of the modal, from a simple <div> to an entire component.

  • Here’s an example of how the children prop can be used:

children={ <button onClick={() => setOpen((prev) => !prev)}> Toggle Connect </button> }

Custom Styling for Carbon Connect

  • Users can now control styling of CCv3 by targeting the specific class names we’ve provided. This allows for complete customization to match the desired look and feel of the application.

  • For example, classes names include:

    • cc-modal: Applies to the entire modal component

    • cc-modal-header: Targets the header section of the modal

    • cc-modal-footer: Targets the footer section of the modal

    • cc-modal-close: Applies to the close button of the modal

    • cc-modal-overlay: Targets the overlay background of the modal

  • By utilizing these class names, users can easily override the default styles and apply their own CSS rules to achieve the desired appearance.

OCR Support for JPG and PNG

  • We now support jpg, jpeg and png file formats for OCR.

  • In addition to the normal steps for enabling OCR, please set media_type to TEXT (via file upload and /integrations/oauth_url) so Carbon knows to process the image via OCR (versus generating image embeddings via our image embedding model).

HTML for Confluence Articles

  • We now return the raw HTML output for each Confluence article via the file_metadata.saved_filename object under user_files_v2.

Cancel Source Items Sync

  • We added an endpoint /integrations/items/sync/cancel to cancel data source syncs that are initiated via /integrations/items/sync.

  • This allows customers to manually stop syncing for user data sources where sync_status = SYNCING.

New Gmail Filter

  • We added a new Gmail filter to sync all emails sent from a given account. Example:

{ "filters": { "key": "in", "value": "sent" } }

Return Raw Notion Blocks

  • We now return the raw output (blocks) for each Notion page via saved_filename under user_files_v2 when include_raw_file: true.

Shared Google Drive Source Items

  • We now return shared Google Drive files and folders via integration/items/list.

Clearer Error Message for SYNC_ERROR Status

  • When a file goes into SYNC_ERROR from re-syncing via /resync_file because it has been deleted in source, sync_error_message will now say File not found in data source

  • The webhook sent for that error will also contain sync_error_message in additional_information.

Slack UI in Carbon Connect v3 (3.0.0-beta32)

  • Select Conversations to Sync

    • After authenticating, users have full control over which conversations they want to sync via CCv3, including:

      • Public channels

      • Private channels

      • Direct messages (DMs)

      • Group DMs

  • Manage Synced Conversations

    • Users can manage their list of synced conversations at any time via CCv3.

    • Easily add or remove channels and DMs to adjust what gets synced between Slack and Carbon.

Carbon Connect Enhancements

  • Synced URLs for Web Scrapes (CCv3 beta30

    • We now display synced URLs in a dedicated list view under the WEB_SCRAPE integration.

    • The default columns displayed in the list view are name, status, created_at.

    • Parent URLs will be displayed as “folders” and children URLs will be displayed as “files” within the folder.

  • When showFilesTab is set to false we surface a Select files button in the account drop-down for users to sync new files.

  • Data Source Polling Interval

    • Added a new configuration property at the component level called dataSourcePollingInterval.

    • This property controls how frequently data sources are polled for any updates and events.

    • The value is specified in milliseconds (ms) and the minimum allowed value for this property is 3000 ms. The default is 8000 ms.

  • Speaker Diarization

    • Added includeSpeakerLabels for LOCAL_FILES integration and file extensions.

    • Added include_speaker_labels to fileSyncConfig for third-party connectors.

  • openFilesTabTo Param

    • The openFilesTabTo prop is set on the component level and determines which tab (FILE_PICKER or FILES_LIST) the user is taken to by default when they select an integration.

    • The prop takes a string value of either "FILE_PICKER" | "FILES_LIST".

    • This prop only applies when the customer has enabled Carbon’s in-house file picker.

  • We now display a banner when data source items are being synced. The user will still be able to select previously synced items for upload in the meantime.

  • Guru support in CCv3 has been added. The enabledIntegration is GURU.

  • We improved the file list view to be better optimized for mobile devices and ensured that the column headers and values align properly.


Pongo Reranking Modal

  • We’ve added Pongo as a supported reranker model alongside Jina and Cohere.

  • Similar to Cohere and Jina reranking, users can now use PONGO_RERANKER in the following manner on the embeddings endpoint: { "query": "how is anime made?", "k": 5, "rerank": {"model": "PONGO_RERANKER"} }

Third-Party File Picker Behavior

  • We added a new parameter automatically_open_file_picker to the external file sync urls: /integrations/oauth_url and /integrations/connect. When true, the file picker for Google Drive, Box, OneDrive, Sharepoint, Dropbox will automatically open when the user lands on the successful connection page.

  • It’s important to note that some users’ browsers may have popup blockers that could prevent this parameter from functioning. In such cases, the user may receive a prompt from their browser asking for permission to allow popups from the platform. If the user grants permission, the feature will work as intended for future syncs.

  • It’s worth mentioning that OneDrive and SharePoint behave differently due to Microsoft treating the file picker as a separate app. Instead of directly opening the file picker, it will trigger another OAuth prompt. If the user consents to the file picker OAuth, the file picker will then automatically open afterwards.

Speaker Diarization

  • Speaker diarization has been added for audio transcription models. This allows us to format chunks so that the text is organized by utterances and each utterance will be labeled with the speaker. It’ll take this format:

[Speaker A] speaker A's utterance

[Speaker B] speaker B's utterance

  • For local file uploads, there is a new parameter include_speaker_labels. And for external file uploads, the parameter file_sync_config object can take a new property include_speaker_labels. When either is set to true, speaker diarization will be enabled for the audio transcription services

  • Minor note: Speaker label may appear differently depending on the transcription service. Deepgram uses numbers to label speakers while AssemblyAI uses letters.

request_id on Additional Webhooks

  • request_id is now included in following webhook events under the additional_information object for external files: UPDATE, FILES_CREATED, FILE_READY, FILE_ERROR, FILES_SKIPPED, FILE_SYNC_LIMIT_REACHED

Cold Storage for Files (Beta)

  • Overview

    • Carbon supports moving file embeddings between hot and cold storage. This feature allows you to optimize storage costs and improve performance by keeping embeddngs for frequently accessed files in hot storage (vector storage) while moving less frequently used files to cold storage (object storage).

  • Enabling Cold Storage

    • By default, the cold storage feature is not enabled. Once enabled, files will automatically be moved to cold storage after a set period of inactivity. To enable cold storage, you must set a flag at file upload time. Currently cold storage is only available for local file uploads via /uploadfile, /upload_text and /upload_file_from_url.

      • Moving Files from Hot to Cold Storage

        • Once enabled, files will be automatically moved from hot to cold storage after a specified period of inactivity. This period is determined by the time_to_move_to_cold_storage parameter, which represents the number of seconds a file must be inactive before it’s moved to cold storage. There is no manual way to move files to cold storage.

          • You can make an API request to the /modify_cold_storage_parameters endpoint which allows customers to update existing files to use cold storage.

      • Moving Files from Cold to Hot Storage

        • To move files from cold to hot storage, you must make an API request to /move_to_hot_storage. The request will take filters similar to /user_files_v2, and all files matching the provided filters will be moved to hot storage.

        • To avoid a single request hogging resources, there is a limit of 200 files that can be moved in one request. If the number of files matching the filters exceed  200, the files will be processed in batches of 200 over a longer period of time

    • /embeddings Endpoint Behavior

      • If a request is made to /embeddings that involves files in cold storage, an error will be returned that includes a l file_ids for the affected files. This a lows the client to know which files need to be moved to hot storage before the request can be processed.

      • However, exclude_cold_storage_embeddings is set to true, any files in cold storage will be ignored, and no error  ill be thro n for requests involving files in cold storage. Then the search will naturally exclude those files.

      • In the future, we may enable a way to allow /embeddings to work with files that are in both cold and hot storage.

  • File Object Information

    • Activity is defined as when a file was last used, which currently includes file re-syncs, queries involving that file, and updates to file tags.

    • The following fields under the file object (under user_files_v2) are related to cold storage:

      • last_use: A timestamp indicating when a file was last used (i.e., when it last had activity).

      • supports_cold_storage: A flag indicating whether or not a file can be moved to cold storage.

      • time_to_move_to_cold_storage: An integer representing the number of seconds a file must be inactive before it’s moved to cold storage.

      • embedding_storage_status: The storage status of the embeddings for a file, indicating whether they are in cold or hot storage.

  • New Cold Storage Webhooks

    • MOVED_TO_COLD_STORAGE- This event is fired when a file is moved to cold storage.

    • MOVED_TO_HOT_STORAGE- This event is fired when a file is moved to hot storage.

You can find our documentation on cold storage here.

Warnings Object to API Responses

  • In the next two weeks, we plan to add a warnings object to our API responses to display warning messages.

  • Here’s an example of how it looks:

{ "documents": [], "warnings": [ { "warning_type": "FILES_IN_COLD_STORAGE", "object_type": "FILE_LIST", "object_id": [ 47058 ], "message": "These files won't be queried because they are not in hot storage." } ] }

Carbon Connect 3.0 (CCv3) Enhancements

  • We’ve added 3 new props to CCv3:

    • The showFilesTab (boolean) prop has been reintroduced to CCv3 with a default value of true. As a quick reminder, this prop allows customers to hide the file selector and file list view from the CCv3 component. It can be enabled or disabled at both the component and integration levels. If specified for a specific integration, it will override the component-level configuration.

    • The filesTabColumns (array) prop has been added on both the component and integration levels. This prop controls which columns are displayed and hidden in the file list view and accepts an array of strings with values “name”, “status”, “created_at”, and “external_url”.

    • The transcription_service (enum) prop has been added under fileSyncConfig and transcriptionService for LOCAL_FILES integration to specify which speech-to-text model to use for transcriptions. You can specify the enum as ASSEMBLYAI or DEEPGRAM but the prop defaults to DEEPGRAM.

Google Cloud Storage Connector 

  • We launched our GCS connector that enables syncing files from buckets.

  • The Carbon Connect enabledIntegrations value for GCS is GCS.

  • See more specifics about our GCS connector here.

DigitalOcean Storage Connector

  • We launched our DigitalOcena Storage connector that enables syncing files from buckets.

  • The Carbon Connect enabledIntegrations value for Digital Ocean Spaces is S3 (CC support will be launched tomorrow).

  • The Spaces API is interoperable with the AWS S3, so Digital Ocean Spaces makes use of the existing S3 endpoints.

  • This means that the source of Digital Ocean files is S3. To differentiate between data sources and files from Spaces Object Storage, additional metadata has been added:

    • Data Source Metadata

      • data_source_metadata: Indicates the type of data source. Possible values include:

        • S3: Represents an Amazon S3 data source.

        • DigitalOcean Space: Represents a DigitalOcean Spaces data source.

    • File Metadata

      • file_metadata: Specifies the type of file. Possible values include:

        • S3 File: Represents a file stored in Amazon S3.

        • DigitalOcean Space File: Represents a file stored in DigitalOcean Spaces.

        • S3 Bucket: Represents a file representation for a S3 Bucket.

        • DigitalOcean Space Bucket: Represents a file representation for a DigitalOcean Space Bucket.

  • See more specifics about our DigitalOcean Spaces connector here.

New file_types_at_source Filter for /user_files_v2 and /embeddings

  • Introduced a new optional field file_types_at_source for /user_files_v2 and /embeddings.

  • The file_types_at_source field is an array type that currently accepts the following values:

    • TICKET

    • ARTICLE

  • This new field allows users to specify whether we return tickets, articles or both when retrieving content (files and embeddings) from Zendesk, Intercom and Freshdesk.

    • If file_types_at_source contains TICKET, ticket content from Zendesk, Intercom and Freshdesk are returned.

    • If file_types_at_source contains ARTICLE, article content from Zendesk, Intercom and Freshdesk are returned.

AssemblyAI Integration for Audio Transcriptions

  • We are excited to announce that Carbon now supports multiple audio transcription services. In addition to our existing integration with Deepgram, we have added support for AssemblyAI, providing our users with more options and flexibility when transcribing audio files.

  • To accommodate the new transcription service, we have updated the following endpoints to accept the new parameters transcription_service that allow you to specify which service to use. Valid values are deepgram and assemblyai. If no value is specified, Deepgram will be used as the default transcription service.

  • For local files, the endpoints are:

    • /uploadfile

    • /upload_file_from_url

  • For external files, transcription_service is set within the file_sync_config parameter, under:

    • /integrations/oauth_url

    • /integrations/connect

    • /integrations/files/sync

  • Similar to files transcribed by Deepgram, files transcribed by AssemblyAI also have an additional saved file containing the full JSON response from the AssemblyAI service. To access the transcription response, query the files using the user_files_v2 endpoint with the include_additional_files parameter set to true.

Carbon Webhook Libraries

  • We have released our official webhook libraries for handling the verification of webhook signatures. You can find our updated documentation here, and access our libraries on GitHub here.

Zendesk Auto-Sync Update

We are thrilled to announce that the Zendesk connector now supports auto-sync.

  • Carbon can now sync any new articles with auto-sync enabled.

    • Help Center Categories are now synced into Carbon as files, and Help Center Categories and articles form a parent-child relationship.

  • Reconnecting Existing Zendesk Connections:I

    • If you have existing Zendesk connections in Carbon, please note that you will need to reconnect them to enable the updates above.

Organization Connector Settings

  • The /organization endpoint now includes connector_settings in the response, providing additional information about the organization’s connector configurations, starting with permitted file formats.

  • The /organization/update endpoint has been updated to accept the data_source_config parameter, allowing customers to configure permitted file formats for organization users. The data_source_config parameter should be provided in the following format:

{ "data_source_configs": { "GOOGLE_DRIVE": { "allowed_file_formats": ["PDF", "DOCX"] }, "DROPBOX": { "allowed_file_formats": ["XLSX", "CSV"] }, "DEFAULT": { "allowed_file_formats": ["PDF", "DOCX", "XLSX", "NOTION"] } } }

  • DEFAULT is applied to all data sources that do not have configs defined.

  • If the data_source_config parameter includes file formats that are not supported by Carbon, those formats will be ignored, and only the supported formats from each data source will synced.

Carbon Self-Hosting on AWS

  • Starting today, customers have the option to host a Carbon instance on their own cloud, with full access to all features of our managed solution, including data connectors, hybrid search, and more.

  • We’re launching on Microsoft Azure and Google Cloud later next month!

  • Book a demo if you’re interested to learn more:https://cal.com/carbon-ai/30min

Confluence Enhancements

We’ve made improvements to the Confluence Connector related to the following:

  • Auto-Sync Improvements

    • Auto-syncs process will now index new pages that are added to a previously synced parent page. If a user syncs their entire Confluence account, then the space will be the top-most file.

    • If pages are deleted from a synced parent page in Confluence, the scheduled sync will remove them from the synced content.

  • File Metadata Enhancements

    • The file_metadata property now includes additional information about the type of Confluence item each file represents (spaces and pages).

    • The file_metadata property will also record the external_id of the file’s parent and root, providing better context and hierarchy information.

  • To take advantage of these updates, users will need to reconnect their Confluence account and re-sync their Confluence files.

Reranker Models for Search

We are excited to introduce native support for reranker models. With this release, customers now have the option to rerank search result chunks to provide more relevant and accurate results.
How it works:

  • When making a search query via the embeddings endpoint, customers can control the reranking behavior by setting the rerank parameter in the payload.

    • If rerank is set to "JINA_MULTILINGUAL_BASE_V2" the search result chunks will be reranked using the Jina reranking algorithm.

    • If rerank is set to "COHERE_RERANK_MULTILINGUAL_V3", the search result chunks will be reranked using the Cohere reranking algorithm.

    • If the rerank parameter is not specified or set to any other value, the default ranking will be used.

  • The response format from the embeddings endpoint remains consistent regardless of whether rerank is enabled or not.

We’ll be adding support for more reranker models in the weeks to come!New Webhook: WEBSCRAPE_URLS_READY
We’ve added a new webhook named WEBSCRAPE_URLS_READY that triggers each time a specific web page from a web scrape request is finished processing.

Introducing Carbon Connect 3.0

We’re thrilled to announce the beta release of Carbon Connect 3.0, packed with exciting updates and improvements, based on customer feedback.Key Features and Improvements

1. Seamless File and Folder Uploads
Carbon Connect 3.0 now supports both file and folder uploads by default, eliminating the need for the filePickerMode property. Uploading entire folder directories is now a breeze with our new drag-and-drop functionality.

2. Carbon’s In-House File Picker
We’re excited to introduce Carbon’s in-house file picker is now available for all connectors, except for Slack, Gmail, and Outlook (currently in development). To use Carbon’s file picker instead of the source’s file picker, simply set the new useCarbonFilePicker property to true.

3. Enhanced In-Modal Notifications
We’ve completely replaced toast notifications with in-modal notifications, providing a more cohesive and user-friendly experience. As a result, the enableToasts property has been removed.

4. Customizable Theme Options
Personalize your Carbon Connect experience with our new theme options. Use the theme property to set the application’s theme to light, dark, or auto (default). When set to auto, Carbon Connect will automatically adapt to your system’s theme.

5. Simplified File Limit Control
Limiting the number of files is now easier than ever. Simply set the maxFilesCount property to 1 to restrict uploads to a single file. The allowMultipleFiles property has been removed for a more straightforward approach.

Upcoming Enhancements
We’re continuously working to improve Carbon Connect and have exciting plans for the near future:

1. Enhanced Customization Options
We’re working on bringing back customization options from Carbon Connect 2.0, including loadingIconColor, primaryBackgroundColor, primaryTextColor, secondaryBackgroundColor, and secondaryTextColor.

2. Expanded In-House File Pickers
In the coming weeks, we’ll be launching Carbon’s in-house file pickers for Outlook, Slack, and Gmail, providing a consistent and seamless experience across all connectors.

Installation
You can install the new component for testing via the command npm install carbon-connect@beta. We plan to bring 3.0 out of beta by the end of the month!

Here’s a Loom video providing a quick walkthrough of the new modal: https://www.loom.com/share/b7b241fa5e5e4d0a92fb5e748d3d6ec3

External URLs Filter

A new external_urls filter has been added to the user_files_v2 endpoint.This filter allows you to refine the results returned by the endpoint based on a list of external_urls passed.

File Deletion Enhancements 

  • When a customer deletes a file from Carbon (via delete_files_v2), they have the flexibility to control whether the file row in the database is preserved or marked as deleted when deleting a file.

    • This behavior is managed by the preserve_file_record flag. If preserve_file_record is set to true, then we delete the files stored in our S3/GCS while keeping the file record and metadata to allow for re-syncs and auto-syncs.

    • We also added a file_contents_deleted field to the user_files_v2 endpoint. If the field is returned as true, then the file record still exists, but the stored file content is deleted.

  • Find more details here.

High Accuracy Mode 

  • We’ve introduced a new optional boolean parameter to the /embeddings endpoint called high_accuracy . If set to true, then vector search may give more accurate results at a slight performance penalty. By default, it’s false.

  • Find more details here.

To And From Filters for Outlook and Gmail

  • We added 2 more filters for syncing emails from Outlook and Gmail:

  • Note: Outlook only supports from filters.

Intercom Auto-Sync Update

  • We are thrilled to announce 2 updates to our Intercom connector:

    • Carbon can now sync multiple Intercom Help Centers:

      • Help Centers are now synced into Carbon as files, and Help Center and articles form a parent-child relationship.

      • Just as only published articles are synced, only activated Help Centers will be synced.

    • Carbon can now sync any new published articles with auto-sync is enabled.

  • Reconnecting Existing Intercom Connections:

    • If you have existing Intercom connections in Carbon, please note that you will need to reconnect them to enable the updates above.

Load More

CARBON

Data Connectors for LLMs

COPYRIGHT @ 2024 JCDT DBA CARBON