March 2024

syncFilesOnConnection For More Data Sources

  • We’ve added the sync_files_on_connection parameter to the oauth_url endpoint for the following data sources: Intercom, Salesforce, Zendesk, Confluence, Freshdesk, and Gitbook.

  • This parameter is also accessible for each enabledIntegration in Carbon Connect. You can find more information about this here.

  • By default, this parameter is set to true. When enabled, all files will be synchronized automatically after a user connects their account. This is particularly useful when a user connects a data source that doesn’t have a built-in file picker.

Delete Child Files Based on Parent ID

  • Added a flag named delete_child_files to the delete_files endpoint. When set to true, it will delete all files that have the same parent_file_ids as the file submitted for deletion. This flag defaults to false.

  • Find more details here.

upload_chunks_and_embeddings Updates

  • You can now upload only chunks to Carbon via the upload_chunks_and_embeddings and we can generate the embeddings for you. This is useful for migrations where you want to migrate between embedding models and vector databases.

  • In the API request, you can exclude embeddings and set chunks_only to true. Then, include your embedding model API key (OpenAI or Cohere) under custom_credentials.

{ "api_key": "lkdsjflds" }

  • Make sure to include some delay between requests. There are also stricter limits on how many embeddings/chunks can be uploaded per request if chunks_only is true. Each request can only include 100 chunks.

Data Source Connections with Pre-Existing Auth

  • If you’re using our white labeling add-on, we added a new POST endpoint /integrations/connect so customers can bypass the authentication flow on Carbon by directly passing in an access token.

  • The request takes an authentication object that contains all the necessary pieces of data to connect to user’s account. The object will vary by data source and a list specifying the required keys can be found in our docs. If the connection is successful, the upserted data source will be returned.

  • This endpoint also returns a sync url for some data source types that will initiate the sync process.

Improvements to CSV, TSV, XLSX, GSheet Parsing

  • You have the option to now chunk CSV, TSV, XLSX, and Google Sheets by tokens via chunk_size and/or rows via max_items_per_chunk parameters. When a file is processed, we will add rows to a chunk until adding the next row would exceed chunk_size or max_items_per_chunk.

  • If a single row exceeds chunk_size or the embedding model’s limit for number of tokens, then the file’s sync_error_message will point out which row has too many tokens.

  • For example:

  • If each CSV row is 250 tokens, chunk_size of 800 tokens and no max_items_per_chunk set, then each chunk will contain 3 CSV rows.

  • If each CSV row is 250 tokens, chunk_size of 800 tokens and max_items_per_chunk set to 1, then each chunk will contain 1 CSV rows.

  • Consequently, it is essential to ensure that the number of tokens in a CSV row does not surpass the token limits established by the embedding models. Token counting is currently only supported for OpenAI models currently.

  • You can find more details here.

Improvements to OCR

  • Table parsing in PDFs has been improved significantly with this most recent OCR update.

  • In order to use the enhanced table parsing features, you need to set parse_pdf_tables_with_ocr to true when uploading PDFs (use_ocr must also be true).

    • Any tables parsed when parse_pdf_tables_with_ocr is true have their own chunk(s) assigned to them. These chunks can be identified by the presence of the string TABLE in embedding_metadata.block_types.

    • The format of these tabular chunks will be the same format as CSV-derived chunks.

    • Using this table-parsing feature in conjunction with hybrid search should provide much better search results than before (assuming the PDF has tables that need to be searched over).

  • If you’re using OCR we now also return metadata such as coordinates and page numbers even if set_page_as_boundary is set to false.

    • Specifically, we will return the bounding box coordinates as well as the start and end page number of the chunk.

    • In the event that pg_start < pg_end, then you should interpret bounding box coordinates slightly differently. x1 and x2 will correspond to the minimum x1 and maximum x2 over all pages for the chunk. y1 will correspond to the upper-most coordinate of the part of the chunk on pg_start, and y2 will correspond to the bottom-most coordinate of the part of the chunk on pg-end.

Carbon Connect 2.0 (Beta)

  • We are thrilled to announce the beta launch of Carbon Connect 2.0, with the following improvements:

  • Support multiple active accounts per data source.

  • Improved data source list.

  • Built-in user interface for users to view and re-sync files per account.

  • Ability for users to directly disconnect active accounts.

  • To install Carbon Connect 2.0 please npm install carbon-connect@2.0.0-beta5. It is not treated as the latest version of Carbon Connect so you won’t get this version automatically.

  • Few other important updates for Carbon Connect 2.0:

  • We’ve made a change to remove file details from the payload of UPDATE callbacks. If you used to get files in this way, you’ll now need to switch to using our SDK or API to get the updated files when a data source updates.

  • When you’re specifying embedding models, just make sure to use the format like this: embeddingModel={EmbeddingGenerators.OPENAI_ADA_LARGE_1024}, instead of just writing out a string.

  • You can hide our built-in UI for viewing and re-syncing files using the showFilesTab param on either the global component or enabledIntegration level.

Scheduled Syncs Per User and Data Source

  • Control user and data source syncing using the /update_users endpoint, allowing organizations to specify enabled syncing for particular users and data source types. The endpoint accepts a list of user IDs and data sources, with an option to enable syncing for all sources using the string 'ALL'.

    • Each request supports up to 100 customer IDs.

  • In the following example, future Gmail accounts for specified users will automatically have syncing enabled according to the provided settings.

{ "customer_ids": ["", ""], "auto_sync_enabled_sources": ["GMAIL"] }

  • Find more details in our documentation here.

  • Note: This update is meant to replace our file-level sync logic and any existing auto-syncs have been migrated over to use this updated logic.

Delete Files Based on Filters

  • We added the /delete_files_v2 endpoint which allows customers to delete files via the same filters as /user_files_v2

  • We plan to deprecate the /delete_files endpoint in a month.

  • Find more details in our documentation here.

Filtering for Child Files

  • We added the ability to include all descendent (child) files on both /delete_files_v2 and /user_files_v2 when filtering.

  • Filters applied to the endpoint extend to the returned child files.

  • We plan to deprecate the parent_file_ids filter on the /user_files_v2 endpoint in a month.

Customer Portal v1

  • We’ve officially launched v1 of our Customer Portal -

  • You can currently manage your API keys directly via the Portal, and we plan to release the following functionality next quarter:

    • User management

    • Usage monitoring

    • Billing management

  • For current customers, you can reset your password with the email provided to Carbon to gain access. If you don’t know the email you have on file, DM me!

integration/items/list Improvements

  • We are implementing four distinct filters: external_ids, ids, root_files_only, and name, each meant to filter data based on their respective fields.

    • The root_files_only filter will exclusively return top-level files. However, if a parent_id is specified, then root_files_only can’t be specified and vice versa.

  • The external_url has been added to the response body of the integrations/items/list endpoint.

  • See more details here.

Multiple Active Accounts Per Data Source

  • Carbon now support multiple active accounts per data connection!

  • We’ve introduced two new parameters across various API endpoints to support this functionality across all our connectors. While these parameters are optional for users with a single data source of each type, they become mandatory when managing multiple accounts.

    • /integrations/oauth_url

      • data_source_id: Specifies the data source from which synchronization should occur when dealing with multiple data sources of the same type.

      • connecting_new_account: This parameter is utilized to consistently generate an OAuth URL as opposed to a sync URL. A sync URL is the destination where users are redirected after a successful OAuth authentication to synchronize their files. While this parameter can be skipped when adding the first data source of that type, it should be explicitly specified for subsequent additions.

    • /integrations/s3/files, /integrations/outlook/sync, /integrations/gmail/sync

      • data_source_id: Used to specify the data source for synchronization when managing multiple data sources of the same type.

    • /integrations/outlook/user_folders, /integrations/outlook/user_categories, /integrations/gmail/user_labels

      • data_source_id: Specifies the data source to be utilized when there are multiple data sources of the same type.

  • Note that the following endpoints already have a mandatory requirement to pass in a data_source_id: /integrations/items/sync,/integrations/items/list,/integrations/files/sync/,integrations/gitbook/spaces,/integrations/gitbook/sync


Data Connectors for LLMs