March 2024
syncFilesOnConnection
For More Data Sources
We’ve added the
sync_files_on_connection
parameter to theoauth_url
endpoint for the following data sources: Intercom, Salesforce, Zendesk, Confluence, Freshdesk, and Gitbook.This parameter is also accessible for each
enabledIntegration
in Carbon Connect. You can find more information about this here.By default, this parameter is set to
true
. When enabled, all files will be synchronized automatically after a user connects their account. This is particularly useful when a user connects a data source that doesn’t have a built-in file picker.
Delete Child Files Based on Parent ID
Added a flag named
delete_child_files
to thedelete_files
endpoint. When set totrue
, it will delete all files that have the sameparent_file_ids
as the file submitted for deletion. This flag defaults tofalse
.Find more details here.
upload_chunks_and_embeddings
Updates
You can now upload only chunks to Carbon via the
upload_chunks_and_embeddings
and we can generate the embeddings for you. This is useful for migrations where you want to migrate between embedding models and vector databases.In the API request, you can exclude embeddings and set
chunks_only
totrue
. Then, include your embedding model API key (OpenAI or Cohere) undercustom_credentials
.
{ "api_key": "lkdsjflds" }
Make sure to include some delay between requests. There are also stricter limits on how many embeddings/chunks can be uploaded per request if
chunks_only
istrue
. Each request can only include 100 chunks.
Data Source Connections with Pre-Existing Auth
If you’re using our white labeling add-on, we added a new POST endpoint
/integrations/connect
so customers can bypass the authentication flow on Carbon by directly passing in an access token.The request takes an authentication object that contains all the necessary pieces of data to connect to user’s account. The object will vary by data source and a list specifying the required keys can be found in our docs. If the connection is successful, the upserted data source will be returned.
This endpoint also returns a sync url for some data source types that will initiate the sync process.
Improvements to CSV, TSV, XLSX, GSheet Parsing
You have the option to now chunk CSV, TSV, XLSX, and Google Sheets by tokens via
chunk_size
and/or rows viamax_items_per_chunk
parameters. When a file is processed, we will add rows to a chunk until adding the next row would exceedchunk_size
ormax_items_per_chunk
.If a single row exceeds
chunk_size
or the embedding model’s limit for number of tokens, then the file’ssync_error_message
will point out which row has too many tokens.For example:
If each CSV row is 250 tokens,
chunk_size
of 800 tokens and nomax_items_per_chunk
set, then each chunk will contain 3 CSV rows.If each CSV row is 250 tokens,
chunk_size
of 800 tokens andmax_items_per_chunk
set to 1, then each chunk will contain 1 CSV rows.Consequently, it is essential to ensure that the number of tokens in a CSV row does not surpass the token limits established by the embedding models. Token counting is currently only supported for OpenAI models currently.
You can find more details here.
Improvements to OCR
Table parsing in PDFs has been improved significantly with this most recent OCR update.
In order to use the enhanced table parsing features, you need to set
parse_pdf_tables_with_ocr
to true when uploading PDFs (use_ocr
must also be true).Any tables parsed when
parse_pdf_tables_with_ocr
istrue
have their own chunk(s) assigned to them. These chunks can be identified by the presence of the stringTABLE
inembedding_metadata.block_types
.The format of these tabular chunks will be the same format as CSV-derived chunks.
Using this table-parsing feature in conjunction with hybrid search should provide much better search results than before (assuming the PDF has tables that need to be searched over).
If you’re using OCR we now also return metadata such as coordinates and page numbers even if
set_page_as_boundary
is set tofalse
.Specifically, we will return the bounding box coordinates as well as the start and end page number of the chunk.
In the event that
pg_start
<pg_end
, then you should interpret bounding box coordinates slightly differently.x1
andx2
will correspond to the minimumx1
and maximumx2
over all pages for the chunk.y1
will correspond to the upper-most coordinate of the part of the chunk onpg_start
, andy2
will correspond to the bottom-most coordinate of the part of the chunk onpg-end
.
Carbon Connect 2.0 (Beta)
We are thrilled to announce the beta launch of Carbon Connect 2.0, with the following improvements:
Support multiple active accounts per data source.
Improved data source list.
Built-in user interface for users to view and re-sync files per account.
Ability for users to directly disconnect active accounts.
To install Carbon Connect 2.0 please npm install
carbon-connect@2.0.0-beta5
. It is not treated as the latest version of Carbon Connect so you won’t get this version automatically.Few other important updates for Carbon Connect 2.0:
We’ve made a change to remove file details from the payload of
UPDATE
callbacks. If you used to get files in this way, you’ll now need to switch to using our SDK or API to get the updated files when a data source updates.When you’re specifying embedding models, just make sure to use the format like this: embeddingModel=
{EmbeddingGenerators.OPENAI_ADA_LARGE_1024}
, instead of just writing out a string.You can hide our built-in UI for viewing and re-syncing files using the
showFilesTab
param on either the global component orenabledIntegration
level.
Scheduled Syncs Per User and Data Source
Control user and data source syncing using the
/update_users
endpoint, allowing organizations to specify enabled syncing for particular users and data source types. The endpoint accepts a list of user IDs and data sources, with an option to enable syncing for all sources using the string'ALL'
.Each request supports up to 100 customer IDs.
In the following example, future Gmail accounts for specified users will automatically have syncing enabled according to the provided settings.
{ "customer_ids": ["swapnil@carbon.ai", "swapnil.galaxy@gmail.com"], "auto_sync_enabled_sources": ["GMAIL"] }
Find more details in our documentation here.
Note: This update is meant to replace our file-level sync logic and any existing auto-syncs have been migrated over to use this updated logic.
Delete Files Based on Filters
We added the
/delete_files_v2
endpoint which allows customers to delete files via the same filters as/user_files_v2
We plan to deprecate the
/delete_files
endpoint in a month.Find more details in our documentation here.
Filtering for Child Files
We added the ability to include all descendent (child) files on both
/delete_files_v2
and/user_files_v2
when filtering.Filters applied to the endpoint extend to the returned child files.
We plan to deprecate the
parent_file_ids
filter on the/user_files_v2
endpoint in a month.
Customer Portal v1
We’ve officially launched v1 of our Customer Portal - portal.carbon.ai
You can currently manage your API keys directly via the Portal, and we plan to release the following functionality next quarter:
User management
Usage monitoring
Billing management
For current customers, you can reset your password with the email provided to Carbon to gain access. If you don’t know the email you have on file, DM me!
integration/items/list
Improvements
We are implementing four distinct filters:
external_ids
,ids
,root_files_only
, andname
, each meant to filter data based on their respective fields.The
root_files_only
filter will exclusively return top-level files. However, if aparent_id
is specified, thenroot_files_only
can’t be specified and vice versa.
The
external_url
has been added to the response body of theintegrations/items/list
endpoint.See more details here.
Multiple Active Accounts Per Data Source
Carbon now support multiple active accounts per data connection!
We’ve introduced two new parameters across various API endpoints to support this functionality across all our connectors. While these parameters are optional for users with a single data source of each type, they become mandatory when managing multiple accounts.
/integrations/oauth_url
data_source_id
: Specifies the data source from which synchronization should occur when dealing with multiple data sources of the same type.connecting_new_account
: This parameter is utilized to consistently generate an OAuth URL as opposed to a sync URL. A sync URL is the destination where users are redirected after a successful OAuth authentication to synchronize their files. While this parameter can be skipped when adding the first data source of that type, it should be explicitly specified for subsequent additions.
/integrations/s3/files
,/integrations/outlook/sync
,/integrations/gmail/sync
data_source_id
: Used to specify the data source for synchronization when managing multiple data sources of the same type.
/integrations/outlook/user_folders
,/integrations/outlook/user_categories
,/integrations/gmail/user_labels
data_source_id
: Specifies the data source to be utilized when there are multiple data sources of the same type.
Note that the following endpoints already have a mandatory requirement to pass in a
data_source_id
:/integrations/items/sync
,/integrations/items/list
,/integrations/files/sync/
,integrations/gitbook/spaces
,/integrations/gitbook/sync
CARBON
Data Connectors for LLMs
COPYRIGHT @ 2024 JCDT DBA CARBON