February 2024

We now support embedding generation using OpenAI’s text-embedding-3-small and text-embedding-3-large models.
To define the embedding model, utilize the embedding_model parameter in the POST body for the /embeddings and other API endpoints. By default, if no specific model is provided, the system will use OPENAI (the original Ada-2).
Find more details on the models available here.

presigned_url field under user_files_v2 now returns a pre-signed URL to the raw HTML content for each web page.
parsed_text_url field still returns a pre-signed URL for the corresponding plain text.
Find more details here.

file_metadata field under user_files_v2 now returns og:image and og:description for each web page.
Find more details here.

You can now exclude specific CSS selectors from web scraping. This ensures that text content within these elements does not appear in the parsed plaintext, chunks, and embeddings. Useful for omitting irrelevant elements, such as headers or footers, which might affect semantic search results.
The web_scrape request objects supports a new fields:
css_selectors_to_skip: Optional[list[str]] = []
Find more details here.

We launched our Gitbook integration today that syncs pages from any public and shared spaces.
The Carbon Connect enabledIntegrations value for Gitbook is GITBOOK.
Gitbook does not come with a pre-built file selector so we added 2 endpoints for listing and syncing Gitbook spaces:
- List all Gitbook spaces with /integrations/gitbook/spaces (API Reference)
- Sync multiple spaces at once with integrations/gitbook/sync (API Reference)
You can also use our global endpoints for listing and syncing specific pages in Gitbook spaces:
- List pages in spaces with the global endpoints /integrations/items/list
- Sync pages in spaces with the global endpoint /integrations/files/sync
- Note: Spaces are treated like folders via the Carbon API.
See more specifics about our Gitbook integration here.
Note: our Gitbook page parser is still in beta so feedback is much appreciated!

We’re transitioning file deletion from sync to async processing.
This means that the FILE_DELETED webhook event will not fire immediately and instead fire when the file is actually deleted.
We are also limiting 50 files to be deleted per /delete_files request to limit the load on our servers. We advise spacing out delete requests every 24 hours.

We’ve launched our Pinecone destination connector! We offer support for both pod-based and serverless offerings.
Carbon seamlessly updates your Pinecone instance with the latest embeddings upon processing user files. Users gain full access to Carbon’s API endpoints, including hybrid search for supported sparse vector storage.
Find more details here.

Moving forward, we will be able to provide support for a greater number of SDKs and promptly release SDK support for API updates. If there is a language for which you want us to add SDK support, we should be able to turn that around in less than a week.
We’re adding support for the following languages today:
The current Javascript SDK will continue to be supported for the next month, and it will be available longer term. However, new features that are introduced will only be supported in the new Typescript SDK moving forward.

Added an endpoint /delete_users that takes an array of customer IDs and deletes all those users.
Deleting a user revokes all of the user’s oauth connections and deletes all their files, embeddings and chunks.
The request format is:

{ "customer_ids": ["USER_1", "USER_2", "USER_3"] }

All articles from an end user’s Salesforce Knowledge can be listed and synced via the global API endpoint /integrations/items/list and /integrations/files/sync.
The Carbon Connect integration (launching tomorrow) will sync all articles by default.
The enabledIntegrations value is SALESFORCE.
You can find more info here.

After connecting your Outlook account, you can use this endpoint to list all of your folders on outlook.
This includes both system folders like inbox and user-created folders.
Find more details here.

After connecting a Gmail account, you can use the /integrations/gmail/user_labels endpoint to list all of your labels.
User created labels will have the type user and Gmail’s default labels will have the type system.
Find more details here.

Added a flag named delete_child_files to the delete_files endpoint. When set to true, it will delete all files that have the same parent_file_ids as the file submitted for deletion. This flag defaults to false.
Find more details here.

Added support for JSON file formats and maxItemsPerChunk param to specify the number of items to include in a specific chunk.
Added cssSelectorsToSkip to WEB_SCRAPE to define CSS Selectors to exclude when converting HTML to plaintext.
Added SALESFORCE as an enabledIntegration on Carbon Connect.
For Salesforce, we added a param syncFilesOnConnection that defaults to true and will automatically sync all pages from a user’s Salesforce account.
We’ll be adding this param to other connectors too, meaning you can automatically sync all files from connectors that don’t have built-in file selectors (Gitbook, Confluence, etc).
This parameter is also added to the /integrations/oauth_url endpoint as sync_files_on_connection and also defaults to true.

CARBON

Data Connectors for LLMs

LETS CHAT!

team@carbon.ai