So we continue to have guests in our show to talk to us about interesting things... This time is about Apache Tika. This is an incredible tool to do search file processing and metadata extraction. Think about that you have tons of unstructured files, like emails, or documents, and you want to extract, index and then search theses. This is Tika's purpose. And who best to walk us through how it does its magic that its Project Management Committee (PMC) Chair, Tim Allison!
So take a listen as we go deeper on ingesting tons of content (which is fundamental for things like training LLMs).
http://www.javapubhouse.com/datadog
We thank DataDogHQ for sponsoring this podcast episode
Don't forget to SUBSCRIBE to our cool NewsCast OffHeap!
http://www.javaoffheap.com/
Apache Tika
* https://tika.apache.org/
OpenSearch Project and OpenSearch Neural Plugin Tutorials
* https://opensearch.org/
* https://opensearch.org/docs/latest/search-plugins/neural-search/
* https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/
* https://opster.com/guides/opensearch/opensearch-machine-learning/opensearch-hybrid-search/
* https://sease.io/2024/01/opensearch-knn-plugin-tutorial.html
* https://sease.io/2024/04/opensearch-neural-search-tutorial-hybrid-search.html
Selected Advanced File Processing toolkits/services
* https://unstructured.io/
* https://aws.amazon.com/textract/
* https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence
Selected Hybrid Search/RAG toolkits (there are _MANY_ others!)
* Haystack: https://haystack.deepset.ai/
* LangChain: https://www.langchain.com/
* LangStream: https://langstream.ai/
Search/Relevance Conferences
* https://haystackconf.com/
* https://2024.berlinbuzzwords.de/
* https://mices.co/
Tim's personal project
* JavaFX (ahem) tika-config writer UI: https://github.com/tballison/tika-gui-v2
Do you like the episodes? Want more? Help us out! Buy us a beer!
https://www.javapubhouse.com/beer
And Follow us!
https://www.twitter.com/javapubhouse
Episode 103. Let's share data cross-language with Apache Arrow! (among other things)
Episode 102. Oh my... Spring Boot 3 is out! An interview with Dan Vega from the Pivotal Team!
Episode 101. Allright, let's talk about Kafka
Episode 100. To the CLOUD... Which one? All of them!
Episode 99. SHHH! It's a secret! (Storing API Keys / Passwords / tokens!)
Episode 98. It's HERE, FINALLY HERE! Java 17 LTS Release
Episode 97. Hey there Scala 3! Looking good with those new Features!
Episode 96. Watching Metrics w/Micrometer and Statsd
Episode 95. Ludicruos speed! Practical GraalVM
Episode 94. Oh, put on your hat Dr. Watson, we are sleuthing this Heap Dump
Episode 93. Not your Grandpa's Serialization Part DEUX!
Episode 92. Not your Grandpa's Serialization!
Episode 91. OracleJDK? OpenJDK?, Zulu? Corretto? So many!
Episode 90. Let's get Recording (AND VIDEO!)
Episode 89. Kubernetes! (Oh container orchestration)
Episode 88. Logging! (An Interview w/Renaud from DataDog)
Episode 87. Ok, it's time to get Reactive!
Episode 86. Move Over Slow Startup times, GraalVM...IS...HERE. (and cross-language support, and less memory footprint...)
Episode 85. Monitor the World with JMX!
Create your
podcast in
minutes
It is Free
Insight Story: Tech Trends Unpacked
Zero-Shot
Fast Forward by Tomorrow Unlocked: Tech past, tech future
Black Wolf Feed (Chapo Premium Feed Bootleg)
Bannon`s War Room