A backend engineer's journey of learning and growth.
by kan01234
Recently, we encountered and resolved a high memory usage problem in our Apache Camel application, which processes large batches of records for CSV export. This post breaks down the problem, root causes, and the steps we took to improve performance and stability.
Our application needed to export a large number of database records to a CSV file using Apache Camel. However, during execution, the application showed severe memory pressure, often leading to GC thrashing or even OutOfMemoryError
in high-load environments.
After profiling the app, we identified two major bottlenecks:
UNION
and an ORDER BY
, which prevented pagination or streaming at the JDBC level.marshal().csv()
and to("file://...")
to write CSV output.The query used:
(SELECT ... FROM A ...)
UNION
(SELECT ... FROM B ...)
ORDER BY timestamp;
This design forced the database to materialize the entire result set, preventing any form of lazy fetching or cursor-based streaming. This meant Spring Data had no choice but to load everything at once.
Camel’s CSV marshalling and file writing behavior assumes the whole message is available and does not stream row-by-row unless explicitly told to do so.
Refactored the Query
Removed the UNION. We realized the combined data could be retrieved through a single, more general query using a discriminator column.
This change allowed Spring Data to use streaming query execution.
Enabled Streaming in Spring Data
Instead of returning List
@Query("SELECT e FROM Entity e")
Stream<MyEntity> streamAll();
Combined with:
@Modifying(clearAutomatically = true)
@Transactional(readOnly = true)
This enabled MySQL streaming result sets through Spring Data JPA (using ResultSet.TYPE_FORWARD_ONLY and fetchSize=Integer.MIN_VALUE under the hood).
⚠️ Note: Avoid using offset-based pagination in large datasets — it leads to performance degradation and doesn’t scale well in production.
We changed the route from:
from("direct:start")
.marshal().csv()
.to("file://output");
To a custom streaming approach:
from("direct:start")
.split(body()).streaming()
.process(exchange -> {
MyEntity entity = exchange.getIn().getBody(MyEntity.class);
String line = convertToCsv(entity);
Files.write(outputPath, (line + "\n").getBytes(), StandardOpenOption.APPEND);
});
This way, each line is written immediately, avoiding large in-memory CSV representations.
Heap usage dropped significantly (from 32 GBs to 4 GBs).
GC frequency and duration reduced by over 80%.
The system can now handle millions of records without OOM risk.
Our batch job runtime decreased by 35% thanks to reduced memory pressure.
Review SQL queries carefully; set-based operations like UNION + ORDER BY can kill streamability.
Use streaming where possible in both your data source (JPA/JDBC) and data sink (Camel/file output).
Apache Camel is powerful, but you must explicitly configure it to process large datasets efficiently.
Prefer cursor-based iteration (Stream
Don’t rely on offset/limit pagination for high-volume jobs — use keyset pagination or streaming.
Always profile and monitor heap usage before deploying batch jobs in production.
Of course, there are many other architectural solutions to handle large-scale batch processing — like pre-calculating results, splitting data by partition, or offloading processing to distributed systems. But those approaches often require more time, complexity, and effort to develop and maintain.
For us, optimizing the existing flow with streaming, query refinement, and buffered writing provided the most practical balance between performance and development cost.
tags: apache-camel - spring-boot - spring-data-jpa - batch-processing - csv-export - memory-optimization - java - streaming - performance - mysql - backend