Skip to main content

Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

0 votes
0 answers
51 views

fread() takes 60GB of RAM to load a 22GB CSV dataset [duplicate]

I am loading a CSV file into RStudio using fread() and despite the file being 22GB large, I can see my memory usage at 60 of my 64GB. Why is that? This becomes a problem right after as I need to join ...
Marti's user avatar
  • 101
0 votes
0 answers
44 views

Dealing in R with too many columns [closed]

I am facing issues due to low memory caused by the use of too many columns. I have 900 data frames (df) each with 2 million rows. Each df contains values for one individual. I tried to merge all the ...
Kostas's user avatar
  • 1
1 vote
0 answers
23 views

Hash a sparse vector from CountVectorizer

I am very new to spark so bear with me. I am currently trying to hash feature vectors generated by the CountVectorizer. So for the following example with a hash size of 50: +---+--------------------+--...
EyedBread's user avatar
0 votes
0 answers
10 views

ES: Why Refresh delete Segment with "committed=false"?

Background: There exists an index named "i_dm_f_da_enterprise" in ES, and I prohibited auto-refresh by setting "refresh_interval" : -1. I notice that the refresh operation of the ...
Junbo Wang's user avatar
0 votes
0 answers
35 views

Finding Top Users with Common Records in a Growing Dataset

I’m working on a project where I have a large dataset containing billions of records. Each user can have one or more records, and each record can be associated with multiple users. Given a specific ...
Herman Streltsov's user avatar
0 votes
0 answers
19 views

How to append time-series data with PyArrow Datasets?

Problem I'm looking to store time-series data that's being aggregated live to Parquet Datasets via PyArrow. I receive live batched data, for example, video view count each hour for the last 24 hours. ...
humanlikely's user avatar
0 votes
0 answers
31 views

Optimizing Data Model for Frequent IoT Sensor Data Updates in Data Warehouse

I am working on a data engineering project involving IoT sensor data. In this project, I have a use case where: We have a table in the data warehouse to store sensor data generated every second by 10 ...
Harshit Chandani's user avatar
0 votes
0 answers
24 views

Standalone spark 3.3.0 java application throws access denied exception when reading from files on mounted drive

I'm using spark 3.3.0 on a standalone cluster, and i have mounted drive from which i need to read some files that comes periodically. As spark application is running as spark user and mounted drive is ...
Petar Markov's user avatar
1 vote
0 answers
39 views

How to load .dat file to Hive with additional columns?

I want to load .dat(without headers) file to hive external table. But in hive table there are extra columns like cob_date , region, file_name which are not present in .dat file. cob_date will be the ...
Big data Pyspark's user avatar
-1 votes
0 answers
14 views

How to get MetricQueryService URL?

@Override public CompletableFuture<MetricQueryServiceGateway> retrieveService(String rpcServiceAddress) { return rpcService.connect(rpcServiceAddress, MetricQueryServiceGateway.class); } I ...
seedoilz's user avatar
-1 votes
0 answers
14 views

Assistance with Integrating Open Data Cube and Cesium for 3D Topographical Earth Model

I am currently working on a project that involves developing an integrated big data processing system and visualizing the data in a 3D topographical Earth model. I am seeking assistance with the ...
Angel Zaldivar's user avatar
-1 votes
0 answers
10 views

Why Livy Server delete request doesn't terminate the yarn application if session in starting state

Seeing LivyServer behavior that if LivySession state is starting and call the delete API for this session ,than LivyServer return the success status that deleted and also Livy Server UI also doesn't ...
agarwal_achhnera's user avatar
0 votes
0 answers
21 views

How to build ActorSystem in Flink 1.13.5?

This is how I build ActorSystem in Flink 1.8.5. public static ActorSystem createNewActorSystem() throws Exception { String ip = HostPortUtil.getLocalIp(); Configuration configuration = new ...
seedoilz's user avatar
-1 votes
0 answers
17 views

Save data into hive from worker nodes using apache spark

I am working on a data backloading taks using apache spark and hive. I am loading the data from hive, mapping it to another schema and storing the new result in hive. Now there might be a few failures ...
Destravna's user avatar
0 votes
1 answer
28 views

Comparing two types of data in bigQuery

We have a very big dataset. And I need to get all the values that are mapping from the source attributes to normalized attributes in my json. The relation between normalised and source is that if the ...
Ankita Prasad's user avatar

15 30 50 per page
1
2 3 4 5
534